AT32F403A anyone?

Anything not related to STM32
ozcar
Posts: 143
Joined: Wed Apr 29, 2020 9:07 pm
Answers: 5

Re: AT32F403A anyone?

Post by ozcar »

webjorn wrote: Fri Sep 16, 2022 10:10 am ...
I cannot see any good way to unroll the loop since I need to change the array index, but I have to study ARM assmbler,
The ldrh.w and the strh could be replicated, but R7 needs to change.....+2 i guesss...

ldrh.w r1,[r7],#2
strh r1,[r2,#12]
adds r7,#2

this should be 4 instructions per move, so sampling should be 60 Mhz...

We'll see....

Gullik
You could look at automatic increment of registers, but if the loop is unrolled, then the offset in each load instruction can just be increased without needing to add to a register. Compiler may well do things like that anyway, perhaps at the expense of making the timing a bit erratic, but then unless you disable interrupts, you could expect it to get a lot worse at times.

Here is some compiler generated code that more-or-less gets it down to 1.5 instructions per move (data array declared as uint32_t). I'm not sure if that really makes much difference though.

Code: Select all

 ...
 80001a8:	6890      	ldr	r0, [r2, #8]  ; r2 already setup to point to data array
 80001aa:	b410      	push	{r4}
 80001ac:	e9d2 4100 	ldrd	r4, r1, [r2]
 80001b0:	60dc      	str	r4, [r3, #12]  ; r3 is GPIOx pointer
 80001b2:	60d9      	str	r1, [r3, #12]
 80001b4:	e9d2 1403 	ldrd	r1, r4, [r2, #12]
 80001b8:	60d8      	str	r0, [r3, #12]
 80001ba:	60d9      	str	r1, [r3, #12]
 80001bc:	e9d2 1005 	ldrd	r1, r0, [r2, #20]
 80001c0:	60dc      	str	r4, [r3, #12]
 80001c2:	60d9      	str	r1, [r3, #12]
 80001c4:	e9d2 1407 	ldrd	r1, r4, [r2, #28]
 80001c8:	60d8      	str	r0, [r3, #12]
 80001ca:	60d9      	str	r1, [r3, #12]
 80001cc:	e9d2 1009 	ldrd	r1, r0, [r2, #36]	; 0x24
 80001d0:	60dc      	str	r4, [r3, #12]
 80001d2:	60d9      	str	r1, [r3, #12]
 80001d4:	e9d2 140b 	ldrd	r1, r4, [r2, #44]	; 0x2c
 80001d8:	60d8      	str	r0, [r3, #12]
 80001da:	60d9      	str	r1, [r3, #12]
...
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

Code: Select all

for(k=0;k<maxdata;k++) {
GPIOB->ODR = datarray[k];
}
that has loop overhead in it. given how simple of a task "GPIOx->ODR = xyz;" is, that overhead can be significant.

one way of doing it is to unroll the loop. something like this:

Code: Select all

for(k=0, i=maxIteration;i;i--) {
  GPIOB->ODR = datarray[k++];
  ... //repeat many times
  GPIOB->ODR = datarray[k++];
}
and time that whole thing.
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

DWT is quick handy. I often use something like this for timing:

Code: Select all

#define coreticks()  (DWT->CYCCNT)
#if defined(USE_CORETICK)
#define ticks()  coreticks()
#else
#define ticks()  systicks()
#endif
systicks() is essentially a 32-bit composite counter built on SysTick timer.

I can simply use ticks() for timing, knowing that it is 32-bit and freerunning. based on USE_CORETICK, it can be mapped to either DWT or SysTick.
webjorn
Posts: 43
Joined: Sat Jul 09, 2022 8:49 pm

Re: AT32F403A anyone?

Post by webjorn »

Hello dannyf (and others)

I have actually done what you suggested, but am now stuck with something different......

Now I have struck another problem, which I guess is either the c compiler or objdump

It seems to me that the compiler will not generate code for all these output statements
OR that the objdump screws up when interpreting the elf

I am using the tools from the weact installation, as shown below, 11.2.1-1.2

------- C code -----------

for(k=0;k<maxdata;k++) { // fill array with alternating 0 and 1
if(( k & 1) == 0)
datarray[k] = 0x0;
else
datarray[k] = 0xffff;
}

t1 = micros(); // this does not serve any purpose now, it just generates a bl micros, which is easily found in the dump file
snapshot = *DWT_CYCCNT;
datptr = &datarray[0];
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
Serial.print("CYCCNT : ");
Serial.println(*DWT_CYCCNT - snapshot);
-----------------------------

THIS IS dump of the elf of the project taken with

.arduino15/packages/WeActStudio/tools/xpack-arm-none-eabi-gcc/11.2.1-1.2/bin/arm-none-eabi-objdump /tmp/arduino_build_865338/Blink-403a-out-1.ino.elf -d -S -l > dmp14.txt

/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:158
snapshot = *DWT_CYCCNT;
80004b2: 4d2d ldr r5, [pc, #180] ; (8000568 <_Z4loopv+0x34c>)
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:157
t1 = micros();
80004b4: f000 f9de bl 8000874 <micros>
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:169
(*(uint16_t *) 0x40010c0c ) = *datptr++;
80004b8: 4b2e ldr r3, [pc, #184] ; (8000574 <_Z4loopv+0x358>)
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:158
snapshot = *DWT_CYCCNT;
80004ba: 686f ldr r7, [r5, #4]
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:169
(*(uint16_t *) 0x40010c0c ) = *datptr++;
80004bc: f8c6 3fa0 str.w r3, [r6, #4000] ; 0xfa0
80004c0: 4b1c ldr r3, [pc, #112] ; (8000534 <_Z4loopv+0x318>)
80004c2: 8a72 ldrh r2, [r6, #18]
80004c4: 819a strh r2, [r3, #12]
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:170
Serial.print("CYCCNT : ");
80004c6: 492c ldr r1, [pc, #176] ; (8000578 <_Z4loopv+0x35c>)
80004c8: 4620 mov r0, r4
80004ca: f000 fda0 bl 800100e <_ZN5Print5printEPKc>
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:171
Serial.println(*DWT_CYCCNT - snapshot);
80004ce: 6869 ldr r1, [r5, #4]
80004d0: 220a movs r2, #10
80004d2: 1bc9 subs r1, r1, r7
80004d4: 4620 mov r0, r4
80004d6: f000 fdfe bl 80010d6 <_ZN5Print7printlnEmi>
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:147
digitalWrite(PC13, HIGH); // turn the LED on (HIGH is the voltage level)
80004da: e7d1 b.n 8000480 <_Z4loopv+0x264>

--------- end of dump

IF I grep for generated code in the dmp file, I only find the declaration,
and three other references, at line 170, which skips 5 output statements
and then on line 169, which occurs twice, once with a load and once with a store

I would have expected 10 times the behaviour of line 169. This is the last line
of my stores

webjorn@webjorn-Lenovo-G70-70:~$ grep datptr dmp14.txt -B 1 -A 1
uint16_t datarray[maxdata];
uint16_t *datptr;
// the loop function runs over and over again forever
--
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:170
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
Serial.print("CYCCNT : ");
--
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:169
(*(uint16_t *) 0x40010c0c ) = *datptr++;
80004b8: 4b2e ldr r3, [pc, #184] ; (8000574 <_Z4loopv+0x358>)
--
/home/webjorn/Arduino/Blink-403a-out-1/Blink-403a-out-1.ino:169
(*(uint16_t *) 0x40010c0c ) = *datptr++;
80004bc: f8c6 3fa0 str.w r3, [r6, #4000] ; 0xfa0
webjorn@webjorn-Lenovo-G70-70:~$

What do you think?

Gullik
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

did a quick scan and not sure if i got what you were trying to do. but to get accurate timing, don't blend your serial output in there.

Code: Select all

datptr = &datarray[0];
snapshot = *DWT_CYCCNT;
(*(uint16_t *) 0x40010c0c ) = *datptr++;
...
(*(uint16_t *) 0x40010c0c ) = *datptr++;
snapshot = *DWT_CYCCNT - snapshot;
//do something else

otherwise your benchmarking is meaningless.

also, try to use the macros in the header file. after a while those magic numbers become hard to understand, even for you.
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

to illustrate the point i made earlier re unrolling the loop, i time the two pieces of code on a CM0.

Code: Select all

		for (tmp=0; tmp<1000; tmp++) IO_FLP(GPIOB, 1<<7);					//flip led, 10031 ticks /1000 iterations
		for (tmp=0; tmp<1000/5; tmp++) {IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);}				//flip led, 6028 ticks/1000 iterations
they do the same thing, flip PB.7 1000x. in the 1st version, it took 10k ticks to do that. in the "slightly" unrolled version, it took 6K ticks to do the same.

you can unroll more to spread the overhead more but the gain will become more marginal: 5.5K ticks if you do 10 discrete flips in the loop.
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

here is an example of what I do when benchmarking something.

Code: Select all

		//benchmark
		tmp0 = ticks();							//stamp tick1
		//do something
		//for (tmp=0; tmp<1000; tmp++) IO_FLP(GPIOB, 1<<7);					//flip led, 10031 ticks /1000 iterations
		//for (tmp=0; tmp<1000/5; tmp++) {IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);}				//flip led, 6028 ticks/1000 iterations
		//for (tmp=0; tmp<1000/10; tmp++) {IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);IO_FLP(GPIOB, 1<<7);}				//flip led, 5500 ticks/1000 iterations
		tmp0 = ticks() - tmp0;					//calculate time elapsed

		//display something
		//u1Print("F_CPU=                    ", F_CPU);
		//u1Print("ticks=                    ", tick0);
		u1Print("tmp0 =                    ", tmp0);
		u1Println();  //add a line return
I often just use a debugger to step through the code.

you can also put that code you want to benchmark into a routine - can be cleaner sometimes.

Code: Select all

//benchmark a function
clock_t benchmark(void (*func_ptr)(void)) {
	clock_t t0=ticks();		//timestamp the start of the execution
	func_ptr();				//run the function to be benchmarked
	return ticks() - t0;	//return clock elapsed
}

by defining ticks() to match the system you have, the above code is quite portable.
webjorn
Posts: 43
Joined: Sat Jul 09, 2022 8:49 pm

Re: AT32F403A anyone?

Post by webjorn »

"did a quick scan and not sure if i got what you were trying to do. but to get accurate timing, don't blend your serial output in there."

I figured the Serial.printf is executed AFTER the calculation of (*DWT_CYCCNT - snapshot) which means it does not matter,

To explain what I am trying to do, is to determine the maximum transmit rate of a buffer to a 16 bit io port,
which should be equivalent to reading a port into a buffer. This is not a project (yet), but rather an experiment
to understand what can be done with a 240 Mhz stm32-style MCU.

Best regards,

Gullik
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

I figured the Serial.printf is executed AFTER the calculation of (*DWT_CYCCNT - snapshot) which means it does not matter,
it doesn't seem to be what your code is doing.
is to determine the maximum transmit rate of a buffer to a 16 bit io port,
which should be equivalent to reading a port into a buffer.
there are a variety of ways of doing, DMA would be a good way. but if you want to, your approach works. reading from or writing to a port should do.
dannyf
Posts: 447
Joined: Sat Jul 04, 2020 7:46 pm

Re: AT32F403A anyone?

Post by dannyf »

the following line transmits 1K 16-bit data to GPIOA.

Code: Select all

		for (tmp=1000/5, bufPtr=datBuffer; tmp--;) {GPIOA->ODR = *bufPtr++; GPIOA->ODR = *bufPtr++; GPIOA->ODR = *bufPtr++; GPIOA->ODR = *bufPtr++; GPIOA->ODR = *bufPtr++; }

on a CM0, it takes about 6.2K ticks for 1000 words, -O1, Keil MDK.

that should be indicative of what your chip should be able to do - obviously if it is loaded with other work (interrupts for example), it may be longer. and optimization / programming techniques could impact that figure too.
Post Reply

Return to “Off topic”