webjorn wrote: Fri Sep 16, 2022 10:10 am
...
I cannot see any good way to unroll the loop since I need to change the array index, but I have to study ARM assmbler,
The ldrh.w and the strh could be replicated, but R7 needs to change.....+2 i guesss...
ldrh.w r1,[r7],#2
strh r1,[r2,#12]
adds r7,#2
this should be 4 instructions per move, so sampling should be 60 Mhz...
We'll see....
Gullik
You could look at automatic increment of registers, but if the loop is unrolled, then the offset in each load instruction can just be increased without needing to add to a register. Compiler may well do things like that anyway, perhaps at the expense of making the timing a bit erratic, but then unless you disable interrupts, you could expect it to get a lot worse at times.
Here is some compiler generated code that more-or-less gets it down to 1.5 instructions per move (data array declared as uint32_t). I'm not sure if that really makes much difference though.
IF I grep for generated code in the dmp file, I only find the declaration,
and three other references, at line 170, which skips 5 output statements
and then on line 169, which occurs twice, once with a load and once with a store
I would have expected 10 times the behaviour of line 169. This is the last line
of my stores
they do the same thing, flip PB.7 1000x. in the 1st version, it took 10k ticks to do that. in the "slightly" unrolled version, it took 6K ticks to do the same.
you can unroll more to spread the overhead more but the gain will become more marginal: 5.5K ticks if you do 10 discrete flips in the loop.
//benchmark a function
clock_t benchmark(void (*func_ptr)(void)) {
clock_t t0=ticks(); //timestamp the start of the execution
func_ptr(); //run the function to be benchmarked
return ticks() - t0; //return clock elapsed
}
by defining ticks() to match the system you have, the above code is quite portable.
"did a quick scan and not sure if i got what you were trying to do. but to get accurate timing, don't blend your serial output in there."
I figured the Serial.printf is executed AFTER the calculation of (*DWT_CYCCNT - snapshot) which means it does not matter,
To explain what I am trying to do, is to determine the maximum transmit rate of a buffer to a 16 bit io port,
which should be equivalent to reading a port into a buffer. This is not a project (yet), but rather an experiment
to understand what can be done with a 240 Mhz stm32-style MCU.
on a CM0, it takes about 6.2K ticks for 1000 words, -O1, Keil MDK.
that should be indicative of what your chip should be able to do - obviously if it is loaded with other work (interrupts for example), it may be longer. and optimization / programming techniques could impact that figure too.