did a google and stumbled into this
https://community.arm.com/support-forum ... programmer
https://community.infineon.com/t5/Knowl ... a-p/251309.
relocatable vector table in m0+ is interesting (vs m0), in a sense that vector table can be copied into sram and relocated, then hooks (IRQ callbacks) can change on demand on the fly. But that unfortunately, the vendor 'candies' small mcus tend to have like 4k, 8k sram, that is pretty 'squeezy' to start with, leaving less room for the stack and heap.
anothe stumble from a google search
https://documentation-service.arm.com/s ... 6f2176f8cc
more stumbles
arm v6m cheatsheet e.g. Cortex M0, M0+
https://gist.github.com/s-aguado/81bc3b ... tsheet-pdf
arm v7m cheat sheet (Cortex M4)
https://www.ic.unicamp.br/~ranido/mc404 ... -sheet.pdf
(incorrect, this seemed to be for v7 Cortex A)
https://courses.cs.washington.edu/cours ... /armv7.pdf)
comparing v6m (e.g. Cortex M0, M0+) vs v7m (e.g. Cortex M3 e.g. stm32f103, f1xx etc) cheatsheets
a key difference seem to be the 3rd 'flexible register'
e.g. in M0 (v6m) add is like:
Code: Select all
ADD{S} {Rd,} Rn, {#}op2 Add
Code: Select all
add{s}<c><q> {<Rd>,} <Rn>, <Rm> {,<shift>}
and
add{s}<c><q> {<Rd>,} <Rn>, #<const>
it is interesting that Raspberry Pico RP2040 are after all M0+ but dual cores
https://www.raspberrypi.com/documentati ... p2040.html
'ART accelerator' is the magic that ST adds to its Cortex M4 (e.g. F4xx (f401/f411, etc) G4xx which otherwise would be a little better than M3 by purely M4 vs M3 core.
'ART accelerator' is then like an L2 cache as do those 'big' cpus have. It makes a big difference for codes run from flash as flash often have latencies of many wait states. from a few wait states to zero wait states would deliver like 1 instructions per hz kind of performance (I think less optimistic than this), it would be quite incredible if an mcu can deliver like 80 Mips constant throughput
even stm32f401 claims
105 DMIPS/1.25 DMIPS/MHz (Dhrystone 2.1),
https://www.st.com/en/microcontrollers- ... 401ce.html
while stm32f103 it is
https://www.st.com/en/microcontrollers- ... 103c8.html
72 MHz maximum frequency, 1.25 DMIPS/MHz (Dhrystone 2.1) performance at 0 wait state memory access
provided one can even get 0 wait states, which is rather unlikely if the loads are all from flash.
that is possibly feasible if codes are run from sram instead (precious).
and I think F4xx FPU is dual vector (e.g. 2 parallel single precision computation together)
it could after all be true if gcc didn't 'cheat' on the whetstone benchmarks

viewtopic.php?p=939#p939
stm32f401 overclocked
Code: Select all
Beginning Whetstone benchmark at 144 MHz ... -O3 overclocked
ART enabled (prefetch + instr + data cache) on
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled (prefetch + instr + data cache) off
C Converted Single Precision Whetstones:100.09 Mflops

http://www.roylongbottom.org.uk/whetstone.htm
but of course that is cheat in p4 that is double precision (much more transistors / hardware wires are needed)
Code: Select all
Pentium 4 1700 Mhz 603 MWps 152 mflops 798 621 2001