Page 3 of 3

Re: Discharging the battery for the FSE clock

Posted: Mon Jul 15, 2024 12:51 pm
by ag123
@GonzoG I think F1 is Cortex M3, while F0 is M0 and G0 is M0+. M0 do not have a complete Arm instruction sets as like M3 (e.g. F103), so binaries will be bigger (some instructions needs other (more) instructions to re-create) and that the F0, G0 often have smaller flash than do even F1xx.

did a google and stumbled into this
https://community.arm.com/support-forum ... programmer
https://community.infineon.com/t5/Knowl ... a-p/251309.
relocatable vector table in m0+ is interesting (vs m0), in a sense that vector table can be copied into sram and relocated, then hooks (IRQ callbacks) can change on demand on the fly. But that unfortunately, the vendor 'candies' small mcus tend to have like 4k, 8k sram, that is pretty 'squeezy' to start with, leaving less room for the stack and heap.

anothe stumble from a google search
https://documentation-service.arm.com/s ... 6f2176f8cc

more stumbles
arm v6m cheatsheet e.g. Cortex M0, M0+
https://gist.github.com/s-aguado/81bc3b ... tsheet-pdf

arm v7m cheat sheet (Cortex M4)
https://www.ic.unicamp.br/~ranido/mc404 ... -sheet.pdf
(incorrect, this seemed to be for v7 Cortex A)
https://courses.cs.washington.edu/cours ... /armv7.pdf)

comparing v6m (e.g. Cortex M0, M0+) vs v7m (e.g. Cortex M3 e.g. stm32f103, f1xx etc) cheatsheets
a key difference seem to be the 3rd 'flexible register'
e.g. in M0 (v6m) add is like:

Code: Select all

ADD{S} {Rd,} Rn, {#}op2 Add
while in M3 (v7m) it is

Code: Select all

add{s}<c><q> {<Rd>,} <Rn>, <Rm> {,<shift>}
and
add{s}<c><q> {<Rd>,} <Rn>, #<const> 
so Cortex M3 (e.g. stm32f103) has more instructions and the binary could be leaner with those more complete instructions.

it is interesting that Raspberry Pico RP2040 are after all M0+ but dual cores
https://www.raspberrypi.com/documentati ... p2040.html

'ART accelerator' is the magic that ST adds to its Cortex M4 (e.g. F4xx (f401/f411, etc) G4xx which otherwise would be a little better than M3 by purely M4 vs M3 core.
'ART accelerator' is then like an L2 cache as do those 'big' cpus have. It makes a big difference for codes run from flash as flash often have latencies of many wait states. from a few wait states to zero wait states would deliver like 1 instructions per hz kind of performance (I think less optimistic than this), it would be quite incredible if an mcu can deliver like 80 Mips constant throughput
even stm32f401 claims
105 DMIPS/1.25 DMIPS/MHz (Dhrystone 2.1),
https://www.st.com/en/microcontrollers- ... 401ce.html
while stm32f103 it is
https://www.st.com/en/microcontrollers- ... 103c8.html
72 MHz maximum frequency, 1.25 DMIPS/MHz (Dhrystone 2.1) performance at 0 wait state memory access
provided one can even get 0 wait states, which is rather unlikely if the loads are all from flash.
that is possibly feasible if codes are run from sram instead (precious).

and I think F4xx FPU is dual vector (e.g. 2 parallel single precision computation together)
it could after all be true if gcc didn't 'cheat' on the whetstone benchmarks :lol:
viewtopic.php?p=939#p939
stm32f401 overclocked

Code: Select all

Beginning Whetstone benchmark at 144 MHz ... -O3 overclocked
ART enabled (prefetch + instr + data cache) on
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled (prefetch + instr + data cache) off
C Converted Single Precision Whetstones:100.09 Mflops
these numbers are better than an old Intel Pentium 4 :lol:
http://www.roylongbottom.org.uk/whetstone.htm
but of course that is cheat in p4 that is double precision (much more transistors / hardware wires are needed)

Code: Select all

Pentium 4        1700 Mhz   603 MWps    152 mflops     798     621     2001 
fast FPU matters if one does calcs on the fly, e.g. DSP etc.