dhrystone and whetstone benchmarks

BennehBoy · Post by **BennehBoy** » Sun Jan 26, 2020 2:31 pm

He means it's up to us to create non ST variants...

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 2:41 pm

Ahh ...

I just did do it for the BP F411CE - Attached in previous post
STM32F411CE Blackpill @96MHz , ART Enabled.

@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry

Zip file w. F411CE implementation (Closely based on the existing PILL_F401XX variant directory, as only PLL & ART changed)
download/file.php?id=86

Ahh ... Someone did a F411CE vers (Pull on git)
https://github.com/stm32duino/Arduino_C ... 8120aac7fa

But the PLL seems a bit odd.
I goofed in the first PLL gfx , as i used N=92 , NOT N=96 as in the PR

USB won't/can't get 48MHz if 100MHz is selected, as Q is an integer divide

: Hmm-1.png (63.79 KiB) Viewed 10702 times

How's your take on a fractional PLL divide ??
I mean HSE / 12 is fractional

/Bingo

ag123 · Post by **ag123** » Sun Jan 26, 2020 2:49 pm

that ART accelerator is nothing but a combination of prefetch, instruction cache and data cache

in the recent commit, steve is rather careful and basically disable ART disable only prefetch, leaving the instruction cache and data cache in place
https://github.com/stevstrong/Arduino_S ... a12f57e320
accordingly simply disabling prefetch is the best practice to reduce power consumption while still keeping adequate performance with the instruction and data cache. but accordingly if the 3 (prefetch + instr cache + data cache) on vs all 3 off
the bogo mflops
viewtopic.php?p=550#p550
can be significantly different
Beginning Whetstone benchmark at 144 MHz ... -O3 overclocked
ART enabled (prefetch + instr + data cache) on
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled (prefetch + instr + data cache) off
C Converted Single Precision Whetstones:100.09 Mflops

note that this isn't steve's codes, i deliberately turn the 3 of them on or off to make a comparison. i think the bogo mflops is reflective of the effect of the cache and prefetch, the performance difference is nearly doubled for an (heavily) overclocked processor
but as it stands ART is cache and prefetch, this means that if your codes results in a lot of cache miss and prefetch miss as well, you may be running off flash without any assistance, no cache, no prefetch etc and hence *slow*.
of course this still won't explain those 'weird' cases e.g. if you have ART slower than if there is no ART. my guess is that one'd need to review the generated assembly or binary codes if one really want to get to the bottom of it. e.g. could the codes work in such a way that it causes a cache miss + prefetch miss every loop? and the other thing is could ART have an unintended effect of accelerating the microseconds or milliseconds timing code, e.g. with ART it 'runs faster' (i.e. each ms is shorter, counts more ms) while without ART it 'runs slower' (each ms is longer, count less ms) this would throw the ball out of the water, eitherway these mflops are for entertainment purpose, don't take them too seriously

one of the use of these bogo mflops is that it respond rather linearly to your sysclocks speed, hence while trying to set the rcc prescalers, this spoof bogo mflops gives a good feel of whether you set the pll prescalers correctly

Pito · Post by **Pito** » Sun Jan 26, 2020 2:59 pm

This is Roger's core Black407ZET @168MHz

I cannot get it work with -O2 and -O3

-Os

Code: Select all

##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.11 Seconds1.00 Passes (x 100)
0.53 Seconds5.00 Passes (x 100)
2.64 Seconds25.00 Passes (x 100)
Use94 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point -1.12
78.81 0.02
N2 floating point -1.12
57.43 0.22
N3 if then else   1.00
83.87 0.12
N4 fixed point    12.00
193.53 0.15
N5 sin,cos etc.   0.50
2.89 2.71
N6 floating point 1.00
16.51 3.07
N7 assignments    3.00
13.60 1.28
N8 exp,sqrt etc.  0.75
1.48 2.35
MWIPS
94.70 9.93

-01

Code: Select all

##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.10 Seconds1.00 Passes (x 100)
0.50 Seconds5.00 Passes (x 100)
2.52 Seconds25.00 Passes (x 100)
Use99 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point -1.12
76.65 0.02
N2 floating point -1.12
56.14 0.24
N3 if then else   1.00
100.46 0.10
N4 fixed point    12.00
251.49 0.12
N5 sin,cos etc.   0.50
2.79 2.95
N6 floating point 1.00
16.77 3.18
N7 assignments    3.00
16.78 1.09
N8 exp,sqrt etc.  0.75
1.62 2.27
MWIPS
99.20 9.98

fpiSTM · Post by **fpiSTM** » Sun Jan 26, 2020 3:04 pm

Bingo600 wrote: Sun Jan 26, 2020 2:41 pm
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry

/Bingo

Currently, I have no time to do this, sorry. I already have several PR to review and test. Several around F4xx generic variant and so the Black should relies on it, I guess.

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 3:11 pm

fpiSTM wrote: Sun Jan 26, 2020 3:04 pm
Bingo600 wrote: Sun Jan 26, 2020 2:41 pm
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry

/Bingo
Currently, I have no time to do this, sorry. I already have several PR to review and test. Several around F4xx generic variant and so the Black should relies on it, I guess.

Oki, I'll just use it locally

@fpiSTM
The BPF411 Pull on git , has an odd pll setting
https://github.com/stm32duino/Arduino_C ... 2/pull/890

This one uses the same variant dir as the F401 (PLL is 84MHz) , my ZIP adds PILL_F411XX variant dir , enabling F411@96MHz & ART
https://github.com/stm32duino/Arduino_C ... aac768f9cb

ag123 · Post by **ag123** » Sun Jan 26, 2020 3:13 pm

a little note here is that the 'bogo' whetstone benchmark codes (those that give the high mflops)
is literally derived from the netlib whetstone codes
https://www.netlib.org/benchmark/
just that these days gcc compilers do all sorts of esoteric -O3 black magic which gives the high mflops
it would otherwise be the same netlib whetstone codes

i still like my bogo netlib whetstone benchmark which give the high bogo mflops
for one thing, they scale pretty well with whatever mhz you tweak your stm32 to run, multiplied proportionally
and respond similarly in kind to ART (prefetch + instr cache + data cache) being on or off
so it is still very useful if you want to play with the pll prescalers, e.g. overclock, tweak settings e.g. ART etc

ag123 · Post by **ag123** » Sun Jan 26, 2020 3:15 pm

maybe we should try 1 'level' higher linpack benchmark, the trouble of course is there is too little memory to do all that fast math
maybe it is still possible to run linpack on f4* and up skus
in the same way they are all on netlib
https://www.netlib.org/benchmark/

Pito · Post by **Pito** » Sun Jan 26, 2020 3:30 pm

So with libmaple Roger core F407 @168MHz, with FPU on (gcc directives and asm), I 've got with default -Os

MFLOPS 79 57 17 and MWIPS 95

and similar with -O1. It crashed (no serial) with -O2 and -O3. That is identical with bingo's -Os with STM core and 407.

We may assume therefore the ART and FPU is enabled with 407 in STM and libmaple.

Now, bingo got MFLOPS 45 33 9 and MWIPS 46 with 411 @96MHz -Os and ART/FPU enabled - that would be OK when compared to above 168MHz.

Code: Select all

MFLOPS  79 57 17 and MWIPS 95               168MHz, -Os, 407, FPU/ART, Pito + Bingo600, libmaple/STM
MFLOPS  45 33  9 and MWIPS 46                96MHz, -Os, 411, FPU/ART, Bingo600, STM

@bingo600: was it really 96MHz?

ag123 · Post by **ag123** » Sun Jan 26, 2020 3:43 pm

on top of just the -O3 optimization
do note that
viewtopic.php?p=722#p722
there are some compiler flags
this is nothing new turning on fpu hard float

Code: Select all

-mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant

this flag as i found

Code: Select all

-mcpu=cortex-m4 -march=armv7e-m+fp

use an arm optimized floating point library i think it is the CMSIS DSP set of math libraries
the use of those specialized libraries along with -O3 if it connects well can have an impact on the mflops

as is illustrated here, without that specialised library and

Code: Select all

-mcpu=cortex-m4 -march=armv7e-m+fp

it results in low mflops
viewtopic.php?p=717#p717

once that special math library is successfully used with -O3
viewtopic.php?p=724#p724
mflops can literally go from low to triple that
so that math library may have a non-trival impact which we have casually 'taken for granted'

e.g. i read the f4 programming specs, vfp (vector floating point) isn't there, but could there be some possible optimization with the floating point codes so that it is possible to run more than 1 ops per cycle? or for that matter vfp as that in arm 11 is literally there?
http://infocenter.arm.com/help/topic/co ... DEJJH.html
i read the various docs about stm32f4 fpus they do not seem to offer the vfp as that in arm 11, at least
https://www.st.com/content/ccc/resource ... 046982.pdf
the f4 programmer's manual did not specify registers which would make vfp possible (those bits are marked reserved),
but that if indeed vfp is after all possible like the arm 11
then we are looking at things like quad vector parallel floating point ops or octa (x8) vector parallel floating point ops
if that is indeed possible benchmarks above 100 mflops would not be surprising at all

Arduino for STM32

dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks