dhrystone and whetstone benchmarks
Re: dhrystone and whetstone benchmarks
He means it's up to us to create non ST variants...
Re: dhrystone and whetstone benchmarks
Ahh ...
I just did do it for the BP F411CE - Attached in previous post
STM32F411CE Blackpill @96MHz , ART Enabled.
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry
Zip file w. F411CE implementation (Closely based on the existing PILL_F401XX variant directory, as only PLL & ART changed)
download/file.php?id=86
Ahh ... Someone did a F411CE vers (Pull on git)
https://github.com/stm32duino/Arduino_C ... 8120aac7fa
But the PLL seems a bit odd.
I goofed in the first PLL gfx , as i used N=92 , NOT N=96 as in the PR
USB won't/can't get 48MHz if 100MHz is selected, as Q is an integer divide How's your take on a fractional PLL divide ??
I mean HSE / 12 is fractional
/Bingo
I just did do it for the BP F411CE - Attached in previous post
STM32F411CE Blackpill @96MHz , ART Enabled.
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry
Zip file w. F411CE implementation (Closely based on the existing PILL_F401XX variant directory, as only PLL & ART changed)
download/file.php?id=86
Ahh ... Someone did a F411CE vers (Pull on git)
https://github.com/stm32duino/Arduino_C ... 8120aac7fa
But the PLL seems a bit odd.
I goofed in the first PLL gfx , as i used N=92 , NOT N=96 as in the PR
USB won't/can't get 48MHz if 100MHz is selected, as Q is an integer divide How's your take on a fractional PLL divide ??
I mean HSE / 12 is fractional
/Bingo
- Attachments
-
- Hmm-pll.png (66.79 KiB) Viewed 9676 times
Last edited by Bingo600 on Sun Jan 26, 2020 5:36 pm, edited 7 times in total.
Re: dhrystone and whetstone benchmarks
that ART accelerator is nothing but a combination of prefetch, instruction cache and data cache
in the recent commit, steve is rather careful and basically disable ART disable only prefetch, leaving the instruction cache and data cache in place
https://github.com/stevstrong/Arduino_S ... a12f57e320
accordingly simply disabling prefetch is the best practice to reduce power consumption while still keeping adequate performance with the instruction and data cache. but accordingly if the 3 (prefetch + instr cache + data cache) on vs all 3 off
the bogo mflops
viewtopic.php?p=550#p550
can be significantly different
Beginning Whetstone benchmark at 144 MHz ... -O3 overclocked
ART enabled (prefetch + instr + data cache) on
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled (prefetch + instr + data cache) off
C Converted Single Precision Whetstones:100.09 Mflops
note that this isn't steve's codes, i deliberately turn the 3 of them on or off to make a comparison. i think the bogo mflops is reflective of the effect of the cache and prefetch, the performance difference is nearly doubled for an (heavily) overclocked processor
but as it stands ART is cache and prefetch, this means that if your codes results in a lot of cache miss and prefetch miss as well, you may be running off flash without any assistance, no cache, no prefetch etc and hence *slow*.
of course this still won't explain those 'weird' cases e.g. if you have ART slower than if there is no ART. my guess is that one'd need to review the generated assembly or binary codes if one really want to get to the bottom of it. e.g. could the codes work in such a way that it causes a cache miss + prefetch miss every loop? and the other thing is could ART have an unintended effect of accelerating the microseconds or milliseconds timing code, e.g. with ART it 'runs faster' (i.e. each ms is shorter, counts more ms) while without ART it 'runs slower' (each ms is longer, count less ms) this would throw the ball out of the water, eitherway these mflops are for entertainment purpose, don't take them too seriously
one of the use of these bogo mflops is that it respond rather linearly to your sysclocks speed, hence while trying to set the rcc prescalers, this spoof bogo mflops gives a good feel of whether you set the pll prescalers correctly
in the recent commit, steve is rather careful and basically disable ART disable only prefetch, leaving the instruction cache and data cache in place
https://github.com/stevstrong/Arduino_S ... a12f57e320
accordingly simply disabling prefetch is the best practice to reduce power consumption while still keeping adequate performance with the instruction and data cache. but accordingly if the 3 (prefetch + instr cache + data cache) on vs all 3 off
the bogo mflops
viewtopic.php?p=550#p550
can be significantly different
Beginning Whetstone benchmark at 144 MHz ... -O3 overclocked
ART enabled (prefetch + instr + data cache) on
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled (prefetch + instr + data cache) off
C Converted Single Precision Whetstones:100.09 Mflops
note that this isn't steve's codes, i deliberately turn the 3 of them on or off to make a comparison. i think the bogo mflops is reflective of the effect of the cache and prefetch, the performance difference is nearly doubled for an (heavily) overclocked processor
but as it stands ART is cache and prefetch, this means that if your codes results in a lot of cache miss and prefetch miss as well, you may be running off flash without any assistance, no cache, no prefetch etc and hence *slow*.
of course this still won't explain those 'weird' cases e.g. if you have ART slower than if there is no ART. my guess is that one'd need to review the generated assembly or binary codes if one really want to get to the bottom of it. e.g. could the codes work in such a way that it causes a cache miss + prefetch miss every loop? and the other thing is could ART have an unintended effect of accelerating the microseconds or milliseconds timing code, e.g. with ART it 'runs faster' (i.e. each ms is shorter, counts more ms) while without ART it 'runs slower' (each ms is longer, count less ms) this would throw the ball out of the water, eitherway these mflops are for entertainment purpose, don't take them too seriously

one of the use of these bogo mflops is that it respond rather linearly to your sysclocks speed, hence while trying to set the rcc prescalers, this spoof bogo mflops gives a good feel of whether you set the pll prescalers correctly

Last edited by ag123 on Sun Jan 26, 2020 2:59 pm, edited 1 time in total.
Re: dhrystone and whetstone benchmarks
This is Roger's core Black407ZET @168MHz
I cannot get it work with -O2 and -O3
-Os
-01
I cannot get it work with -O2 and -O3
-Os
Code: Select all
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.11 Seconds1.00 Passes (x 100)
0.53 Seconds5.00 Passes (x 100)
2.64 Seconds25.00 Passes (x 100)
Use94 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12
78.81 0.02
N2 floating point -1.12
57.43 0.22
N3 if then else 1.00
83.87 0.12
N4 fixed point 12.00
193.53 0.15
N5 sin,cos etc. 0.50
2.89 2.71
N6 floating point 1.00
16.51 3.07
N7 assignments 3.00
13.60 1.28
N8 exp,sqrt etc. 0.75
1.48 2.35
MWIPS
94.70 9.93
Code: Select all
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.10 Seconds1.00 Passes (x 100)
0.50 Seconds5.00 Passes (x 100)
2.52 Seconds25.00 Passes (x 100)
Use99 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12
76.65 0.02
N2 floating point -1.12
56.14 0.24
N3 if then else 1.00
100.46 0.10
N4 fixed point 12.00
251.49 0.12
N5 sin,cos etc. 0.50
2.79 2.95
N6 floating point 1.00
16.77 3.18
N7 assignments 3.00
16.78 1.09
N8 exp,sqrt etc. 0.75
1.62 2.27
MWIPS
99.20 9.98
Last edited by Pito on Sun Jan 26, 2020 3:17 pm, edited 1 time in total.
Pukao Hats Cleaning Services Ltd.
Re: dhrystone and whetstone benchmarks
Currently, I have no time to do this, sorry. I already have several PR to review and test. Several around F4xx generic variant and so the Black should relies on it, I guess.Bingo600 wrote: Sun Jan 26, 2020 2:41 pm
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry
/Bingo
Re: dhrystone and whetstone benchmarks
Oki, I'll just use it locallyfpiSTM wrote: Sun Jan 26, 2020 3:04 pmCurrently, I have no time to do this, sorry. I already have several PR to review and test. Several around F4xx generic variant and so the Black should relies on it, I guess.Bingo600 wrote: Sun Jan 26, 2020 2:41 pm
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry
/Bingo

@fpiSTM
The BPF411 Pull on git , has an odd pll setting
https://github.com/stm32duino/Arduino_C ... 2/pull/890
This one uses the same variant dir as the F401 (PLL is 84MHz) , my ZIP adds PILL_F411XX variant dir , enabling F411@96MHz & ART
https://github.com/stm32duino/Arduino_C ... aac768f9cb
Last edited by Bingo600 on Sun Jan 26, 2020 3:26 pm, edited 3 times in total.
Re: dhrystone and whetstone benchmarks
a little note here is that the 'bogo' whetstone benchmark codes (those that give the high mflops)
is literally derived from the netlib whetstone codes
https://www.netlib.org/benchmark/
just that these days gcc compilers do all sorts of esoteric -O3 black magic which gives the high mflops
it would otherwise be the same netlib whetstone codes

i still like my bogo netlib whetstone benchmark which give the high bogo mflops
for one thing, they scale pretty well with whatever mhz you tweak your stm32 to run, multiplied proportionally
and respond similarly in kind to ART (prefetch + instr cache + data cache) being on or off
so it is still very useful if you want to play with the pll prescalers, e.g. overclock, tweak settings e.g. ART etc

is literally derived from the netlib whetstone codes
https://www.netlib.org/benchmark/
just that these days gcc compilers do all sorts of esoteric -O3 black magic which gives the high mflops
it would otherwise be the same netlib whetstone codes

i still like my bogo netlib whetstone benchmark which give the high bogo mflops
for one thing, they scale pretty well with whatever mhz you tweak your stm32 to run, multiplied proportionally
and respond similarly in kind to ART (prefetch + instr cache + data cache) being on or off
so it is still very useful if you want to play with the pll prescalers, e.g. overclock, tweak settings e.g. ART etc

Last edited by ag123 on Sun Jan 26, 2020 3:37 pm, edited 4 times in total.
Re: dhrystone and whetstone benchmarks
maybe we should try 1 'level' higher linpack benchmark, the trouble of course is there is too little memory to do all that fast math
maybe it is still possible to run linpack on f4* and up skus
in the same way they are all on netlib
https://www.netlib.org/benchmark/

maybe it is still possible to run linpack on f4* and up skus
in the same way they are all on netlib
https://www.netlib.org/benchmark/

Re: dhrystone and whetstone benchmarks
So with libmaple Roger core F407 @168MHz, with FPU on (gcc directives and asm), I 've got with default -Os
MFLOPS 79 57 17 and MWIPS 95
and similar with -O1. It crashed (no serial) with -O2 and -O3. That is identical with bingo's -Os with STM core and 407.
We may assume therefore the ART and FPU is enabled with 407 in STM and libmaple.
Now, bingo got MFLOPS 45 33 9 and MWIPS 46 with 411 @96MHz -Os and ART/FPU enabled - that would be OK when compared to above 168MHz.
@bingo600: was it really 96MHz?
MFLOPS 79 57 17 and MWIPS 95
and similar with -O1. It crashed (no serial) with -O2 and -O3. That is identical with bingo's -Os with STM core and 407.
We may assume therefore the ART and FPU is enabled with 407 in STM and libmaple.
Now, bingo got MFLOPS 45 33 9 and MWIPS 46 with 411 @96MHz -Os and ART/FPU enabled - that would be OK when compared to above 168MHz.
Code: Select all
MFLOPS 79 57 17 and MWIPS 95 168MHz, -Os, 407, FPU/ART, Pito + Bingo600, libmaple/STM
MFLOPS 45 33 9 and MWIPS 46 96MHz, -Os, 411, FPU/ART, Bingo600, STM
Last edited by Pito on Sun Jan 26, 2020 4:32 pm, edited 13 times in total.
Pukao Hats Cleaning Services Ltd.
Re: dhrystone and whetstone benchmarks
on top of just the -O3 optimization
do note that
viewtopic.php?p=722#p722
there are some compiler flags
this is nothing new turning on fpu hard float
this flag as i found
use an arm optimized floating point library i think it is the CMSIS DSP set of math libraries
the use of those specialized libraries along with -O3 if it connects well can have an impact on the mflops
as is illustrated here, without that specialised library and
it results in low mflops
viewtopic.php?p=717#p717
once that special math library is successfully used with -O3
viewtopic.php?p=724#p724
mflops can literally go from low to triple that
so that math library may have a non-trival impact which we have casually 'taken for granted'
e.g. i read the f4 programming specs, vfp (vector floating point) isn't there, but could there be some possible optimization with the floating point codes so that it is possible to run more than 1 ops per cycle? or for that matter vfp as that in arm 11 is literally there?
http://infocenter.arm.com/help/topic/co ... DEJJH.html
i read the various docs about stm32f4 fpus they do not seem to offer the vfp as that in arm 11, at least
https://www.st.com/content/ccc/resource ... 046982.pdf
the f4 programmer's manual did not specify registers which would make vfp possible (those bits are marked reserved),
but that if indeed vfp is after all possible like the arm 11
then we are looking at things like quad vector parallel floating point ops or octa (x8) vector parallel floating point ops
if that is indeed possible benchmarks above 100 mflops would not be surprising at all

do note that
viewtopic.php?p=722#p722
there are some compiler flags
this is nothing new turning on fpu hard float
Code: Select all
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant
Code: Select all
-mcpu=cortex-m4 -march=armv7e-m+fp
the use of those specialized libraries along with -O3 if it connects well can have an impact on the mflops
as is illustrated here, without that specialised library and
Code: Select all
-mcpu=cortex-m4 -march=armv7e-m+fp
viewtopic.php?p=717#p717
once that special math library is successfully used with -O3
viewtopic.php?p=724#p724
mflops can literally go from low to triple that
so that math library may have a non-trival impact which we have casually 'taken for granted'
e.g. i read the f4 programming specs, vfp (vector floating point) isn't there, but could there be some possible optimization with the floating point codes so that it is possible to run more than 1 ops per cycle? or for that matter vfp as that in arm 11 is literally there?
http://infocenter.arm.com/help/topic/co ... DEJJH.html
i read the various docs about stm32f4 fpus they do not seem to offer the vfp as that in arm 11, at least
https://www.st.com/content/ccc/resource ... 046982.pdf
the f4 programmer's manual did not specify registers which would make vfp possible (those bits are marked reserved),
but that if indeed vfp is after all possible like the arm 11
then we are looking at things like quad vector parallel floating point ops or octa (x8) vector parallel floating point ops
if that is indeed possible benchmarks above 100 mflops would not be surprising at all

Last edited by ag123 on Sun Jan 26, 2020 3:57 pm, edited 3 times in total.