dhrystone and whetstone benchmarks
Re: dhrystone and whetstone benchmarks
@ag123: let us stay with dhrystone and whetstone in this thread..
Pukao Hats Cleaning Services Ltd.
Re: dhrystone and whetstone benchmarks
yay, this sounds like good newsBingo600 wrote: Sun Jan 26, 2020 2:41 pm Ahh ...
I just did do it for the BP F411CE - Attached in previous post
STM32F411CE Blackpill @96MHz , ART Enabled.
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry
Zip file w. F411CE implementation (Closely based on the existing PILL_F401XX variant directory, as only PLL & ART changed)
download/file.php?id=86
Ahh ... Someone did a F411CE vers (Pull on git)
https://github.com/stm32duino/Arduino_C ... 8120aac7fa
But the PLL seems a bit odd.
Hmm-pll.png
/Bingo

but in that pr
Code: Select all
RCC_OscInitStruct.PLL.PLLM = 12;
RCC_OscInitStruct.PLL.PLLN = 96;
RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV2;
RCC_OscInitStruct.PLL.PLLQ = 4;
viewtopic.php?f=41&t=78
Code: Select all
FHSE: 25 m: 25 n: 192 p: 2 (RCC_PLLP_DIV2) q: 4 fusb: 48.0 fcpu: 96.0
FHSE: 25 m: 25 n: 384 p: 4 (RCC_PLLP_DIV4) q: 8 fusb: 48.0 fcpu: 96.0
FHSE: 25 m: 25 n: 432 p: 4 (RCC_PLLP_DIV4) q: 9 fusb: 48.0 fcpu: 108.0
FHSE: 25 m: 50 n: 384 p: 2 (RCC_PLLP_DIV2) q: 4 fusb: 48.0 fcpu: 96.0
Re: dhrystone and whetstone benchmarks
@ag123
Use my ST-Core BP-F411 fix from here
viewtopic.php?p=936#p936
Zip file
download/file.php?id=86
It's just a copy boards.txt , and an add F411XX directory to variants.
/Bingo
Use my ST-Core BP-F411 fix from here
viewtopic.php?p=936#p936
Zip file
download/file.php?id=86
It's just a copy boards.txt , and an add F411XX directory to variants.
/Bingo
Re: dhrystone and whetstone benchmarks
+1, i commented the PR
https://github.com/stm32duino/Arduino_C ... -348391500
https://github.com/stm32duino/Arduino_C ... -348391500
Re: dhrystone and whetstone benchmarks
The
Doesn't cut it w. the ST Core & PITO's program.
I just did some "Stuff" in the ST Core .... with
@Pito
The only diff in the ST Core is the N5 is ~3x faster w. -march=armv7e-m+fp , and the MWIPS is around 100.
If N5 weights a lot in the test , that could explain the high MWIPS on the -march=armv7e-m+fp
I have NOT checkked map files or disassembly (I'm not an arm asm guru) so the 3x boost in N5 , is purely by printf report
/Bingo
[/code]
Code: Select all
-mcpu=cortex-m4 -march=armv7e-m+fp
I just did some "Stuff" in the ST Core .... with
-mcpu=cortex-m4 -march=armv7e-m+fp
Code: Select all
-O2
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.10 Seconds 1 Passes (x 100)
0.50 Seconds 5 Passes (x 100)
2.48 Seconds 25 Passes (x 100)
Use 100 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 46.489 0.041
N2 floating point -1.12274742126464844 33.350 0.403
N3 if then else 1.00000000000000000 95.833 0.108
N4 fixed point 12.00000000000000000 159.898 0.197
N5 sin,cos etc. 0.49909299612045288 2.260 3.681
N6 floating point 0.99999982118606567 35.960 1.500
N7 assignments 3.00000000000000000 41.158 0.449
N8 exp,sqrt etc. 0.75110614299774170 1.051 3.539
MWIPS 100.824 9.918
-O3
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.10 Seconds 1 Passes (x 100)
0.48 Seconds 5 Passes (x 100)
2.40 Seconds 25 Passes (x 100)
Use 104 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 48.000 0.042
N2 floating point -1.12274742126464844 36.495 0.383
N3 if then else 1.00000000000000000 0.000 0.000
N4 fixed point 12.00000000000000000 180.000 0.182
N5 sin,cos etc. 0.49909299612045288 2.358 3.670
N6 floating point 0.99999982118606567 33.875 1.656
N7 assignments 3.00000000000000000 48.048 0.400
N8 exp,sqrt etc. 0.75110614299774170 1.063 3.639
MWIPS 104.296 9.972
Code: Select all
-O2 (M4 - No -march=armv7e-m+fp)
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.15 Seconds 1 Passes (x 100)
0.77 Seconds 5 Passes (x 100)
3.85 Seconds 25 Passes (x 100)
Use 64 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 46.545 0.026
N2 floating point -1.12274742126464844 33.340 0.258
N3 if then else 1.00000000000000000 96.000 0.069
N4 fixed point 12.00000000000000000 180.001 0.112
N5 sin,cos etc. 0.49909299612045288 0.915 5.817
N6 floating point 0.99999982118606567 33.878 1.019
N7 assignments 3.00000000000000000 41.210 0.287
N8 exp,sqrt etc. 0.75110614299774170 1.051 2.265
MWIPS 64.952 9.853
[code]
The only diff in the ST Core is the N5 is ~3x faster w. -march=armv7e-m+fp , and the MWIPS is around 100.
If N5 weights a lot in the test , that could explain the high MWIPS on the -march=armv7e-m+fp
I have NOT checkked map files or disassembly (I'm not an arm asm guru) so the 3x boost in N5 , is purely by printf report
/Bingo
[/code]
Re: dhrystone and whetstone benchmarks
Some reading on D and W
http://users.ece.utexas.edu/~ljohn/teac ... eicker.pdf
I would recommend to stay with -Os, as the higher levels may produce questionable results.
http://users.ece.utexas.edu/~ljohn/teac ... eicker.pdf
I would recommend to stay with -Os, as the higher levels may produce questionable results.
Pukao Hats Cleaning Services Ltd.
Re: dhrystone and whetstone benchmarks
So TRIG weights 21.6% , then a 3x would be +2x 21.6% = 43.2% extraPito wrote: Sun Jan 26, 2020 5:36 pm Some reading on D and W
http://users.ece.utexas.edu/~ljohn/teac ... eicker.pdf
64,952×1,432 = 93,011264 ... And then a bit on the rest ... Would be close to ~ 100
/Bingo
Re: dhrystone and whetstone benchmarks
Naah ... It's NOT -O2's fault i added -march=armv7e-m+fp.Pito wrote: Sun Jan 26, 2020 5:36 pm I would recommend to stay with -Os, as the higher levels may produce questionable results.
Actually it makes me wonder a bit why -mcpu=cortex-m4 didn't invoke -march=armv7e-m and -mfpu=fpv4-sp-d16 didn't add +fp
automaticaly making it -march=armv7e-m+fp.
-march=armv7e-m should specify a cortex-m4 or cortex-m7, according to ARM.
A m7 has a 6 stages pipeline vs 3 stages on a m4 ... Hmmm.....

https://en.wikipedia.org/wiki/ARM_Cortex-M
From arm-gcc doc
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
-march=
Code: Select all
‘armv7e-m’
‘+fp’
The single-precision VFPv4 floating-point instructions.
‘+fpv5’
The single-precision FPv5 floating-point instructions.
‘+fp.dp’
The single- and double-precision FPv5 floating-point instructions.
‘+nofp’
Disable the floating-point extensions.
Code: Select all
-mfpu=name
This specifies what floating-point hardware (or hardware emulation) is available on the target. Permissible names are: ‘auto’, ‘vfpv2’, ‘vfpv3’, ‘vfpv3-fp16’, ‘vfpv3-d16’, ‘vfpv3-d16-fp16’, ‘vfpv3xd’, ‘vfpv3xd-fp16’, ‘neon-vfpv3’, ‘neon-fp16’, ‘vfpv4’, ‘vfpv4-d16’, ‘fpv4-sp-d16’, ‘neon-vfpv4’, ‘fpv5-d16’, ‘fpv5-sp-d16’, ‘fp-armv8’, ‘neon-fp-armv8’ and ‘crypto-neon-fp-armv8’. Note that ‘neon’ is an alias for ‘neon-vfpv3’ and ‘vfp’ is an alias for ‘vfpv2’.
The setting ‘auto’ is the default and is special. It causes the compiler to select the floating-point and Advanced SIMD instructions based on the settings of -mcpu and -march.
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
You can also set the fpu name at function level by using the target("fpu=") function attributes (see ARM Function Attributes) or pragmas (see Function Specific Option Pragmas).
Last edited by Bingo600 on Sun Jan 26, 2020 6:10 pm, edited 2 times in total.
Re: dhrystone and whetstone benchmarks
Hmm, most realistic results are with -Os, imho.
With -2, -3, you are getting results which are off my understanding
They indicate you have to be pretty careful with compilers because of optimization. Therefore Fortran is the only lanuage "approved" for whetstone, they say.
Why to optimize off the code? You do measure the performance of the HW, don't you?
Btw these are some coefficients in the source.:
With -2, -3, you are getting results which are off my understanding

They indicate you have to be pretty careful with compilers because of optimization. Therefore Fortran is the only lanuage "approved" for whetstone, they say.
Why to optimize off the code? You do measure the performance of the HW, don't you?
Btw these are some coefficients in the source.:
Code: Select all
n1 = 12 * x100;
n2 = 14 * x100;
n3 = 345 * x100;
n4 = 210 * x100;
n5 = 32 * x100;
n6 = 899 * x100;
n7 = 616 * x100;
n8 = 93 * x100;
n1mult = 10;
Pukao Hats Cleaning Services Ltd.
Re: dhrystone and whetstone benchmarks
https://gcc.gnu.org/onlinedocs/gcc-4.7. ... tions.htmlPito wrote: Sun Jan 26, 2020 6:05 pm Hmm, most realistic results are with -Os, imho.
With -2, -3, you are getting results which are off my understanding![]()
Code: Select all
-O
-O1
Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.
With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
-O turns on the following optimization flags:
-fauto-inc-dec
-fcompare-elim
-fcprop-registers
-fdce
-fdefer-pop
-fdelayed-branch
-fdse
-fguess-branch-probability
-fif-conversion2
-fif-conversion
-fipa-pure-const
-fipa-profile
-fipa-reference
-fmerge-constants
-fsplit-wide-types
-ftree-bit-ccp
-ftree-builtin-call-dce
-ftree-ccp
-ftree-ch
-ftree-copyrename
-ftree-dce
-ftree-dominator-opts
-ftree-dse
-ftree-forwprop
-ftree-fre
-ftree-phiprop
-ftree-sra
-ftree-pta
-ftree-ter
-funit-at-a-time
-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging.
-O2
Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
-O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags:
-fthread-jumps
-falign-functions -falign-jumps
-falign-loops -falign-labels
-fcaller-saves
-fcrossjumping
-fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks
-fdevirtualize
-fexpensive-optimizations
-fgcse -fgcse-lm
-finline-small-functions
-findirect-inlining
-fipa-sra
-foptimize-sibling-calls
-fpartial-inlining
-fpeephole2
-fregmove
-freorder-blocks -freorder-functions
-frerun-cse-after-loop
-fsched-interblock -fsched-spec
-fschedule-insns -fschedule-insns2
-fstrict-aliasing -fstrict-overflow
-ftree-switch-conversion -ftree-tail-merge
-ftree-pre
-ftree-vrp
Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos.
-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone options.
-O0
Reduce compilation time and make debugging produce the expected results. This is the default.
-Os
Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
-Os disables the following optimization flags:
-falign-functions -falign-jumps -falign-loops
-falign-labels -freorder-blocks -freorder-blocks-and-partition
-fprefetch-loop-arrays -ftree-vect-loop-version
-Ofast would be "Non compliant"
/Bingo