Re: dhrystone and whetstone benchmarks
Posted: Sun Jan 26, 2020 3:51 pm
@ag123: let us stay with dhrystone and whetstone in this thread..
Everything relating to using STM32 boards with the Arduino IDE and alternatives
https://www.stm32duino.com/
yay, this sounds like good newsBingo600 wrote: Sun Jan 26, 2020 2:41 pm Ahh ...
I just did do it for the BP F411CE - Attached in previous post
STM32F411CE Blackpill @96MHz , ART Enabled.
@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry
Zip file w. F411CE implementation (Closely based on the existing PILL_F401XX variant directory, as only PLL & ART changed)
download/file.php?id=86
Ahh ... Someone did a F411CE vers (Pull on git)
https://github.com/stm32duino/Arduino_C ... 8120aac7fa
But the PLL seems a bit odd.
Hmm-pll.png
/Bingo
Code: Select all
RCC_OscInitStruct.PLL.PLLM = 12;
RCC_OscInitStruct.PLL.PLLN = 96;
RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV2;
RCC_OscInitStruct.PLL.PLLQ = 4;
Code: Select all
FHSE: 25 m: 25 n: 192 p: 2 (RCC_PLLP_DIV2) q: 4 fusb: 48.0 fcpu: 96.0
FHSE: 25 m: 25 n: 384 p: 4 (RCC_PLLP_DIV4) q: 8 fusb: 48.0 fcpu: 96.0
FHSE: 25 m: 25 n: 432 p: 4 (RCC_PLLP_DIV4) q: 9 fusb: 48.0 fcpu: 108.0
FHSE: 25 m: 50 n: 384 p: 2 (RCC_PLLP_DIV2) q: 4 fusb: 48.0 fcpu: 96.0
Code: Select all
-mcpu=cortex-m4 -march=armv7e-m+fp
-mcpu=cortex-m4 -march=armv7e-m+fp
Code: Select all
-O2
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.10 Seconds 1 Passes (x 100)
0.50 Seconds 5 Passes (x 100)
2.48 Seconds 25 Passes (x 100)
Use 100 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 46.489 0.041
N2 floating point -1.12274742126464844 33.350 0.403
N3 if then else 1.00000000000000000 95.833 0.108
N4 fixed point 12.00000000000000000 159.898 0.197
N5 sin,cos etc. 0.49909299612045288 2.260 3.681
N6 floating point 0.99999982118606567 35.960 1.500
N7 assignments 3.00000000000000000 41.158 0.449
N8 exp,sqrt etc. 0.75110614299774170 1.051 3.539
MWIPS 100.824 9.918
-O3
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.10 Seconds 1 Passes (x 100)
0.48 Seconds 5 Passes (x 100)
2.40 Seconds 25 Passes (x 100)
Use 104 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 48.000 0.042
N2 floating point -1.12274742126464844 36.495 0.383
N3 if then else 1.00000000000000000 0.000 0.000
N4 fixed point 12.00000000000000000 180.000 0.182
N5 sin,cos etc. 0.49909299612045288 2.358 3.670
N6 floating point 0.99999982118606567 33.875 1.656
N7 assignments 3.00000000000000000 48.048 0.400
N8 exp,sqrt etc. 0.75110614299774170 1.063 3.639
MWIPS 104.296 9.972
Code: Select all
-O2 (M4 - No -march=armv7e-m+fp)
##########################################
Single Precision C Whetstone Benchmark
Calibrate
0.15 Seconds 1 Passes (x 100)
0.77 Seconds 5 Passes (x 100)
3.85 Seconds 25 Passes (x 100)
Use 64 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 46.545 0.026
N2 floating point -1.12274742126464844 33.340 0.258
N3 if then else 1.00000000000000000 96.000 0.069
N4 fixed point 12.00000000000000000 180.001 0.112
N5 sin,cos etc. 0.49909299612045288 0.915 5.817
N6 floating point 0.99999982118606567 33.878 1.019
N7 assignments 3.00000000000000000 41.210 0.287
N8 exp,sqrt etc. 0.75110614299774170 1.051 2.265
MWIPS 64.952 9.853
[code]
So TRIG weights 21.6% , then a 3x would be +2x 21.6% = 43.2% extraPito wrote: Sun Jan 26, 2020 5:36 pm Some reading on D and W
http://users.ece.utexas.edu/~ljohn/teac ... eicker.pdf
Naah ... It's NOT -O2's fault i added -march=armv7e-m+fp.Pito wrote: Sun Jan 26, 2020 5:36 pm I would recommend to stay with -Os, as the higher levels may produce questionable results.
Code: Select all
‘armv7e-m’
‘+fp’
The single-precision VFPv4 floating-point instructions.
‘+fpv5’
The single-precision FPv5 floating-point instructions.
‘+fp.dp’
The single- and double-precision FPv5 floating-point instructions.
‘+nofp’
Disable the floating-point extensions.
Code: Select all
-mfpu=name
This specifies what floating-point hardware (or hardware emulation) is available on the target. Permissible names are: ‘auto’, ‘vfpv2’, ‘vfpv3’, ‘vfpv3-fp16’, ‘vfpv3-d16’, ‘vfpv3-d16-fp16’, ‘vfpv3xd’, ‘vfpv3xd-fp16’, ‘neon-vfpv3’, ‘neon-fp16’, ‘vfpv4’, ‘vfpv4-d16’, ‘fpv4-sp-d16’, ‘neon-vfpv4’, ‘fpv5-d16’, ‘fpv5-sp-d16’, ‘fp-armv8’, ‘neon-fp-armv8’ and ‘crypto-neon-fp-armv8’. Note that ‘neon’ is an alias for ‘neon-vfpv3’ and ‘vfp’ is an alias for ‘vfpv2’.
The setting ‘auto’ is the default and is special. It causes the compiler to select the floating-point and Advanced SIMD instructions based on the settings of -mcpu and -march.
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
You can also set the fpu name at function level by using the target("fpu=") function attributes (see ARM Function Attributes) or pragmas (see Function Specific Option Pragmas).
Code: Select all
n1 = 12 * x100;
n2 = 14 * x100;
n3 = 345 * x100;
n4 = 210 * x100;
n5 = 32 * x100;
n6 = 899 * x100;
n7 = 616 * x100;
n8 = 93 * x100;
n1mult = 10;
https://gcc.gnu.org/onlinedocs/gcc-4.7. ... tions.htmlPito wrote: Sun Jan 26, 2020 6:05 pm Hmm, most realistic results are with -Os, imho.
With -2, -3, you are getting results which are off my understanding![]()
Code: Select all
-O
-O1
Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.
With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
-O turns on the following optimization flags:
-fauto-inc-dec
-fcompare-elim
-fcprop-registers
-fdce
-fdefer-pop
-fdelayed-branch
-fdse
-fguess-branch-probability
-fif-conversion2
-fif-conversion
-fipa-pure-const
-fipa-profile
-fipa-reference
-fmerge-constants
-fsplit-wide-types
-ftree-bit-ccp
-ftree-builtin-call-dce
-ftree-ccp
-ftree-ch
-ftree-copyrename
-ftree-dce
-ftree-dominator-opts
-ftree-dse
-ftree-forwprop
-ftree-fre
-ftree-phiprop
-ftree-sra
-ftree-pta
-ftree-ter
-funit-at-a-time
-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging.
-O2
Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
-O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags:
-fthread-jumps
-falign-functions -falign-jumps
-falign-loops -falign-labels
-fcaller-saves
-fcrossjumping
-fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks
-fdevirtualize
-fexpensive-optimizations
-fgcse -fgcse-lm
-finline-small-functions
-findirect-inlining
-fipa-sra
-foptimize-sibling-calls
-fpartial-inlining
-fpeephole2
-fregmove
-freorder-blocks -freorder-functions
-frerun-cse-after-loop
-fsched-interblock -fsched-spec
-fschedule-insns -fschedule-insns2
-fstrict-aliasing -fstrict-overflow
-ftree-switch-conversion -ftree-tail-merge
-ftree-pre
-ftree-vrp
Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos.
-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone options.
-O0
Reduce compilation time and make debugging produce the expected results. This is the default.
-Os
Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
-Os disables the following optimization flags:
-falign-functions -falign-jumps -falign-loops
-falign-labels -freorder-blocks -freorder-blocks-and-partition
-fprefetch-loop-arrays -ftree-vect-loop-version