dhrystone and whetstone benchmarks

Pito · Post by **Pito** » Sun Jan 26, 2020 3:51 pm

@ag123: let us stay with dhrystone and whetstone in this thread..

ag123 · Post by **ag123** » Sun Jan 26, 2020 4:24 pm

Bingo600 wrote: Sun Jan 26, 2020 2:41 pm Ahh ...

I just did do it for the BP F411CE - Attached in previous post
STM32F411CE Blackpill @96MHz , ART Enabled.

@fpiSTM
Will you please pick that "ZIP file up" , I'm not advanced enough in git , to do merge requests etc ... Sorry

Zip file w. F411CE implementation (Closely based on the existing PILL_F401XX variant directory, as only PLL & ART changed)
download/file.php?id=86

Ahh ... Someone did a F411CE vers (Pull on git)
https://github.com/stm32duino/Arduino_C ... 8120aac7fa

But the PLL seems a bit odd.
Hmm-pll.png

/Bingo

yay, this sounds like good news

but in that pr

Code: Select all

  RCC_OscInitStruct.PLL.PLLM = 12;
  RCC_OscInitStruct.PLL.PLLN = 96;
  RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV2;
  RCC_OscInitStruct.PLL.PLLQ = 4;

then using my python script
viewtopic.php?f=41&t=78

Code: Select all

FHSE: 25 m: 25 n: 192 p: 2 (RCC_PLLP_DIV2) q: 4 fusb: 48.0 fcpu: 96.0
FHSE: 25 m: 25 n: 384 p: 4 (RCC_PLLP_DIV4) q: 8 fusb: 48.0 fcpu: 96.0
FHSE: 25 m: 25 n: 432 p: 4 (RCC_PLLP_DIV4) q: 9 fusb: 48.0 fcpu: 108.0
FHSE: 25 m: 50 n: 384 p: 2 (RCC_PLLP_DIV2) q: 4 fusb: 48.0 fcpu: 96.0

yeah, the crystal mhz didn't seem to fall in line with a 25mhz crystal

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 4:58 pm

@ag123

Use my ST-Core BP-F411 fix from here
viewtopic.php?p=936#p936

Zip file
download/file.php?id=86

It's just a copy boards.txt , and an add F411XX directory to variants.

/Bingo

ag123 · Post by **ag123** » Sun Jan 26, 2020 5:04 pm

+1, i commented the PR
https://github.com/stm32duino/Arduino_C ... -348391500

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 5:07 pm

The

Code: Select all

-mcpu=cortex-m4 -march=armv7e-m+fp

Doesn't cut it w. the ST Core & PITO's program.

I just did some "Stuff" in the ST Core .... with

-mcpu=cortex-m4 -march=armv7e-m+fp

Code: Select all



-O2
##########################################
Single Precision C Whetstone Benchmark
Calibrate
       0.10 Seconds          1   Passes (x 100)
       0.50 Seconds          5   Passes (x 100)
       2.48 Seconds         25   Passes (x 100)

Use 100  passes (x 100)

          Single Precision C/C++ Whetstone Benchmark
Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12475013732910156        46.489              0.041
N2 floating point     -1.12274742126464844        33.350              0.403
N3 if then else        1.00000000000000000                  95.833    0.108
N4 fixed point        12.00000000000000000                 159.898    0.197
N5 sin,cos etc.        0.49909299612045288                   2.260    3.681
N6 floating point      0.99999982118606567        35.960              1.500
N7 assignments         3.00000000000000000                  41.158    0.449
N8 exp,sqrt etc.       0.75110614299774170                   1.051    3.539

MWIPS                                            100.824              9.918


-O3

##########################################
Single Precision C Whetstone Benchmark
Calibrate
       0.10 Seconds          1   Passes (x 100)
       0.48 Seconds          5   Passes (x 100)
       2.40 Seconds         25   Passes (x 100)

Use 104  passes (x 100)

          Single Precision C/C++ Whetstone Benchmark
Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12475013732910156        48.000              0.042
N2 floating point     -1.12274742126464844        36.495              0.383
N3 if then else        1.00000000000000000                   0.000    0.000
N4 fixed point        12.00000000000000000                 180.000    0.182
N5 sin,cos etc.        0.49909299612045288                   2.358    3.670
N6 floating point      0.99999982118606567        33.875              1.656
N7 assignments         3.00000000000000000                  48.048    0.400
N8 exp,sqrt etc.       0.75110614299774170                   1.063    3.639

MWIPS                                            104.296              9.972

Code: Select all


-O2 (M4 - No -march=armv7e-m+fp)
##########################################
Single Precision C Whetstone Benchmark
Calibrate
       0.15 Seconds          1   Passes (x 100)
       0.77 Seconds          5   Passes (x 100)
       3.85 Seconds         25   Passes (x 100)

Use 64  passes (x 100)

          Single Precision C/C++ Whetstone Benchmark
Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12475013732910156        46.545              0.026
N2 floating point     -1.12274742126464844        33.340              0.258
N3 if then else        1.00000000000000000                  96.000    0.069
N4 fixed point        12.00000000000000000                 180.001    0.112
N5 sin,cos etc.        0.49909299612045288                   0.915    5.817
N6 floating point      0.99999982118606567        33.878              1.019
N7 assignments         3.00000000000000000                  41.210    0.287
N8 exp,sqrt etc.       0.75110614299774170                   1.051    2.265

MWIPS                                             64.952              9.853
[code]

@Pito
The only diff in the ST Core is the N5 is ~3x faster w. -march=armv7e-m+fp , and the MWIPS is around 100.
If N5 weights a lot in the test , that could explain the high MWIPS on the -march=armv7e-m+fp

I have NOT checkked map files or disassembly (I'm not an arm asm guru) so the 3x boost in N5 , is purely by printf report

/Bingo

[/code]

Pito · Post by **Pito** » Sun Jan 26, 2020 5:36 pm

Some reading on D and W
http://users.ece.utexas.edu/~ljohn/teac ... eicker.pdf

I would recommend to stay with -Os, as the higher levels may produce questionable results.

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 5:45 pm

Pito wrote: Sun Jan 26, 2020 5:36 pm Some reading on D and W
http://users.ece.utexas.edu/~ljohn/teac ... eicker.pdf

So TRIG weights 21.6% , then a 3x would be +2x 21.6% = 43.2% extra
64,952×1,432 = 93,011264 ... And then a bit on the rest ... Would be close to ~ 100

/Bingo

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 5:48 pm

Pito wrote: Sun Jan 26, 2020 5:36 pm I would recommend to stay with -Os, as the higher levels may produce questionable results.

Naah ... It's NOT -O2's fault i added -march=armv7e-m+fp.
Actually it makes me wonder a bit why -mcpu=cortex-m4 didn't invoke -march=armv7e-m and -mfpu=fpv4-sp-d16 didn't add +fp
automaticaly making it -march=armv7e-m+fp.

-march=armv7e-m should specify a cortex-m4 or cortex-m7, according to ARM.
A m7 has a 6 stages pipeline vs 3 stages on a m4 ... Hmmm.....

https://en.wikipedia.org/wiki/ARM_Cortex-M

From arm-gcc doc
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

-march=

Code: Select all

‘armv7e-m’

    ‘+fp’

        The single-precision VFPv4 floating-point instructions.
    ‘+fpv5’

        The single-precision FPv5 floating-point instructions.
    ‘+fp.dp’

        The single- and double-precision FPv5 floating-point instructions.
    ‘+nofp’

        Disable the floating-point extensions.

-mfpu=

Code: Select all

-mfpu=name

    This specifies what floating-point hardware (or hardware emulation) is available on the target. Permissible names are: ‘auto’, ‘vfpv2’, ‘vfpv3’, ‘vfpv3-fp16’, ‘vfpv3-d16’, ‘vfpv3-d16-fp16’, ‘vfpv3xd’, ‘vfpv3xd-fp16’, ‘neon-vfpv3’, ‘neon-fp16’, ‘vfpv4’, ‘vfpv4-d16’, ‘fpv4-sp-d16’, ‘neon-vfpv4’, ‘fpv5-d16’, ‘fpv5-sp-d16’, ‘fp-armv8’, ‘neon-fp-armv8’ and ‘crypto-neon-fp-armv8’. Note that ‘neon’ is an alias for ‘neon-vfpv3’ and ‘vfp’ is an alias for ‘vfpv2’.

    The setting ‘auto’ is the default and is special. It causes the compiler to select the floating-point and Advanced SIMD instructions based on the settings of -mcpu and -march.

    If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.

    You can also set the fpu name at function level by using the target("fpu=") function attributes (see ARM Function Attributes) or pragmas (see Function Specific Option Pragmas).

/Bingo

Pito · Post by **Pito** » Sun Jan 26, 2020 6:05 pm

Hmm, most realistic results are with -Os, imho.
With -2, -3, you are getting results which are off my understanding

They indicate you have to be pretty careful with compilers because of optimization. Therefore Fortran is the only lanuage "approved" for whetstone, they say.
Why to optimize off the code? You do measure the performance of the HW, don't you?

Btw these are some coefficients in the source.:

Code: Select all

  n1 = 12 * x100;
  n2 = 14 * x100;
  n3 = 345 * x100;
  n4 = 210 * x100;
  n5 = 32 * x100;
  n6 = 899 * x100;
  n7 = 616 * x100;
  n8 = 93 * x100;
  n1mult = 10;

Bingo600 · Post by **Bingo600** » Sun Jan 26, 2020 6:18 pm

Pito wrote: Sun Jan 26, 2020 6:05 pm Hmm, most realistic results are with -Os, imho.
With -2, -3, you are getting results which are off my understanding

https://gcc.gnu.org/onlinedocs/gcc-4.7. ... tions.html

Code: Select all

-O
-O1
    Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.

    With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

    -O turns on the following optimization flags:

              -fauto-inc-dec 
              -fcompare-elim 
              -fcprop-registers 
              -fdce 
              -fdefer-pop 
              -fdelayed-branch 
              -fdse 
              -fguess-branch-probability 
              -fif-conversion2 
              -fif-conversion 
              -fipa-pure-const 
              -fipa-profile 
              -fipa-reference 
              -fmerge-constants
              -fsplit-wide-types 
              -ftree-bit-ccp 
              -ftree-builtin-call-dce 
              -ftree-ccp 
              -ftree-ch 
              -ftree-copyrename 
              -ftree-dce 
              -ftree-dominator-opts 
              -ftree-dse 
              -ftree-forwprop 
              -ftree-fre 
              -ftree-phiprop 
              -ftree-sra 
              -ftree-pta 
              -ftree-ter 
              -funit-at-a-time

    -O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging.
-O2
    Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.

    -O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags:

              -fthread-jumps 
              -falign-functions  -falign-jumps 
              -falign-loops  -falign-labels 
              -fcaller-saves 
              -fcrossjumping 
              -fcse-follow-jumps  -fcse-skip-blocks 
              -fdelete-null-pointer-checks 
              -fdevirtualize 
              -fexpensive-optimizations 
              -fgcse  -fgcse-lm  
              -finline-small-functions 
              -findirect-inlining 
              -fipa-sra 
              -foptimize-sibling-calls 
              -fpartial-inlining 
              -fpeephole2 
              -fregmove 
              -freorder-blocks  -freorder-functions 
              -frerun-cse-after-loop  
              -fsched-interblock  -fsched-spec 
              -fschedule-insns  -fschedule-insns2 
              -fstrict-aliasing -fstrict-overflow 
              -ftree-switch-conversion -ftree-tail-merge 
              -ftree-pre 
              -ftree-vrp

    Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos.
-O3
    Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone options.
-O0
    Reduce compilation time and make debugging produce the expected results. This is the default.
-Os
    Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.

    -Os disables the following optimization flags:

              -falign-functions  -falign-jumps  -falign-loops 
              -falign-labels  -freorder-blocks  -freorder-blocks-and-partition 
              -fprefetch-loop-arrays  -ftree-vect-loop-version

-Ofast would be "Non compliant"

/Bingo

Arduino for STM32

dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks

Re: dhrystone and whetstone benchmarks