Page 2 of 3
Re: ART accelerator
Posted: Fri Jan 10, 2020 11:38 am
by Pito
I have to doublecheck my results as well as it seems there is mess with printing out

- I get Mflops before the printout of the results, so it could be the actual Mflops has to be as yours * 2.
Re: ART accelerator
Posted: Fri Jan 10, 2020 11:41 am
by ag123
and the results look like these for libmaple core -O3 compiled
ART enabled
Beginning Whetstone benchmark at 84 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:8456.39 millisec
C Converted Single Precision Whetstones:118.25 Mflops
ART disabled
Beginning Whetstone benchmark at 84 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:11455.22 millisec
C Converted Single Precision Whetstones:87.30 Mflops
now libmaple is faster

Re: ART accelerator
Posted: Fri Jan 10, 2020 12:04 pm
by Pito
Pito wrote: Fri Jan 10, 2020 11:38 am
I have to doublecheck my results as well as it seems there is mess with printing out

- I get Mflops before the printout of the results, so it could be the actual Mflops has to be as yours * 2.
Ok, my results are ok, just a copy paste issue (I do it in an loop).
xpack and Roger's core
Code: Select all
Beginning Whetstone benchmark at 168 MHz FPU on ...
0 0 0 1.00 -1.00 -1.00 -1.00 0
120000 140000 120000 -0.00 0.00 -0.00 0.00 120000
140000 120000 120000 -0.00 0.00 0.00 0.00 140000
3450000 1 1 1.00 -1.00 -1.00 -1.00 3450000
2100000 1 2 6.00 6.00 0.00 0.00 2100000
320000 1 2 0.00 0.00 0.00 0.00 320000
8990000 1 2 1.00 1.00 1.00 1.00 8990000
6160000 1 2 3.00 2.00 3.00 0.00 6160000
0 2 3 1.00 -1.00 -1.00 -1.00 0
930000 2 3 1.00 1.00 1.00 1.00 930000
Loops: 10000 Iterations: 1 Duration: 4220 millisec.
C Converted Single Precision Whetstones: 236.97 Mflops
Re: ART accelerator
Posted: Fri Jan 10, 2020 1:04 pm
by ag123
marginal overclock to 96 Mhz libmaple core
STM32F4/variants/blackpill_f401/blackpill_f401.h
STM32F4/cores/maple/libmaple/rccF4.c
Code: Select all
//temporary work around
void rcc_clk_init(void)
{
SystemCoreClock = CYCLES_PER_MICROSECOND * 1000000;
SetupClock84MHz();
return;
}
void SetupClock84MHz()
{
/******************************************************************************/
/* PLL (clocked by HSE) used as System clock source */
/******************************************************************************/
/************************* PLL Parameters *************************************/
// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
int PLL_M = 25;
//int PLL_N = 336;
int PLL_N = 192;
// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
//int PLL_P = 4;
int PLL_P = 2;
// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
//int PLL_Q = 7;
int PLL_Q = 4;
ART enabled
Beginning Whetstone benchmark at 96 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:7413.27 millisec
C Converted Single Precision Whetstones:134.89 Mflops
ART disabled
Beginning Whetstone benchmark at 96 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:10027.59 millisec
C Converted Single Precision Whetstones:99.72 Mflops
interestingly F401 runs rather stably at 96 Mhz (spec speeds 84Mhz)
Vref int (1.21v):1496
temp sensor:959
mvolt:775
temp:31.00
Re: ART accelerator
Posted: Fri Jan 10, 2020 1:23 pm
by ag123
next try 108 Mhz
STM32F4/variants/blackpill_f401/blackpill_f401.h
Code: Select all
#define CYCLES_PER_MICROSECOND 108
STM32F4/cores/maple/libmaple/rccF4.c
Code: Select all
//temporary work around
void rcc_clk_init(void)
{
SystemCoreClock = CYCLES_PER_MICROSECOND * 1000000;
SetupClock84MHz();
return;
}
void SetupClock84MHz()
{
/******************************************************************************/
/* PLL (clocked by HSE) used as System clock source */
/******************************************************************************/
/************************* PLL Parameters *************************************/
// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
int PLL_M = 25;
//int PLL_N = 336;
//int PLL_N = 192;
int PLL_N = 432;
// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
//int PLL_P = 4;
//int PLL_P = 2;
int PLL_P = 4;
// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
//int PLL_Q = 7;
//int PLL_Q = 4;
int PLL_Q = 9
ART enabled
Beginning Whetstone benchmark at 108 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:6573.10 millisec
C Converted Single Precision Whetstones:152.14 Mflops
ART disabled
Beginning Whetstone benchmark at 108 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:8902.95 millisec
C Converted Single Precision Whetstones:112.32 Mflops
Vref int (1.21v):1496
temp sensor:962
mvolt:778
temp:32.20
runs slightly warmer at 108Mhz
Re: ART accelerator
Posted: Fri Jan 10, 2020 1:36 pm
by Pito
At 96MHz you have to see 96/84=1.143 speedup..
Like 135Mflops with Art.
Try with 168MHz, I bet it works..

(you should see 236.97 Mflops, xpack 9 compiler, Roger's core, -O3)
Re: ART accelerator
Posted: Fri Jan 10, 2020 1:58 pm
by ag123
do i need dry ice? or maybe a big cpu cooler

ok searching for a setup
using my python script to get a clock config
viewtopic.php?f=41&t=78
Code: Select all
FHSE: 25 m: 25 n: 336 p: 2 (RCC_PLLP_DIV2) q: 7 fusb: 48.0 fcpu: 168.0
STM32F4/variants/blackpill_f401/blackpill_f401.h
Code: Select all
#define CYCLES_PER_MICROSECOND 168
STM32F4/cores/maple/libmaple/rccF4.c
Code: Select all
// 168 Mhz !
void SetupClock84MHz()
{
/******************************************************************************/
/* PLL (clocked by HSE) used as System clock source */
/******************************************************************************/
/************************* PLL Parameters *************************************/
// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
int PLL_M = 25;
//int PLL_N = 336;
//int PLL_N = 192;
//int PLL_N = 432;
int PLL_N = 336;
// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
//int PLL_P = 4;
//int PLL_P = 2;
//int PLL_P = 4;
int PLL_P = 2;
// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
//int PLL_Q = 7;
//int PLL_Q = 4;
//int PLL_Q = 9;
int PLL_Q = 7;
stall increase flash wait state to 7! (this is the max)
Code: Select all
void setup() {
/* enable the ART accelerator */
/* enable prefetch buffer */
FLASH_BASE->ACR |= FLASH_ACR_PRFTEN;
/* Enable flash instruction cache */
FLASH_BASE->ACR |= FLASH_ACR_ICEN;
/* Enable flash data cache */
FLASH_BASE->ACR |= FLASH_ACR_DCEN;
// 7 wait state
FLASH_BASE->ACR |= FLASH_ACR_LATENCY_7WS;
stall, hang, doesn't blink

Re: ART accelerator
Posted: Fri Jan 10, 2020 2:28 pm
by Pito
Re: ART accelerator
Posted: Fri Jan 10, 2020 2:28 pm
by ag123
well i can't remember where i got the codes from, but maybe it is derived from this
https://www.netlib.org/benchmark/
https://www.netlib.org/benchmark/whetstone.c
ok next try
Code: Select all
FHSE: 25 m: 25 n: 288 p: 2 (RCC_PLLP_DIV2) q: 6 fusb: 48.0 fcpu: 144.0
STM32F4/variants/blackpill_f401/blackpill_f401.h
Code: Select all
#define CYCLES_PER_MICROSECOND 144
STM32F4/cores/maple/libmaple/rccF4.c
Code: Select all
void SetupClock84MHz()
{
// 144 Mhz !
/************************* PLL Parameters *************************************/
// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
int PLL_M = 25;
//int PLL_N = 336; //84Mhz
//int PLL_N = 192; //96Mhz
//int PLL_N = 432; //108Mhz
int PLL_N = 288; //144Mhz
// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
//int PLL_P = 4;
//int PLL_P = 2;
//int PLL_P = 4;
int PLL_P = 2;
// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
//int PLL_Q = 7;
//int PLL_Q = 4;
//int PLL_Q = 9;
int PLL_Q = 6;
Code: Select all
void setup() {
// 7 wait state
FLASH_BASE->ACR |= FLASH_ACR_LATENCY_7WS;
ART enabled
Beginning Whetstone benchmark at 144 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:5399.54 millisec
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled
Beginning Whetstone benchmark at 144 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:9991.17 millisec
C Converted Single Precision Whetstones:100.09 Mflops
Vref int (1.21v):1495
temp sensor:958
mvolt:775
temp:31.00
not stable at 144 Mhz, it didn't show up in the temperatures, but it hangs more than once trying to run whetstone benchmark
this result only after a few tries, the low Mflops maybe due to the 7 waits
Re: ART accelerator
Posted: Fri Jan 10, 2020 2:41 pm
by Pito
The ART somehow eliminates the flash access waitstates.
Even theoretically the stuff would run with zero WS, the floating point unit needs 1-14 (1 or 3 or 14) cpu clocks to provide a single precision fp result.
Adding the overhead etc, I would guess the MFLOPs result @168MHz cannot be better than say 15, like IBM C6x86 @150MHz best case.