ART accelerator

Post here first, or if you can't find a relevant section!
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

I have to doublecheck my results as well as it seems there is mess with printing out :twisted: - I get Mflops before the printout of the results, so it could be the actual Mflops has to be as yours * 2.
Pukao Hats Cleaning Services Ltd.
ag123
Posts: 1668
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

and the results look like these for libmaple core -O3 compiled
ART enabled
Beginning Whetstone benchmark at 84 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:8456.39 millisec
C Converted Single Precision Whetstones:118.25 Mflops
ART disabled
Beginning Whetstone benchmark at 84 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:11455.22 millisec
C Converted Single Precision Whetstones:87.30 Mflops
now libmaple is faster :lol:
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

Pito wrote: Fri Jan 10, 2020 11:38 am I have to doublecheck my results as well as it seems there is mess with printing out :twisted: - I get Mflops before the printout of the results, so it could be the actual Mflops has to be as yours * 2.
Ok, my results are ok, just a copy paste issue (I do it in an loop).

xpack and Roger's core

Code: Select all

Beginning Whetstone benchmark at 168 MHz FPU on ...
0       0       0       1.00    -1.00   -1.00   -1.00   0
120000  140000  120000  -0.00   0.00    -0.00   0.00    120000
140000  120000  120000  -0.00   0.00    0.00    0.00    140000
3450000 1       1       1.00    -1.00   -1.00   -1.00   3450000
2100000 1       2       6.00    6.00    0.00    0.00    2100000
320000  1       2       0.00    0.00    0.00    0.00    320000
8990000 1       2       1.00    1.00    1.00    1.00    8990000
6160000 1       2       3.00    2.00    3.00    0.00    6160000
0       2       3       1.00    -1.00   -1.00   -1.00   0
930000  2       3       1.00    1.00    1.00    1.00    930000
Loops: 10000 Iterations: 1 Duration: 4220 millisec.
C Converted Single Precision Whetstones: 236.97 Mflops
Last edited by Pito on Fri Jan 10, 2020 1:10 pm, edited 2 times in total.
Pukao Hats Cleaning Services Ltd.
ag123
Posts: 1668
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

marginal overclock to 96 Mhz libmaple core
STM32F4/variants/blackpill_f401/blackpill_f401.h

Code: Select all

#define CYCLES_PER_MICROSECOND   96
STM32F4/cores/maple/libmaple/rccF4.c

Code: Select all

//temporary work around
void rcc_clk_init(void)
{
	SystemCoreClock = CYCLES_PER_MICROSECOND * 1000000;

	SetupClock84MHz();
	return;
}

void SetupClock84MHz()
{
	/******************************************************************************/
	/*            PLL (clocked by HSE) used as System clock source                */
	/******************************************************************************/
	/************************* PLL Parameters *************************************/
	// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
	int PLL_M = 25;
	//int PLL_N = 336;
	int PLL_N = 192;

	// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
	//int PLL_P = 4;
	int PLL_P = 2;

	// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
	//int PLL_Q = 7;
	int PLL_Q = 4;
ART enabled
Beginning Whetstone benchmark at 96 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00

Loops:10000, Iterations:1, Duration:7413.27 millisec
C Converted Single Precision Whetstones:134.89 Mflops
ART disabled
Beginning Whetstone benchmark at 96 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00

Loops:10000, Iterations:1, Duration:10027.59 millisec
C Converted Single Precision Whetstones:99.72 Mflops
interestingly F401 runs rather stably at 96 Mhz (spec speeds 84Mhz)
Vref int (1.21v):1496
temp sensor:959
mvolt:775
temp:31.00
Last edited by ag123 on Fri Jan 10, 2020 1:44 pm, edited 4 times in total.
ag123
Posts: 1668
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

next try 108 Mhz
STM32F4/variants/blackpill_f401/blackpill_f401.h

Code: Select all

#define CYCLES_PER_MICROSECOND   108
STM32F4/cores/maple/libmaple/rccF4.c

Code: Select all

//temporary work around
void rcc_clk_init(void)
{
	SystemCoreClock = CYCLES_PER_MICROSECOND * 1000000;

	SetupClock84MHz();
	return;
}
void SetupClock84MHz()
{
	/******************************************************************************/
	/*            PLL (clocked by HSE) used as System clock source                */
	/******************************************************************************/
	/************************* PLL Parameters *************************************/
	// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
	int PLL_M = 25;
	//int PLL_N = 336;
	//int PLL_N = 192;
	int PLL_N = 432;

	// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
	//int PLL_P = 4;
	//int PLL_P = 2;
	int PLL_P = 4;

	// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
	//int PLL_Q = 7;
	//int PLL_Q = 4;
	int PLL_Q = 9
ART enabled
Beginning Whetstone benchmark at 108 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:6573.10 millisec
C Converted Single Precision Whetstones:152.14 Mflops
ART disabled
Beginning Whetstone benchmark at 108 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:8902.95 millisec
C Converted Single Precision Whetstones:112.32 Mflops
Vref int (1.21v):1496
temp sensor:962
mvolt:778
temp:32.20
runs slightly warmer at 108Mhz
Last edited by ag123 on Fri Jan 10, 2020 1:57 pm, edited 2 times in total.
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

At 96MHz you have to see 96/84=1.143 speedup..
Like 135Mflops with Art.

Try with 168MHz, I bet it works.. :) (you should see 236.97 Mflops, xpack 9 compiler, Roger's core, -O3)
Pukao Hats Cleaning Services Ltd.
ag123
Posts: 1668
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

do i need dry ice? or maybe a big cpu cooler :lol:
ok searching for a setup
using my python script to get a clock config
viewtopic.php?f=41&t=78

Code: Select all

FHSE: 25 m: 25 n: 336 p: 2 (RCC_PLLP_DIV2) q: 7 fusb: 48.0 fcpu: 168.0
STM32F4/variants/blackpill_f401/blackpill_f401.h

Code: Select all

#define CYCLES_PER_MICROSECOND   168
STM32F4/cores/maple/libmaple/rccF4.c

Code: Select all

// 168 Mhz !
void SetupClock84MHz()
{
	/******************************************************************************/
	/*            PLL (clocked by HSE) used as System clock source                */
	/******************************************************************************/
	/************************* PLL Parameters *************************************/
	// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
	int PLL_M = 25;
	//int PLL_N = 336;
	//int PLL_N = 192;
	//int PLL_N = 432;
	int PLL_N = 336;

	// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
	//int PLL_P = 4;
	//int PLL_P = 2;
	//int PLL_P = 4;
	int PLL_P = 2;

	// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
	//int PLL_Q = 7;
	//int PLL_Q = 4;
	//int PLL_Q = 9;
	int PLL_Q = 7;
stall increase flash wait state to 7! (this is the max) :lol:

Code: Select all

void setup() {


	/* enable the ART accelerator */
	/* enable prefetch buffer */
	FLASH_BASE->ACR |= FLASH_ACR_PRFTEN;
	/* Enable flash instruction cache */
	FLASH_BASE->ACR |= FLASH_ACR_ICEN;
	/* Enable flash data cache */
	FLASH_BASE->ACR |= FLASH_ACR_DCEN;

	// 7 wait state
	FLASH_BASE->ACR |= FLASH_ACR_LATENCY_7WS;

stall, hang, doesn't blink :lol:
Last edited by ag123 on Fri Jan 10, 2020 4:37 pm, edited 1 time in total.
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

BTW - I doubt the result we see are the Whetstone results :)

We should rather called it "Whetduinos" :P

https://www.st.com/content/ccc/resource ... 047230.pdf

http://www.roylongbottom.org.uk/whetstone%20results.htm
Last edited by Pito on Fri Jan 10, 2020 2:34 pm, edited 2 times in total.
Pukao Hats Cleaning Services Ltd.
ag123
Posts: 1668
Joined: Thu Dec 19, 2019 5:30 am
Answers: 25

Re: ART accelerator

Post by ag123 »

well i can't remember where i got the codes from, but maybe it is derived from this :lol:
https://www.netlib.org/benchmark/
https://www.netlib.org/benchmark/whetstone.c

ok next try

Code: Select all

FHSE: 25 m: 25 n: 288 p: 2 (RCC_PLLP_DIV2) q: 6 fusb: 48.0 fcpu: 144.0
STM32F4/variants/blackpill_f401/blackpill_f401.h

Code: Select all

#define CYCLES_PER_MICROSECOND   144
STM32F4/cores/maple/libmaple/rccF4.c

Code: Select all

void SetupClock84MHz()
{
	// 144 Mhz !
	/************************* PLL Parameters *************************************/
	// PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N = 25[MHz]/25 * 336 = 336
	int PLL_M = 25;
	//int PLL_N = 336; //84Mhz
	//int PLL_N = 192; //96Mhz
	//int PLL_N = 432; //108Mhz
	int PLL_N = 288; //144Mhz

	// SYSCLK = PLL_VCO / PLL_P = 336 / 4 = 84
	//int PLL_P = 4;
	//int PLL_P = 2;
	//int PLL_P = 4;
	int PLL_P = 2;

	// USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ = 336 / 7 = 48
	//int PLL_Q = 7;
	//int PLL_Q = 4;
	//int PLL_Q = 9;
	int PLL_Q = 6;

Code: Select all

void setup() {
	// 7 wait state
	FLASH_BASE->ACR |= FLASH_ACR_LATENCY_7WS;
ART enabled
Beginning Whetstone benchmark at 144 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:5399.54 millisec
C Converted Single Precision Whetstones:185.20 Mflops
ART disabled
Beginning Whetstone benchmark at 144 MHz ...
0 0 0 1.00 -1.00 -1.00 -1.00
120000 140000 120000 -0.00 0.00 -0.00 0.00
140000 120000 120000 -0.00 0.00 0.00 0.00
3450000 1 1 1.00 -1.00 -1.00 -1.00
2100000 1 2 6.00 6.00 0.00 0.00
320000 1 2 0.00 0.00 0.00 0.00
8990000 1 2 1.00 1.00 1.00 1.00
6160000 1 2 3.00 2.00 3.00 0.00
0 2 3 1.00 -1.00 -1.00 -1.00
930000 2 3 1.00 1.00 1.00 1.00
Loops:10000, Iterations:1, Duration:9991.17 millisec
C Converted Single Precision Whetstones:100.09 Mflops
Vref int (1.21v):1495
temp sensor:958
mvolt:775
temp:31.00

not stable at 144 Mhz, it didn't show up in the temperatures, but it hangs more than once trying to run whetstone benchmark
this result only after a few tries, the low Mflops maybe due to the 7 waits
Last edited by ag123 on Fri Jan 10, 2020 4:52 pm, edited 7 times in total.
User avatar
Pito
Posts: 94
Joined: Tue Dec 24, 2019 1:53 pm

Re: ART accelerator

Post by Pito »

The ART somehow eliminates the flash access waitstates.
Even theoretically the stuff would run with zero WS, the floating point unit needs 1-14 (1 or 3 or 14) cpu clocks to provide a single precision fp result.
Adding the overhead etc, I would guess the MFLOPs result @168MHz cannot be better than say 15, like IBM C6x86 @150MHz best case.
Last edited by Pito on Fri Jan 10, 2020 2:46 pm, edited 2 times in total.
Pukao Hats Cleaning Services Ltd.
Post Reply

Return to “General discussion”