well, i think those sp flops around 60 Mflops to 150 Mflops for the F401 84mhz, F411 96 mhz is after all real.
for one thing it seem rather close to the arm 11 vfp fpu less that 'vector' floating point
http://infocenter.arm.com/help/topic/co ... DEJJH.html
nevertheless, fp instructions probably execute at 1 flops per cycle and that there are possibly several alu e.g. separate for multiply, divide and add, plus some kind of 'speculative' (out of order) execution. this is the only way to explain the above 1 flops per hz performance on the stm32f4x cpus
the optimization using vfp libraries may also have used things like fma (floating multiply and add) instructions for the whetstone benchmarks, that does both multiply and add in a single instruction which would make matrix - vector calcs run like they do 2 flops per clock
so after all we do have pretty fast single precision floating points on our little f4 chips
these are probably more advanced than the p4 technology at that time, after all desktop intel chips these days runs more than a single flops per flop and intel does that much more extreme at 64 bits (in fact 80 bits)
https://en.wikipedia.org/wiki/Extended_precision