This is the sequel of the single precision SSE optimized sin, cos, log and exp that I wrote some time ago. Adapted to the NEON fpu of my pandaboard. Precision and range are exactly the same than the SSE version, so I won't repeat them.
command line: gcc -O3 -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a9 -Wall -W neon_mathfun_test.c -lm
exp([ -1000, -100, 100, 1000]) = [ 0, 0, 2.4061436e+38, 2.4061436e+38] exp([ -nan, inf, -inf, nan]) = [ nan, 2.4061436e+38, 0, nan] log([ 0, -10, 1e+30, 1.0005271e-42]) = [ -nan, -nan, 69.077553, -nan] log([ -nan, inf, -inf, nan]) = [ 89.128304, 88.722839, -nan, 89.128304] sin([ -nan, inf, -inf, nan]) = [ nan, nan, -nan, nan] cos([ -nan, inf, -inf, nan]) = [ nan, nan, nan, nan] sin([ -1e+30, -100000, 1e+30, 100000]) = [ inf, -0.035749275, -inf, 0.035749275] cos([ -1e+30, -100000, 1e+30, 100000]) = [ nan, -0.9993608, nan, -0.9993608] benching sinf .. -> 2.0 millions of vector evaluations/second -> 121 cycles/value on a 1000MHz computer benching cosf .. -> 1.8 millions of vector evaluations/second -> 132 cycles/value on a 1000MHz computer benching expf .. -> 1.1 millions of vector evaluations/second -> 221 cycles/value on a 1000MHz computer benching logf .. -> 1.7 millions of vector evaluations/second -> 141 cycles/value on a 1000MHz computer benching cephes_sinf .. -> 2.4 millions of vector evaluations/second -> 103 cycles/value on a 1000MHz computer benching cephes_cosf .. -> 2.0 millions of vector evaluations/second -> 123 cycles/value on a 1000MHz computer benching cephes_expf .. -> 1.6 millions of vector evaluations/second -> 153 cycles/value on a 1000MHz computer benching cephes_logf .. -> 1.5 millions of vector evaluations/second -> 156 cycles/value on a 1000MHz computer benching sin_ps .. -> 5.8 millions of vector evaluations/second -> 43 cycles/value on a 1000MHz computer benching cos_ps .. -> 5.9 millions of vector evaluations/second -> 42 cycles/value on a 1000MHz computer benching sincos_ps .. -> 6.0 millions of vector evaluations/second -> 41 cycles/value on a 1000MHz computer benching exp_ps .. -> 5.6 millions of vector evaluations/second -> 44 cycles/value on a 1000MHz computer benching log_ps .. -> 5.3 millions of vector evaluations/second -> 47 cycles/value on a 1000MHz computer
So performance is not stellar. I recommend to use gcc 4.6.1 or newer as it generates much better code than previous (gcc 4.5) versions -- almost 20% faster here. I believe rewriting these functions in assembly would improve the performance by 30%, and should not be very hard as the ARM and NEON asm is quite nice and easy to write -- maybe I'll do it. Computing two SIMD vectors at once would also help to improve a lot the performance as there are enough registers on NEON, and it would reduce the dependancies between neon instructions.
Note also that I have no idea of the performance on a Cortex A8 -- it may be extremely bad, I don't know.
command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c (this is msvc 2010)
benching sinf .. -> 1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer benching cosf .. -> 1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer benching sincos (x87) .. -> 1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer benching expf .. -> 1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer benching logf .. -> 1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer benching cephes_sinf .. -> 1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer benching cephes_cosf .. -> 1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer benching cephes_expf .. -> 0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer benching cephes_logf .. -> 0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer benching sin_ps .. -> 9.2 millions of vector evaluations/second -> 43 cycles/value on a 1600MHz computer benching cos_ps .. -> 9.5 millions of vector evaluations/second -> 42 cycles/value on a 1600MHz computer benching sincos_ps .. -> 8.8 millions of vector evaluations/second -> 45 cycles/value on a 1600MHz computer benching exp_ps .. -> 9.8 millions of vector evaluations/second -> 41 cycles/value on a 1600MHz computer benching log_ps .. -> 8.6 millions of vector evaluations/second -> 46 cycles/value on a 1600MHz computerThe number of cycles is quite similar -- but the atom has a higher clock..