Simple SSE and SSE2 (and now NEON) optimized sin, cos, log and exp

The story

I have spent quite a while looking for a simple (but fast) SSE version of some basic transcendental functions (sines and exponential). On the mac, you get the vsinf and friends (in the Accelerate framework) which are nice (there is a ppc version and an intel version, Apple rox) but closed-source, and restricted to macos..

Both Intel and AMD have some sort of vector math library with SIMD sines and cosines, but

Some time ago, I found out the Intel Approximate Math library. This one is completely free and open-source, and it provides SSE and SSE2 versions of many functions. But it has two drawbacks:

However, it served as a great source of inspiration for the sin_ps, cos_ps, exp_ps and log_ps provided below.

I chose to write them in pure SSE1+MMX so that they run on the pentium III of your grand mother, and also on my brave athlon-xp, since thoses beast are not SSE2 aware. Intel AMath showed me that the performance gain for using SSE2 for that purpose was not large enough (10%) to consider providing an SSE2 version (but it can be done very quickly). Update: I finally did that SSE2 version -- see below.

The functions use only the _mm_ intrinsics , there is no inline assembly in the code. Advantage: easier to debug, works out of the box on 64 bit setups, let the compiler choose what should be stored in a register, and what is stored in memory. Inconvenient: some versions of gcc 3.x are badly broken with certain intrinsic functions ( _mm_movehl_ps , _mm_cmpeq_ps etc). Mingw's gcc for example -- beware that the brokeness is dependent on the optimization level. A workaround is provided (inline asm replacement for the braindead intrinsics), it is not nice but robust, and broken compilers are detected by the validation program below.

I first tried to improve the AMath functions by using longer minimax polynomial approximations for sine, but of course it failed to achieve full precision because of rounding errors in the polynom, and in the computation of x modulo Pi. So I took a look at the implementation of these functions in the cephes library, noticed that they were simpler than what I imagined and contained very few branches, and just translated them in SSE intrinsics. The sincos_ps is nice because you get magically a free sine for each cosine you compute, so it is almost as fast as the sin_ps and the cos_ps.

Of course it is not IEEE compliant, but the max absolute error on sines is 2^-24 on the range [-8192,8192]. The SSE implementation matches exactly the cephes one if you do not use the x87 fpu when compiling the test program ( -mfpmath=sse with gcc)

The code

The functions below are licensed under the zlib license, so you can do basically what you want with them. There is nothing smart in them, it's just a straight translation of the cephes functions in SSE1+MMX. However it took some time to debug and validate (more than I expected at first). If you find a bug, or a way to improve their performances, or if you add other functions (tan, log2, exp2, pow, asin, ...) I'd be glad to know.

SSE2 version (feature added on 2008/12/15)

Ok, I finally added a pure SSE2 version that does not use any mmx intrinsic. The reason is that the 64 bits MSVC 2008 compiler is not able to generate any MMX intrinsics. This compiler is just stupid. So I had no choice but translate mmx stuff into sse2, which proved to be very easy and boring. The perf improvement is below 10%. In order to use the SSE2 version, just define USE_SSE2 when compiling.

ARM NEON version (feature added on 2011/05/29)

This is a direct translation of the SSE2 code into NEON intrinsics. I have put it on a separate page, which is here.

Performance

If you are compiling with gcc 3.3 or older, and get deceiving performance, you should have a look at the generated assembly because it has some real problems to inline some SSE intrinsics. gcc 4.x should definitively be preferred..

results on a macbook with 1.83GHz Core 1 Duo (apple gcc 4.0.1)

command line: gcc -O2 -Wall -W -DHAVE_VECLIB -msse sse_mathfun_test.c -framework Accelerate

exp([        -1000,          -100,           100,          1000]) = [            0,             0, 2.4061436e+38, 2.4061436e+38]
exp([          nan,           inf,          -inf,           nan]) = [2.4061436e+38, 2.4061436e+38,             0, 2.4061436e+38]
log([            0,           -10,         1e+30, 1.0005271e-42]) = [   -87.336548,    -87.336548,     69.077553,    -87.336548]
log([          nan,           inf,          -inf,           nan]) = [   -87.336548,     88.722839,    -87.336548,    -87.336548]
sin([          nan,           inf,          -inf,           nan]) = [          nan,           nan,           nan,           nan]
cos([          nan,           inf,          -inf,           nan]) = [          nan,           nan,           nan,           nan]
sin([       -1e+30,       -100000,         1e+30,        100000]) = [          inf,  -0.035749275,          -inf,   0.035749275]
cos([       -1e+30,       -100000,         1e+30,        100000]) = [          nan,    -0.9993608,           nan,    -0.9993608]
benching                 sinf .. ->    3.2 millions of vector evaluations/second -> 142 cycles/value on a 1830MHz computer
benching                 cosf .. ->    3.2 millions of vector evaluations/second -> 142 cycles/value on a 1830MHz computer
benching         sincos (x87) .. ->    2.8 millions of vector evaluations/second -> 161 cycles/value on a 1830MHz computer
benching                 expf .. ->    3.0 millions of vector evaluations/second -> 148 cycles/value on a 1830MHz computer
benching                 logf .. ->    3.0 millions of vector evaluations/second -> 150 cycles/value on a 1830MHz computer
benching          cephes_sinf .. ->    4.5 millions of vector evaluations/second -> 100 cycles/value on a 1830MHz computer
benching          cephes_cosf .. ->    4.9 millions of vector evaluations/second ->  92 cycles/value on a 1830MHz computer
benching          cephes_expf .. ->    3.0 millions of vector evaluations/second -> 151 cycles/value on a 1830MHz computer
benching          cephes_logf .. ->    2.6 millions of vector evaluations/second -> 172 cycles/value on a 1830MHz computer
benching               sin_ps .. ->   18.1 millions of vector evaluations/second ->  25 cycles/value on a 1830MHz computer
benching               cos_ps .. ->   18.2 millions of vector evaluations/second ->  25 cycles/value on a 1830MHz computer
benching            sincos_ps .. ->   15.0 millions of vector evaluations/second ->  30 cycles/value on a 1830MHz computer
benching               exp_ps .. ->   17.6 millions of vector evaluations/second ->  26 cycles/value on a 1830MHz computer
benching               log_ps .. ->   15.5 millions of vector evaluations/second ->  29 cycles/value on a 1830MHz computer
benching                vsinf .. ->   14.3 millions of vector evaluations/second ->  32 cycles/value on a 1830MHz computer
benching                vcosf .. ->   14.4 millions of vector evaluations/second ->  32 cycles/value on a 1830MHz computer
benching                vexpf .. ->   12.0 millions of vector evaluations/second ->  38 cycles/value on a 1830MHz computer
benching                vlogf .. ->   13.1 millions of vector evaluations/second ->  35 cycles/value on a 1830MHz computer

Results on a 2600MHz opteron running linux, with the 64 bits acml math vector lib (gcc 4.2)

command line: gcc-4.2 -msse -O3 -Wall -W sse_mathfun_test.c -lm -DHAVE_ACML -I /usr/local/acml3.6.0/pathscale64/include /usr/local/acml3.6.0/pathscale64/lib/libacml_mv.a

benching                 sinf .. ->    6.3 millions of vector evaluations/second -> 103 cycles/value on a 2600MHz computer
benching                 cosf .. ->    5.6 millions of vector evaluations/second -> 115 cycles/value on a 2600MHz computer
benching         sincos (x87) .. ->    4.2 millions of vector evaluations/second -> 153 cycles/value on a 2600MHz computer
benching                 expf .. ->    1.1 millions of vector evaluations/second -> 546 cycles/value on a 2600MHz computer
benching                 logf .. ->    4.7 millions of vector evaluations/second -> 138 cycles/value on a 2600MHz computer
benching          cephes_sinf .. ->   11.6 millions of vector evaluations/second ->  56 cycles/value on a 2600MHz computer
benching          cephes_cosf .. ->    8.7 millions of vector evaluations/second ->  74 cycles/value on a 2600MHz computer
benching          cephes_expf .. ->    3.7 millions of vector evaluations/second -> 172 cycles/value on a 2600MHz computer
benching          cephes_logf .. ->    5.5 millions of vector evaluations/second -> 117 cycles/value on a 2600MHz computer
benching               sin_ps .. ->   26.1 millions of vector evaluations/second ->  25 cycles/value on a 2600MHz computer
benching               cos_ps .. ->   26.1 millions of vector evaluations/second ->  25 cycles/value on a 2600MHz computer
benching            sincos_ps .. ->   23.7 millions of vector evaluations/second ->  27 cycles/value on a 2600MHz computer
benching               exp_ps .. ->   22.9 millions of vector evaluations/second ->  28 cycles/value on a 2600MHz computer
benching               log_ps .. ->   21.6 millions of vector evaluations/second ->  30 cycles/value on a 2600MHz computer
benching       acml vrs4_sinf .. ->   17.9 millions of vector evaluations/second ->  36 cycles/value on a 2600MHz computer
benching       acml vrs4_cosf .. ->   18.3 millions of vector evaluations/second ->  35 cycles/value on a 2600MHz computer
benching       acml vrs4_expf .. ->   28.6 millions of vector evaluations/second ->  23 cycles/value on a 2600MHz computer
benching       acml vrs4_logf .. ->   23.6 millions of vector evaluations/second ->  27 cycles/value on a 2600MHz computer

Results on a 2GHz athlon-xp 2400+ , using mingw (gcc 3.4.5)

command line: gcc -mfpmath=sse -msse -O2 -Wall -W sse_mathfun_test.c

benching                 sinf .. ->    3.4 millions of vector evaluations/second -> 144 cycles/value on a 2000MHz computer
benching                 cosf .. ->    5.1 millions of vector evaluations/second ->  97 cycles/value on a 2000MHz computer
benching         sincos (x87) .. ->    2.3 millions of vector evaluations/second -> 214 cycles/value on a 2000MHz computer
benching                 expf .. ->    1.8 millions of vector evaluations/second -> 272 cycles/value on a 2000MHz computer
benching                 logf .. ->    2.5 millions of vector evaluations/second -> 200 cycles/value on a 2000MHz computer
benching          cephes_sinf .. ->    3.7 millions of vector evaluations/second -> 132 cycles/value on a 2000MHz computer
benching          cephes_cosf .. ->    3.2 millions of vector evaluations/second -> 153 cycles/value on a 2000MHz computer
benching          cephes_expf .. ->    1.2 millions of vector evaluations/second -> 407 cycles/value on a 2000MHz computer
benching          cephes_logf .. ->    1.4 millions of vector evaluations/second -> 355 cycles/value on a 2000MHz computer
benching               sin_ps .. ->   17.2 millions of vector evaluations/second ->  29 cycles/value on a 2000MHz computer
benching               cos_ps .. ->   17.0 millions of vector evaluations/second ->  29 cycles/value on a 2000MHz computer
benching            sincos_ps .. ->   14.7 millions of vector evaluations/second ->  34 cycles/value on a 2000MHz computer
benching               exp_ps .. ->   17.2 millions of vector evaluations/second ->  29 cycles/value on a 2000MHz computer
benching               log_ps .. ->   14.7 millions of vector evaluations/second ->  34 cycles/value on a 2000MHz computer

Results on a 2GHz athlon-xp 2400+ , using cl.exe (visual c++ express 2005)

command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c

benching                 sinf .. ->    3.1 millions of vector evaluations/second -> 160 cycles/value on a 2000MHz computer
benching                 cosf .. ->    3.9 millions of vector evaluations/second -> 127 cycles/value on a 2000MHz computer
benching         sincos (x87) .. ->    2.8 millions of vector evaluations/second -> 175 cycles/value on a 2000MHz computer
benching                 expf .. ->    2.0 millions of vector evaluations/second -> 239 cycles/value on a 2000MHz computer
benching                 logf .. ->    2.6 millions of vector evaluations/second -> 192 cycles/value on a 2000MHz computer
benching          cephes_sinf .. ->    2.5 millions of vector evaluations/second -> 198 cycles/value on a 2000MHz computer
benching          cephes_cosf .. ->    2.8 millions of vector evaluations/second -> 176 cycles/value on a 2000MHz computer
benching          cephes_expf .. ->    0.9 millions of vector evaluations/second -> 546 cycles/value on a 2000MHz computer
benching          cephes_logf .. ->    1.3 millions of vector evaluations/second -> 370 cycles/value on a 2000MHz computer
benching               sin_ps .. ->   17.2 millions of vector evaluations/second ->  29 cycles/value on a 2000MHz computer
benching               cos_ps .. ->   17.3 millions of vector evaluations/second ->  29 cycles/value on a 2000MHz computer
benching            sincos_ps .. ->   15.5 millions of vector evaluations/second ->  32 cycles/value on a 2000MHz computer
benching               exp_ps .. ->   17.8 millions of vector evaluations/second ->  28 cycles/value on a 2000MHz computer
benching               log_ps .. ->   10.4 millions of vector evaluations/second ->  48 cycles/value on a 2000MHz computer

Results on a 1.6GHz atom N270 , using cl.exe (visual c++ express 2010)

command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c

benching                 sinf .. ->    1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer
benching                 cosf .. ->    1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer
benching         sincos (x87) .. ->    1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer
benching                 expf .. ->    1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer
benching                 logf .. ->    1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer
benching          cephes_sinf .. ->    1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer
benching          cephes_cosf .. ->    1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer
benching          cephes_expf .. ->    0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer
benching          cephes_logf .. ->    0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer
benching               sin_ps .. ->    9.2 millions of vector evaluations/second ->  43 cycles/value on a 1600MHz computer
benching               cos_ps .. ->    9.5 millions of vector evaluations/second ->  42 cycles/value on a 1600MHz computer
benching            sincos_ps .. ->    8.8 millions of vector evaluations/second ->  45 cycles/value on a 1600MHz computer
benching               exp_ps .. ->    9.8 millions of vector evaluations/second ->  41 cycles/value on a 1600MHz computer
benching               log_ps .. ->    8.6 millions of vector evaluations/second ->  46 cycles/value on a 1600MHz computer

update (2008/03/14): If you are looking for extreme performance on SSE2 simple precision logf, have a look at the page of Hans R. Battenfeld, who is working on a version twice faster than the ones here.


Last modified: Sun May 29 23:59:10 CEST 2011