I have spent quite a while looking for a simple (but fast) SSE version of some basic transcendental functions (sines and exponential). On the mac, you get the vsinf and friends (in the Accelerate framework) which are nice (there is a ppc version and an intel version, Apple rox) but closed-source, and restricted to macos..

Both Intel and AMD have some sort of vector math library with SIMD sines and cosines, but

- Intel MKL is not free (neither as beer, nor as speech)
- AMD ACML is free, but no source is available. Morever the vector functions are only available in 64bits OSes !
- Would you trust the intel MKL to run at full speed on AMD hardware ?

Some time ago, I found out the Intel Approximate Math library. This one is completely free and open-source, and it provides SSE and SSE2 versions of many functions. But it has two drawbacks:

- It is written as inline assembly, MASM style. The source is very targetted for MSVC/ICC so it is painful to use with gcc
- As the name implies, it is approximated. And, well, I had no use for a sine which has garbage in the ten last bits.

I chose to write them in pure SSE1+MMX so that they run on the pentium III of your grand mother, and also on my brave athlon-xp, since thoses beast are not SSE2 aware. Intel AMath showed me that the performance gain for using SSE2 for that purpose was not large enough (10%) to consider providing an SSE2 version (but it can be done very quickly). Update: I finally did that SSE2 version -- see below.

The functions use only the `_mm_` intrinsics , there is
no inline assembly in the code. Advantage: easier to debug,
works out of the box on 64 bit setups, let the compiler choose
what should be stored in a register, and what is stored in
memory. Inconvenient: some versions of gcc 3.x are badly broken
with certain intrinsic functions ( `_mm_movehl_ps` ,
`_mm_cmpeq_ps` etc). Mingw's gcc for example -- beware
that the brokeness is dependent on the optimization level. A
workaround is provided (inline asm replacement for the braindead
intrinsics), it is not nice but robust, and broken compilers are
detected by the validation program below.

I first tried to improve the AMath functions by using longer minimax polynomial approximations for sine, but of course it failed to achieve full precision because of rounding errors in the polynom, and in the computation of x modulo Pi. So I took a look at the implementation of these functions in the cephes library, noticed that they were simpler than what I imagined and contained very few branches, and just translated them in SSE intrinsics. The sincos_ps is nice because you get magically a free sine for each cosine you compute, so it is almost as fast as the sin_ps and the cos_ps.

Of course it is not IEEE compliant, but the max absolute error on sines is 2^-24 on the range [-8192,8192]. The SSE implementation matches exactly the cephes one if you do not use the x87 fpu when compiling the test program ( `-mfpmath=sse` with gcc)

- sse_mathfun.h source code for sin_ps, cos_ps, sincos_ps, exp_ps, log_ps, as straight C.
- sse_mathfun_test.c Validation+Bench program for those function. Do not forget to run it once.

command line: `gcc -O2 -Wall -W -DHAVE_VECLIB -msse sse_mathfun_test.c -framework Accelerate`

exp([ -1000, -100, 100, 1000]) = [ 0, 0, 2.4061436e+38, 2.4061436e+38] exp([ nan, inf, -inf, nan]) = [2.4061436e+38, 2.4061436e+38, 0, 2.4061436e+38] log([ 0, -10, 1e+30, 1.0005271e-42]) = [ -87.336548, -87.336548, 69.077553, -87.336548] log([ nan, inf, -inf, nan]) = [ -87.336548, 88.722839, -87.336548, -87.336548] sin([ nan, inf, -inf, nan]) = [ nan, nan, nan, nan] cos([ nan, inf, -inf, nan]) = [ nan, nan, nan, nan] sin([ -1e+30, -100000, 1e+30, 100000]) = [ inf, -0.035749275, -inf, 0.035749275] cos([ -1e+30, -100000, 1e+30, 100000]) = [ nan, -0.9993608, nan, -0.9993608] benching sinf .. -> 3.2 millions of vector evaluations/second -> 142 cycles/value on a 1830MHz computer benching cosf .. -> 3.2 millions of vector evaluations/second -> 142 cycles/value on a 1830MHz computer benching sincos (x87) .. -> 2.8 millions of vector evaluations/second -> 161 cycles/value on a 1830MHz computer benching expf .. -> 3.0 millions of vector evaluations/second -> 148 cycles/value on a 1830MHz computer benching logf .. -> 3.0 millions of vector evaluations/second -> 150 cycles/value on a 1830MHz computer benching cephes_sinf .. -> 4.5 millions of vector evaluations/second -> 100 cycles/value on a 1830MHz computer benching cephes_cosf .. -> 4.9 millions of vector evaluations/second -> 92 cycles/value on a 1830MHz computer benching cephes_expf .. -> 3.0 millions of vector evaluations/second -> 151 cycles/value on a 1830MHz computer benching cephes_logf .. -> 2.6 millions of vector evaluations/second -> 172 cycles/value on a 1830MHz computer benching sin_ps .. -> 18.1 millions of vector evaluations/second -> 25 cycles/value on a 1830MHz computer benching cos_ps .. -> 18.2 millions of vector evaluations/second -> 25 cycles/value on a 1830MHz computer benching sincos_ps .. -> 15.0 millions of vector evaluations/second -> 30 cycles/value on a 1830MHz computer benching exp_ps .. -> 17.6 millions of vector evaluations/second -> 26 cycles/value on a 1830MHz computer benching log_ps .. -> 15.5 millions of vector evaluations/second -> 29 cycles/value on a 1830MHz computer benching vsinf .. -> 14.3 millions of vector evaluations/second -> 32 cycles/value on a 1830MHz computer benching vcosf .. -> 14.4 millions of vector evaluations/second -> 32 cycles/value on a 1830MHz computer benching vexpf .. -> 12.0 millions of vector evaluations/second -> 38 cycles/value on a 1830MHz computer benching vlogf .. -> 13.1 millions of vector evaluations/second -> 35 cycles/value on a 1830MHz computer

command line: `gcc-4.2 -msse -O3 -Wall -W sse_mathfun_test.c -lm -DHAVE_ACML -I /usr/local/acml3.6.0/pathscale64/include /usr/local/acml3.6.0/pathscale64/lib/libacml_mv.a`

benching sinf .. -> 6.3 millions of vector evaluations/second -> 103 cycles/value on a 2600MHz computer benching cosf .. -> 5.6 millions of vector evaluations/second -> 115 cycles/value on a 2600MHz computer benching sincos (x87) .. -> 4.2 millions of vector evaluations/second -> 153 cycles/value on a 2600MHz computer benching expf .. -> 1.1 millions of vector evaluations/second -> 546 cycles/value on a 2600MHz computer benching logf .. -> 4.7 millions of vector evaluations/second -> 138 cycles/value on a 2600MHz computer benching cephes_sinf .. -> 11.6 millions of vector evaluations/second -> 56 cycles/value on a 2600MHz computer benching cephes_cosf .. -> 8.7 millions of vector evaluations/second -> 74 cycles/value on a 2600MHz computer benching cephes_expf .. -> 3.7 millions of vector evaluations/second -> 172 cycles/value on a 2600MHz computer benching cephes_logf .. -> 5.5 millions of vector evaluations/second -> 117 cycles/value on a 2600MHz computer benching sin_ps .. -> 26.1 millions of vector evaluations/second -> 25 cycles/value on a 2600MHz computer benching cos_ps .. -> 26.1 millions of vector evaluations/second -> 25 cycles/value on a 2600MHz computer benching sincos_ps .. -> 23.7 millions of vector evaluations/second -> 27 cycles/value on a 2600MHz computer benching exp_ps .. -> 22.9 millions of vector evaluations/second -> 28 cycles/value on a 2600MHz computer benching log_ps .. -> 21.6 millions of vector evaluations/second -> 30 cycles/value on a 2600MHz computer benching acml vrs4_sinf .. -> 17.9 millions of vector evaluations/second -> 36 cycles/value on a 2600MHz computer benching acml vrs4_cosf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value on a 2600MHz computer benching acml vrs4_expf .. -> 28.6 millions of vector evaluations/second -> 23 cycles/value on a 2600MHz computer benching acml vrs4_logf .. -> 23.6 millions of vector evaluations/second -> 27 cycles/value on a 2600MHz computer

command line: `gcc -mfpmath=sse -msse -O2 -Wall -W sse_mathfun_test.c`

benching sinf .. -> 3.4 millions of vector evaluations/second -> 144 cycles/value on a 2000MHz computer benching cosf .. -> 5.1 millions of vector evaluations/second -> 97 cycles/value on a 2000MHz computer benching sincos (x87) .. -> 2.3 millions of vector evaluations/second -> 214 cycles/value on a 2000MHz computer benching expf .. -> 1.8 millions of vector evaluations/second -> 272 cycles/value on a 2000MHz computer benching logf .. -> 2.5 millions of vector evaluations/second -> 200 cycles/value on a 2000MHz computer benching cephes_sinf .. -> 3.7 millions of vector evaluations/second -> 132 cycles/value on a 2000MHz computer benching cephes_cosf .. -> 3.2 millions of vector evaluations/second -> 153 cycles/value on a 2000MHz computer benching cephes_expf .. -> 1.2 millions of vector evaluations/second -> 407 cycles/value on a 2000MHz computer benching cephes_logf .. -> 1.4 millions of vector evaluations/second -> 355 cycles/value on a 2000MHz computer benching sin_ps .. -> 17.2 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching cos_ps .. -> 17.0 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching sincos_ps .. -> 14.7 millions of vector evaluations/second -> 34 cycles/value on a 2000MHz computer benching exp_ps .. -> 17.2 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching log_ps .. -> 14.7 millions of vector evaluations/second -> 34 cycles/value on a 2000MHz computer

command line: `cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c`

benching sinf .. -> 3.1 millions of vector evaluations/second -> 160 cycles/value on a 2000MHz computer benching cosf .. -> 3.9 millions of vector evaluations/second -> 127 cycles/value on a 2000MHz computer benching sincos (x87) .. -> 2.8 millions of vector evaluations/second -> 175 cycles/value on a 2000MHz computer benching expf .. -> 2.0 millions of vector evaluations/second -> 239 cycles/value on a 2000MHz computer benching logf .. -> 2.6 millions of vector evaluations/second -> 192 cycles/value on a 2000MHz computer benching cephes_sinf .. -> 2.5 millions of vector evaluations/second -> 198 cycles/value on a 2000MHz computer benching cephes_cosf .. -> 2.8 millions of vector evaluations/second -> 176 cycles/value on a 2000MHz computer benching cephes_expf .. -> 0.9 millions of vector evaluations/second -> 546 cycles/value on a 2000MHz computer benching cephes_logf .. -> 1.3 millions of vector evaluations/second -> 370 cycles/value on a 2000MHz computer benching sin_ps .. -> 17.2 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching cos_ps .. -> 17.3 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching sincos_ps .. -> 15.5 millions of vector evaluations/second -> 32 cycles/value on a 2000MHz computer benching exp_ps .. -> 17.8 millions of vector evaluations/second -> 28 cycles/value on a 2000MHz computer benching log_ps .. -> 10.4 millions of vector evaluations/second -> 48 cycles/value on a 2000MHz computer

command line: `cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c`

benching sinf .. -> 1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer benching cosf .. -> 1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer benching sincos (x87) .. -> 1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer benching expf .. -> 1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer benching logf .. -> 1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer benching cephes_sinf .. -> 1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer benching cephes_cosf .. -> 1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer benching cephes_expf .. -> 0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer benching cephes_logf .. -> 0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer benching sin_ps .. -> 9.2 millions of vector evaluations/second -> 43 cycles/value on a 1600MHz computer benching cos_ps .. -> 9.5 millions of vector evaluations/second -> 42 cycles/value on a 1600MHz computer benching sincos_ps .. -> 8.8 millions of vector evaluations/second -> 45 cycles/value on a 1600MHz computer benching exp_ps .. -> 9.8 millions of vector evaluations/second -> 41 cycles/value on a 1600MHz computer benching log_ps .. -> 8.6 millions of vector evaluations/second -> 46 cycles/value on a 1600MHz computer

**update** (2008/03/14): If you are looking for extreme performance on SSE2 simple precision logf, have a look at the page of Hans R. Battenfeld, who is working on a version twice faster than the ones here.

Last modified: Sun May 29 23:59:10 CEST 2011