I have spent quite a while looking for a simple (but fast) SSE version of some basic transcendental functions (sines and exponential). On the mac, you get the vsinf and friends (in the Accelerate framework) which are nice (there is a ppc version and an intel version, Apple rox) but closed-source, and restricted to macos..
Both Intel and AMD have some sort of vector math library with SIMD sines and cosines, but
Some time ago, I found out the Intel Approximate Math library. This one is completely free and open-source, and it provides SSE and SSE2 versions of many functions. But it has two drawbacks:
I chose to write them in pure SSE1+MMX so that they run on the pentium III of your grand mother, and also on my brave athlon-xp, since thoses beast are not SSE2 aware. Intel AMath showed me that the performance gain for using SSE2 for that purpose was not large enough (10%) to consider providing an SSE2 version (but it can be done very quickly). Update: I finally did that SSE2 version -- see below.
The functions use only the _mm_ intrinsics , there is no inline assembly in the code. Advantage: easier to debug, works out of the box on 64 bit setups, let the compiler choose what should be stored in a register, and what is stored in memory. Inconvenient: some versions of gcc 3.x are badly broken with certain intrinsic functions ( _mm_movehl_ps , _mm_cmpeq_ps etc). Mingw's gcc for example -- beware that the brokeness is dependent on the optimization level. A workaround is provided (inline asm replacement for the braindead intrinsics), it is not nice but robust, and broken compilers are detected by the validation program below.
I first tried to improve the AMath functions by using longer minimax polynomial approximations for sine, but of course it failed to achieve full precision because of rounding errors in the polynom, and in the computation of x modulo Pi. So I took a look at the implementation of these functions in the cephes library, noticed that they were simpler than what I imagined and contained very few branches, and just translated them in SSE intrinsics. The sincos_ps is nice because you get magically a free sine for each cosine you compute, so it is almost as fast as the sin_ps and the cos_ps.
Of course it is not IEEE compliant, but the max absolute error on sines is 2^-24 on the range [-8192,8192]. The SSE implementation matches exactly the cephes one if you do not use the x87 fpu when compiling the test program ( -mfpmath=sse with gcc)
command line: gcc -O2 -Wall -W -DHAVE_VECLIB -msse sse_mathfun_test.c -framework Accelerate
exp([ -1000, -100, 100, 1000]) = [ 0, 0, 2.4061436e+38, 2.4061436e+38] exp([ nan, inf, -inf, nan]) = [2.4061436e+38, 2.4061436e+38, 0, 2.4061436e+38] log([ 0, -10, 1e+30, 1.0005271e-42]) = [ -87.336548, -87.336548, 69.077553, -87.336548] log([ nan, inf, -inf, nan]) = [ -87.336548, 88.722839, -87.336548, -87.336548] sin([ nan, inf, -inf, nan]) = [ nan, nan, nan, nan] cos([ nan, inf, -inf, nan]) = [ nan, nan, nan, nan] sin([ -1e+30, -100000, 1e+30, 100000]) = [ inf, -0.035749275, -inf, 0.035749275] cos([ -1e+30, -100000, 1e+30, 100000]) = [ nan, -0.9993608, nan, -0.9993608] benching sinf .. -> 3.2 millions of vector evaluations/second -> 142 cycles/value on a 1830MHz computer benching cosf .. -> 3.2 millions of vector evaluations/second -> 142 cycles/value on a 1830MHz computer benching sincos (x87) .. -> 2.8 millions of vector evaluations/second -> 161 cycles/value on a 1830MHz computer benching expf .. -> 3.0 millions of vector evaluations/second -> 148 cycles/value on a 1830MHz computer benching logf .. -> 3.0 millions of vector evaluations/second -> 150 cycles/value on a 1830MHz computer benching cephes_sinf .. -> 4.5 millions of vector evaluations/second -> 100 cycles/value on a 1830MHz computer benching cephes_cosf .. -> 4.9 millions of vector evaluations/second -> 92 cycles/value on a 1830MHz computer benching cephes_expf .. -> 3.0 millions of vector evaluations/second -> 151 cycles/value on a 1830MHz computer benching cephes_logf .. -> 2.6 millions of vector evaluations/second -> 172 cycles/value on a 1830MHz computer benching sin_ps .. -> 18.1 millions of vector evaluations/second -> 25 cycles/value on a 1830MHz computer benching cos_ps .. -> 18.2 millions of vector evaluations/second -> 25 cycles/value on a 1830MHz computer benching sincos_ps .. -> 15.0 millions of vector evaluations/second -> 30 cycles/value on a 1830MHz computer benching exp_ps .. -> 17.6 millions of vector evaluations/second -> 26 cycles/value on a 1830MHz computer benching log_ps .. -> 15.5 millions of vector evaluations/second -> 29 cycles/value on a 1830MHz computer benching vsinf .. -> 14.3 millions of vector evaluations/second -> 32 cycles/value on a 1830MHz computer benching vcosf .. -> 14.4 millions of vector evaluations/second -> 32 cycles/value on a 1830MHz computer benching vexpf .. -> 12.0 millions of vector evaluations/second -> 38 cycles/value on a 1830MHz computer benching vlogf .. -> 13.1 millions of vector evaluations/second -> 35 cycles/value on a 1830MHz computer
command line: gcc-4.2 -msse -O3 -Wall -W sse_mathfun_test.c -lm -DHAVE_ACML -I /usr/local/acml3.6.0/pathscale64/include /usr/local/acml3.6.0/pathscale64/lib/libacml_mv.a
benching sinf .. -> 6.3 millions of vector evaluations/second -> 103 cycles/value on a 2600MHz computer benching cosf .. -> 5.6 millions of vector evaluations/second -> 115 cycles/value on a 2600MHz computer benching sincos (x87) .. -> 4.2 millions of vector evaluations/second -> 153 cycles/value on a 2600MHz computer benching expf .. -> 1.1 millions of vector evaluations/second -> 546 cycles/value on a 2600MHz computer benching logf .. -> 4.7 millions of vector evaluations/second -> 138 cycles/value on a 2600MHz computer benching cephes_sinf .. -> 11.6 millions of vector evaluations/second -> 56 cycles/value on a 2600MHz computer benching cephes_cosf .. -> 8.7 millions of vector evaluations/second -> 74 cycles/value on a 2600MHz computer benching cephes_expf .. -> 3.7 millions of vector evaluations/second -> 172 cycles/value on a 2600MHz computer benching cephes_logf .. -> 5.5 millions of vector evaluations/second -> 117 cycles/value on a 2600MHz computer benching sin_ps .. -> 26.1 millions of vector evaluations/second -> 25 cycles/value on a 2600MHz computer benching cos_ps .. -> 26.1 millions of vector evaluations/second -> 25 cycles/value on a 2600MHz computer benching sincos_ps .. -> 23.7 millions of vector evaluations/second -> 27 cycles/value on a 2600MHz computer benching exp_ps .. -> 22.9 millions of vector evaluations/second -> 28 cycles/value on a 2600MHz computer benching log_ps .. -> 21.6 millions of vector evaluations/second -> 30 cycles/value on a 2600MHz computer benching acml vrs4_sinf .. -> 17.9 millions of vector evaluations/second -> 36 cycles/value on a 2600MHz computer benching acml vrs4_cosf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value on a 2600MHz computer benching acml vrs4_expf .. -> 28.6 millions of vector evaluations/second -> 23 cycles/value on a 2600MHz computer benching acml vrs4_logf .. -> 23.6 millions of vector evaluations/second -> 27 cycles/value on a 2600MHz computer
command line: gcc -mfpmath=sse -msse -O2 -Wall -W sse_mathfun_test.c
benching sinf .. -> 3.4 millions of vector evaluations/second -> 144 cycles/value on a 2000MHz computer benching cosf .. -> 5.1 millions of vector evaluations/second -> 97 cycles/value on a 2000MHz computer benching sincos (x87) .. -> 2.3 millions of vector evaluations/second -> 214 cycles/value on a 2000MHz computer benching expf .. -> 1.8 millions of vector evaluations/second -> 272 cycles/value on a 2000MHz computer benching logf .. -> 2.5 millions of vector evaluations/second -> 200 cycles/value on a 2000MHz computer benching cephes_sinf .. -> 3.7 millions of vector evaluations/second -> 132 cycles/value on a 2000MHz computer benching cephes_cosf .. -> 3.2 millions of vector evaluations/second -> 153 cycles/value on a 2000MHz computer benching cephes_expf .. -> 1.2 millions of vector evaluations/second -> 407 cycles/value on a 2000MHz computer benching cephes_logf .. -> 1.4 millions of vector evaluations/second -> 355 cycles/value on a 2000MHz computer benching sin_ps .. -> 17.2 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching cos_ps .. -> 17.0 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching sincos_ps .. -> 14.7 millions of vector evaluations/second -> 34 cycles/value on a 2000MHz computer benching exp_ps .. -> 17.2 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching log_ps .. -> 14.7 millions of vector evaluations/second -> 34 cycles/value on a 2000MHz computer
command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c
benching sinf .. -> 3.1 millions of vector evaluations/second -> 160 cycles/value on a 2000MHz computer benching cosf .. -> 3.9 millions of vector evaluations/second -> 127 cycles/value on a 2000MHz computer benching sincos (x87) .. -> 2.8 millions of vector evaluations/second -> 175 cycles/value on a 2000MHz computer benching expf .. -> 2.0 millions of vector evaluations/second -> 239 cycles/value on a 2000MHz computer benching logf .. -> 2.6 millions of vector evaluations/second -> 192 cycles/value on a 2000MHz computer benching cephes_sinf .. -> 2.5 millions of vector evaluations/second -> 198 cycles/value on a 2000MHz computer benching cephes_cosf .. -> 2.8 millions of vector evaluations/second -> 176 cycles/value on a 2000MHz computer benching cephes_expf .. -> 0.9 millions of vector evaluations/second -> 546 cycles/value on a 2000MHz computer benching cephes_logf .. -> 1.3 millions of vector evaluations/second -> 370 cycles/value on a 2000MHz computer benching sin_ps .. -> 17.2 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching cos_ps .. -> 17.3 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer benching sincos_ps .. -> 15.5 millions of vector evaluations/second -> 32 cycles/value on a 2000MHz computer benching exp_ps .. -> 17.8 millions of vector evaluations/second -> 28 cycles/value on a 2000MHz computer benching log_ps .. -> 10.4 millions of vector evaluations/second -> 48 cycles/value on a 2000MHz computer
command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c
benching sinf .. -> 1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer benching cosf .. -> 1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer benching sincos (x87) .. -> 1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer benching expf .. -> 1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer benching logf .. -> 1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer benching cephes_sinf .. -> 1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer benching cephes_cosf .. -> 1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer benching cephes_expf .. -> 0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer benching cephes_logf .. -> 0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer benching sin_ps .. -> 9.2 millions of vector evaluations/second -> 43 cycles/value on a 1600MHz computer benching cos_ps .. -> 9.5 millions of vector evaluations/second -> 42 cycles/value on a 1600MHz computer benching sincos_ps .. -> 8.8 millions of vector evaluations/second -> 45 cycles/value on a 1600MHz computer benching exp_ps .. -> 9.8 millions of vector evaluations/second -> 41 cycles/value on a 1600MHz computer benching log_ps .. -> 8.6 millions of vector evaluations/second -> 46 cycles/value on a 1600MHz computer
update (2008/03/14): If you are looking for extreme performance on SSE2 simple precision logf, have a look at the page of Hans R. Battenfeld, who is working on a version twice faster than the ones here.