Analysis
Consider the following C functions (pulled from the libsndfile project with nothing
more than minor modifications) which converts an array of floats to an array of
ints using the standard C casting mechanism.
1: void f2i_array (float *fptr, int *buffer, int count)
2: { while (count)
3: { count -- ;
4: buffer [count] = fptr [count] ;
5: } ;
6: }
As will be shown below in the benchmarking section, the standard C cast from
float (or double) to int is slow in comparison to a number of other conversion
methods.
The root cause of this problem becomes obvious when the assembler output of the
GNU C compiler (obtained using gcc -S) is viewed.
Neglecting the stack handling code at the start and end of the function, the
while loop is as follows (comments added):
1: .L363:
2: decl %edx ; decrement the count variable
3: flds (%ebx,%edx,4) ; load a float from input array
4: fnstcw -2(%ebp) ; store FPU control word
5: movw -2(%ebp),%si ; move FPU control word to si register
6: orw $3072,%si ; modify si
7: movw %si,-4(%ebp) ; move si to the stack
8: fldcw -4(%ebp) ; load same value from stack into FPU control word
9: fistpl -8(%ebp) ; store floating point value as an integer on the stack
10: movl -8(%ebp),%eax ; move the integer value from stack to eax
11: fldcw -2(%ebp) ; restore FPU control word
12: movl %eax,(%ecx,%edx,4) ; move eax to output array
13: testl %edx,%edx ; test of count is zero
14: jne .L363 ; jump to label if zero
The instruction which causes the real damage in this block is fldcw, (FPU load
control word) on lines 8 and 11.
Whenever the FPU encounters this instruction it flushes its pipeline and loads the
control word before continuing operation.
The FPUs of modern CPUs like the Pentium III, Pentium IV and AMD Athlons rely on deep
pipelines to achieve higher peak performance.
Unfortunately certain pieces of C code can reduce the floating point performance of
the CPU to level of a non-pipelined FPU.
So why is the fldcw instruction used?
Unfortunately, it is required to make the calculation meet the ISO C Standard which
specifies that casting from floating point to integer is a truncation operation.
However, if the fistpl instruction was executed without changing the mode
of the FPU, the value would have been rounded instead of truncated.
The standard rounding mode is required for all normal operations like addition,
subtraction, multiplication etc while truncation mode is required for the float to
int cast.
Hence if a block of code contains a float to int cast, the FPU will spend a large
amount of its time switching between the two modes.
Removing the instructions dealing with changing the FPU mode results in a loop that
looks like this:
1: .L363:
2: decl %edx ; decrement the count variable
3: flds (%ebx,%edx,4) ; load a float from input array
4: fistpl (%ecx,%edx,4) ; store as an int in the output array
5: testl %edx,%edx ; is count zero?
6: jne .L363 ; if not, jump to label
Instead of the using truncation, the above loop performs a rounding operation and
doesn't adversely effect the FPU pipeline.
In addition, since this loop contains far fewer instructions than the previous one,
it executes more quickly.
The programmer must decide whether the rounding operation is an acceptable substitute
for truncation.
In most audio applications it would be.
A C Solution
Fortunately, the 1999 ISO C Standard defines two functions which were not a part
of earlier versions of the standard.
These functions round doubles and floats to long ints and have the following
function prototypes:
long int lrint (double x) ;
long int lrintf (float x) ;
These functions are defined in <math.h> but are only usable with the
GNU C compiler if C99 extensions have been enabled before <math.h> is
included.
This is done as follows:
#define _ISOC9X_SOURCE 1
#define _ISOC99_SOURCE 1
#include <math.h>
Two versions of the defines ensure that the required functions are picked up
with older header files.
In the GLIBC (the standard version of the C library on Linux) header files, these
functions are defined as inline functions and are in fact inlined by gcc
(the standard C compiler on Linux) when optimisation is switched on.
If optimisation is switched off, the functions are not inlined and an executable calling
these functions will need to be linked with the maths library.
The original C code can now modified to use one of these functions :
1: void f2i_array (float *fptr, int *buffer, int count)
2: { while (count)
3: { count -- ;
4: buffer [count] = lrintf (fptr [count]) ;
5: } ;
6: }
which generates the following assembler (again, just looking at the assembler within
the while loop):
1: .L363:
2: decl %edx ; decrement the count variable
3: flds (%ebx,%edx,4) ; load a float from the input array
4: #APP
5: fistpl -4(%ebp) ; convert float to int and store on stack
6: #NO_APP
7: movl -4(%ebp),%eax ; load value from the stack to eax
8: movl %eax,(%ecx,%edx,4) ; store eax in output array
9: testl %edx,%edx ; is count zero?
10: jne .L363 ; if yes, jump to label
The new assembler function does contain a bit more data shuffling than the optimal
assembler version but is guaranteed to be portable across CPUs and compilers which
meet the C99 standards.