6/10/2010 
Dr. Gongyuan Zhuang, MTS Software Engineer
Ravindra Babu, Software Engineer
Bragadeesh Natarajan, Senior Software Engineer
Abstract
In this article, we discuss a new technique for using conditional elimination. It is suitable for streaming singleinstruction/multipledata (SIMD) extensions (SSE) instructions. We also applied this new technique for integer scaling and general division applications, and gained tremendous performance improvement.
Introduction
The roundtoeven method, also known as unbiased rounding or statistician’s rounding, is a very popular type of rounding floatingpoint numbers to integral numbers. In this procedure, rounding a floatingpoint number up or down depends on the fractional part of that number. If the fractional part is less than 0.5, it is rounded down (round to the greatest integer less than the floatingpoint number). On the other hand, if the fractional part is greater than 0.5, it is rounded up (round to the least integer greater than the floatingpoint number). The interesting scenario is when the fractional part is exactly 0.5. In this case, the number must be rounded either up or down to yield an even integer.
This kind of rounding is very important for statisticians and bankers because the average error caused due to the rounding process tends towards zero. Statistically, this means that the rounding should not introduce a skew in either the positive or the negative direction. In another words, the mean offset should be stochastically null. This is also important in digital signal processing since there are many subsequent rounding operations over a large set of data.
In this article, we discuss a new method for roundtoeven rounding with largescale data processing. We also discuss its application with a typical computational problem in which two integer numbers are added and scaled down (division by another integer), rendering a floatingpointing number that needs to be rounded. We show our new technique, which can be implemented through streaming singleinstruction/multipledata (SIMD) extensions (SSE) instructions. We further expand this technique for the general division (no scaling) case. We compare results between traditional implementation and our optimized version running on x86 and x8664 architectures using C++ on a Microsoft® Windows® platform.
Traditional Method of RoundtoEven Rounding
In the typical reference implementation, rounding to even can be achieved by writing conditional instructions (ifelse statements) to control how a number is rounded depending on its fractional part. A flow chart representing the traditional method follows.
From the flow chart, we might have to make as many as three conditional judgments to get the right answer for each data point. Not only is conditional judgment an expensive operation, but this technique will not be available for SSE optimization since it is impossible to use a single operation to handle multiple data points.
Innovative Technique with Conditional Elimination
To simplify the problem further, we create the following Lemma for explaining our theoretical improvement.
Lemma: If c = 1, Value = a; if c = 0, Value = b. Then, Value = a * c + b * (1 – c)
Proof: If c = 1, Value = a * 1 + b * (1 – 1) = a;
If c = 0, Value = a * 0 + b * (1 – 0) = b;
Both statements are the same.
By this Lemma, one conditional statement could be translated to two multiples — one add and one minus operation — but without judgment.
In a singlethread data set, this technique will seem to be almost useless. But branch instructions (ifelse statements) are slow because processors — even if they have good branch prediction schemes — cannot execute them faster than a regular sequence of code with no branches. Therefore, it is always wise to write code with as few branches as possible.
With SSE instructions, we could handle 128bit data at the same time. With the branch instructions, each data point must go through the conditional test and cannot be processed at the same time. With this Lemma, we have the chance to use more computations with no branch.
Most of the time, it may be possible to simplify the final calculation. We might not really need all operations. Just like our rounding problem, the following flow chart shows our new method of processing a set of data with the simplified final formula.
Proof of the flow chart’s correctness follows:
Assume floatingpoint number f (in A) has fraction part f_fr, integer part f_int.
(1) f_fr > 0.5 → b = f_int + 1, c = 0 → answer = f_int + 1
(2) f_fr < 0.5 → b = f_int, c = 0 → answer = f_int
(3) f_fr = 0.5 → b = f_int + 1
a. f_int last digit is odd → c = 0 → answer = f_int + 1
b. f_int last digit is even → c = 1 → answer = b1 = f_int
This shows the answer is correct for all floatingpoint numbers as long as the conditional value C could be calculated.
Integer Scaling with the New Method
In the preceding section, the calculation of C is not clearly defined, and it might be in different forms when we deal with different issues. To illustrate how to design such a value, we will use the following integer scaling problem as an example.
A set of unsigned char data (as in image processing) needs to be shifted by integer value n. The rounding for the final result will be round to even.
For each input number x, n is the scale factor, assume y is the output, so we have the following for round to even:
y = { x >> n, fraction < 0.5 (x >> n) + 1, fraction > 0.5 (x >> n) + 1, fraction = 0.5 and (x >> n) is odd (1) (x >> n), fraction = 0.5 and (x >> n) is even }
Step 3 of the flow chart has a varied transformation in this case.
Truncate (x / (2^n) + 0.5) → (x + 2^(n – 1)) >> n
This common technique lets us avoid floatingpoint computing, which is much more expensive than integer computing. With this technique, we could rearrange Equation (1) to include one fewer comparison.
y = {(x + 2^(n  1) ) >> n, fraction ! = 0.5 (x + 2^(n  1) ) >> n, fraction = 0.5 and (x >> n) is odd (2) ((x + 2^(n  1)) >> n)  1, fraction = 0.5 and (x >> n) is even }
Since truncation loses precision, we can safely subtract 1, even when fraction ! = 0.5 and (x >> n) is even.
y = {(x + 2^(n  1)  0) >> n, fraction ! = 0.5 and (x >> n) is odd (x + 2^(n  1)  1) >> n, fraction ! = 0.5 and (x >> n) is even (x + 2^(n  1)  0) >> n, fraction = 0.5 and (x >> n) is odd (3) (x + 2^(n  1)  1) >> n, fraction = 0.5 and (x >> n) is even }
Proof of Equation (3):
Assume x = x1 * 2^(n + 1) + x2 * 2^n + x3 * 2^(n – 1) + x4, and that x1 is an integer, x2 and x3 are 0 or 1, and x4 is an integer between 0 and 2^(n – 2).
(1) fraction > 0.5 → x3 = 1, x4 ! = 0, y = (x >> n) + 1 = x1 * 2 + x2 + 1. (x + 2^(n  1)  0) >> n = (x1 * 2^(n + 1) + x2 * 2^n + 2^(n  1) + x4 + 2^(n  1)) >> n = x1 * 2 + x2 + 1 = y; (x + 2^(n  1) 1) >> n = (x1 * 2^(n + 1) + x2 * 2^n + 2^(n  1) + x4 + 2^(n  1)  1) >> n = (x1 * 2^(n + 1) + (x2 + 1) * 2^n + (x4  1) >> n = x1 * 2 + x2 + 1 = y. (2) fraction < 0.5 → x3 = 0, y = (x >> n) = x1 * 2 + x2. (x + 2^(n  1)  0) >> n = (x1 * 2^(n + 1) + x2 * 2^n + x4 + 2^(n  1)) >> n = x1 * 2 + x2 = y; (x + 2^(n  1)  1) >> n = (x1 * 2^(n + 1) + x2 * 2^n + x4 + 2^(n  1)  1) >> n = x1 * 2 + x2 = y, Since x4 + 2^(n  1)  1 < = 2^(n  1) + 2^(n  2) – 1 < 2^n. (3) fraction = 0.5 and (x >> n) is even → x2 = 0, x3 = 1, x4 = 0, y = (x + 2^(n  1) >> n)  1 = x1 * 2 + 1 – 1 = x1 * 2 (x + 2^(n  1)  1) >> n = (x1 * 2^(n + 1) + 2^(n  1) + 2^(n  1)  1) >> n = (x1 * 2^(n + 1) + 2^n  1) >> n = x1 * 2 = y (4) fraction = 0.5 and (x >> n) is odd: we made no change.
This concludes that Equations (2) and (3) are equal.
We want Equation (3) since we want to simplify the conditional process, as shown in the following short equation:
y = { (x + 2^(n  1)  0) >> n, (x >> n) is odd (x + 2^(n  1)  1) >> n, (x >> n) is even (4) }
Then we find the final equation, which is matched with our flow chart method.
y = { (x + 2^(n  1)  C) >> n, (5) } where C = 1, if (x >> n) is even, and C = 0, if otherwise.
Here, the value of conditional judgment C can be obtained without using conditional instructions. After we perform (x >> n), we INVERT (bitwise) the result and AND (bitwise) it with 0x1. The result will be 1 if (x >> n) is even and 0 if (x >> n) is odd.
The optimized code follows (U8 is for unsigned char data, and U32 is for unsigned 32bit integer).
void AddScaleOptimizedRef( const U8 *s1, const U8 *s2, U8 *d, U32 len, U32 scaleFactor ) { U32 scaleFactorT = scaleFactor; U32 scaleHalf = (1U << (scaleFactorT  1)); for(U8 *end = (d +len); d> scaleFactorT)) & 1U))) >> scaleFactorT; *d = (U8)sAddScale; } }
Optimized SSE Implementation for Integer Scaling
This section explains the implementation of our optimized algorithm using SSE compiler intrinsics. The algorithm is similar to the one already described, but takes the additional advantage of using SIMD instructions to enhance performance. The function AddScaleOptimizedSSE takes arguments identical to the optimized reference code function AddScaleOptimizedRef.
The implementation has two helper routines, Load1 and Scale8u. The helper function Load1 merely fills the given constant 16bit integer value at eight locations in the XMM128 type, which can be alternately accessed as a __m128i type required for SSE intrinsic. The helper function Scale8u performs the actual scaling (division by (2^n)) and rounding to the nearest even integer. The optimized code implementation follows.
#define SHUF_IMM( a, b, c, d ) ( a  (b << 2)  (c< union __declspec(align(16)) XMM128 { __m128i i; int s32[4]; short s16[8]; }; __forceinline void Load1( XMM128 & v, const short &value ){ v.s16[0] = value; v.s16[1] = value; v.i = _mm_shuffle_epi32( v.i, SHUF_IMM( 0,0,0,0 ) ); }
The scaling routine scale8u follows these steps to scale the given value and round it to the nearest even integer:
 Add xmm[1](maskRound) to the value to be scaled, which is equivalent to adding 0.5 after scaling to round to the integer.
srcDst.i = _mm_add_epi16(srcDst.i, xmm[1].i);
 Create a mask ANDing the above result with xmm[2](lower scale + 1 bits high) to clear the upper bits from the result.
mask.i = _mm_and_si128(srcDst.i, xmm[2].i);
 Compare the above mask with xmm[3](scale + 1 bit from LSB set high to test odd whole number).
mask.i = _mm_cmpeq_epi16(mask.i, xmm[3].i);
 Rightshift the above result by 15. If the number is an odd whole number, which implies the result after scaling an is even number with fractional part 0.5, then the variable mask for the corresponding result stores a 1, or else it contains 0.
mask.i = _mm_srli_epi16(mask.i, 15);
 Shift the input by scale bits (xmm[0])to perform scaling.
srcDst.i = _mm_srl_epi16(srcDst.i, xmm[0].i);
 Subtract the 1 or 0 present in the mask obtained in Step 4 from the result obtained in Step 5.
srcDst.i = _mm_sub_epi16(srcDst.i, mask.i);
__forceinline void Scale8u(XMM128 &srcDst, const XMM128 xmm[]) { XMM128 mask ; //1 << (scale  1) srcDst.i = _mm_add_epi16(srcDst.i, xmm[1].i); //All 1's for scale + 1 no of bits from right mask.i = _mm_and_si128(srcDst.i, xmm[2].i); //check for odd whole no, scale bits zero, scale + 1 bit 1 from right mask.i = _mm_cmpeq_epi16(mask.i, xmm[3].i); mask.i = _mm_srli_epi16(mask.i, 15); srcDst.i = _mm_srl_epi16(srcDst.i, xmm[0].i);//scale srcDst.i = _mm_sub_epi16(srcDst.i, mask.i); }
The routine AddScaleOptimizedSSE performs the addition and rounds to the nearest even integer on a given buffer of 8bit unsigned integers. The routine is divided into two parts.
The first part of the routine is a set of onetime calculations that are filled in xmm registers. maskRound contains the value 2^(scale – 1), which is added to the input before scaling. This is equivalent to adding 0.5 after scaling. maskCheckOdd is a mask variable that has the lower scale + 1 bits set, which is used to check if the number is odd after scaling. zeroUpperBits contains the lower scale + 1 bits set. All three values, along with the scale value, are filled into the xmm registers used in the scaling algorithm.
The second part of the routine is a loop in which addition and scaling is performed on each value in the buffer. It contains the following steps.
 The code does simple unaligned load for the two given source buffers, which loads 16 unsigned chars (8bit unsigned integers).
 Unpacking the data from 8 bits to 16 bits into two registers. The unpacking is required to avoid overflow of values on addition.
 Perform the 16bit SSE2 addition on the values present in the register.
 Call the scale routine explained previously, passing the result of addition and the precalculated values present in the xmm registers. (The result available in the two registers are packed and saturated from 16 bits to 8 bits using the SSE2 pack instruction.)
 Store the packed result to the given destination buffer using unsigned store operation.
Notes: if the given buffer is 16byte aligned, then aligned load and store is preferred to the unaligned load and store operations to enhance performance.
// Optimized SSE2 code void AddScaleOptimizedSSE( const U8 *s1, const U8 *s2, U8 *d, U32 len, U32 scaleFactor ) { // Onetime calculation int maskRound, maskCheckOdd, zeroUpperBits; maskRound = 1 << (scaleFactor1); maskCheckOdd = 1 << (scaleFactor); M128 xmmData[4]; xmmData[0].i = _mm_setzero_si128(); xmmData[0].s32[0] = scaleFactor; Load1(xmmData[1], (short)maskRound); xmmData[2].i = _mm_set1_epi16(1); zeroUpperBits = 16scaleFactor1; xmmData[2].i = _mm_srli_epi16(xmmData[2].i, zeroUpperBits); Load1(xmmData[3], (short)maskCheckOdd); // onetime calculationend for( U8 *end = (d +len); d<end; d+=16,s1+=16,s2+=16 { M128 src1L, src1H; M128 src2L, src2H; // checking for alignment and using _mm_load_si128 will further improve the performance src1L.i = _mm_loadu_si128( (__m128i*) s1 ); src2L.i = _mm_loadu_si128( (__m128i*) s2 ); //fill with all zero static const __m128i zero = _mm_setzero_si128(); //unpack 8bit unsigned integer to 16bit signed integers to avoid overflow on addition src1H.i= _mm_unpackhi_epi8( src1L.i, zero ); src1L.i = _mm_unpacklo_epi8( src1L.i, zero ); src2H.i = _mm_unpackhi_epi8( src2L.i, zero ); src2L.i = _mm_unpacklo_epi8( src2L.i, zero ); //performing 16bit addition on source values src1L.i = _mm_add_epi16(src1L.i, src2L.i); src1H.i = _mm_add_epi16(src1H.i, src2H.i); Scale8u(src1L, xmmData); Scale8u(src1H, xmmData); src1L.i = _mm_packus_epi16 (src1L.i,src1H.i) // checking for alignment and using _mm_store_si128 will further improve the performance _mm_storeu_si128( (__m128i*) d, src1L.i ); } }
Comparison of Results for Integer Scaling Problem
We tested the algorithm using AMD Athlon™ 64 2800+ (1.8 GHz) and AMD Athlon 64 X2 3800+ processors.The following table shows those chips’ performance:
AMD Athlon 64 2800+ 
AMD Athlon 64 X2 3800+ 

Time(s) 
vs. ref 
Time(s) 
vs. ref 

Reference code 
0.035427 
100% 
0.0315685 
100% 
Ref with new method 
0.001523 
2326% 
0.00134807 
2342% 
SSE with new method 
0.000608 
5830% 
0.00044779 
7050% 
General Division Case
Having applied the new method to integer scaling, the scaling operation was employed and, hence, division was completely avoided. However, if the dividing number (divisor) happens to be a number that is not a power of two, and then division cannot be omitted. The roundtoeven method can be performed optimally in the general division case. In this section, we compare a typical reference implementation and an optimized reference implementation for the roundtoeven method for a general divisor.
void DivideRoundEvenRef( const U8 *s, U8 *d, U32 len, U8 divisor ) { for( U8 *end = (d +len); d<end; ++d,++s ) *d = (U8) dtol(((double)*s)/((double)divisor)); }
In the preceding code, an integer buffer is being divided by a constant divisor and rounded to even. This routine uses the same “dtol” subroutine as the original reference implementation, and the code is selfexplanatory. The optimized version of the code appears following this paragraph and its calculations. The idea is similar to the previous optimized reference code.
Assume number a is divided by number b and the result needs to be rounded to even. Assume x is the rounded result and y is the roundedtoeven result. To get the rounded result, we add 0.5 to (a / b) and truncate. The division operation performed in these formulas is integer division, so the result is always truncated.
x = (a / b + 0.5) = (a / b + 1/2) = ((2a + b) / (2b)) y = { x, fraction(a / b) ! = 0.5 x  1, fraction(a / b) = 0.5 and x is odd x  0, fraction(a / b) = 0.5 and x is even } y = { x  C, where C = 1, fraction(a / b) = 0.5 and x is odd C = 0, otherwise }
To determine whether the fraction of (a / b) is 0.5 or not, we just have to see if 2bx is equal to (2a + b). This will happen only when the fractional part is exactly half. Checking for even or odd is also simple.
void DivideRoundEvenOptimizedRef( const U8 *s, U8 *d, U32 len, U8 divisor ) { U32 b = (U32)divisor; U32 b2 = b << 1; for( U8 *end = (d +len); d<end; ++d,++s ) { U32 a = (U32)*s; U32 a2 = a << 1; U32 val = (a2 + b) / b2; U32 valA = val  ((val & 1U) & ((val*b2  (a2 + b)) == 0 ? 1 : 0)); *d = (U8)valA; } }
In this case, we also gain substantial performance improvements, as shown in the following table.
AMD Athlon 64 2800+  AMD Athlon 64 X2 3800+  
Time(s) 
vs. ref 
Time(s) 
vs. ref 

Division ref 
0.037791 
100% 
0.032482 
100% 
Ref with new method 
0.002951 
1280% 
0.00202441 
1641% 
Conclusions
From the applications described in this paper, we clearly demonstrate a powerful new conditional elimination technique for improving data handling performance. We believe users could follow similar deduction procedures of integer scaling and establish the conditional formula for their own applications. We could also further improve the performance with SSE instructions.
Reference:
 http://en.wikipedia.org/wiki/Round_to_even
 A comparison of three rounding algorithms for IEEE floatingpoint multiplicationEven, G.; Seidel, P.M.;Computers, IEEE Transactions on Volume 49, Issue 7, July 2000, pp. 638–650.