6/10/2010 

Dr. Gongyuan Zhuang, MTS Software Engineer
Ravindra Babu, Software Engineer
Bragadeesh Natarajan, Senior Software Engineer

Abstract

In this article, we discuss a new technique for using conditional elimination. It is suitable for streaming single-instruction/multiple-data (SIMD) extensions (SSE) instructions. We also applied this new technique for integer scaling and general division applications, and gained tremendous performance improvement.

Introduction

The round-to-even method, also known as unbiased rounding or statistician’s rounding, is a very popular type of rounding floating-point numbers to integral numbers. In this procedure, rounding a floating-point number up or down depends on the fractional part of that number. If the fractional part is less than 0.5, it is rounded down (round to the greatest integer less than the floating-point number). On the other hand, if the fractional part is greater than 0.5, it is rounded up (round to the least integer greater than the floating-point number). The interesting scenario is when the fractional part is exactly 0.5. In this case, the number must be rounded either up or down to yield an even integer.

This kind of rounding is very important for statisticians and bankers because the average error caused due to the rounding process tends towards zero. Statistically, this means that the rounding should not introduce a skew in either the positive or the negative direction. In another words, the mean offset should be stochastically null. This is also important in digital signal processing since there are many subsequent rounding operations over a large set of data.

In this article, we discuss a new method for round-to-even rounding with large-scale data processing. We also discuss its application with a typical computational problem in which two integer numbers are added and scaled down (division by another integer), rendering a floating-pointing number that needs to be rounded. We show our new technique, which can be implemented through streaming single-instruction/multiple-data (SIMD) extensions (SSE) instructions. We further expand this technique for the general division (no scaling) case. We compare results between traditional implementation and our optimized version running on x86 and x86-64 architectures using C++ on a Microsoft® Windows® platform.

Traditional Method of Round-to-Even Rounding

In the typical reference implementation, rounding to even can be achieved by writing conditional instructions (if-else statements) to control how a number is rounded depending on its fractional part. A flow chart representing the traditional method follows.

New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling  - Traditional Method of Round-to-Even Rounding

From the flow chart, we might have to make as many as three conditional judgments to get the right answer for each data point. Not only is conditional judgment an expensive operation, but this technique will not be available for SSE optimization since it is impossible to use a single operation to handle multiple data points.

Innovative Technique with Conditional Elimination

To simplify the problem further, we create the following Lemma for explaining our theoretical improvement.

Lemma: If c = 1, Value = a; if c = 0, Value = b. Then, Value = a * c + b * (1 – c)

Proof: If c = 1, Value = a * 1 + b * (1 – 1) = a;

If c = 0, Value = a * 0 + b * (1 – 0) = b;

Both statements are the same.

By this Lemma, one conditional statement could be translated to two multiples — one add and one minus operation — but without judgment.

In a single-thread data set, this technique will seem to be almost useless. But branch instructions (if-else statements) are slow because processors — even if they have good branch prediction schemes — cannot execute them faster than a regular sequence of code with no branches. Therefore, it is always wise to write code with as few branches as possible.

With SSE instructions, we could handle 128-bit data at the same time. With the branch instructions, each data point must go through the conditional test and cannot be processed at the same time. With this Lemma, we have the chance to use more computations with no branch.

Most of the time, it may be possible to simplify the final calculation. We might not really need all operations. Just like our rounding problem, the following flow chart shows our new method of processing a set of data with the simplified final formula.

New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling  - new method of processing

Proof of the flow chart’s correctness follows:

Assume floating-point number f (in A) has fraction part f_fr, integer part f_int.

(1) f_fr > 0.5 → b = f_int + 1, c = 0 → answer = f_int + 1
(2) f_fr < 0.5 → b = f_int, c = 0 → answer = f_int
(3) f_fr = 0.5 → b = f_int + 1

a. f_int last digit is odd → c = 0 → answer = f_int + 1

b. f_int last digit is even → c = 1 → answer = b-1 = f_int

This shows the answer is correct for all floating-point numbers as long as the conditional value C could be calculated.

Integer Scaling with the New Method

In the preceding section, the calculation of C is not clearly defined, and it might be in different forms when we deal with different issues. To illustrate how to design such a value, we will use the following integer scaling problem as an example.

A set of unsigned char data (as in image processing) needs to be shifted by integer value n. The rounding for the final result will be round to even.

For each input number x, n is the scale factor, assume y is the output, so we have the following for round to even:

y = { x >> n, 		fraction < 0.5
        (x >> n) + 1, 		fraction > 0.5
        (x >> n) + 1, 		fraction = 0.5 and (x >> n) is odd	(1)
        (x >> n),      		fraction = 0.5 and (x >> n) is even	
        }

Step 3 of the flow chart has a varied transformation in this case.

Truncate (x / (2^n) + 0.5) → (x + 2^(n – 1)) >> n

This common technique lets us avoid floating-point computing, which is much more expensive than integer computing. With this technique, we could re-arrange Equation (1) to include one fewer comparison.

y = {(x + 2^(n - 1) ) >> n,     fraction ! = 0.5
        (x + 2^(n - 1) ) >> n,      fraction = 0.5 and (x >> n) is odd	(2)
        ((x + 2^(n - 1)) >> n) - 1, fraction = 0.5 and (x >> n) is even	
        }

Since truncation loses precision, we can safely subtract 1, even when fraction ! = 0.5 and (x >> n) is even.

y =     {(x + 2^(n - 1) - 0) >> n, fraction ! = 0.5 and (x >> n) is odd
	(x + 2^(n - 1) - 1) >> n, fraction ! = 0.5 and (x >> n) is even
	(x + 2^(n - 1) - 0) >> n, fraction = 0.5 and (x >> n) is odd	(3)
	(x + 2^(n - 1) - 1) >> n, fraction = 0.5 and (x >> n) is even	
	}

Proof of Equation (3):

Assume x = x1 * 2^(n + 1) + x2 * 2^n + x3 * 2^(n – 1) + x4, and that x1 is an integer, x2 and x3 are 0 or 1, and x4 is an integer between 0 and 2^(n – 2).

(1)  fraction > 0.5 → x3 = 1, x4 ! = 0, y = (x >> n) + 1 = x1 * 2 + x2 + 1. (x + 2^(n - 1) - 0) >> n = (x1 * 2^(n + 1) + x2 * 2^n + 2^(n - 1) + x4 + 2^(n - 1)) >> n = x1 * 2 + x2 + 1 = y; (x + 2^(n - 1) -1) >> n = (x1 * 2^(n + 1) + x2 * 2^n + 2^(n - 1) + x4 + 2^(n - 1) - 1) >> n = (x1 * 2^(n + 1) + (x2 + 1) * 2^n + (x4 - 1) >> n = x1 * 2 + x2 + 1 = y.

(2)  fraction < 0.5 → x3 = 0, y = (x >> n) = x1 * 2 + x2. (x + 2^(n - 1) - 0) >> n = (x1 * 2^(n + 1) + x2 * 2^n + x4 + 2^(n - 1)) >> n = x1 * 2 + x2 = y; (x + 2^(n - 1) - 1) >> n = (x1 * 2^(n + 1) + x2 * 2^n + x4 + 2^(n - 1) - 1) >> n = x1 * 2 + x2 = y, Since x4 + 2^(n - 1) - 1 < = 2^(n - 1) + 2^(n - 2) – 1 < 2^n.

(3)  fraction = 0.5 and (x >> n) is even  →  x2 = 0, x3 = 1, x4 = 0, y = (x + 2^(n - 1) >> n) - 1 = x1 * 2 + 1 – 1 = x1 * 2 (x + 2^(n - 1) - 1) >> n = (x1 * 2^(n + 1) + 2^(n - 1) + 2^(n - 1) - 1) >> n = (x1 * 2^(n + 1) + 2^n - 1) >> n = x1 * 2 = y

(4)  fraction = 0.5 and (x >> n) is odd: we made no change.

This concludes that Equations (2) and (3) are equal.

We want Equation (3) since we want to simplify the conditional process, as shown in the following short equation:

y  = {	(x + 2^(n - 1) - 0) >> n, (x >> n) is odd
        (x + 2^(n - 1) - 1) >> n, (x >> n) is even	(4)
        }

Then we find the final equation, which is matched with our flow chart method.

y = { (x + 2^(n - 1) - C) >> n,   			 (5) 
}
where C = 1, if (x >> n) is even, and C = 0, if otherwise.

Here, the value of conditional judgment C can be obtained without using conditional instructions. After we perform (x >> n), we INVERT (bit-wise) the result and AND (bit-wise) it with 0×1. The result will be 1 if (x >> n) is even and 0 if (x >> n) is odd.

The optimized code follows (U8 is for unsigned char data, and U32 is for unsigned 32-bit integer).

void AddScaleOptimizedRef( const U8 *s1, const U8 *s2, U8 *d, U32 len, U32 scaleFactor )
{	
	U32 scaleFactorT = scaleFactor;
	U32 scaleHalf = (1U << (scaleFactorT - 1));
	for(U8 *end = (d +len); d> scaleFactorT)) & 1U))) >> scaleFactorT;
		*d = (U8)sAddScale;
	}
}

Optimized SSE Implementation for Integer Scaling

This section explains the implementation of our optimized algorithm using SSE compiler intrinsics. The algorithm is similar to the one already described, but takes the additional advantage of using SIMD instructions to enhance performance. The function AddScaleOptimizedSSE takes arguments identical to the optimized reference code function AddScaleOptimizedRef.

The implementation has two helper routines, Load1 and Scale8u. The helper function Load1 merely fills the given constant 16-bit integer value at eight locations in the XMM128 type, which can be alternately accessed as a __m128i type required for SSE intrinsic. The helper function Scale8u performs the actual scaling (division by (2^n)) and rounding to the nearest even integer. The optimized code implementation follows.

#define SHUF_IMM( a, b, c, d ) ( a | (b << 2) | (c<
union __declspec(align(16)) XMM128
{
	__m128i	i;
	int		s32[4];
	short	s16[8];
};

__forceinline void Load1( XMM128 & v, const short &value ){ 
		v.s16[0] = value; 
		v.s16[1] = value; 	
		v.i = _mm_shuffle_epi32( v.i, SHUF_IMM( 0,0,0,0 ) );
}

The scaling routine scale8u follows these steps to scale the given value and round it to the nearest even integer:

  1. Add xmm[1](maskRound) to the value to be scaled, which is equivalent to adding 0.5 after scaling to round to the integer.

srcDst.i = _mm_add_epi16(srcDst.i, xmm[1].i);

  1. Create a mask ANDing the above result with xmm[2](lower scale + 1 bits high) to clear the upper bits from the result.

mask.i = _mm_and_si128(srcDst.i, xmm[2].i);

  1. Compare the above mask with xmm[3](scale + 1 bit from LSB set high to test odd whole number).

mask.i = _mm_cmpeq_epi16(mask.i, xmm[3].i);

  1. Right-shift the above result by 15. If the number is an odd whole number, which implies the result after scaling an is even number with fractional part 0.5, then the variable mask for the corresponding result stores a 1, or else it contains 0.

mask.i = _mm_srli_epi16(mask.i, 15);

  1. Shift the input by scale bits (xmm[0])to perform scaling.

srcDst.i = _mm_srl_epi16(srcDst.i, xmm[0].i);

  1. Subtract the 1 or 0 present in the mask obtained in Step 4 from the result obtained in Step 5.

srcDst.i = _mm_sub_epi16(srcDst.i, mask.i);

__forceinline void Scale8u(XMM128 &srcDst, const XMM128 xmm[])
{
	XMM128 mask ;

	//1 << (scale - 1)
	srcDst.i	= _mm_add_epi16(srcDst.i, xmm[1].i); 

    	//All 1's for scale + 1 no of bits from right
	mask.i		= _mm_and_si128(srcDst.i, xmm[2].i);

    	//check for odd whole no, scale bits zero, scale + 1 bit 1 from right
	mask.i		= _mm_cmpeq_epi16(mask.i, xmm[3].i);
	mask.i		= _mm_srli_epi16(mask.i, 15);
	srcDst.i		= _mm_srl_epi16(srcDst.i, xmm[0].i);//scale
	srcDst.i		= _mm_sub_epi16(srcDst.i, mask.i);
}

The routine AddScaleOptimizedSSE performs the addition and rounds to the nearest even integer on a given buffer of 8-bit unsigned integers. The routine is divided into two parts.

The first part of the routine is a set of one-time calculations that are filled in xmm registers. maskRound contains the value 2^(scale – 1), which is added to the input before scaling. This is equivalent to adding 0.5 after scaling. maskCheckOdd is a mask variable that has the lower scale + 1 bits set, which is used to check if the number is odd after scaling. zeroUpperBits contains the lower scale + 1 bits set. All three values, along with the scale value, are filled into the xmm registers used in the scaling algorithm.

The second part of the routine is a loop in which addition and scaling is performed on each value in the buffer. It contains the following steps.

  1. The code does simple unaligned load for the two given source buffers, which loads 16 unsigned chars (8-bit unsigned integers).
  2. Unpacking the data from 8 bits to 16 bits into two registers. The unpacking is required to avoid overflow of values on addition.
  3. Perform the 16-bit SSE2 addition on the values present in the register.
  4. Call the scale routine explained previously, passing the result of addition and the pre-calculated values present in the xmm registers. (The result available in the two registers are packed and saturated from 16 bits to 8 bits using the SSE2 pack instruction.)
  5. Store the packed result to the given destination buffer using unsigned store operation.

Notes: if the given buffer is 16-byte aligned, then aligned load and store is preferred to the unaligned load and store operations to enhance performance.

// Optimized SSE2 code
void AddScaleOptimizedSSE( const U8 *s1, const U8 *s2, U8 *d, U32 len, U32 scaleFactor )
{	
	// One-time calculation
	int maskRound, maskCheckOdd, zeroUpperBits;
	maskRound = 1 << (scaleFactor-1);
	maskCheckOdd = 1 << (scaleFactor);

	M128 xmmData[4];
	xmmData[0].i = _mm_setzero_si128();
	xmmData[0].s32[0] = scaleFactor;

	Load1(xmmData[1], (short)maskRound);

	xmmData[2].i = _mm_set1_epi16(-1);
	zeroUpperBits = 16-scaleFactor-1;
	xmmData[2].i = _mm_srli_epi16(xmmData[2].i, zeroUpperBits);

	Load1(xmmData[3], (short)maskCheckOdd);
	// one-time calculation-end

	for( U8 *end = (d +len); d<end; d+=16,s1+=16,s2+=16
	{
	 	M128 src1L, src1H;
	 	M128 src2L, src2H;

	 	// checking for alignment and using _mm_load_si128 will further improve the performance
	 	src1L.i = _mm_loadu_si128( (__m128i*) s1 );
	 	src2L.i = _mm_loadu_si128( (__m128i*) s2 );

	 	//fill with all zero
	 	static const __m128i zero = _mm_setzero_si128();

	 	//unpack 8-bit unsigned integer to 16-bit signed integers to avoid overflow on addition
	 	src1H.i= _mm_unpackhi_epi8( src1L.i, zero );
	 	src1L.i = _mm_unpacklo_epi8( src1L.i, zero );

	 	src2H.i = _mm_unpackhi_epi8( src2L.i, zero );
	 	src2L.i = _mm_unpacklo_epi8( src2L.i, zero );

	 	//performing 16-bit addition on source values
	 	src1L.i = _mm_add_epi16(src1L.i, src2L.i);
	 	src1H.i = _mm_add_epi16(src1H.i, src2H.i);

	 	Scale8u(src1L, xmmData);
	 	Scale8u(src1H, xmmData);

	 	src1L.i = _mm_packus_epi16 (src1L.i,src1H.i) 

	 	// checking for alignment and using _mm_store_si128 will further improve the performance
	 	_mm_storeu_si128( (__m128i*) d, src1L.i );
	}
}

Comparison of Results for Integer Scaling Problem

We tested the algorithm using AMD Athlon™ 64 2800+ (1.8 GHz) and AMD Athlon 64 X2 3800+ processors. The following table shows those chips’ performance:

AMD Athlon 64 2800+

AMD Athlon 64 X2 3800+

Time(s)

vs. ref

Time(s)

vs. ref

Reference code

0.035427

100%

0.0315685

100%

Ref with new method

0.001523

2326%

0.00134807

2342%

SSE with new method

0.000608

5830%

0.00044779

7050%

General Division Case

Having applied the new method to integer scaling, the scaling operation was employed and, hence, division was completely avoided. However, if the dividing number (divisor) happens to be a number that is not a power of two, and then division cannot be omitted. The round-to-even method can be performed optimally in the general division case. In this section, we compare a typical reference implementation and an optimized reference implementation for the round-to-even method for a general divisor.

void DivideRoundEvenRef( const U8 *s, U8 *d, U32 len, U8 divisor )
{
	for( U8 *end = (d +len); d<end; ++d,++s )	
		*d = (U8) dtol(((double)*s)/((double)divisor));
}

In the preceding code, an integer buffer is being divided by a constant divisor and rounded to even. This routine uses the same “dtol” subroutine as the original reference implementation, and the code is self-explanatory. The optimized version of the code appears following this paragraph and its calculations. The idea is similar to the previous optimized reference code.

Assume number a is divided by number b and the result needs to be rounded to even. Assume x is the rounded result and y is the rounded-to-even result. To get the rounded result, we add 0.5 to (a / b) and truncate. The division operation performed in these formulas is integer division, so the result is always truncated.

x = (a / b + 0.5) = (a / b + 1/2) = ((2a + b) / (2b))

y = {	x, fraction(a / b) ! = 0.5
        x - 1, fraction(a / b) = 0.5 and x is odd
        x - 0, fraction(a / b) = 0.5 and x is even	}

y = {	x - C,	where	C = 1, fraction(a / b) = 0.5 and x is odd
			        C = 0, otherwise	}

To determine whether the fraction of (a / b) is 0.5 or not, we just have to see if 2bx is equal to (2a + b). This will happen only when the fractional part is exactly half. Checking for even or odd is also simple.

void DivideRoundEvenOptimizedRef( const U8 *s, U8 *d, U32 len, U8 divisor )
{
	U32 b = (U32)divisor;
	U32 b2 = b << 1;

	for( U8 *end = (d +len); d<end; ++d,++s )
	{
		U32 a = (U32)*s;
		U32 a2 = a << 1;

		U32 val = (a2 + b) / b2;
		U32 valA = val - ((val & 1U) & ((val*b2 - (a2 + b)) == 0 ? 1 : 0));
		*d = (U8)valA;
	}
}

In this case, we also gain substantial performance improvements, as shown in the following table.

AMD Athlon 64 2800+ AMD Athlon 64 X2 3800+

Time(s)

vs. ref

Time(s)

vs. ref

Division ref

0.037791

100%

0.032482

100%

Ref with new method

0.002951

1280%

0.00202441

1641%

Conclusions

From the applications described in this paper, we clearly demonstrate a powerful new conditional elimination technique for improving data handling performance. We believe users could follow similar deduction procedures of integer scaling and establish the conditional formula for their own applications. We could also further improve the performance with SSE instructions.

Reference:

  1. http://en.wikipedia.org/wiki/Round_to_even
  2. http://www.diycalculator.com/popup-m-round.shtml
  3. A comparison of three rounding algorithms for IEEE floating-point multiplicationEven, G.; Seidel, P.-M.;Computers, IEEE Transactions on Volume 49, Issue 7, July 2000, pp. 638–650.