| Sidebar: Know the Code |
| A code profiler can help determine which parts of your application should be optimized. You should always use a profiler both before and after optimization to see if your change actually made an improvementwith C/C++, you can never be sure if your hand-tuning methods are better than those of your optimizing compiler! |
The smart engineers at AMD have spent a lot of time identifying places where C/C++ developers can optimize code for best performance. Some of these optimizations might look familiarthey're generally great coding practice for any 32-bit or 64-bit platform or operating system. In fact, most of these suggestions are appropriate whether you're developing 32-bit or 64-bit apps. Others are specifically for 64-bit applications running on the AMD64 architecture. This article will look at six optimizations; Part 2 will examine eight more.
#1. Declare as static functions that are not used outside the file where they are defined
Declaring a function as static forces an internal linkage within that file, which can improve the performance of the code. Functions that are not declared as static default to external linkage, which may inhibit certain optimizations, such as aggressive inlining, with some C/C++ compilers.
#2: Use array notation instead of pointer notation when working with arrays
You can use either array operators ([]) or pointers to access the elements of an array. However, it's hard for the compiler to optimize pointers because the pointers are, by default, less constrainedthe compiler has to assume that the pointer can read/write to any location in memory. Because array notation is less ambiguous, the compiler may be able to optimize it more effectively.
Because it's hard to know which way is more efficient, you may want to try coding your array reference using both pointers and array notation, and use a code profiler to see which way is fastest for your particular application.
#3. Completely unroll loops that have a small fixed loop count and a small loop body
As you know, many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance, because you're avoiding the loop overhead; for a small loop, that overhead can be significant.
For example, avoid a small loop like this:
// 3D-transform: Multiply vector V by 4x4 transform matrix M.
for (i = 0; i < 4; i++) {
r[i] = 0;
for (j = 0; j < 4; j++) {
r[i] += m[j][i] * v[j];
}
}
Instead, replace it with its completely unrolled equivalent, as shown here:
r[0] = m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3];
r[1] = m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3];
r[2] = m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3];
r[3] = m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3];
#4. Consider expression
#4. Consider expression order when coding a compound branch
Complex branchesthose which consist of many Boolean expressions with logical AND (&&) and logical OR (||) operatorsprovide great opportunities for optimization, if you know which AND and OR conditions are more likely to occur.
That's because the C/C++ compiler will optimize the code to terminate execution of that complex branch as soon as it's sure if the operand evaluates to true or false, depending on the logic. For example, the branch conditions are satisfied as soon as it detects a true operand in an OR expression; similarly, the application knows that the branch will fail as soon as it sees a false operand in an AND expression.
You might experiment with changing the order of the operands if you know that a true/false outcome is more probable for a specific operand. For example, in an AND expression, place the operand most likely to be false up front. Of course, you'll need to make sure that changing these order of the operands won't cause any unwanted side effects.
#5. Leverage the early termination abilities of complex If statements
As seen above, as soon as a true operand is found in an OR expression, or a false one is found in an AND expression, execution of that particular expression is terminated because outcome is known. You can leverage that, if you have a good idea as to whether an entire expression is likely to be true or false, to try to short-cut execution of the condition.
For example, say you're testing to see if a and b are going to be within a specified numeric range. If you know that those numbers will generally be within that range, you can use the following because the most likely outcome will evaluate to false and thereby terminate the AND condition:
if (a <= max && a >= min && b <= max && b >= min)
If you know that the data will generally not be in a specified range, you can logically reverse the If statement to optimize execution in the same way, with a true terminating the OR condition:
if (a > max || a < min || b > max || b < min)
#6. Use if-else statements in place of switch statements that have noncontiguous case expressions
If the case expressions are contiguous or nearly contiguous integer values, most compilers translate the switch statement as a jump table instead of a comparison chain. Jump tables generally improve performance because they reduce the number of branches to a single procedure call, and shrink the size of the control-flow code no matter how many cases there are. The amount of control-flow code that the processor must execute is also the same for all values of the switch expression.
However, if the case expressions are noncontiguous values, most compilers translate the switch statement as a comparison chain. Comparison chains are undesirable because they use dense sequences of conditional branches, which interfere with the processor's ability to successfully perform branch prediction. Also, the amount of control-flow code increases with the number of cases, and the amount of control-flow code that the processor must execute varies with the value of the switch expression.
For example, if the case expression are contiguous integers, a switch statement can provide good performance:
switch (grade)
{
case 'A':
...
break;
case 'B':
...
break;
case 'C':
...
break;
case 'D':
...
break;
case 'F':
...
break;
But if the case expression aren't contiguous, the compiler may likely translate the code into a comparison chain instead of a jump table, and that can be slow:
switch (a)
{
case 8:
// Sequence for a==8
break;
case 16:
// Sequence for a==16
break;
...
default:
// Default sequence
break;
}
In cases like this, replace the switch with a series of if-else statements:
if (a==8) {
// Sequence for a==8
}
else if (a==16) {
// Sequence for a==16
}
...
else {
// Default sequence
}
Tune in next time
In the second part of this series, we'll look at eight more optimizations, focused on data structure alignment, integer operations and floating-point math.
A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.