If you've read my previous articles about AMD's performance tuning software,
"Got Bottlenecks? CodeAnalyst
Can Help" and "Cache
or Check? CodeAnalyst Addresses Performance Deficits," you've seen
how to perform timer-based and event-based profiling of your applications. I'm
not going to review what CodeAnalyst does herethat's been covered adequately
(I hope!) in the previous two articles. This time around, let's dispense with
the 50,000-foot overview, and dive into how CodeAnalyst can help you track down
performance problems in applications.
For this example, we're going to start with a Visual C++ application, which
was provided to me by Paul Drongowski, of AMD's Boston Design Center; I asked
Paul to provide me with a short app that could highlight a typical performance
problem. Before we continue, you might print out Listing
1, which implements a simple matrix multiplication algorithm using a nested
loop. It's a simple algorithm.
The main routine, sets up a 1000 by 1000 matrix, calls the multiply_matrices()
function, and writes the elapsed execution time out to a file, so you can see
how long the application takes to run.
Here's the multiply_matrices() functionget to know it, because we'll
be spending a lot of time with it.
void multiply_matrices()
{
// Multiply the two matrices
for (int i = 0 ; i < ROWS ; i++) {
for (int j = 0 ; j < COLUMNS ; j++) {
float sum = 0.0 ;
for (int k = 0 ; k < COLUMNS ; k++) {
sum = sum + matrix_a[i][k] * matrix_b[k][j] ;
}
matrix_r[i][j] = sum ;
}
}
}
Figure 1. The matrix multiplication took 10.39 seconds.
Straightforward, eh? Because the algorithm is easy to read and program, it should
run fast. Let's find out just how efficient it is.
I compiled and linked the program, called classic.cpp, using Visual Studio
.NET 2003. On my dual-processor workstation, the runtime was 10.39 seconds (see
Figure 1). I wonder if we can make it go faster.

Figure 2. CodeAnalyst settings for running a timer-based profile.
To get a handle on execution, we'll use CodeAnalyst. If you haven't already
installed CodeAnalyst, you can download it from here. The application is freeso
there's no reason not to download it and start exploring. The version that I'm
using is 2.45.61.
Figure 2 shows the settings that I used to configure CodeAnalyst to run a timer-based
profile of the application. A timer-based profile takes a snapshot of the system
at regular intervals, and will show you what each processor is doing during
each snapshot.
In order for CodeAnalyst to do its magic, by the way, you'll want to ensure
that Visual C++ is outputting full debug information (of course). You'll also
need to configure CodeAnalyst so that it can find the debug info, under Tools
-> CodeAnalyst Options -> Directories, as shown in Figure 3.
Figure 3. Be sure to set the proper debug file search pathCodeAnalyst
needs to know!
After CodeAnalyst is configured, press the Run button (the sideways triangle)
to begin the test. You'll see something like what's in Figure
4 under the default System Data tab. It shows that during the execution
period of classic.exe, CodeAnalyst took just over 20,000 processor samples.
Nearly all of the time was eaten up, as you'd expect, by the 32-bit classic.exe
application, some of which ran on processor 0, but some of which was allocated
to processor 1 by Windows XP Professional x64 Edition. Some time was also consumed
by the processor driver (amdk8.sys), 32-bit Windows subsystem (win32k.sys) and
64-bit kernel (ntoskrnl.exe), and then it goes down from there.
If you click on the System Graph tab, you get a graphical view of this same
data, and the Processes tab shows you what else is running on your computer.
The Processes tab shows you, well, processes. We don't care about thosewe
want to get into the application and see why it's running so slowly. So, sorry,
no screen shots.
Back on the System Data tab, double-click classic.exe. You'll get a screen
like Figure
5, showing the extended instruction address (CS:EIP), its symbol, and its
sample counts. (Yes, yes, I messed something up and didn't load the symbol table
properly. I don't want to hear about it.) Ignore the lack of symbol information
and double-click the first item, which shows the line of assembly code that
took up the most time883 samples, or 4.33 percent of the execution time.
You'll see an assembly breakdown of the code region around that instruction,
such as in Figure
6.
Do you want to see assembly code? Soon, yes, but first, let's figure out which
line of source code is sucking down those CPU cycles. Unclick the Asm button,
and you'll see the C++ source, as shown on Figure
7. You will see that the line below was responsible for 10007 samples during
the timer-based profilingthat's about half of the entire execution!
sum = sum + matrix_a[i][k] * matrix_b[k][j] ;
Figure 7. Surprise! The matrix multiplier, nested inside the loops, is chewing
up all the processor time.
As you scroll through the source listing, you can see other parts of the code
where there were many samples. (Code that doesn't have a sample count was probably
executed, but not frequently enough to be sampled.) During my sample run, only
two other lines stood out as having more than 10 time samplesa float conversion
and an arithmetic assignment.
To understand why this is running so slowly, click on the Asm button again,
and scroll. You can see the full assembly breakdown of this nasty statement,
as shown in Figure
8. Ouch. The processors seem to be spending a lot of time with a few floating-point-add-and-pop-the-stack
operations. I wonder what's going on with those FADDP instructions, and why
they're running so slowly.
Let's use the event-based profiling features of CodeAnalyst to see more of
what the processor's up to. Under Tools -> Project Options, change to an
event trigger, as shown in Figure
9.

Figure 9. The only change to switch to event-type profiling is to change the
session type, lower left.
Unlike with timer-based profiling, event-based profiling requires that you set
additional parameters, which tells CodeAnalyst which processor events to monitor.
You can monitor as many as four at a time.
We'll begin by following the trail of the FADDP instruction: looking at what
the floating point unit is doing. Under the Event Config tab, select Dispatched
FPU operations (that's event code 0x00) for counter 0, as in Figure
10. Choose dispatch stall when the FPU is full (event code 0xd7) for counter
1, as in Figure
11. And select FPU exceptions (event code 0xdb) for counter 2, as in Figure
12. We're not going to choose a fourth counter. Then run CodeAnalyst again.
The results will be as in Figure
13.
Let's interpret those results. CodeAnalyst recorded about 2.5 million processor
events based on those three counters during our test run. There weren't any
exceptions, but there were a lot of dispatched FPU operations, and more than
2.2 million instances when the FPU stalled.
Folks, I think we've found our problem. Drilling down into the assembly code
confirms it; as shown in Figure
14, those FADDP instructions are definitely stalling out the FPU and delaying
execution. Don't you hate when that happens? My guess was that the FADDP instructions
could be stalling the processor because there's a memory access issue, perhaps
caused by the way the nested loops are accessing the matrix elements. How can
we tell? By checking the data caches and data translation lookaside buffer.
That means going back to CodeAnalyst and setting up another test run.
Figure 14. The floating point operations are the culprit. But why?
Let's set up another event-based profile, this time looking at the data cache
misses (event code 0x41) as counter 0, as in Figure
15, and data translation lookaside buffer misses (event code 0x46) as counter
1, as in Figure
16. Be sure to reset counter 2 to a blank state, since we don't need to
count FPU exceptions again. Then run the test again.
As you can see from Figure
17 and Figure
18, we've sure got a lot of both of those performance robbing events. Why
do we have cache misses? Because the access patterns for the matrix data jump
with big strides, which means that the cache and data translation lookaside
buffer aren't preloaded with the right dataand that causes the FPU to
stall.
So, now we know what to do: Focus our refactoring efforts on how the data is
accessed, so that it's more sequential, and thus the right data will be in the
cache and lookaside buffer. To make a long story short, Paul suggests an alternative
routine that changes the nesting of the innermost loop:
void multiply_matrices()
{
// Multiply the two matrices
for (int i = 0 ; i < ROWS ; i++) {
for (int k = 0 ; k < COLUMNS ; k++) {
for (int j = 0 ; j < COLUMNS ; j++) {
matrix_r[i][j] = matrix_r[i][j] +
matrix_a[i][k] * matrix_b[k][j] ;
}
}
}
}

Figure 19. The rewritten algorithm cuts total runtime five-fold, to 1.92 seconds.
You can download this version of the program, called interchange.cpp, from Listing
2. How does it do? Elapsed runtime: 1.92 secondsa better than five-fold
improvement (see Figure 19).
What does CodeAnalyst think? Figure
20 shows the timer profilethere are fewer total samples, because the
program ran faster. Figure
21 shows the source code breakdownyes, there are still a lot of samples
at the matrix multiplication, but there are far, far fewer than we originally
had.
Figure 22. Changing the order that the loops iterate might make the source code
slightly more complicated, but it sure makes the cache and buffers run more
efficiently.
What about those data cache misses and data translation lookaside buffer misses?
Let's check those as counter 0 and counter 1 using event-based profiling. Compare
Figure 22 to Figure
17. The number of data cache misses has dropped by a factor of five, from
52279 to 12535. But more impressively, the number of data translation lookaside
buffer misses has dropped by three orders of magnitude, from 145,363 to 163.
Yow.
Are we satisfied? Depends on your performance objectives: There's always more
tuning and tweaking you can do. But any time you can drop runtime by a factor
of five, that's a job well done, and a job made a lot easier by using the CodeAnalyst
profiler. Let's call it a day, shall we?
A former mainframe software developer and systems analyst, Alan Zeichick
is principal analyst at Camden Associates, an independent technology research
firm focusing on networking, storage, and software development. Read his blog
at http://ztrek.blogspot.com.