AMD Logo AMD Developer Central

Code Tuning: Seeing Is Believing 

Skip Navigation LinksHome
CodeAnalyst's timer-based and event-based profiling can find out why a simple matrix multiplication is taking such a long time to execute. Here's how.
Alan Zeichick  3/28/2007 

If you've read my previous articles about AMD's performance tuning software, "Got Bottlenecks? CodeAnalyst Can Help" and "Cache or Check? CodeAnalyst Addresses Performance Deficits," you've seen how to perform timer-based and event-based profiling of your applications. I'm not going to review what CodeAnalyst does here—that's been covered adequately (I hope!) in the previous two articles. This time around, let's dispense with the 50,000-foot overview, and dive into how CodeAnalyst can help you track down performance problems in applications.

For this example, we're going to start with a Visual C++ application, which was provided to me by Paul Drongowski, of AMD's Boston Design Center; I asked Paul to provide me with a short app that could highlight a typical performance problem. Before we continue, you might print out Listing 1, which implements a simple matrix multiplication algorithm using a nested loop. It's a simple algorithm.

The main routine, sets up a 1000 by 1000 matrix, calls the multiply_matrices() function, and writes the elapsed execution time out to a file, so you can see how long the application takes to run.

Here's the multiply_matrices() function—get to know it, because we'll be spending a lot of time with it.

void multiply_matrices()
{
	// Multiply the two matrices
	for (int i = 0 ; i < ROWS ; i++) {
		for (int j = 0 ; j < COLUMNS ; j++) {
			float sum = 0.0 ;
			for (int k = 0 ; k < COLUMNS ; k++) {
				sum = sum + matrix_a[i][k] * matrix_b[k][j] ;
			}
			matrix_r[i][j] = sum ;
		}
	}
}

Figure 1
Figure 1. The matrix multiplication took 10.39 seconds.
Straightforward, eh? Because the algorithm is easy to read and program, it should run fast. Let's find out just how efficient it is.

I compiled and linked the program, called classic.cpp, using Visual Studio .NET 2003. On my dual-processor workstation, the runtime was 10.39 seconds (see Figure 1). I wonder if we can make it go faster.

Figure 2
Figure 2. CodeAnalyst settings for running a timer-based profile.
To get a handle on execution, we'll use CodeAnalyst. If you haven't already installed CodeAnalyst, you can download it from here. The application is free—so there's no reason not to download it and start exploring. The version that I'm using is 2.45.61.

Figure 2 shows the settings that I used to configure CodeAnalyst to run a timer-based profile of the application. A timer-based profile takes a snapshot of the system at regular intervals, and will show you what each processor is doing during each snapshot.

In order for CodeAnalyst to do its magic, by the way, you'll want to ensure that Visual C++ is outputting full debug information (of course). You'll also need to configure CodeAnalyst so that it can find the debug info, under Tools -> CodeAnalyst Options -> Directories, as shown in Figure 3.

Figure 3. Be sure to set the proper debug file search path—CodeAnalyst needs to know!

After CodeAnalyst is configured, press the Run button (the sideways triangle) to begin the test. You'll see something like what's in Figure 4 under the default System Data tab. It shows that during the execution period of classic.exe, CodeAnalyst took just over 20,000 processor samples. Nearly all of the time was eaten up, as you'd expect, by the 32-bit classic.exe application, some of which ran on processor 0, but some of which was allocated to processor 1 by Windows XP Professional x64 Edition. Some time was also consumed by the processor driver (amdk8.sys), 32-bit Windows subsystem (win32k.sys) and 64-bit kernel (ntoskrnl.exe), and then it goes down from there.

If you click on the System Graph tab, you get a graphical view of this same data, and the Processes tab shows you what else is running on your computer. The Processes tab shows you, well, processes. We don't care about those—we want to get into the application and see why it's running so slowly. So, sorry, no screen shots.

Back on the System Data tab, double-click classic.exe. You'll get a screen like Figure 5, showing the extended instruction address (CS:EIP), its symbol, and its sample counts. (Yes, yes, I messed something up and didn't load the symbol table properly. I don't want to hear about it.) Ignore the lack of symbol information and double-click the first item, which shows the line of assembly code that took up the most time—883 samples, or 4.33 percent of the execution time. You'll see an assembly breakdown of the code region around that instruction, such as in Figure 6.

Do you want to see assembly code? Soon, yes, but first, let's figure out which line of source code is sucking down those CPU cycles. Unclick the Asm button, and you'll see the C++ source, as shown on Figure 7. You will see that the line below was responsible for 10007 samples during the timer-based profiling—that's about half of the entire execution!

  sum = sum + matrix_a[i][k] * matrix_b[k][j] ;

Figure 7. Surprise! The matrix multiplier, nested inside the loops, is chewing up all the processor time.
As you scroll through the source listing, you can see other parts of the code where there were many samples. (Code that doesn't have a sample count was probably executed, but not frequently enough to be sampled.) During my sample run, only two other lines stood out as having more than 10 time samples—a float conversion and an arithmetic assignment.

To understand why this is running so slowly, click on the Asm button again, and scroll. You can see the full assembly breakdown of this nasty statement, as shown in Figure 8. Ouch. The processors seem to be spending a lot of time with a few floating-point-add-and-pop-the-stack operations. I wonder what's going on with those FADDP instructions, and why they're running so slowly.

Let's use the event-based profiling features of CodeAnalyst to see more of what the processor's up to. Under Tools -> Project Options, change to an event trigger, as shown in Figure 9.

Figure 9
Figure 9. The only change to switch to event-type profiling is to change the session type, lower left.
Unlike with timer-based profiling, event-based profiling requires that you set additional parameters, which tells CodeAnalyst which processor events to monitor. You can monitor as many as four at a time.

We'll begin by following the trail of the FADDP instruction: looking at what the floating point unit is doing. Under the Event Config tab, select Dispatched FPU operations (that's event code 0x00) for counter 0, as in Figure 10. Choose dispatch stall when the FPU is full (event code 0xd7) for counter 1, as in Figure 11. And select FPU exceptions (event code 0xdb) for counter 2, as in Figure 12. We're not going to choose a fourth counter. Then run CodeAnalyst again. The results will be as in Figure 13.

Let's interpret those results. CodeAnalyst recorded about 2.5 million processor events based on those three counters during our test run. There weren't any exceptions, but there were a lot of dispatched FPU operations, and more than 2.2 million instances when the FPU stalled.

Folks, I think we've found our problem. Drilling down into the assembly code confirms it; as shown in Figure 14, those FADDP instructions are definitely stalling out the FPU and delaying execution. Don't you hate when that happens? My guess was that the FADDP instructions could be stalling the processor because there's a memory access issue, perhaps caused by the way the nested loops are accessing the matrix elements. How can we tell? By checking the data caches and data translation lookaside buffer. That means going back to CodeAnalyst and setting up another test run.

Figure 14. The floating point operations are the culprit. But why?

Let's set up another event-based profile, this time looking at the data cache misses (event code 0x41) as counter 0, as in Figure 15, and data translation lookaside buffer misses (event code 0x46) as counter 1, as in Figure 16. Be sure to reset counter 2 to a blank state, since we don't need to count FPU exceptions again. Then run the test again.

As you can see from Figure 17 and Figure 18, we've sure got a lot of both of those performance robbing events. Why do we have cache misses? Because the access patterns for the matrix data jump with big strides, which means that the cache and data translation lookaside buffer aren't preloaded with the right data—and that causes the FPU to stall.

So, now we know what to do: Focus our refactoring efforts on how the data is accessed, so that it's more sequential, and thus the right data will be in the cache and lookaside buffer. To make a long story short, Paul suggests an alternative routine that changes the nesting of the innermost loop:

void multiply_matrices()
{
	// Multiply the two matrices
	for (int i = 0 ; i < ROWS ; i++) {
		for (int k = 0 ; k < COLUMNS ; k++) {
			for (int j = 0 ; j < COLUMNS ; j++) {
				matrix_r[i][j] = matrix_r[i][j] +
					matrix_a[i][k] * matrix_b[k][j] ;
			}
		}
	}
}

Figure 19
Figure 19. The rewritten algorithm cuts total runtime five-fold, to 1.92 seconds.

You can download this version of the program, called interchange.cpp, from Listing 2. How does it do? Elapsed runtime: 1.92 seconds—a better than five-fold improvement (see Figure 19).

What does CodeAnalyst think? Figure 20 shows the timer profile—there are fewer total samples, because the program ran faster. Figure 21 shows the source code breakdown—yes, there are still a lot of samples at the matrix multiplication, but there are far, far fewer than we originally had.

Figure 22. Changing the order that the loops iterate might make the source code slightly more complicated, but it sure makes the cache and buffers run more efficiently.

What about those data cache misses and data translation lookaside buffer misses? Let's check those as counter 0 and counter 1 using event-based profiling. Compare Figure 22 to Figure 17. The number of data cache misses has dropped by a factor of five, from 52279 to 12535. But more impressively, the number of data translation lookaside buffer misses has dropped by three orders of magnitude, from 145,363 to 163. Yow.

Are we satisfied? Depends on your performance objectives: There's always more tuning and tweaking you can do. But any time you can drop runtime by a factor of five, that's a job well done, and a job made a lot easier by using the CodeAnalyst profiler. Let's call it a day, shall we?

A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development. Read his blog at http://ztrek.blogspot.com.

Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.