We've tried a couple of optimization techniques from our basic
bag-o-tricks, and obtained good results just by guessing. What next? How can
we get even more performance? At this point, we really need to understand the
details of where the code is spending its time. It's time to profile the code
and measure what is really happening.
A profiler is an essential tool for any performance-oriented development
project. It allows you to view details of the program execution. Timer-based
profiling shows where the code spends time. Event-based profiling counts interesting
hardware events, like cache refills and mispredicted branches.
AMD CodeAnalyst does all this, and more. Download CodeAnalyst
from AMD Developer Central, and install it. Run CodeAnalyst, and read through
the step-by-step tutorial under the Help menu. (notice that CodeAnalyst requires
you to choose the VS2005 compiler setting /Zi and linker setting /DEBUG to create
proper symbol info)
Now create a new CodeAnalyst project for the Mandel application:
Project directory: some temporary directory, like C: emp
Project name: any name you like, let's use "mvec"
Working directory: the location of the .exe, C:mandelx64
elease
Launch app: select mandel.exe in the working
directory
Use the default "Timer Trigger," set the duration to
20 seconds, and check the "Terminate app?" box. Then click OK. We
have set CodeAnalyst to launch the Mandel program, capture 20 seconds of timer-based
profile data, then stop the program.
Click the triangular "Start" button, and the application
should launch. Be sure to keep clicking the mouse and zooming in to the Mandelbrot
set, so the main loop code is being exercised constantly. This will give relevant
sample data, instead of measuring the idle loop.
After 20 seconds, Mandel will exit and CodeAnalyst will open a
Session window with all your Timer Based Profile (TBP) sample data.
Click the System Graph tab, and you should see Mandel.exe taking
the lion's share of the time. You can double-click the bars and drill down to
the assembly code, with timer counts associated with the individual instructions.
Now click the System Data tab. The modules are ranked according
to activity, with the most active module at the top. Presumably, this will be
Mandel.exe if you kept the program busy while it ran.
Double-click mandel.exe, and you will see which
functions inside Mandel.exe were the most active. The mandel
function (main function) should be the most active, by far. Double-click it,
and you will see the final destination: the source, ASM code, and corresponding
sample counts. There is also a module navigator section showing your current
position in the module, and a top-level graphical overview of the hot spots
in the module.
This view shows a lot of useful info. Scroll through the code,
and see where most of the time is spent. Click the little square box to expand
a source line, and see the corresponding ASM code. Source and ASM don't always
line up perfectly, but it's close. Note: a slow instruction will generally produce
a large sample count on the next instruction, since
that is the IP address where the actual measurement gets taken.
Notice that the vector calculations like MULPS
and ADDPS take a lot of time. This is expected. However,
you can also see the SHUFPS/COMISS taking lots of
time. This might be a surprise, but this kind of overhead for rearranging data
is an example of the tradeoffs often encountered when writing vectorized code.
How can it be improved? [note: if you're not really interested in vectorization,
skip ahead to the dual-core section]