Two decades ago, fractals burst onto the scene. Fractals are fascinating chaotic geometric shapes that have the characteristic that no matter how much you zoom in on them, they're always uneven. One of the most famous fractals is what's called the Mandelbrot set, named after Benoit Mandelbrot, who discovered fractal geometry in the 1970s. For years, computer scientists, mathematicians and interested hobbyists have used PCs to generate beautiful fractal diagrams, including the Mandelbrot set, as shown in Figure 1.
The Mandelbrot set itself is developed by a recursive formula: Zn = Zn-12 + C. Here, C is an initial starting point on the complex plane, and Z is a spot inside the Mandelbrot set. It's a deceptively simple formula, but executing it many times and plotting the results provides graphically striking images.
Because the Mandelbrot formula is processor-intensive, it makes an excellent test case for seeing the differences between 32-bit and 64-bit computation using the AMD64 architecture. For our adventure through the Mandelbrot set, I've used the Visual Studio 2005 beta and a program I've written called Mandelblaze. You can download the source code and project files here. You'll need to be running one of the betas of 64-bit Windows XP or Windows Server 2003 to run these tests.
|
More About Fractals
There are many great sites on the Web that contain information about fractals and the Mandelbrot set. One is the Fractal Geometry site, at , but a quick browse through your favorite search engine should bring up many pages and cool figures.
Dr. Mandelbrot's original work in this area can be found in "The Fractal Geometry of Nature," from W.H. Freeman & Company, published in 1982; in 2004, he also published, "Fractals in Nature," from Springer-Verlag.
Many of the uses of fractal mathematics have been found in financial analysis. A fascinating paper by Dr. Mandelbrot is A Multifractal Walk Down Wall Street, originally published in Scientific American in February 1999.
|
If you don't have VS 2005, you can use batch files and the AMD64 Platform SDK and command-line tools to run the tests. Or you can just view the source code and the figures in this article.
Note: When installing Visual Studio 2005 beta, you must manually choose to install the x64 compiler and libraries; they are not installed by default. In addition to Visual Studio 2005 (or the PSDK), you also need to install the latest version of the DirectX 9.0 SDK.
Step 1: Compile and benchmark the 32-bit baseline code
Let's start by using Visual Studio 2005 to build the baseline project file, and measure the performance of the original code before optimization. On your 64-bit machine, use Windows Explorer to navigate to the folder that contains the C++ project. Double-click mandelblaze.sln to open the solution in Visual Studio 2005, and open the source file mandelblaze.cpp
Look at the code in the Main Mandelbrot Function which is called BOOL mandel(). There is a baseline main loop, and two alternate versions of the loop. The baseline is called "BASIC DOUBLE PRECISION INNER LOOP", and it should be active (not commented out) in the source. The two alternate loops should be commented out for now.
Read the code for the basic inner loop, especially the 7 lines of code inside the "for" loop. You can see this code is performing double precision floating point calculations. Observe that it repeatedly updates the variables z_real and z_imag with new calculated values, up to 1000 times:
for (int i = 0; i < 1000; i++) {
double a = z_real * z_real;
double b = z_imag * z_imag;
double c = z_real * z_imag;
c = c + c; // c * 2.0
z_real = a - b + c_real;
z_imag = c + c_imag;
if((a + b) > 4.0) { break; }
}
Now, let's compile and benchmark the code for 32-bit mode. Select Release Configuration, and Win32 platform. From the Project menu, click Mandelblaze Properties…, expand C/C++, change Optimization to Maximize Speed (/O2), and click OK. Then, from the Build menu, click Build Solution, and from the Debug menu, click Start Without Debugging. A new window should open, displaying the Mandelbrot Set. Zoom into the main region (solid blue) to generate an accurate gigaflops performance number; the picture should look something like Figure 2.
Observe the text in the window's title bar. It includes benchmark info measured in gigaflops, which will be updated every time the Mandelbrot set is calculated. A gigaflop is 1,000,000,000 double precision floating point operations per second. On my system, the 32-bit result was 1.07 gigaflops.
Step 2: Compile the baseline code for 64-bit performance
|
Mandelblaze Navigation
You can use the mouse to navigate around the Mandelbrot set —there are some interesting areas over at the edges of the set. The controls are:
| Left-click: |
zoom in |
| Right-click: |
zoom out |
| Move mouse to edge of window: |
scroll |
| ESC key: |
exit |
|
Now, let's compile and benchmark the code for 64-bit mode. Select Release Configuration, and Win64 (AMD64) platform. From the Project menu, click Mandelblaze Properties…, expand C/C++, click Optimization, and change the first field, Optimization, from Custom to Maximize Speed (/O2), and click OK.
Then from the Build menu, click Build Solution, and from the Debug menu, click Start Without Debugging. The Mandelblaze window should open. This is now 64-bit code, compiled with the optimized 64-bit compiler.
Zoom into the blue region to generate an the 64-bit gigaflops number. It should be higher than the baseline, indicating the 64-bit code is faster than the 32-bit code. Figure 3 shows my result, which is 1.22 gigaflops.
Step 3: Optimize 32-bit code by adding a second, separate dependency chain
We can make the Mandelblaze code run a lot faster by streamlining the inner loop. We'll do that by adding a second set of calculations which proceeds in parallel with the first set. This will give the CPU more independent floating point instructions to process, to keep the floating point units in the CPU efficiently busy.
First, comment out the baseline loop. You can do that by using "begin comment" characters /* and "end comment" characters */ to comment out the baseline loop (about 20 lines of code).
Next, un-comment the optimized loop. Right below the baseline loop, you should see the optimized loop which is called DOUBLE PRECISION INNER LOOP WITH SEPARATE DEPENDENCY CHAINS. Activate this loop by removing the "begin comment" and "end comment" characters at the start and end of the loop.
As you can see, the optimized code performs the same calculations as the baseline code, but it processes two completely separate Mandelbrot pixels within the loop. The two pixels are calculated with separate variables that have no data inter-dependencies.
To run this code, select Release configuration, and Win32 platform. From the Build menu, click Build Solution. From the Debug menu, click Start Without Debugging. The Mandelblaze window should open. This is the optimized 32-bit build.
Zoom into the blue to get the optimized 32-bit performance score. Figure 4 shows that on my system, the result was 1.56 gigaflops—the highest number so far. That's a good demonstration of the benefit of optimizing an algorithm!
Close the Mandelblaze window using the ESC key. Look at the C/C++ project settings (Project / Mandelblaze Properties / C/C++ / Code Generation). Notice the Enable Enhanced Instruction Set Option, /arch:SSE2. This allows the 32-bit compiler to emit SSE and SSE2 instructions for best performance. Right below it, the /fp:fast allows the 32-bit compiler (and the 64-bit compiler) to perform certain speed optimizations that may produce slightly different numerical floating point results, but run faster.
Step 4: Compile and benchmark the optimized code for AMD64 (64-bit mode)
To run the 64-bit version of the optimized program, select Release Configuration, and Win64 (AMD64) platform. From the Build menu, click Build Solution. From the Debug menu, click Start Without Debugging. The Mandelblaze window should open. This is the optimized 64-bit build.
Zoom into the blue to see the optimized 64-bit performance. This should be much higher than the optimized 32-bit performance. My test, seen in Figure 5, shows 2.06 gigaflops.
You can see why the 64-bit code is so much faster by examining the assembly code output. Compare 64-bit main loop in amd64/release/mandelblaze.cod vs. 32-bit main loop in release/mandelblaze.cod. The 64-bit code can keep all the variables in SSE registers, for fast access. The 32-bit code can only use eight SSE registers; it must use memory variables for the rest, which reside in cache but are still slower than using registers.
The benefits are in the benchmarks
We've just seen how a single source file for both 32-bit and 64-bit targets can run unchanged under 64-bit Windows—and the 64-bit code is definitely faster. Using source-level optimizations to help the performance of the floating point code also offers a tremendous payoff.
Using 64-bit Integers to Make an Extended-Precision Inner Loop
Read on...
|
Indeed, the technique of coding two separate dependency chains in the main loop, which offers more independent variables and operations that can be performed in parallel or out-of-order by the pipelined Floating Point Units in the CPU, works especially well in 64-bit mode, because the 16 SSE registers available in 64-bit mode permit all the variables to be stored in registers. As a result, the 64-bit code performs significantly better than the 32-bit code.