Many programmers don't understand this simple reality: When using mainstream programming languages the only way to take advantage of multiple cores is to explicitly use multithreading. Mainstream language processes are not automatically parallelizable, and there is very little that the compiler is allowed to do to exploit this new and exciting era of concurrent processing.
There are of course speedups from the operating system and some libraries can be rewritten to fulfill their contracts asynchronously, but your code is not going to run faster unless you make it multithread. The "free lunch" performance boost of increased clock speeds of previous processor generations is over, and while single-threaded applications will certainly experience some incremental improvements, performance-oriented programming will increasingly rely on distributing the calculations over multiple cores and processors using multiple threads.
Resources
· Download the Code for this Article
But multithreading is hard, right? Race conditions, deadlocks, scheduling algorithmsany discussion of multithreading has to emphasize the insidious assumptions we make, how intermittent and difficult it is to isolate the defects, how difficult the debugging, and so on. Yes, it's all true, but let's shove that out of the way and talk about situations where multithreading is easy.
It's perhaps surprising that C++, with its reputation for difficulty, actually provides one of the easiest ways to exploit multi-core and multiprocessor systems. OpenMP, a multiplatform API for C++ and Fortran, uses compiler instructions to automatically generate all of the support code needed to parallelize code sections. In the simplest case, which is what we're going to focus on for this article, simply wrapping a processor-intensive loop in a #pragma block can lead to about a 70 percent performance increase on a dual-core or dual-processor system and enjoy a similar "free lunch" on the quad-core systems that you build in the future.

Figure 1. The Transform Steps
You can exploit concurrency at any level of granularity from high-altitude system architecture to CPU registers, but the key to avoiding trouble is always the same: minimize data coupling. To err is human, but to really screw up you need shared state. This is why it's often easiest to accomplish concurrency at the highest level (n-tier and service-oriented architectures) where big chunks of functionality can be divided up and coordinated with coarse messages. Meanwhile, successful threading of middle-level abstractions (active functions or classes) is often fraught with difficulty.
The same principle leads to the seeming paradox that the other often-easy win for concurrency is at the lowest level with a loop working on an array. In the center of many, perhaps even most, processor-intensive scenarios, lays a hotspot involving a loop and a big array of data: the pixels, the sound bits, the mesh coordinates, or whatever monstrous array of data you may need for your work. And while there's certainly a chance of encountering a concurrent pitfall in such a situation, but there are also many cases in which the calculations are nicely independent.

Figure 2. Wavelet-based Image Processing
Image processing is typical of the work you might tackle with OpenMP. Wavelets are fascinating tools for image processing because they can be used for compression, feature detection, enhancement, and probably dozens of other techniques. The simplest wavelet, the Haar wavelet, works by calculating the average and difference-from-average of pairs of input data. These transformations are then gathered together, reordering the data. Animation1 shows the Haar transform in action on a 1-dimensional array. On an image, we can perform one transform horizontally and then another transform vertically. When the difference coefficients are scaled into the grayscale range 0..255, the transform steps look like those shown in Figure 1. The result is that the original image is divided into four quadrants; the top-left is a half-resolution version of the original, and the others are essentially edge-maps. If the wavelet is applied recursively, the data quickly becomes un-interpretable to the eye, but details from multiple resolutions are captured in the transformed data. It's also important that the transform is lossless: the original image can be reconstituted perfectly by applying the transform in reverse.
I wrote a simple C++/CLI program to experiment with wavelet-based image processing (Figure 2). While it works quickly enough with the famous 512x512 Lena compression benchmark shown, on a 16megapixel panorama it can take a few seconds. Figure 3 shows CPU utilization during such a challenging run. This type of disappointing perfmon profile is more common than not because so few current applications are written for multi-core machines.

Figure 3. BeforeCPU Utilization During a Challenging Run

Figure 4. AfterA More Pleasing Profile
I added two lines of code (one to each of the horizontal and vertical transform functions) and generated the far more pleasing profile shown in Figure 4. Both processors at near 100 percent utilizationresulting in a 70 percent speedup: now that's the sort of performance improvement you'd expect from multi-core! The magic is in those two lines of code.
The horizontal transform is shown below in Listing 1. As you can see, the listing follows the animation in structure. We iterate over the pOriginal array, calculate the average and difference-from-average, and store the results in the appropriate position in the pNew array. (Complete source code and binaries are available for download here.
Listing 1.
void HorizontalStepDown()
{
#pragma omp parallel for
for(int y = 0; y <lt; height; y++)
{
int yOffset = y * width;
//Note x is being stepped 2 at a time
for(int x = 0; x <lt; width; x+=2)
{
float v0 = pOriginal[yOffset + x];
float v1 = pOriginal[yOffset + x + 1];
float ave = (v0 + v1) / 2;
float diff = v0 - ave;
pNew[yOffset + x / 2] = ave;
pNew[yOffset + x / 2 + width / 2] = diff;
}
}
}
The magic sauce is the #pragma line. Although there are a number of OpenMP utility functions, the majority of OpenMP use is done in the form of #pragma commands. These commands, which are ignored by compilers that do not support OpenMP, dramatically change the semantics of the code block to which they are applied.
Loops cannot, in general, be run in parallel. Many loops have loop-carried dependencies, such that correctness requires sequential processing. For example, if your loop contains lines like x[i] = x[i - 1]; you've likely got a loop-carried dependency. Naturally, the language must solve the general case and leave parallelization to either sophisticated compilers or to (presumably equally sophisticated) human beings.
An OpenMP parallel block is one that the programmer asserts is safe to run concurrently. As mentioned earlier, this is often easiest at the lowest level of abstraction, when you know that your task is to work your way through some big block of data. The OpenMP pragmas define the blocks that can be run concurrently, but they do have overhead. When execution first encounters a parallel region, some amount of threads are started (they were created on program start-up). Parallel regions are generated as out-of-line functions so that their addresses can be passed to the executing threads. So, on a single-core machine, OpenMP introduces overhead and gains nothing.
Even on a multi-core machine, it takes many iterations over a loop before the benefits of distributing the computation overtake the overhead of setting up the threads. In practice, any kind of media processing is likely to involve enough data to make opening threads well worthwhile. Even on the " Lena " image, which is a mere quarter-megapixel file, OpenMP delivered nearly 70 percent better performance. There will always be some overhead to distributing and coordinating concurrent operations, so multiple cores will never lead to perfect performance multiples.
In Visual C++ 2005, using OpenMP is as simple as adding the #pragma and compiling with a command-line switch (/openmp). In Visual Studio 2005, this can be set in the Project Property Pages dialog, under "Configuration Properties|C/C++|Language|OpenMP Support" (see Figure 5). At the moment, Microsoft's OpenMP implementation is not compatible with the /clr:pure or /clr:safe command-line switches, so if you're writing in C++/CLI, you'll have to use the "plain vanilla" /clr switch which can also be set in the Property Page dialog, as shown in Figure 6. The OpenMP runtime file vcomp.dll must be shipped with the final executable code.

Figure 5. Enabling OpenMP Is an Easy Configuration Change

Figure 6. OpenMP Is Not Yet Compatible with /clr:Pure or /clr:Safe
Using OpenMP with C++/CLI is a joy, since you can use.NET's Base Class Library to make such things as GUIs, databases, and Web Services easy, but still be a mere OpenMP pragma away from unleashing your processors. In writing the sample application for this article, though, I noticed that the BCL's Bitmap class had abysmally slow SetPixel() and GetPixel() operations. Listing 2 shows how easy it is in C++/CLI to combine an easy-to-use BCL class like Bitmap with some low-level pointer work to speed things up. The only thing that's a little tricky is that the BitmapData returned by Bitmap->LockBits() should be released as soon as practical and, to be safe, should be wrapped in a try
finally block. A similar function for setting pixels is in the sample application source.
Listing 2.
void ImageToFloatArray()
{
//getPixel is absurdly slow, so let's do it fast
GraphicsUnit guPixel = GraphicsUnit::Pixel;
RectangleF^ boundsF = myBmp->GetBounds(guPixel);
Rectangle bounds = Rectangle((int) boundsF->X, (int) boundsF->Y,
(int) boundsF->Width, (int) boundsF->Height);
BitmapData^ bmpData = nullptr;
try{
bmpData = myBmp->LockBits(bounds, ImageLockMode::ReadWrite,
PixelFormat::Format8bppIndexed);
unsigned char* pData = reinterpret_cast(bmpData->Scan0.ToPointer());
pOriginal = new float[width * height];
pNew = new float[width * height];
#pragma omp parallel for
for(int y = 0; y < height; y++)
{
for(int x = 0; x < width; x++)
{
unsigned char pixelR = pData[(y * width + x) *
sizeof(unsigned char)];
float grayScale = (float) (pixelR / 255.0);
pOriginal[y * width + x] = grayScale;
}
}
}finally{
if(bmpData != nullptr){
myBmp->UnlockBits(bmpData);
}
}
}
It's important to realize that OpenMP is not a substitute for between-the-ears optimization. For instance, the code presented in this sample application uses floating point numbers, but since the Haar wavelet is limited to division by two, it's an easy-enough matter to implement the transform using integers and shifts. Sure enough, such a change doubles the speed of the transform, outshining OpenMP. The point of the sample application isn't to show pedal-to-the-metal optimization, but rather the oppositehow easy it is to get a significant performance boost with minimal work. Of course you can use an OpenMP pragma on the integer version of the transform and get a free lunch on that, too.
Right now, when the question is whether you have one core or two, the boost you can hope to get from an OpenMP pragma hovers around 70 percent. However, once we get four cores on our desktops, that same line might give you a 300 percent speedup. Wait a generation, and you might be talking about OpenMP delivering six times the speed of a single-threaded application. Beyond that, what about when machines start having 16 and 32 cores? Today you might be able to get by without parallelizing your code, but that's certainly not going to be the case in the not-so-distant future.
Larry O'Brien is a recognized expert on.NET, and is a frequent writer and speaker on software development.