Most programmers are wise enough to know that software won't automatically run significantly faster just because the user has a dual-core processor. While there will certainly be incremental benefits of the newer processor, the expected benefits of, say, doubling power based on adding a second core, can only be achieved with explicit code changes. No mainstream language can automatically take advantage of multiple cores and multiple processors. This is true of all the.NET languages from Microsoft except for F#. However, it's also true of Java, C++, C, Delphi, Ruby, JavaScript, Perl, and Python, just to name a few.
Resources
· Download the Code for This Article
Should you care? Well, today, the question of changing your programming to exploit multiple cores is a question of whether your target machine has one or two cores available. The boost that you can hope to get from explicit multithreading is something less than 100 percent. Perhaps, for some, that gain is not worth the effort. Tomorrow, it will be common for there to be one, two, and four core targets. In just a few Moore 's generations, the possibilities will stretch from one core to eight, 16, and even 32 cores. Today, we're in a transitional "multi-core" era, but in the subsequent "many-core" era it will be impossible for professional programmers to hide their heads in the sand when it comes to concurrency.
The Common Language Runtime (CLR) of.NET provides a fairly easy-to-use object-oriented threading model. A Thread encapsulates the idea of a "thread of execution." At the.NET level of abstraction, one can say that at any given moment, a single Thread is executing on a single processor core (once you get under the level of the OS and "closer to the metal," the problem of "what's going on" at any given moment becomes more complex). Thread s run within an ApplicationDomain, which is.NET's "unit of isolation for an application." To quickly finish the model, Windows (like most Operating Systems) has the idea of a "Process," each of which has its own memory space, security token, holds handles to resources likes files and windows, and contains one or more threads. Windows "multitasks" by allocating CPU time to different Processes and switching context between their threads them (and, because Windows can interrupt a Process without that Process's cooperation, it is said to be "preemptive"). In addition, Windows natively has the ability to coordinate threads, so it is a "preemptive multitasking, multithreaded" OS. An ApplicationDomain is, in reality, a lighter-weight entity than a Windows Process, but like a Process, has the important detail of sharing memory between its component Thread s.
I like to illustrate the benefits of multithreading on a multiprocessor machine using an image-processing example. Wavelets are a fascinating topic in signal processing: they can be used to perform compression, image enhancement, feature detection, and probably a dozen other things. The Haar wavelet is the simplest wavelet and works by calculating the pair-wise average and difference-from-average of the original signal (sound, image, video, etc.). For instance, if the original data was [8,6], the Haar wavelet would transform it into [7, 1]. [8,6,5,9] would be transformed into [7, 7, 1, -2] (as shown in Animation 1). Typically, the wavelet is applied recursively until the first value in the signal is equal to the average of the entire signal (so [8,6,5,9] becomes [7, -0.5, 0, 1.5] after 2 levels).

Figure 1. A C# Application That Applies the Haar Wavelet to an Image
Figure 1 shows a C# application that applies the Haar wavelet to an image. In this case, the image is the " Lena " compression benchmark image and the Haar wavelet has been applied once in both the horizontal and vertical dimensions, creating a half-resolution version of the original and 1-pixel edge maps. As you can see, applying a single level of transform to a 512 x 512 image takes only a dozen milliseconds or so, but Figure 2 shows that on a large panorama (4096 x 4096), the single-threaded implementation takes almost 8 seconds to complete a complete transform. Worse, Perfmon shows that during those 8 seconds I was hardly using one of my processors at all!
Listing 1 shows the C# code that performs a single Haar "step" in the horizontal direction. (Essentially, this is the same code as in Animation 1.) Listing 2 shows the same code transformed for multithreading. It's complex, but don't worry, we're going to provide an easier-to-use solution in a bit. In Listing 2's HorizontalStepDownMT(), we're going to use a ManualResetEvent to signal when the entire picture has been finished; we'll do that when the value of remainingRowsPlusOne becomes 0 and, at the beginning of the calculation, we set that value to 1.

Figure 2. The Single-Threaded Implementation of a Large Panorama Has Lousy Performance
Then, we define the horizontalTransform instance of.NET's WaitCallback type. This delegate receives as a parameter the offsets to the pixels it should process and then performs the Haar transform (just as in the inner-loop in Listing 1). After the calculation is performed, the value of remainingRowsPlusOne is decremented in a thread-safe manner using Interlocked.Decrement(). The use of remainingItemsPlusOne inside the delegate is an example of what's called a "closure" or "outer variable capture."
Remember, at this point, all we've done is defined the horizontalTransform() function, we haven't called it. So now, we iterate over y calculating the offset corresponding to the first pixel in row y, just like the outer loop in a href="javascript:showSupportItem('listing1');">Listing 1. As we loop, we increment remainingRowsPlusOne. Then, instead of performing the calculation by directly calling horizontalTransform(), we pass that delegate and the offset as parameters to the.NET function ThreadPool.QueueUserWorkItem(). In production code, the bool result of QueueUserWorkItem() should be checked for success and, on failure, either retried or cause an exception to be raised.
Behind the scenes, this will cause the CLR to manage execution of horizontalTransform() delegates with the appropriate offset on a Windows thread. The call to QueueUserWorkItem() returns asynchronously, which is to say that it doesn't wait for horizontalTransform() to actually execute (I get bugged when people say that asynchronous calls "return instantly," but they do return pretty darn quickly). Another strategy would be to explicitly create the Thread s and handle their lifecycles manually, which isn't very hard, but the ThreadPool class is convenient and handles thread creation and destruction for you, based on machine resources.
After the loop is completed, we decrement remainingRowsPlusOne and, if its value is 0, we call set() on our ManualResetEvent. The final line in the function calls the ManualResetEvent.WaitOne() function, which does not return until set() has been called on the ManualResetEvent.
Do you see why remainingRowsPlusOne has to be set to 1 initially and the final Interlocked.Decrement() call is needed outside of the loop on y ? Remember that we're queuing up instances of the horizontalTransform() delegate to be executed as soon as possible on whatever processor is available. It's possible (not likely, but definitely possible) that the very first execution of the horizontalTransform() will reach the call to Interlocked.Decrement() before the y -loop reaches the second call to Interlocked.Increment(). If that happened, the ManualResetEvent would be set() at that time. The y -loop would continue and all the horizontalTransform() delegates would be nicely queued up, but the WaitOne() call wouldn't cause execution to block. Therefore, the HorizontalStepDownMT() function would return, even if there were still calculations in the queue. Any functions that assumed the calculations were finished would be in a "race condition": behavior would depend on the order in which the two threads raced forward. That's a bad thing, since the outcome of the race could very well vary from machine-to-machine and run-to-run.

Figure 3. View the Difference Between the Single-Threaded and Multi-Threaded Version of This Program: 5.2 Seconds vs. 8 Seconds
Figure 3 shows the difference between the single-threaded and multi-threaded version of this program: 5.2 seconds versus 8 seconds (a 54 percent speedup). More importantly, Perfmon shows that both CPUs were pegged to the wall, so if I had quad-cores or more, my performance would continue to increase dramatically. (Download the source code for this application.) It should be noted that the code has functions that are dramatic improvements on the standard.NET GetPixel() and SetPixel() functions; these speedups use pointers and therefore the code needs to be compiled with the "/unsafe" compiler flag. The multithreading aspects of the code are CLR safe.) Further improvements could be made in a number of ways: queuing up larger chunks of work such as half a dozen lines per work item or transforming the algorithm to use integers (since it relies only on dividing by two and addition).
Virtually all of the discussion of Listing 2 was about threading infrastructure: very little was about the actual Wavelet transform. Listing 3 shows a more generic function called ParallelApply() that uses the ThreadPool to queue up execution of a WaitCallback() delegate on any Ienumerable collection. (Thanks go to Barry Kelly for helping improve the efficiency of my initial implementation.)
Any discussion of parallel programming has to acknowledge the difficulties: changing your mental model of how code is executed; race conditions; deadlocks; debugging, etc. These are all true. But often these warnings overshadow the often straightforward and sometimes downright easy scenarios. Since so much performance-oriented programming involves applying a calculation to large amounts of data that isn't interdependent (or whose dependencies can be factored out, as we did here), multithreading can often be done fairly easily.
Larry O'Brien is a recognized expert on.NET, and is a frequent writer and speaker on software development.
Listing 1.
void HorizontalStepDown()
{
for(int y = 0; y < height; y++)
{
int yOffset = y * width;
//Note x is being stepped 2 at a time
for(int x = 0; x < width; x+=2)
{
float v0 = original[yOffset + x];
float v1 = original[yOffset + x + 1];
float ave = (v0 + v1) / 2;
float diff = v0 - ave;
newValues[yOffset + x / 2] = ave;
newValues[yOffset + x / 2 + width / 2] = diff;
}
}
}</
Listing 2.
void HorizontalStepDownMT()
{
ManualResetEvent done = new ManualResetEvent(false);
int remainingRowsPlusOne = 1; //Yes, 1.
WaitCallback horizontalTransform = delegate(object state){
int offset = (int)state;
//Note x is being stepped 2 at a time
for (int x = 0; x < width; x += 2)
{
float v0 = original[offset + x];
float v1 = original[offset + x + 1];
float ave = (v0 + v1) / 2;
float diff = v0 - ave;
newValues[offset + x / 2] = ave;
newValues[offset + x / 2 + width / 2] = diff;
}
//Little bit of closure magic here
int isDone = Interlocked.Decrement(ref remainingRowsPlusOne);
if (isDone == 0)
{
// Debug.WriteLine("Final calc was in loop.");
done.Set();
}
};
for (int y = 0; y < height; y++)
{
int yOffset = y * width;
Interlocked.Increment(ref remainingRowsPlusOne);
ThreadPool.QueueUserWorkItem(horizontalTransform,yOffset);
}
int isDoneAlternative = Interlocked.Decrement(ref remainingRowsPlusOne);
if (isDoneAlternative == 0)
{
//Debug.WriteLine("Final calc occurred after loop.")
done.Set();
}
done.WaitOne();
}
Listing 3.
class Program
{
static void Main(string[] args)
{
int[] nums = new int[100];
for (int i = 0; i < 100; i++)
{
nums[i] = i;
}
Random r = new Random();
WaitCallback func = delegate(object memberOfEnumerable)
{
int i = (int)memberOfEnumerable;
Thread.Sleep(r.Next(1000));
Console.WriteLine(i);
};
ParallelApply(nums, func);
Console.ReadKey();
}
static void ParallelApply(IEnumerable enumerable, WaitCallback function)
{
using (ManualResetEvent done = new ManualResetEvent(false))
{
int doneRefCount = 1;
WaitCallback wrappedBlock = delegate(object state)
{
function.Invoke(state);
int isDone = Interlocked.Decrement(ref doneRefCount);
if (isDone == 0)
{
done.Set();
}
};
WaitCallback callback = new WaitCallback(wrappedBlock);
foreach (object o in enumerable)
{
Interlocked.Increment(ref doneRefCount);
ThreadPool.QueueUserWorkItem(callback, o);
}
int isDoneLate = Interlocked.Decrement(ref doneRefCount);
if (isDoneLate == 0)
{
done.Set();
}
done.WaitOne();
}
}
}