» Overview» Fine-Grained Threading» Explicit Fine-Grained Threading» Implicit Fine-Grained Threading» Tying Up the Threads
In the first part of this story, we discussed the need for threading applications. We discussed why single-threaded, or linear, applications don't scale to take advantage of new multicore processors and multi-processor computers. We also discussed that there are two different ways to scale, coarse-grained and fine-grained.
Coarse-grained threading architecturally separates the application up into separate functional units, each of which runs in its own thread. (That type of thread is generally as a process, if it has its own memory space.) Separating applications into coarse-grained threads will certainly help a complex application, like an e-mail client, more efficiently do many things at the same time, such as retrieving messages from multiple POP3 mailboxes while also sending multiple SMTP messages while conducting a message archive search while rending an HTML e-mail message with an embedded graphic.
However, coarse-grained threading won't accelerate those individual tasks, or providing them with the resources they need to handle a big job. If your message renderer has to scale a big PDF image to fit into a window, and if that message renderer runs in a single process, that task might take 15 seconds – whether you have 2 cores or 8 cores available. How do you make the individual task faster? Through fine-grained threading.
Fine-grained threading is opportunistic parallelization within a logical task. Generally, it's not part of the application's architecture; functionally, it is (or should be!) invisible, with the only benefit being that the application code can “burst” into spawning short-lived threads to tackle a compute-intensive task, decompose the task into pieces that can run across lots of processors to get the job done quickly, then merge those threads back together when the short-term task is done.
So, where coarse-grained threads would handle logical functions of the application like rendering messages or managing the message-search function, fine-grained threads would run within those coarse threads to implement a parallel search algorithms, or break up the scaling of a large graphic into a smaller one. In other words, they'd turn that 15-second search into a 5-second search or a 3-second search.
There are two types of fine-grained threading, explicit and implicit. Both have their pros and cons – or as I like to phrase it, good news and bad news.
Explicit fine-grained threading is hand-coded by the developer using a language like C/C++ or C#, with the help from threading libraries which can help with locks and memory management. The developer needs to identify those places in the code which could benefit from opportunistic parallelization, choose the proper algorithms, and then make the calls to spin off the threads, each with their piece of work, while the main thread manages the code, allocates the memory, integrates the results, frees the resources and then kills the threads.
The good news is that in the hands of an expert developer – and explicit threading is best left to experts with tremendous knowledge of the operating system, threading libraries and computer science – this technique can yield astounding performance improvements. That's why parts of critical applications, like games, or parallel math libraries, or graphics programs, have key routines that have been hand-coded for optimal results. Such sections of code are identified using run-time profiling, or are tackled during refactoring exercises.
The bad news about explicit parallelization is that it does require an expert – and even then, it's difficult. Multi-threaded programs can be non-deterministic. It's very difficult to ensure that they'll give the right results consistently, or that they'll even terminate. This is where race conditions and deadlocks come into play; it's easy for one thread to stomp all over another thread here, and take down the application. Explicit fine-grained threaded routines are also typically coded (either intentionally or subconsciously) for a specific number of processor cores, and run-time results may suffer if the hardware has more or fewer cores than expected, or if the operating system doesn't give the application the optimal resources.
Most testing tools can't do much with explicit fine-grained threading. Source code analyzers can't always detect problem conditions or tell you if the threads will yield the right results. Functional and stress tests may not exercise all the pathways, or detect subtle errors brought about by locking problems or thread sync errors. And don't get me started on memory leaks… explicit threading is a tool best used with caution. But again, it's your only route to the utmost in performance tuning. (We'll be covering explicit threading more on the AMD Developer Center in the future.)
Implicit fine-grained threading is much, much simpler and is the approach that I recommend for most developers. It uses a system like OpenMP, which uses a series of in-code pragmas and code libraries to tell the compiler to automatically parallelize a piece of code.
OpenMP avoids almost all of the difficulties associated with explicit fine-grained threading, and is much easier to perform. OpenMP is also available for every platform and every major compiler, thanks to its standing as a vendor-neutral interface for threading – and it's even free. On the other hand, OpenMP may not be suitable for all tasks, and doesn't offer as tight an optimization as hand-coded explicit threading written by an expert.
To use OpenMP, you link in the appropriate library, and use code pragmas to indicate to the compiler areas where you think that the code can be parallelized. OpenMP excels at parallelizing loops where there aren't external dependencies, and where the compiler can figure out how to optimize the code. It's not so good for situations where there aren't loops.
So, for example, you can use OpenMP to do things like breaking up the task of rendering and scaling a huge PDF file into a small graphic, if you have an algorithm that can break that task into smaller pieces. First, the main thread divides the graphic's data set into chunks that can be rendered and scaled independently. Then, you write the code that does the rendering and coding, as if it were a loop that worked on each chunk in sequence – but use pragmas to tell OpenMP that this can be optimized. (You must take care to ensure that the code truly can be parallelized, and that there are no dependencies that can mess you up if the outer loop interactions runs out of sequence.) After this code, the main thread integrates the results.
At compile time, the compiler will then automatically analyze the code and generate the correct code to spawn the fine-grained threads, let them work, and then integrate the results. If you've designed the algorithm correctly, you get the performance gain of explicit fine-grained threading, without the hard work, and with fewer opportunities for error. Best of all, the code remains easy to read and study. In fact, the compiler can generate both a fine-grained threaded and non-threaded version of the code, so you can stress test them both to ensure that they're providing the same answers. You can also use a code profilers on both to see if everything is working correctly.
An excellent example of OpenMP in action is Performance Optimization of 64-bit Windows® Applications for AMD Athlon 64™ and AMD Opteron™ Processors using Microsoft Visual Studio 2005. It shows how parallelization can accelerate a Mandelbrot generator using C++, assembly code and Windows; the code works on both 32-bit and 64-bit Windows.
Is OpenMP perfect? No. You still can create race conditions or deadlocks, if the logic or algorithms are wrong. OpenMP is also only good for situations where the parallel tasks use the same code; think of it roughly analogous to loop unrolling on a massive scale. (You can thread other things beyond loops in OpenMP, but that's where it's usually applied.) We'll be writing more about OpenMP in the AMD Developer Center as well.
Here are four points to take away:
- Single threaded applications don't scale to take advantage of modern hardware and operating systems.
- Coarse-grained threads, which generally correspond to processes, help your applications perform better and do many things at the same time, but doesn't help them perform specific tasks faster using all hardware resources.
- Explicit fine-grained threading can offer tremendous performance advantages, especially in situations that require complex data or memory management, or where the application doesn't lend itself to implicit fine-grained threading using OpenMP. However, this programming is difficult, requires considerable expertise, and in non-trivial implements can be extremely hard to test or debut.
- Implicit fine-grained threading, such as using OpenMP, makes it easy to improve application performance by getting rid of CPU-intensive bottlenecks. However, it's only suitable for cases where there are available algorithms, clear functional decomposition, and no chance of deadlock or other thread problems.
Alan Zeichick is a technology consultant and analyst who focuses on software development and microprocessor technology. Reach him at zeichick@camdenassociates.com.