|
|
|
Powered by
Quad-Core AMD Opteron Processors
|
|
Multiple Cores + Shared Caches = Application Performance Boosts—If You Know How
Home > Docs & Articles > Articles & Whitepapers
|
The innovative design used in multicore processors from Advanced Micro Devices can yield unexpected dividends, if you take special advantage of architectural constructs like shared caches. We’ll show you how that works with the new AMD Phenom™ quad-core family.
|
|
Anderson Bailey
|
7/14/2008
|
|
|
|
|
|
How do cores communicate? Most of the time, by using main memory. One core writes data to a common area, where another core can retrieve. That's one way, but it's not the only way. If you're working in an environment where you'll have two, four or more cores within a processor, you can leverage shared cache to allow inter-core communication even faster than using main memory.
Bear in mind that system RAM is orders of magnitude slower than shared cache. Not only is the memory itself slower, but routing data accesses through the memory controller adds additional delay. That's why it's desirable to place data in shared cache, which runs faster and can circumvent the memory controller under the right circumstances. There, all cores on the same chip die can access the data.
Now, let's be clear up front: Not all applications can benefit from this approach. Most of the time, you'll want to use main memory to share information. So, why are we talking about this? For three reasons: First, cache sharing is a tool that advanced programmers will want to have in their toolbox. Second, because our discussion will give you a peek under the hood of how multicore processors operate, and that may inspire other ideas. And third, because it's interesting!
We'll begin our exploration with a situation in which multiple cores need to share data via a threaded program. One thread produces data and one or more threads process that data. This arrangement is referred to as producer-consumer, and is encountered from time to time in parallel programming.
The data structure commonly used for producer-consumer is the circular queue, in which the producer thread adds blocks of data at one end (the head), while the consumer thread picks them up from the other end (the tail). In queues involving multiple threads, you'll need a series of safeguard mechanisms to control access to the circular queue, and to make sure the two threads don't ever collide in data access. The ideal situation is to have the consumer slightly lag the producer thread, but have them both moving at the roughly same speed. As long as the lag is smaller than the size of the queue, the two threads will never have to wait on each other. So, you'll need to size the queue appropriate to the jitter between the two threads, and also based on how much delay you'll be expecting from when the producer writes the data and when the consumer will read it.
In our two-thread example, one core will be the producer, the other the consumer. Let's see how we can create a fast queue in shared cache for optimal performance.
AMD's Barcelona architecture, which is the architecture used in the AMD Phenom™ and the quad-core AMD Opteron™ processors, employs a three-level cache design. The innermost cache is 64KB L1 cache located in the processor core; the L2 cache is 512KB. Both of these caches are dedicated, private caches used only by their respective core. In other words, they're not shared. The L3 cache, which on the current generation of chips weighs in at 2MB, is shared by all cores on the processor. This is where we wish the data queue to be located for our exercise.
On the Barcelona architecture, data is always loaded into the L1 cache first, and when L1 fills up it is eventually evicted to L2. When L2 is full, the data is evicted to the L3 cache. This scheme of progressive eviction is referred to as a "victim" cache design. In the case of the Barcelona architecture, it's an 'exclusive' victim cache, meaning that when a cache line is evicted from one cache to another, it is removed from the first cache. In other words, except in specific situations, there should never be more than one copy of a cache line in the entire cache hierarchy.
From this design, you can see that the producer thread must generate more data than can fit in the L1 and L2 caches before any data can reach the shared L3 cache. In the current generation of Barcelona processors, this equals 512KB for the L2 and 64KB for the L1, for a total of 576KB of data. When the producer thread generates more than 576KB of data, the output pushes previously generated data blocks from L2 into the shared L3 cache-where they are accessible by the consumer thread. Bear in mind, of course, that those cache sizes aren't absolute, and certainly can vary depending on processor model. Don't hard-wire them into your application designs! If you don't know the exact cache sizes of your processor, use the CPUID instruction to find out (for more on CPUID check our blog).
As we've discussed, leveraging the L3 cache can be a very efficient way of sharing data items between a processor's cores. As the producer thread begins pushing data into the L3 cache, the consumer thread needs to begin reading the data blocks quickly. If the delay before the consumer thread retrieves the data is too great, the producer threatens to flood the L3 cache, in which case new data evicts unprocessed data and the consumer thread is forced to retrieve the evicted data from system RAM.
Avoiding flooding the L3 cache is one precaution. Providing data fast enough for the consumer thread is another. If the producer thread does not produce enough data to start filling L3, the consumer thread will not see the data and try to retrieve it from system RAM. Given the exclusive nature of the cache hieraerchy, the right lag between two threads is about the size of the private caches on a core, which is about 576KB of data on today's processors. Again, that number can vary, which is why you should consider this an exercise, not an absolute recommendation.
According to Kent Knox, a member of AMD's technical staff, who has written about this issue, here are some constraints that can help you in planning, assuming that that a single buffer or queue has been defined for the producer and consumer threads to walk (you can read more in the AMD Developer Blog):
- The consumer thread needs to 'lag' the producer thread by at least L1 & L2 cache size (modulo arithmetic)
- The producer thread needs to 'lag' the consumer thread by at least L1 & L2 cache size (modulo arithmetic)
- The buffer should be at least 2*(L1 & L2)
- The producer thread should not get so far ahead of the consumer to flood the L3, if a large buffer is used
- Add a small fudge factor to the calculated sizes to give the threads some 'slack' when communicating through the caches
For the whole system to work properly, the consumer thread needs to read blocks quickly and efficiently. To do this, the code should rely on the prefetchw instruction. This instruction takes advantage of the cache coherency mechanism of the Barcelona protocols that read and write data to cache. The data blocks read or written to a cache use a minimum-sized block called a "cache line."
When a cache line is placed in the L3 cache, a metadata tag identifies it as an M cache line (technically, the M stands for modified, indicating the cache line is different from the data in the corresponding RAM address). When a core reads the block it changes the M state to an O (owned). This tells the cache controller that the cache line is shared by some core and so it cannot overwrite the cache line without cache coherence being enforced (that is, making sure that it is not overwriting data that is still referenced by multiple cores).
At some point, the producer thread circles through the circular queue and needs to overwrite previous blocks of data. If the cache lines for the queue have the metadata bit set to O, the data cannot be over-written without performing a cache coherency check, which is a time-consuming process that destroys the performance benefits of this technique.
The prefetchw instruction, however, retrieves the cache line without setting the M state to an O. As a result, the core can overwrite the old cache lines without performing coherency validation. In fact, here's how the prefetch instruction works - this is from the "Software Optimization Guide for AMD Family 10h Processors".
Prefetch instructions take advantage of the high bus bandwidth of AMD Family 10h processors to hide latencies when fetching data from system memory. A prefetch instruction initiates a read request of a specified address and reads the entire cache line that contains that address. AMD Family 10h processors perform three types of prefetches:
- Load Reads the data into the L1 data cache; the data is later evicted to the L2 cache. The following instructions perform load prefetches: PREFETCH, PREFETCHT0, PREFETCHT1, and PREFETCHT2.
- Store Reads the data into the L1 data cache and marks the data as modified; the data is later evicted to the L2 cache. The PREFETCHW instruction performs a store prefetch.
- Nontemporal The PREFETCHNTA instruction performs a nontemporal prefetch. The data is read into the L1 data cache; to avoid cache pollution, when a PREFETCHNTA misses in the L2 cache and reads from memory, the data is never evicted to the L2 cache. When a PREFETCHNTA hits in the L2 cache, the data is evicted back to the L2 cache.
Before looking at the results, I'll point out the obvious: the L3 cache is shared only by cores on the same processor. To use this approach on a system with multiple Barcelona chips, it is necessary to make sure the producer and consumer threads are both running on the same processor package. This can be done via processor affinity. Most operating systems have APIs that enable processor affinity so that developers can place specific threads on specific chips. Of course, if you're working on a machine with only a single processor (like a laptop or typical desktop), that's not an issue!
The method described here is applicable only in a few narrow application situations, but it does illustrate how shared-cache systems work and the tradeoffs involved.
The producer-consumer relationship between two threads is a standard, nearly inevitable, part of parallel programming. However, it's a big performance block. Two threads waiting on one another is an inherently sequential arrangement, not a parallel design, so it reduces the performance benefit of parallelizing the code. As a result, anything that makes producer-consumer situations work faster is to be desired. That's why this cache-sharing technique, in the appropriate situations, can be such a big boost.
This article has shown how exploiting processor architecture can greatly improve performance. The method described here may not apply to your coding, but whenever you are working on an extended optimization task, you'll find that greater knowledge of how the processor works will greatly assist your pursuit of top-end performance. You can dig more deeply into these caches, and the inter-core data transfers in the "Software Optimization Guide," referenced earlier, in section 11.5, which starts on page 199.
For even more fascinating information about how AMD Phenom/Barcelona quad-core processor features can be put to good use in demanding applications, see http://developer.amd.com/documentation/guides/Pages/default.aspx
Anderson Bailey is a developer with a longstanding interest in the techniques for using code to exploit processor features. He can be reached at chip.coder@gmail.com. |
|
|
Back to top |
|
|
© 2009 Advanced Micro Devices, Inc.
AMD, the AMD Arrow logo, AMD
Opteron, AMD Athlon, AMD Turion, AMD
Sempron, AMD LIVE!, and combinations
thereof, are trademarks of Advanced
Micro Devices, Inc. Microsoft and
Windows are registered trademarks of
Microsoft Corporation in the United
States and/or other jurisdictions.
Linux is a registered trademark of
Linus Torvalds. Other names are for
informational purposes only and may
be trademarks of their respective
owners.
This website may be linked to other
websites which are not in the
control of and are not maintained by
AMD. AMD is not responsible for the
content of those sites. AMD provides
these links to you only as a
convenience, and the inclusion of
any link to such sites does not
imply endorsement by AMD of those
sites. AMD reserves the right to
terminate any link or linking
program at any time. |
|
|
|
|
|