AMD Logo AMD Developer Central

Performance Optimization of Windows Applications on AMD Processors, Part II 

Skip Navigation LinksHome > Docs & Articles > Articles & Whitepapers
Part II: Cache and Memory
Michael Wall, Principal Member of Technical Staff, Advanced Micro Devices, Inc.   6/16/2008 

» Introduction
» Memory optimization
» Conclusion

Introduction


» Performance Optimization of Windows Applications on AMD Processors, Part I

 

» Download cacheandmemory.zip

 

Multi-core microprocessors offer more raw CPU horsepower than ever before, but software developers need to face several challenges to really deliver on the promise of higher performance. One of those challenges involves the increased demands on the cache and memory system.

Memory latency and bandwidth historically have been a limiting factor for many applications, even with single-core processors. But when two, four or even more processors (cores) reside in the same CPU package and compete for cache and memory resources, the problem becomes even more acute. This paper describes techniques developers can apply to make more efficient use of those resources.

Memory system basic description

Back in the early days of computers, memory systems were simple. The processor issued a memory request, waited while the data was read or written, then resumed execution. This kind of sequential processing is easy to understand, but is no longer accurate: the never-ending quest for higher performance has produced modern systems with far more complex behavior. Instructions are executed out-of-order, and many memory operations can be "in flight" simultaneously. In other words, the hardware attempts to hide memory latency by implementing greater concurrency in the memory system.

Cache system basic description

To reduce the effective latency of memory even further, processors include a relatively small amount of fast on-chip memory called the cache. Details vary, but virtually every modern processor includes some kind of cache. Typically, memory data which has recently been read or written resides in the cache. The assumption is that recently used data is likely to be needed again soon, and the cache provides faster access to that data.

The smallest unit of data that moves between the cache and the main memory is called a cache line. The cache line is a set of N bytes in sequential memory addresses and is “atomic”, which means that the entire line is moved to and from memory as a unit. In most AMD processors, cache lines are 64 bytes in size and they are "naturally aligned," which means they start at an address which is an integer multiple of 64.

For example, if a program running on an AMD Phenom™ or AMD Opteron™ processor reads a single byte from memory, the requested byte and the 63 other bytes in adjacent memory addresses will be read into the cache:

A rule of thumb

Before diving into details of programming techniques, consider a basic rule of thumb. To help developers think about how data flows through the machine, and to make good judgments about where optimization might help, it can be useful to model the memory system performance using an order-of-magnitude approximation. The Level One or L1 data cache is assumed to have a latency of one unit, or 1x. Higher cache levels such as L2 or L3 cache are modeled as having a latency of 10x, and main DDR memory latency is 100x.

Memory system level

Relative latency

L1 cache

1x

Higher cache levels

10x

Main memory

100x

Although the exact details are more complicated, and differ from one platform to another, keeping these rough approximations in mind can be a useful guide.

Memory optimization

Reading data from main memory is a relatively slow operation. Of course an application must read memory data, but how can the latency be minimized? One answer to this question is simple: to get data sooner, issue the memory request sooner. The processor executes instructions out-of-order, so it has the ability to "look ahead" and issue memory read requests in advance. But in every case, the processor must know the memory address before the memory request can be started. If the addresses are known then the processor can initiate memory operations well in advance, keeping several "in flight" concurrently, effectively hiding the memory latency to some degree.

Arrays vs. linked lists

With this fact in mind, let's look at two very common ways developers store data in their applications: arrays and linked lists. Arrays are conceptually the simplest way to store a collection of similar data items. The items are arranged linearly in memory, and addressed using an index value. They can be accessed sequentially or randomly. Arrays work fine when a fixed number of items are stored, but adding or removing items is slow and awkward.

Linked lists provide versatility for adding, re-ordering, and removing items. The key to implementing this capability is that every item includes at least one pointer, which links to the next element in the list. Items can be allocated as needed, and can reside anywhere in the memory address space. However, basic linked lists are inherently a sequential system: to reach a given element, a series of pointers must be followed. It's not possible to access items in a random order.

Because of the way items are accessed, arrays and linked lists can have very different behavior when accessing memory. We may ask the important question, "how can we generate addresses sooner?" The answer for arrays is simple. Typically, the index of an array is generated by a loop counter or other simple computation. In these cases, when the processor executes instructions out-of-order, it can easily calculate the address and begin the memory operation in advance. Therefore, several array memory operations will often be "in flight" simultaneously.

For linked lists, the story is quite different. The items in a linked list are not arranged neatly in memory; each item can potentially reside at any random address. Therefore, the address of a linked list element cannot be calculated in advance. The address must be read from memory. Traversing a linked list involves reading the first element, waiting for the data from memory, then starting the second memory access, waiting for that data, then starting the next access, etc.

Let's see a benchmark comparing the two ways of storing data.

(please run tests #1a and #1b in the Visual Studio programming demo)

As you can see, traversing an array gives the processor ample opportunity for concurrency in generating memory requests. Overlapping memory operations results in lower effective latency per item. In contrast, traversing a linked list is fundamentally sequential. The penalty is especially bad when the array elements are randomly scattered in memory: the full memory latency cost is incurred for accessing each individual item.

The simple conclusion is that arrays should be preferred over linked lists, when they are capable of doing what's needed. This is especially true when the data will typically be accessed in a sequential manner.

Improving performance by improving locality

But what about the other cases, where the versatility of a linked list is necessary? Fortunately, there are some tricks for making linked lists work faster. These techniques are especially relevant for relatively small data items, in a list that doesn't change too rapidly.

Trick #1 is to lay out the elements of the list in sequential memory locations. In other words, sort the elements in ascending or descending address order, just as they would appear in an array. (Remember how test #1a, fully sorted elements, showed the best linked list performance.) This arrangement helps performance in two ways. First, adjacent elements may share space in the same cache line, so accessing element N will actually bring part (or all) of element N+1 into the cache (more details on cache lines later). Secondly, the processor has automatic data prefetchers which can detect regular address patterns, and speculatively fetch the next data elements before the program actually requests them.

Trick #2 for linked lists is a bit more subtle. It involves creating a separate table (an array) to store the sequential addresses of linked list elements. During subsequent passes through the linked list, this "history table" is traversed concurrently by the code and is used to prefetch the linked list elements that will be needed in the near future.

(please run test #2 in the Visual Studio programming demo)

Look at the source code, and notice how little extra code is required to implement the history table.

void build_history_array(linky* list, linky** history_array) {
       while(list) {
              *(history_array++) = list;
              list = list->next;
       }
}
linky* traverse_list_with_history (linky* list, linky** history_array, int n) { 
       while(list) {
              _mm_prefetch((char*) *(history_array + n), _MM_HINT_NTA);
              history_array++;
              list = list->next;
       }
       return list;
}

Also, notice the use of the_mm_prefetch compiler intrinsic function. This software prefetch instruction tells the processor to read data from the specified address, and store the data in the L1 cache. The goal is to hide the memory latency and have the data sitting in the L1 cache by the time it's actually needed. When software prefetch is used with the hint _MM_HINT_NTA, the data is specially tagged as non-temporal, which means it will not subsequently get evicted to the L2 cache. This option can improve performance when a large data set is used in a "single-pass" manner, because it avoids displacing more useful data (and code) from the L2 cache.

In effect, the code can use the history table to effectively look several steps into the future. It can begin fetching elements before their pointers have actually been read from memory. Of course, entries in the history table will become stale whenever the linked list topology is modified. This means a certain fraction of the software prefetch instructions may not actually fetch the most useful data, although they will not cause a crash or any actual program errors. To keep the history table up to date in a continuous manner, the correct addresses can simply be stored back into the history table on each working pass through the actual linked list.

We've seen a few methods for reducing the latency to access main memory. Along the way, the cache has emerged as an important part of the whole process. Let's look at the cache in more detail, and see how we can use it efficiently to squeeze out even more performance.

Cache optimization

The data cache is the fastest data storage in the system, but it's also very small. AMD Phenom™ and AMD Opteron™ processors typically have only 64 Kbytes of L1 data cache. Using the cache efficiently is essential to optimal application performance. There are two golden rules for managing the cache: (1) minimize latency by reading data into the cache before it's actually needed, and (2) keep only useful data in the cache.

We've already seen how using arrays enables the processor to automatically request memory in advance, and we examined using the software prefetch instruction to manually bring important data into the cache before the code actually tries to use the data. Furthermore, we have seen that the hardware prefetcher can automatically perform the same task when memory accesses form a simple pattern like an ascending or descending sequence. Please see the Software Optimization Guide for further details regarding software and hardware prefetch.

Dense = good

Applications should also take care to avoid filling the cache with data that isn't needed. Remember, the cache is very small compared to main memory and every unnecessary data byte in the cache is taking up space that could be used beneficially by more relevant data. One simple way to avoid wasting cache space is to use smaller data types when possible. For example, a data structure might include an integer numeric field that ranges in value between 1 and 1000. Often a 32-bit INT would be used to store this value, but in this case that is overkill; a SHORT (two bytes) would work as well, and take up half the space.

In a similar vein, developers should be aware that compilers often add padding between elements in a data structure. The padding is added to ensure elements are "naturally aligned" in memory, which sometimes helps performance but other times the padding is just a waste of cache. Use the sizeof() function to check how much space a structure is actually using; you may discover bytes that can be reclaimed. The padding can be controlled by using compiler options like /Zp1 (Microsoftâ Visual Studioâ) and -fpack-struct (gcc). There are also pragmas which can give finer control of padding, for example #pragma pack in Microsoft and __attribute__((packed)) in gcc. Experiment to see if the default alignment behavior gives optimal performance, or if removing the padding helps. See the compiler documentation for more details on padding.

To visually summarize the value of dense data packing:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Sparse data = lower performance L                                            Dense data = higher performance J

Place “hot” data items close together

Even more can be done to optimize data structures. In many applications, just a few structure elements are used frequently, and the rest are rarely used. For example, the link pointers in a linked list of structures are used all the time, whereas other elements in the structures may not be. This fact suggests an extremely simple optimization: arrange the elements in the structure so the "hot" elements are adjacent in the structure definition, hence adjacent in memory!

slow code

struct foo {
             foo* next;  // link pointer is “hot”
             char rarely_used [80];
             int often_used; // other “hot” data on separate cache line!
}


faster code

struct foo {
             foo* next;  // link pointer is “hot”
             int often_used; // other “hot” data likely on same cache line J
             char rarely_used [80];}

Why is this a good idea? Remember, the smallest unit of data in the cache is the cache line, e.g. 64 sequential memory bytes. Placing frequently accessed structure members in adjacent memory locations increases the probability that they'll reside on the same cache line. When the first element is accessed, the others will get a "free ride" from main memory to cache when the first element is accessed. Also, if the required data is found on a single cache line, the cache space is used more efficiently.

Taking this approach to an extreme, the structure might be split into a pair of smaller structures: one for "hot" data, the other for "cold" data. The code can build a separate linked list of each structure type. The code can quickly traverse the densely packed "hot" list, filling the cache with useful data, while only dipping occasionally into the "cold" list.

Classic cache optimization: cache blocking

The cache is so important for performance that a very simple but important optimization technique has been developed called "cache blocking." This technique involves processing data in blocks that fit within the cache. For example, consider processing a large 10MB data set.

The original algorithm
        for each item in the set, do operation A   // must read from memory
        for each item in the set, do operation B   // must read again from memory
        for each item in the set, do operation C   // must read again from memory

The cache-block optimized algorithm
        for each cache-size block in the set
                do operation A on the block   // must read from memory
                do operation B on the block   // it's still in cache! fast!
                do operation C on the block   // it's still in cache! fast!
        next block

The latter approach can dramatically improve performance, by reducing the number of times main memory is accessed.

Data cache associativity and long-stride data access

The L1 data cache is organized in blocks called “ways.” In AMD AthlonTM 64, AMD Phenom™, and AMD Opteron™ processors the L1 data cache has two “ways,” also called two-way associative. This means any given address in memory can potentially be mapped onto two different lines in the cache. If both of these lines already contain useful data, then useful data will be displaced to make room for the new data.

How does the CPU decide which two lines correspond with a given memory address? The assignment is done according to the address of the data. Since the total L1 D-cache size is 64 Kbytes, each “way” is 32 Kbytes in size. So, the address modulo 32K determines which two cache lines can be used. This algorithm works very well for contiguous sequential spans, but there is one case where it works poorly. If many active data items reside at addresses which are equal modulo 32K, then many reads will be contending for the same two cache lines, much data will be evicted and performance will suffer. This situation is called “cache thrashing.”

For example, consider an image processing program which stores data in color planes: one buffer for red, one buffer for green, and one buffer for blue. Think about what happens when the code reads the data representing a single pixel. It must read three values, one from each buffer. All three values can reside in cache, unless the buffers are unfortunately aligned to the same address modulo 32K, to within the size of a cache line. This alignment might happen, for example, if the image size is close to a “round number” in binary, like 512 Kbytes or 1 Megabyte.  Another way it can happen is when the width of an image is a large round number in binary, and the code is processing a neighborhood of pixels on consecutive scan lines, for example when computing a spatial filter.

This kind of cache thrashing is very bad for performance, but fortunately it can easily be fixed. When allocating storage for this kind of data, a small number of padding bytes can be used to offset the different data elements, staggering them so they no longer map onto the same cache lines. The effectiveness of this padding can be tested by running a range of different data sizes, straddling the “round number” values, and measuring performance. There should be no abnormally slow cases if the padding is implemented correctly. 

Optimizing write operations: the Streaming Store

When your code stores data in memory, it's actually storing the data in the cache. This is normally a good thing. However, sometimes you are storing a large block and you don't want to disrupt the cache. Fortunately, there is a simple way to achieve this: the Streaming Store, supported in all modern x86 processors. Let's look at the situation in a bit more detail.

Normally, when your code writes to "memory," the data is actually going into the L1 data cache, then sometime later it will actually get evicted from the cache and written into the DRAM memory. There is overhead associated with this process. First, before the data can get written to cache, the CPU must allocate a cache line. When a cache line is allocated, the corresponding memory region (64 bytes) is READ into that cache line. Yes, even when you are doing a WRITE, the CPU must first execute a READ from memory. Then the new values are written into the cache line by your code. Later, the entire cache line gets evicted from L1 data cache up to higher cache levels, and eventually out to main memory. Clearly, this process is consuming memory bandwidth (to initially fill the cache line, and eventually to write new data to memory) and cache space (to store the data at each level of cache). Data that was displaced from the cache will likely need to be reloaded soon, which takes time.

What if you are writing an array of several Kbytes or Megabytes, and don't need to have the data in cache? Can you avoid the overhead described above? Yes. Streaming Store will write data into a special Write Combine (WC) Buffer, which is the same size as a cache line. When full, the Write Combine Buffer is written out to main memory, bypassing the cache completely. The net result: only half the memory bandwidth is needed (no read), and the cache does not get "polluted" with the new data. So performance can improve significantly.

How do you implement the Streaming Store? There was a time when assembly language was required, but now Microsoft Visual Studio and also GCC both support Streaming Store as intrinsic functions in C/C++. There are two limitations:

  • you need to use a 128-bit (16-bytes) data type, for example __m128i for packed integers or __m128 for packed float
  • you must write to an address that is 16-byte aligned

Let's look at one example. Suppose you have an array of bytes, which you want to copy using Streaming Store. To further reduce cache pollution, we will also use the non-temporal software prefetch we saw previously:

        #include <emmintrin.h> 
        int nontemporal_copy(char* outbuff, char* inbuff, int size) {
                const int step = 64; // a handy unroll factor, equal to WC buffer size
                while(size > step) {
                        _mm_prefetch(inbuff + 320, _MM_HINT_NTA); // non-temporal
                        __m128i A = _mm_loadu_si128((__m128i*) (inbuff + 0));
                        __m128i B = _mm_loadu_si128((__m128i*) (inbuff + 16));
                        __m128i C = _mm_loadu_si128((__m128i*) (inbuff + 32));
                        __m128i D = _mm_loadu_si128((__m128i*) (inbuff + 48));
                        // destination must be 16-byte aligned for streaming store!
                        _mm_stream_si128((__m128i*) (outbuff + 0), A);
                        _mm_stream_si128((__m128i*) (outbuff + 16), B);
                        _mm_stream_si128((__m128i*) (outbuff + 32), C);
                        _mm_stream_si128((__m128i*) (outbuff + 48), D);
                        inbuff += step;
                        outbuff += step;
                        size -= step;
                }
                _mm_mfence(); // ensure last WC buffers get flushed to memory
        }

Note: This code is only copying the major chunks; if you have odd bytes left over, they must be copied separately.

Run the example code, comparing an ordinary cacheable memory copy vs. the non-temporal version. You should see an extreme performance advantage for the latter. The test will also show how the standard LIBC memcpy function stacks up; the one in 64-bit Visual Studio 2005 and 2008 uses non-temporal data for large blocks and does especially well.

(please run test #4 in the Visual Studio programming demo)

Notice that the demo code uses OpenMP to run threads in the cache test. You can edit the code to run a single thread, to see the impact on total achievable memory bandwidth. Just replace the call to omp_get_max_threads with the constant “1”.

For more details on these and related intrinsic functions, see Microsoft's documentation here (gcc should support the same syntax):
http://msdn2.microsoft.com/en-us/library/t467de55(VS.71).aspx

For more details on AMD's implementation of Write Combining, see appendix B of the Software Optimization Guide for AMD Family 10h Processors:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf

and also the Software Optimization video series: http://developer.amd.com/documentation/videos/pages/SoftwareOptimizationVideoSeries.aspx

Conclusion

We have seen an overview of how the cache and memory systems function, and how applications can make best use of these resources for better performance. As multi-core CPUs become more capable, with more cores processing ever-more data, efficient use of cache and memory will yield increasing benefits in performance.

For more details specifically aimed at optimizing code for AMD processors, see the Software Optimization Guide for AMD Family 10h Processors:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf

Also, see related software development information and resources on the AMD Developer Central web site:
http://developer.amd.com

About the Author

Michael Wall became obsessed with software optimization in 1978, and has never fully recovered. He earned the BS degree in Computer Science at M.I.T in 1985, and has done engineering in various areas of software design, VLSI design, and digital hardware. Software optimization platforms included custom microcode engines, Z80, 6502, DSP56000, 68000, and now this new PC thing. He currently works as a developer evangelist and optimization engineer at Advanced Micro Devices in lovely Sunnyvale, CA.

Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.