AMD Logo AMD Developer Central
Home
Drivers & Downloads
CPU Tools
GPU Tools
Partner Tools
Tech Zones
Docs & Articles
Samples & Demos
Community
Programs
Support

Implementing AMD cache-optimal coding techniques 

Skip Navigation LinksHome > Docs & Articles > Articles & Whitepapers
Greg Fry  9/4/2008 
» The Need for Cache
» Cache-friendly Coding Techniques: Key Concepts
» Code Execution Performance Example
» Conclusion
» Resources

Modern microprocessors split up and access memory (RAM) in a somewhat complicated, multi-tiered manner. Most microprocessors include one, two, or even three levels of cache, which are used to temporarily hold recently or often accessed areas of main RAM. These different levels of cache run at different speeds and are configured in different sizes depending on the make and model of the microprocessor. All ordinary microprocessor memory access goes through these caches. Although in most cases the cache architecture makes itself transparent and allows software to run optimally without any special care from the software designer, there are cases where taking the cache architecture into consideration can result in significant software performance improvement. This article first explores the AMD microprocessor cache architecture and then covers code construction techniques that are designed to be cache-friendly, thus producing optimal code execution speed.

The Need for Cache

Given how fast micro-processors run, main memory (RAM) is just too slow to be effectively used directly by the processors. Instead, small pieces of main memory are loaded into a special type of very fast memory that resides inside the processor. This very fast memory is called Level 1 (L1) and Level 2 (L2) cache. Every time the microprocessor has to access main memory for a piece of data, unless that piece of data is already in one of these levels of cache, it has to introduce a long series of wait cycles while the reading of main memory takes place. To combat this, the microprocessor uses these two levels of cache to temporarily store the contents of different sections of main memory. In this way, the long wait only has to occur once for any data that gets held in cache.
Cache Structure and Function

Let's take a look at the basic structure of L1 and L2 cache and how they operate with main memory.

L1 cache is very limited in size (12KB to 64KB) but has the fastest memory speed. It runs at close to microprocessor speed (2 or 3 processor cycles per access). Often this cache is broken up into two parts: one holds instructions (code), and the other holds data. We are only concerned about the functioning of the data cache in this article. L2 cache is the next fastest memory. It is many times larger than L1 cache (up to 2MB), but runs 3 to 5 times slower than L1 cache. Main memory is often 1000 times larger than L2 cache (1GB or more) but runs 10 to 30 times slower than L2 cache. The microprocessor always checks L1 cache first to see if a memory location that it needs is found there before checking L2 and, if necessary, going to main memory.

The amount of main memory that gets loaded into cache at any one time is called the "cache line size". This size varies amongst different microprocessor models, but the common values are 32 bytes, 64 bytes, and 128 bytes in length. When a section of main memory is loaded into cache, the section's starting address is always a multiple of the cache line size (in other words, the sections are always "aligned" on cache line boundaries). In AMD microprocessors, the L1 and L2 caches are "exclusive" caches, meaning they never both hold a copy of the same memory location.
Cache operation

Now let's look at an example of cache operation that shows the concepts mentioned in the previous section (see Figure 1 below). As a microprocessor runs, it may encounter instructions that require a memory location to be read. When the microprocessor executes such an instruction, the microprocessor first checks L1 cache to see if the needed memory location is there. If the memory location is found in L1 cache, otherwise known as a "hit", because of its speed, it only takes the microprocessor a couple of cycles to access the data needed by the instruction. But, if the needed memory location isn't found in L1 cache (a L1 cache miss), the processor checks L2 cache. The term "cache miss" describes the situation where a needed memory location is not found in the cache. A cache miss automatically triggers a read of a section of main memory into the cache.

Figure 1: An example cache operation

If the memory location is found in L2 cache, a cache line is thrown out of L1 cache to make room and the memory location is moved into L1 cache. At this point the instruction execution can continue. The locating and moving of the data between the caches costs the processor perhaps a dozen wait cycles. If however the memory location isn't found in either cache (a L1 and L2 cache miss), the processor causes that section of memory to be read from main memory and placed in L1 cache and perhaps portions placed in L2 cache as well. It takes a lot of (typically between 50 and 350) microprocessor wait cycles to get the data from main memory and copy it into the caches.
Automatic Pre-fetching of Main Memory Sections

The above example may seem complicated and perhaps even inefficient at first, but there is no way around the fact that main memory address lookups and reads take too long when compared with how fast microprocessors currently run. The current AMD microprocessor cache allows the microprocessor to only have to wait when a cache miss occurs. In order to minimize these cache misses, the AMD microprocessors have been architected to do one more thing: anticipatory pre-fetching of main memory sections into the cache. The pre-fetch algorithm used in AMD microprocessors is well designed and does a good job of putting those memory locations into cache that will most likely be needed when the microprocessor executes the next several instructions. Because the microprocessor does this pre-fetching in parallel while other instructions are executing, the main memory access times for filling the cache are often hidden. This "background" cache filling is one of the things that really helps boost AMD microprocessor performance. Even when this pre-fetching doesn't end up getting all the needed main memory sections ahead of time, it makes a notable performance difference. But there are cases where lack of appropriate software design can thwart the AMD microprocessor memory pre-fetch mechanisms. In these cases any software being executed by the microprocessor could end up with sub-optimal performance.

Cache-friendly Coding Techniques: Key Concepts

In order to avoid confounding the microprocessor's anticipatory pre-fetch mechanism, a software engineer needs to clearly understand the cache layout and operating concepts, as outlined above. Let's review two of the most important of these concepts.

First, cache is broken up into sections that are each a "cache line size" in length. For L1 cache, this is often 64 bytes meaning each section of memory that the cache holds is 64 bytes in length. Thus, a L1 cache of 32KB with a 64 byte cache line size could hold a maximum of 512 different sections of memory simultaneously. Since an entire cache line is filled as a single operation, executing any instruction that would cause two cache lines to have to be filled from main memory, will be much slower than if only one cache line had to be filled.

Secondly, because of how cache memory address lookups happen, when sections of main memory are brought into the cache, they are always brought in as sections aligned on cache line size boundaries. This means that if an instruction accesses a 32-bit value at location 0x00C0A328 in main memory, an L1 cache line will get filled with bytes 0x00C0A300 thru 0x00C0A33F. Once this has occurred, the original 32-bit access by the instruction can be completed. Furthermore, accessing any other memory locations in that same cache line can be done without paying the price of a main memory access.
Cache-friendly Coding Techniques: Rules to Code By

The two fundamental cache concepts that we just reviewed mean that software performance can be directly affected by how data structures are sized, how they are aligned, and how they are accessed. From this understanding come the following general rules:
  • Rule 1: If you can form your data structures so that 1 or more of them fit evenly within a cache line then you can access more data per cache-line fill. This gives much greater speed than if two or more separate cache lines must be filled to handle the data structure. One way to reduce the size of a structure is to avoid using a large data type (e.g. 32-bit int) when a smaller type is sufficient (e.g. char or short for small integer values).
  • Rule 2: Ensure that your data structures are aligned on cache line boundaries; otherwise your software may have to pay the price of filling multiple cache lines just to do a single memory access. Often the system memory allocation routines as well as the compiler take care of some of this alignment for you. But to gain optimal performance, you can check pointer values for appropriate alignment and adjust it if necessary (see the example below).
  • Rule 3: The order in which the members of a large data structure or array are accessed can affect speed. If data is accessed in memory layout order, there is much better likelihood that the microprocessor's pre-fetch algorithm will have correctly guessed which main memory items to pre-fetch.
  • Rule 4: L1 cache is a very limited resource (often 64KB or smaller). Therefore, once your software has started working with a given data structure, try to get everything done with that data structure before moving on to another data structure that might displace the first one from cache.


Code Execution Performance Example

In the following 32-bit code example, a simple function was written to show how applying the rules mentioned above can affect code execution performance. The function first allocates two arrays of data structures, initializes the first array of StuffA data structures and then runs a timed test of how many linked list search loops can be executed in ten seconds. Then the function does the same thing with the array of StuffB data structures. The key difference between these two timed tests is really just the data structure layout.
Explanation of Data Structure Differences

We start with a data structure (StuffA"”see Listing 1) that breaks all the rules mentioned above: it is oversized (doesn't fit in a cache-line), the code forcibly sets the data structures memory location to not be on a cache-line boundary (i.e. it is non-aligned), the data structure members used in the loop are at opposite ends of the data structure, thus forcing them to be on separate cache lines, and finally, the linking together of these data structures is done in such a way as to subvert the pre-fetch mechanism (each StuffA data structure is linked to another one that is 4 structures away).

Listing 1: Cache unfriendly struct declaration.
#define MAX_BRANCHES 8

struct StuffA // this one doesn't work very well with cache
{
int nodeType; // values from 1 to 5
int nodeClass; // values from 0 to 7
int holdingValue; // values from 0 to 3
int holdTime; // milliseconds since holding
int numberOfConnections; // values from 0 to 255
int branchesCnt; // 0 to 7
struct StuffA *branches[MAX_BRANCHES];
char *description;
int connectionsWaiting; // values from 0 to 255
struct StuffA *next;
};


To overcome the limitations imposed by StuffA we use a second data structure, StuffB (see Listing 2), which is essentially the first data structure, StuffA, that has been reworked to follow the rules indicated above: StuffB fits within a cache line, the data members used in the linked list search loop are contiguous, the data structure's address is forcibly aligned to be on a cache line boundary, and the StuffB linked-list is set up so that it is pre-fetch friendly (each StuffB data structure is linked to the next in the array).

Listing 2: Cache friendly struct declaration.
struct StuffB // this one works well with cache
{
struct StuffB *next;
unsigned char nodeType; // values from 1 to 5
unsigned char nodeClass; // values from 0 to 7
unsigned char holdingValue; // values from 0 to 3
unsigned char numberOfConnections; // values from 0 to 255
unsigned char connectionsWaiting; // values from 0 to 255
unsigned char branchesCnt; // 0 to 7
unsigned short extraStuff1; // room for growth and to keep ptrs aligned
unsigned __int32 holdTime; // milliseconds since holding
struct StuffB *branches[MAX_BRANCHES];
char *description;
void *extraStuff2; // room for growth and to keep cache line aligned
void *extraStuff3; // room for growth and to keep cache line aligned
void *extraStuff4; // room for growth and to keep cache line aligned
};

Both of these structures are put through the test loop (see Listing 3) and the differences are calculated and shown both as the number of times through the loop (i.e. the number of searched that could be performed in 10 seconds) as well as the number of MBs of data that were accessed per second. Both of these numbers allow us to easily see what difference the data structure layout made to execution speed.

Listing 3: The Performance Test Loop.
// change this to see how having more or less affects cache performance
#define MAX_STUFF (1024*32)


void TestStuffAccessSpeed()
{
int idx = 0;
int tmpval = 0;
unsigned __int32 begtime = 0;
unsigned __int32 endtime = 0;
unsigned __int32 totalTimeA = 0;
unsigned __int32 totalTimeB = 0;
unsigned __int64 totalLoopsA = 0;
unsigned __int64 totalLoopsB = 0;
unsigned __int64 megaMemAccessesPerSecondA = 0;
unsigned __int64 megaMemAccessesPerSecondB = 0;
StuffA *origStuffAptr = (StuffA *)malloc((MAX_STUFF*sizeof(StuffA))+64);
StuffA *stuffAptr = origStuffAptr;
StuffA *tmpstuffA = NULL;
StuffB *origStuffBptr = (StuffB *)malloc((MAX_STUFF*sizeof(StuffB))+64);
StuffB *stuffBptr = origStuffBptr;
StuffB *tmpstuffB = NULL;


// SETUP AND TEST THE BAD DATA STRUCTURE
memset(stuffAptr,0,(MAX_STUFF*sizeof(StuffA))+64);
// force it to be unaligned with regards to cache line boundaries
stuffAptr = (StuffA *)(((((INT_PTR)stuffAptr) + 31) & 0xFFFFFFE0)+4);

tmpstuffA = stuffAptr;
// Initialize the data structure – only “nodeType” and “next” members really
// matter for this test.
for ( idx = 0; idx < MAX_STUFF; idx++, tmpstuffA++ )
{
tmpstuffA->nodeType = (idx % 6) + 1;
tmpstuffA->nodeClass = (idx+15) % 8;
tmpstuffA->holdingValue = (idx+93) % 4;
tmpstuffA->holdTime = ((int)tmpstuffA->nodeType *
(int)tmpstuffA->nodeClass *
(int)tmpstuffA->holdingValue) *
(idx % 173) + 1;
tmpstuffA->numberOfConnections = (idx+((char)tmpstuffA->holdTime>>1))%256;
tmpstuffA->branchesCnt = idx % 4;
tmpstuffA->next = &(stuffAptr[(idx+4)%MAX_STUFF]);
if (tmpstuffA->numberOfConnections > 0)
tmpstuffA->connectionsWaiting = idx % (tmpstuffA->numberOfConnections+1);
}

// clear cache before using the above memory...
WarmUpCPU();

printf(_T("\nBad Example: Too Large (%d bytes) and unaligned stuff 0x%p\n"),
sizeof(StuffA),stuffAptr);

tmpval = 0;
totalLoopsA = 0;
begtime = GetTickCount();
do
{
tmpstuffA = stuffAptr;
// search the list for any node of type 5 and count it...
for ( idx = 0; idx < MAX_STUFF; idx++, tmpstuffA = tmpstuffA->next )
{
if ( tmpstuffA->nodeType == 5 ) {
tmpval++;
}
}

totalLoopsA++;
endtime = GetTickCount();
} while ( (endtime - begtime) < 10000 );

totalTimeA = (endtime - begtime);
megaMemAccessesPerSecondA =
(unsigned __int64)((totalLoopsA*MAX_STUFF*2)/(1024*1024));
printf(_T("Total loops: %I64d in %d seconds\n"),
totalLoopsA, (totalTimeA / 1000), tmpval);
printf(_T(" SCORE: %I64d M-Accesses/s\n"),
megaMemAccessesPerSecondA);


// TEST THE GOOD DATA STRUCTURE!!!

memset(stuffBptr,0,(MAX_STUFF*sizeof(StuffB))+64);
// force it to be cache line aligned
stuffBptr = (StuffB *)((((INT_PTR)stuffBptr) + 63) & 0xFFFFFFC0);

tmpstuffB = stuffBptr;
// Initialize the data structure – only “nodeType” and “next” members really
// matter for this test.
for ( idx = 0; idx < MAX_STUFF; idx++, tmpstuffB++ )
{
tmpstuffB->nodeType = (idx % 6) + 1;
tmpstuffB->nodeClass = (idx+15) % 8;
tmpstuffB->holdingValue = (idx+93) % 4;
tmpstuffB->holdTime = ((int)tmpstuffB->nodeType *
(int)tmpstuffB->nodeClass *
(int)tmpstuffB->holdingValue) *
(idx % 173) + 1;
tmpstuffB->numberOfConnections = (idx+((char)tmpstuffB->holdTime>>1))%256;
tmpstuffB->branchesCnt = idx % 4;
tmpstuffB->next = &(stuffBptr[(idx+1)%MAX_STUFF]);
if (tmpstuffB->numberOfConnections > 0)
tmpstuffB->connectionsWaiting = idx % (tmpstuffB->numberOfConnections+1);
}

// clear cache before using the above memory...
WarmUpCPU();

printf(_T("\nGood Example: Exact cache line fit (%d bytes) “
“and aligned - stuff 0x%p\n"),
sizeof(StuffB),stuffBptr);

tmpval = 0;
totalLoopsB = 0;
begtime = GetTickCount();
do
{
tmpstuffB = stuffBptr;
for ( idx = 0; idx < MAX_STUFF; idx++, tmpstuffB = tmpstuffB->next )
{
if ( tmpstuffB->nodeType == 5 ) {
tmpval++;
}
}

totalLoopsB++;
endtime = GetTickCount();
} while ( (endtime - begtime) < 10000 );

totalTimeB = (endtime - begtime);
megaMemAccessesPerSecondB = (unsigned __int64)((totalLoopsB*MAX_STUFF*2)/(1024*1024));
printf(_T("Total loops: %I64d in %d seconds\n"),
totalLoopsB, (totalTimeB / 1000), tmpval);
printf(_T(" SCORE: %I64d M-Accesses/s\n"),
megaMemAccessesPerSecondB);

printf(_T("\nUsing good cache aware design principles “
“gave %d%% performance improvement.\n"),
(megaMemAccessesPerSecondB*100)/megaMemAccessesPerSecondA);

free(origStuffAptr);
free(origStuffBptr);
}

The test shown in Listing 3 was run on a variety of PCs as well as allocating larger and smaller numbers of data structures. In all test runs, the StuffB data structure loops out performed the StuffA data structure loops. The performance difference averaged out with the StuffB loops performing 168% faster than StuffA loops.
Details of the Search Loop

Although a program typically would not execute such a search loop multiple times as this one does (the search is repeated as many times as possible until 10 seconds has passed), doing this allows us to see with large enough numbers just how much of a difference the data structure layouts made. If you look at how the loop operates, you will see that the data structure members "nodeType" and "next" are the only ones used in this loop. This is critical in understanding how Rule 3 above can affect things. In StuffA, these data structure members are at opposite ends of the data structure and therefore are guaranteed to be in different cache lines. In StuffB, these data structure members are right next to each other and therefore end up in the same cache line and will be much faster to access during the linked list search (see Figure 2).

Figure 2: StuffA vs StuffB

Conclusion

We can see that using the above cache-friendly coding rules allows a programmer to obtain better performance from their software. In addition to just using these rules, AMD supplies a very useful tool, AMD CodeAnalyst, which allows you to, among other things, see where your code is incurring processor waits due to cache misses. By using the AMD CodeAnalyst tool on a system with an AMD processor, you can easily determine where in your code to apply these cache-friendly coding rules and help optimize your software's performance. Good luck and may the cache-line always be pre-fetched for you!

Resources

Performance Optimization of Windows Applications on AMD Processors, Part II. This white paper and demo code illustrate performance optimization techniques related to cache and memory.

Cache or Check? CodeAnalyst Addresses Performance Deficits. This article describes five common problems that can be tracked by AMD's CodeAnalyst Performance Analyzer, what causes them, and what you can do about them.

Supersizing Java: Large Pages on the Opteron Processor, Part 1. Learn about how the AMD Opteron processor can handle large memory pages in a Java Virtual Machine (JVM) and why large pages can improve performance.

Supersizing Java: Large Pages on the Opteron Processor, Part 2. Learn how to use large memory pages under 32-bit and 64-bit Windows and Linux, and configure three advanced Java Virtual Machines to take advantage of them.

An Introduction to Analysis and Optimization with AMD CodeAnalyst. This technical note demonstrates how AMD CodeAnalyst can be used to analyze and improve the performance of a compute-bound program.

Basic Performance Measurements for AMD Athlon™ 64 and AMD Opteron™ Processors. This technical note describes a collection of basic measurements which can be taken using the performance measurement features of AMD Athlon 64 and AMD Opteron processors.


Greg Fry of Portland, Oregon, A 20+ year veteran of professional software development, including a decade in commercial software development (large volume, end-user/shrink-wrapped software). B.S. in Computer Science, Instrument rated private pilot, and outdoor enthusiast.
Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.