barcelonaSSE4a2_90x660

This is a follow-up to the first post on the SSE4a instruction set.

While shuffling data around in registers is extremely important (as we mentioned in the last entry on SSE4a), one of the primary bottlenecks in performance comes from the loading and storing of data. Even if a processor executes instructions really fast working with just registers, one memory access from DRAM can lead to close to a 50 nanosecond hit, which would mean hundreds of cycles on most processors.

The SSE extensions already have instructions that help in reducing this bottleneck. SSE4a complements these instructions with two of its own.

MOVNTSS
MOVNTSD

Before I move on to the applications of these instructions, let me provide some information to help set the context for what the MOVNT* instructions are really useful for, and why.

Almost all user data usually exists in what is called “Write Back” memory. This means multiple things, but ideally is supposed to be the most cached mode that memory can be. The following description outlines what happens on reads or writes to Write Back memory. (For the sake of simplicity, I am not going to delve into the different combinations of the data being in the L1 or L2 cache. Assume that a cache hit means that data is either in L1 or L2.)

  • Read
    • Cache hit: Data is read from the cache line to the target register
    • Cache miss: Data is moved from memory to the cache, and read into the target register
  • Write
    • Cache hit: Data is moved from the register to the cache line*
    • Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line*

*As per MOESI cache coherency protocol rules, in both cases, the cache line is marked as modified

Using the concept of cohesion among data, writes are typically done to a memory location that has been recently read from. Using this architecture, writes to memory that have been recently read become extremely fast.

Unfortunately, in cases where you know that the data is not being written to a location recently read from, this procedure is still followed. So, on every write, that cache line is fetched into the cache, causing what is called cache pollution.

Cache pollution is bad, bad, bad. Considering that we have only 32k for our L1 data cache, if we’re reading data from one location and writing it to another, we are literally losing half our cache lines, making the work of the hardware pre-fetcher become rather ineffective. Plus, other memory accesses between this read and write end up running out of cache lines, also. Remember, cache lines are 64 bytes, so even if one byte of data in a megabyte needs to be cached, an entire 64 bytes are used up.

This is where the non-temporal store comes to our rescue.

These instructions, first of all, do NOT update the cache line, but instead directly write to memory. Along with that, they “write combine” memory, meaning they do not write data immediately to memory but instead wait for 64 bytes to accumulate at a time. Once that threshold is reached (or one of the many other triggers), this memory is written in one shot to DRAM.

Of course, this also means that the data may not necessarily be written in order to memory, and/or not quite when the write was executed. To flush out the write combine buffer, the SFENCE (store fence) instruction needs to be used.

I’ve noticed gains of up to 2x or more on simple operation loops (something like a load, add, store) working on large pieces of data, when I switched the stores to non-temporal. This is a HUGE gain considering that most of this comes from the store time, which in operations like this, is a major bottleneck. I’ve found this ideal typically for large buffers (~1MB+).

If I wanted to write register after register to memory, this would work fine. However, in case you’re working on part of the register (e.g. scalar SSE instructions) and you only want to write that part, things get complicated. Until now there has not been an instruction that would use the SS or SD parts of the register, hence any NT * memory write would span a full 16 bytes.

*I often refer to these stores as either NT stores/writes, or stores/writes with the NT hint. Keep in mind, though, that these stores are often referred to as “streaming stores.” All compiler intrinsics that map to these intrinsics are named _mm_stream*.

Of course, with the AMD “Barcelona” processors, we now have these two new instructions:

MOVNTSS : This instruction will write the least significant 32 bits of a register to memory using the non-temporal hint. For example, a loop that performs scalar single-precision floating point math on a large array can use the SSE registers and MOVNTSS to store results to memory.

MOVNTSD : This instruction will write the lower 64 bits of a register to memory using the non-temporal hint. This instruction can be used for similar purposes as the MOVNTSS instruction, but typically for double-precision floating point data.

Before these two instructions were available, there really was no way to do either of these stores with the NT hint. With these two new instructions, SSE4a completes the NT instruction set to more completely match our set of normal stores.

Support for these two new instructions and all the SSE4a instructions is detected by the CPU ID instruction. Specifically, ECX bit 6 will be set for CPU ID function 8000_0001h.

-Rahul Chaturvedi


This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Leave a Reply