Writing SIMD code poses several complications. Doing 2 to 16 operations with one instruction is a powerful feature, but unless you have enough support instructions to get your data back and forth between the registers and memory, you may not always be utilizing the full potential that SIMD offers.
The SSE2 and SSE3 instruction sets include many instructions to help with this. They include packs and unpacks, shuffles, partial register move instructions, and more. These are pretty big sets, so for SSE4a to actually provide an improvement seemed like a difficult proposition. So I looked at the instructions a bit closer, and learned about the new bit field insertion and extraction instructions.
Before I start, let me mention that both of the following instructions work only on the lower 64 bits of the registers they deal with, and the upper 64 bits are undefined. When using these instructions, keep in mind that to access the upper half of any register, you’d have to shift the bits down by 64 and then do the required processing.
EXTRQ: Extract Field from Register
This instruction basically extracts a particular set of bits from one register and moves them to the register’s least significant position. For example, if you want to have only the third 16 bit value in the xmm register, xmm0 (bits [47:32]) extracted and left at bits [15:0], you would use the EXTRQ instruction, and in this way:
EXTRQ xmm0, 16, 32
The first thought that came to my mind when I saw this instruction was, “Can’t I do the same thing just using a shift instruction? Okay, that wouldn’t clear out the rest of the bits in this 64 bit half, but I could do a mask, and then a shift…but then I’d have to use an extra register for the mask. Well, I could do two shifts, one left and one right, but then that would be two instructions.”
Anyway, you get the idea. EXTRQ can be fairly useful, but not essential. Now INSERTQ, that comes close.
INSERTQ: Inserts Field from a source Register to a destination Register
This instruction takes a set of bits from one register and places those bits at ANY offset you specify (within 64 bits of course) within the destination register. For example, if you want to take a 16 bit value from xmm0 and move it to the third 16 bit value of xmm1, you would do:
INSERTQ xmm1, xmm0, 16, 32
But if you didn’t have this instruction and wanted to accomplish the same thing, what would you do?
The quickest way would be to have a mask at bits [31:16] in one register, and use that mask to zero out those bits in xmm1. Then you’d have to shift the data in xmm0 to the correct location, and then merge those bits into xmm1.
So essentially INSERTQ is doing the job of three instructions in one!
If you want to do this entire process for arbitrary bit positions (in case you want to insert or extract different bits, based on other computation), you would add one more instruction here, because now the mask will also have to be shifted in place before you do the PAND. Further, if the ‘source’ register has more data than just the value you want to insert, then that would involve one more PAND to zero out the rest of the unwanted bits. If you put both together, you’d need to add ONE more shift for moving the mask which will zero out the bits in the ‘source’ register.
This means that, in order to do what INSERTQ provides — inserting a value in a register at any location, based on value stored in a register — you could potentially need to use a grand total of six SSE instructions.
If you think about it, you’ll probably find a lot of places in your code where this INSERTQ instruction could save you significant time and complexity.
There are two more instructions in the SSE4a instruction set that add some more convenience — the partial stream (non-temporal hint store) instructions for floating point values. Look for future posts covering these topics.
This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.