SIMD (single instruction, multiple data) instructions, also called packed instructions, are widely used in high performance computing (HPC), multimedia, and security applications. These instructions operate on a set of packed data values simultaneously. The popular SIMD instruction set extensions in the x86 architecture are called SSE (Streaming SIMD Extension) and consist of SSE1 (or simply SSE) to SSE5. Many of these instructions operate on multiple data elements (e. g. a vector) packed into a 128-bit wide register.

Streaming SIMD Extension 5 (SSE5) is a new, proposed extension to the AMD64 (x86-64 or x64) instruction set. SSE5 would add 170 new instructions with greater benefits in domains like HPC, multimedia, and security applications than previously released SSE instruction sets.

SSE5 instructions typically operate on 128-bits of data at a time, as do previously released SSE instruction sets. These new instructions aim to increase work per instruction and remove additional overhead for storing and reloading of register operands through the introduction of an additional operand.

The new instructions include

  • Fused multiply accumulate (FMACxx) instructions
  • Integer multiply accumulate (PMAC, PMADC) instructions
  • Permutation and conditional move instructions
  • Vector compare and test instructions
  • Precision control, rounding, and conversion instructions
  • AMD64 Technology 128-Bit SSE5 Instruction Set


Image Converter

Consider a simple multimedia application, for example an image converter that converts a BMP image to a YUV image format. This involves reading individual pixels from the BMP image and converting the pixels into YUV format. Instead of operating on individual pixels, if we can pack the pixels and operate on a set of pixels with a single instruction it will result in higher performance. This is an example where using SSE instructions can give a performance boost. Assume the bitmap image consists of 8 bit monochrome pixels. By packing these pixel values in a 128 bit register (8 bit * 16 pixels) we can operate on 16 values at a time.
Please refer to the AMD SSE5 specification for comprehensive details on SSE instructions.

FMADDPS – Multiply and add packed single precision floating point instruction

One of the typical operations computed in transformations such as DFT of FFT is of the form




Let f(n) and x(n) be two source buffers, for example src1 and src2, and let p be the destination to accumulate the results. All the buffers in the discussion are of floating point type. The implementation in plain C for N = 4(128 bits) is as follows:

for(int i =0; i< 4; i++)


p = p + src1[i] * src2[i];


The code generated in x86 instructions per iteration is as follows:

//src1 is on the top of the stack; src1 = src1 * src2

fmul DWORD PTR _src2$[esp+148]

//p = ST(1), src1 = ST(0); ST(1) = ST(0)+ST(1);ST-Stack Top

faddp ST(1), ST(0)


The total number of instructions generated for 4 iterations= 2 * 4 = 8.

The above calculations in SSE2 instructions are as follows:

//xmm0 = p, xmm1 = src1, xmm2 = src2

mulps xmm1, xmm2

addps xmm0, xmm1


However, the SSE5 instruction accomplishes the same computation in a single instruction:

//xmm0 = p, xmm1 = src1, xmm2 = src2

fmaddps xmm0, xmm1, xmm2, xmm0