For many years, developers writing multimedia software have relied on a series
of special extensions added to the instruction set of x86 processors. These
extensions have evolved across multiple generations of processors, starting
with MMX capabilities in 1997 through the SSE-3 extensions currently found on
AMD64 chips as well as on chips from other vendors. The principal focus of these
extensions has been increasing the performance of floating-point and integer
arithmetic. This accelerated math is a critical need of the multimedia experience,
be it for calculating new shading values of a graphic image or converting bits
into a video clip or a rich audio experience. The extensions that don't deal
directly with arithmetic frequently provide capabilities that complement or
support arithmetic operations.
At the heart of much of this computation lies a design called SIMD, which refers
to the ability to use a single instruction on multiple
data items. The typical implementation relies on a series of 128-bit
registers into which multiple smaller data items are placed. These items can
be small 8-bit integers just as well as 64-bit floating point numbers. Once
the registers are loaded with the data, a single instruction is issued and the
same operation is performed on all the data operands at once-providing tremendous
performance leverage. Even with the loading and unloading of registers (technically
referred to as load and store operations, respectively), the resulting arithmetic
is far faster than performing the equivalent computations on one operand at
a time.
The tradition of enhancing SIMD capabilities continues with the upcoming release
of AMD's Barcelona x86 processor. It includes new instructions and speed improvements
that greatly enhance multimedia calculations. These extensions, known collectively
as SSE128, are the focus of this article. I will discuss what's new and different
about these instructions, and give some pointers regarding their use.
The Coming-Out Party
AMD processors have long been at the forefront of the multimedia revolution.
Most recently, its 64-bit AMD64 extensions doubled the number of floating-registers
available for advanced computations. And by leading the way in 64-bit x86 extensions,
the company was the first to offer native 64-bit integer calculations in the
core set of processor instructions.
Later this year, AMD will ship the Barcelona processor, which is a groundbreaking
quad-core x86 processor that offers several compelling performance-oriented
features. One feature that has received considerable attention is what AMD calls
"true quad core"-a design that provides each processor core with its
own private L2 cache. Another feature, which has perhaps not been so widely
covered, but which also will provide visible performance benefits, is the set
of SSE128 extensions. These extensions provide multiple capabilities to move
the x86 architecture past the 64-bit limitations of many current SSE instructions.