AMD Logo AMD Developer Central
Home
Drivers & Downloads
CPU Tools
GPU Tools
Partner Tools
Tech Zones
Docs & Articles
Samples & Demos
Community
Programs
Support

Get Extraordinary Performance By Exploiting the GPU  

Skip Navigation LinksHome > Docs & Articles > Articles & Whitepapers
The processors in video cards are purpose-designed for running thousands of threads simultaneously. So how come applications don’t use them to offload processing? A new toolkit from AMD allows you to do just that.
Anderson Bailey  8/13/2008 
» Software Tools
» Basic Concepts
» Performance
» Getting Started
» Wrapping Up
» Additional Resources

Overview
The chip that drives video cards — the graphical processing unit (GPU) — is designed to perform fast execution of integer and floating-point arithmetic. This capability enables the video adapter to quickly compute color, shading, texture, and other aspects of a changing image and render these in real time to the screen — thereby creating lifelike multimedia experiences. On many PCs, especially business PCs, a lot of this capability remains unused because business graphics only rarely need these full-bore advanced video capabilities, which means that the GPU and related hardware are available to be harnessed for non-video computation.

Software Tools
To enable this use of the video hardware, AMD recently released a set of software development tools that enable programmers to offload arithmetic operations to the GPU. These tools, which collectively form AMD’s Stream SDK, consist of software that functions at three distinct layers and integrate with C and C++ development products. The SDK comprises:

Brook+, a high-level language based on Brook, which is an extension to the C language that was originally developed at Stanford University for parallel computation. Specifically, Brook+ is an implementation of BrookGPU, which is an abstract processor model used for simplifying computation on graphics chips. Brook+ is implemented using AMD’s Compute Abstraction Layer (CAL), which is discussed next. AMD includes a Brook+ compiler that converts Brook+ files into C for execution on the CPU and specialized output, discussed shortly, for execution on the GPU.

CAL, a compute abstraction layer for the GPU. By use of CAL, much of the hardware peculiarities of GPUs can be abstracted away. By programming with CAL, developers can write directly to the GPU without needing to learn graphics-oriented languages or video and multimedia details in the graphics processor.

IL, or intermediate language, a high-level assembly language for the GPU. It is designed to be not quite as low level as assembly language, so that it can run unchanged on many different ATI GPUs that might have varying hardware capabilities. IL is ultimately translated into binary instructions for the GPU. While AMD does not recommend programming at the IL level as this step requires detailed hardware knowledge, should developers want to get down to the low level of hardware manipulation, IL is available to suit that purpose.

With this set of tools, developers can offload parallelizable arithmetic computation to the GPU for accelerated execution.

Basic Concepts
To understand how code in Brook+ should be approached, it’s useful to think in terms of two fundamental building blocks: streams and kernels.

Streams are collections of data elements of the same data type that can be operated on  in parallel. This data generally takes the form of an array. The stream is read (that is, from CPU to GPU) or written (from GPU to CPU).

Kernels are the program functions that operate to produce every element of the output streams. Ideally, the operations are SIMD-like, meaning that they are designed for a single operation to be performed across multiple data items (SIMD=single instruction, multiple data items). Developers who have used primitives to take advantage of SSE (and SSE2, SSE3, or SSE5) instructions will surely recognize this approach.

In a typical operation, the kernels are defined first, followed by the stream definitions, followed by calls to the Brook+ access functions, as shown in the following listing.

kernel void sum(floata<>, float b<>, out float c<>)

{

     c = a + b;

}

 

int main(int argc, char** argv)

{

int i, j;

 

float a<10, 10>;

float b<10, 10>;

float c<10, 10>;

 

float input_a[10][10];

float input_b[10][10];

float input_c[10][10];

for(i=0; i<10; i++) {

for(j=0; j<10; j++) {

input_a[i][j] = (float) i;

input_b[i][j] = (float) j;

    }

     }

 

streamRead(a, input_a);

streamRead(b, input_b);

sum(a, b, c);

streamWrite(c, input_c);

}

The kernel is defined (at the top of this example) prior to being called. The streams, a, b, and c, are declared using angle brackets in the main() function. Note that the arrays input_a and input_b are loaded with floating-point values and then mapped to the streams a and b via the streamRead() command, which sends the data streams to the GPU where the kernel sum is executed. The resulting stream, c, is then moved from the GPU to the CPU and mapped to input_c via the call to streamWrite().

As you can see, the communication between the CPU and GPU is kept at a very high level with these stream functions, and the work of mapping values to computational data items that the GPU can process is also greatly facilitated. C programmers should be able to quickly get the hang of using Brook+ to accelerate computation of parallelizable operations.

Performance
Although Brook+ is easy to use, developers might wonder whether they will get any significant advantage from it. In tests of a Brook+ implementation of Folding@Home (the Stanford distributed processing project for studying proteins at http://folding.stanford.edu/), it was found that the Brook+ client average 60 gigaflops per GPU client. This compared with 25 gigaflops for a Sony PlayStation 3 client, and 1 gigaflop for a CPU-only PC solution. In other words, by offloading the parallel arithmetic to the GPU, a 6000% increase[1] in arithmetic execution was achieved.

Getting Started
To get started with Brook+, you need a system with an ATI graphics chip that supports Brook+, an up-to-date driver for the GPU, the AMD Stream SDK which is available at no charge from AMD, and a supported operating system plus development toolchain. As I’ll show shortly, most PC systems today with an ATI-based graphics card fit the bill.

For hardware, the graphics card must be an ATI Radeon HD 2400 (R6xx GPU) or later. For double precision support, you will need an ATI Radeon HD 3870 graphics card (RV670 GPU) or later. These cards first started shipping in the spring of 2007. ATI Catalyst™ driver version 7.11 or later is required for the graphics card. (The latest drivers can be downloaded from http://ati.amd.com/support/driver.html). On Microsoft® Windows® XP (32- and 64-bit versions), you will need Microsoft .NET Framework Version 2.0 Redistributable Package plus Microsoft Visual Studio 2005 (also called VC8). Brook+ runs on both 32- and 64-bit versions of Linux® and it requires gcc/g++ 4.1.2.

The complete AMD Stream SDK for all the supported operating systems can be downloaded from http://ati.amd.com/technology/streamcomputing/sdkdwnld.html. Tech support for these tools can be found in an active forum at: http://forums.amd.com/devforum/categories.cfm?catid=328


Wrapping Up
AMD’s acquisition of ATI Technologies in 2006 gave the company a unique ability to leverage the GPU and the CPU due to its extensive expertise in both semiconductor technologies and to its established set of high-performance development tools. The AMD Stream Computing initiative makes it possible for developers to access the benefits of this combined expertise without having to delve deeply into GPU or CPU details—while using nothing more than C language extensions that are quickly learned and easily mastered. Future releases of the software are expected to provide even greater performance benefits as a result of advances in both the software and hardware components. So if supercomputing on the desktop machines appeals to you--AMD’s Stream SDK is just the ticket!

Additional Resources
FAQ for AMD Stream Computing
http://ati.amd.com/technology/streamcomputing/faq.html#brook1

White Papers and Recent Presentations on AMD Stream Computing
http://ati.amd.com/technology/streamcomputing/resources.html

Stanford paper explaining Brook and the Brook processor: http://graphics.stanford.edu/papers/brookgpu/

 

Anderson Bailey is a developer with a longstanding interest in the techniques for using code to exploit processor features. He can be reached at chip.coder@gmail.com.

Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.