Skip navigation links
Tools
SDKs
Libraries
Samples & Demos
Docs
Zones
Community
Support
OpenCL™ Optimization Case Study: Simple Reductions 
Skip Navigation LinksHome > Docs > Articles & Whitepapers

Reductions, which take a vector of data and reduce it to a single element, are widely used in data-parallel programming. In this article, we examine strategies for efficiently mapping reductions onto the ATI Radeon™ HD 5870 GPU and AMD Phenom™ II X4 965 CPU. Taking advantage of properties of the reduction being performed, as well as matching the style of reduction to the hardware platform, can result in performance improvements of up to 15x, compared to native code.

Bryan Catanzaro  8/24/2010 

Introduction

OpenCL™ allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL™, understanding the issues which arise when optimizing OpenCL™ code is a natural part of learning how to use OpenCL™ itself. This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element, for example - by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:

float reduce_sum(float* input, int length) {
  float accumulator = input[0];
  for(int i = 1; i < length; i++) 
    accumulator += input[i];
  return accumulator;
}

This code is completely sequential! There's no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How can we parallelize this code? It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction we're trying to perform. As counter-intuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications - from databases to physical simulation and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum element of a vector, or find the index of the maximum or minimum element of a vector.

The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU devices, represented by the AMD Phenom™ II X4 965 CPU, as well as GPU devices, represented by the ATI Radeon™ HD 5870 GPU.

Reduction

The simple sequential sum reduction we just saw is not parallel at all: there's a sequential dependency on the accumulator variable that requires this reduction be done in a particular order, from front to back of the input array. However, many operators we want to apply which turn an array into a single element are more flexible, and don't require that the operations be done in one particular order. By taking advantage of this flexibility, we can turn a sequential loop into a parallel kernel by parallelizing the reduction in a variety of ways. We'll take a look at how to do this by starting from the bottom and going up.

Back to top
«1 2 3 4 5 6 7 8 »
2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.
Printer Friendly Version
Table Of Contents