Introduction
OpenCL™ allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL™, understanding the issues which arise when optimizing OpenCL™ code is a natural part of learning how to use OpenCL™ itself.
This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element, for example - by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:
float reduce_sum(float* input, int length) {
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}
This code is completely sequential! There's no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How can we parallelize this code?
It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction we're trying to perform. As counter-intuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications - from databases to physical simulation and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum element of a vector, or find the index of the maximum or minimum element of a vector.
The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU devices, represented by the AMD Phenom™ II X4 965 CPU, as well as GPU devices, represented by the ATI Radeon™ HD 5870 GPU.
Reduction
The simple sequential sum reduction we just saw is not parallel at all: there's a sequential dependency on the accumulator variable that requires this reduction be done in a particular order, from front to back of the input array.
However, many operators we want to apply which turn an array into a single element are more flexible, and don't require that the operations be done in one particular order. By taking advantage of this flexibility, we can turn a sequential loop into a parallel kernel by parallelizing the reduction in a variety of ways. We'll take a look at how to do this by starting from the bottom and going up.