Reductions, which take a vector of data and reduce it to a single element, are widely used in data-parallel programming. In this article, we examine strategies for efficiently mapping reductions onto the ATI Radeon™ HD 5870 GPU and AMD Phenom™ II X4 965 CPU. Taking advantage of properties of the reduction being performed, as well as matching the style of reduction to the hardware platform, can result in performance improvements of up to 15x, compared to native code.

Bryan Catanzaro |
8/24/2010 |

### Introduction

OpenCL™ allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL™, understanding the issues which arise when optimizing OpenCL™ code is a natural part of learning how to use OpenCL™ itself.

This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element, for example – by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:

float reduce_sum(float* input, int length) { float accumulator = input[0]; for(int i = 1; i < length; i++) accumulator += input[i]; return accumulator; }

This code is completely sequential! There’s no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How can we parallelize this code?

It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction we’re trying to perform. As counter-intuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications – from databases to physical simulation and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum element of a vector, or find the index of the maximum or minimum element of a vector.

The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU devices, represented by the AMD Phenom™ II X4 965 CPU, as well as GPU devices, represented by the ATI Radeon™ HD 5870 GPU.

### Reduction

The simple sequential sum reduction we just saw is not parallel at all: there’s a sequential dependency on the accumulator variable that requires this reduction be done in a particular order, from front to back of the input array.

However, many operators we want to apply which turn an array into a single element are more flexible, and don’t require that the operations be done in one particular order. By taking advantage of this flexibility, we can turn a sequential loop into a parallel kernel by parallelizing the reduction in a variety of ways. We’ll take a look at how to do this by starting from the bottom and going up.

Associativity and Commutativity

As we just mentioned, if our reduction operator gives some flexibility in terms of what order the operations must be performed, we can parallelize a sequential reduction. Addition is a common reduction operator that gives us a lot of flexibility – let’s consider summing a vector of three numbers: [10, 20, 30]. The sequential sum would do two additions: ((10 + 20) + 30). But, we’d get the same answer if we had grouped the additions differently: (10 + (20 + 30)), or even if we had reordered the additions: ((30 + 10) + 20).

You’ve probably heard of these properties before – if an operator allows us to regroup the operations and still get the same result, we call it associative, and if it allows us to reorder the operations and still get the same result, we call it commutative.

It turns out that these properties are key to parallelizing a reduction. We can take advantage of associativity to divide up the reduction into independent pieces, and then combine results from the independent pieces. For example, a+b+c+d = (a+b)+(c+d). (a+b) can be computed in parallel with (c+d), and then the two partial reductions combined to complete the reduction. This can be generalized to reductions on vectors of arbitrary size by recursively dividing the input vectors, computing partial reductions, and then reducing the partial reductions to form the result.

**Figure 1: Associative Reduction Tree and SIMD Mapping**

Figure 1 illustrates an associative reduction tree on an eight element vector. This reduction tree does not assume commutativity of the operator, since none of the additions are reordered, they are just regrouped.

### Building a reduction for GPU Devices

Let’s build a parallel reduction for the GPU, starting at the OpenCL™ work-group level. We’ll take advantage of associativity to break the vector into small chunks, each of which we’ll build independent reduction trees for each chunk, and execute them independently, in parallel. We’ll make sure each of the chunks is small enough that it fits in local memory, and then we’ll assign one work-item per element.

At each stage of the reduction tree, we’ll be loading and storing partial reductions as we compute, so it’s crucial to use local memory to communicate between work-items in the work group. We’ll then execute the reduction tree by using a for loop in conjunction with OpenCL™ barriers. For example, see the following code, which performs a min reduction to find the smallest element in a vector:

__kernel void reduce( __global float* buffer, __local float* scratch, __const int length, __global float* result) { int global_index = get_global_id(0); int local_index = get_local_id(0); // Load data into local memory if (global_index < length) { scratch[local_index] = buffer[global_index]; } else { // Infinity is the identity element for the min operation scratch[local_index] = INFINITY; } barrier(CLK_LOCAL_MEM_FENCE); for(int offset = 1; offset < get_local_size(0); offset <<= 1) { int mask = (offset << 1) - 1; if ((local_index & mask) == 0) { float other = scratch[local_index + offset]; float mine = scratch[local_index]; scratch[local_index] = (mine < other) ? mine : other; } barrier(CLK_LOCAL_MEM_FENCE); } if (local_index == 0) { result[get_group_id(0)] = scratch[0]; } }

A couple things to note about this code. Firstly, it assumes that the number of work-items in the work-group is a power of 2, otherwise it will try to access illegal elements in the scratch space. For reductions where the size is not a power of two, we’ll just pad the vector with a few extra elements on the host. Secondly, it assumes that local memory has a legal element for every work-item in the work group. To take care of situations where this is not true, it initializes the local memory with the identity element for the operation we’re performing. Thirdly, you’ll see that we have not reordered any of the reduction operations, but only regrouped them. If you look at the code, you’ll see that we’ve encoded the reduction tree very directly: on the first iteration of the reduction loop, the condition

((local_ index & mask) == 0)

will be true for every other work-item in the work-group. On the second iteration, the condition will be true for every 4th work-item in the work group. On the third iteration, the condition is true for every 8th work-item in the work-group, and so on until we’ve completed the reduction. This ensures that the reductions are not reordered, but this implementation leads to a very sparse utilization of the work-items in the work group, which is inefficient when mapped onto SIMD processors such as AMD GPUs.

Figure 1 shows how this reduction tree proceeds in hardware when mapped onto a SIMD processor. Recall that we have assigned one work-item to each element of the input array we’re reducing, and that each work-item is going to be mapped to a SIMD lane in the GPU hardware wavefront. Figure 1 shows how each work-item will be used during the reduction, and how data will be transferred through the reduction tree. As you can see, the work-items are going to be used sparsely, and at each step of the reduction tree, the active work-items get sparser and sparser. This leads to poor SIMD efficiency, in the example in figure 1, we have only about 30% of the work-items active, on average.

### Taking Advantage Of Commutativity

Almost all simple reduction operators, such as addition, min and max, or argmin and argmax used to find the index of the smallest or largest element in a vector, are both associative and commutative. We’ll take advantage of this to improve efficiency.

SIMD reduction efficiency can be improved by using commutativity to reorder the operations into a more SIMD friendly structure, as shown in Figure 2.

**Figure 2: Commutative Reduction and SIMD Mapping**

With a commutative reduction tree, the active work-items are compacted into contiguous blocks. Assuming the work-group will be mapped onto multiple SIMD wavefronts, as a wavefront becomes completely unused during the reduction, it will be dropped from execution. This leads to better SIMD efficiency – if we assume a wavefront size of four work-items for the tree shown in figure 2, our SIMD efficiency is 58%, which is about twice the efficiency of the associative reduction. The code needed to do this is basically identical to the associative version, just with the for loop restructured as follows:

for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) { if (local_index < offset) { float other = scratch[local_index + offset]; float mine = scratch[local_index]; scratch[local_index] = (mine < other) ? mine : other; } barrier(CLK_LOCAL_MEM_FENCE); }

Multi-stage Reduction

**Figure 3: Multi-stage Reduction**

So far, we’ve been discussing parallel reductions on vectors which are short enough to fit in a single work-group. In order to reduce over larger vectors, we need to move to a multi-stage reduction, where we do local reductions, followed by larger reductions. Three possible ways of doing this are:

- Recursive multi-stage reductions. In this approach, illustrated in figure 3, the results produced by each local reduction are gathered into a new vector, which is then reduced again. At each stage of the reduction, the amount of data remaining is reduced by a constant reduction factor. For example, if we worked with work-groups of 256 work-items, each stage would reduce the amount of data by a factor of 256. This style of reduction expresses the largest amount of parallelism, and it can be written to only take advantage of associativity, for operators which are not commutative. However, it can be less efficient.
- Two-stage reductions. In this approach,which we will explain in greater detail later, we express just enough parallelism to fill the machine, and then follow with a final global reduction. Taking advantage of commutativity, we can then perform most of the work sequentially, which improves efficiency compared to the fully-parallel multi-stage reduction. Additionally, we only have to launch two kernels per reduction.
- Reductions using atomics. Instead of using an explicitly multi-stage algorithm, you can use atomic memory operations in OpenCL™, such as atom_ add() to reduce the partial results from each local reduction. Of course, atomic transactions will limit you to the operators and data-types which are supported by the platform you’re targeting. Many applications need reductions which are not supported by atomic operations, so we’ll just mention that they can be useful in some situations, but won’t give details in this article.

### Recursive Multi-stage Reduction Performance

The recursive, multi-stage reduction is often the first reduction approach people try, since it’s maximally parallel. We often think that since GPUs are highly parallel processors that can process thousands of work-items, we should parallelize things as much as possible. As we’ll see, sometimes it’s better to choose a more serial approach.

Figure 4 shows the performance achieved on the ATI Radeon™ HD 5870 GPU, for multi-stage reduction kernels which find the minimum element in vectors of floats. The purely associative reduction averages 1.6 GigaReductions/second for large input sizes, while the multi-stage commutative kernel, which permutes the reduction tree to better fit the SIMD nature of the GPU, averages 2.1 GigaReductions/second.

**Figure 4: Multi-stage GPU Performance**

To analyze how well this performs, we note that for a vector of *n* elements, there are *n-1* operations which must be performed to complete the reduction. This means that our reduction kernels are performing at 1.6 or 2.1 GFlops/s. All *n *elements need to be loaded from off-chip memory, which means that the peak performance we could expect if we were off-chip memory bandwidth limited would be:

157.6 GB/s * 1 SPFLOP/4 B = 39.4 SPGFLOP/s

Accordingly, our multi-stage reductions are performing at about 5% of peak performance, which shows we’ve got quite a bit of room to improve performance.

Two-stage Reduction

All those parallel reductions in the multi-stage reductions we’ve been discussing are fairly expensive, because they involve lots of synchronization and data-sharing amongst work-items in the work-group. Perhaps it makes sense to try and do as much of the reduction serially as possible, avoiding parallel reductions across the work-group as much as possible. However, we still need to fill our compute device with useful work.

This observation motivates the two-stage reduction, where the input is divided up into *p* chunks, where *p* is large enough to keep all of our processors busy. In OpenCL™, each chunk will be processed by a work-group. Taking advantage of commutativity, we can have each work-group process its chunk by iterating over work-group sized pieces, and having every work-item keep a running reduction as it goes. After we’ve processed the entire array, each work-group writes out a single reduction result, which we assemble into another array, and then reduce with a final reduction call.

Technically, we could do a two stage reduction without taking advantage of commutativity by having each work-item sequentially reduce a large contiguous block of the array, and then finishing with a parallel reduction in each work-group, followed by a final reduction call. However, it is difficult to make this approach efficient, since each work-item would be loading data from a separate region of memory, which then

reduces bandwidth utilization substantially, since the loads from a wavefront of work-items are not contiguous. We’ll leave it as an exercise to the reader how to create an efficient two-stage, associative reduction, but will note that since a great number of reduction operators are commutative anyway, this problem is mostly just a curiosity.

**Figure 5: Two-stage Reduction**

Figure 5 illustrates how the two-stage reduction works. The vector is divided into work-group size chunks, which are then parceled out to the compute units on the device. In this example, we show a parallelization for two work-groups, a blue work group and a green work-group. The work-groups loop over their chunks of the input vector, by performing sequential reductions. Once the entire vector has been processed, the work-groups perform parallel reductions to finish out the first stage. The second stage of the reduction performs a parallel reduction on all the partial results from each work-group.

__kernel void reduce(__global float* buffer, __local float* scratch, __const int length, __global float* result) { int global_index = get_global_id(0); float accumulator = INFINITY; // Loop sequentially over chunks of input vector while (global_index < length) { float element = buffer[global_index]; accumulator = (accumulator < element) ? accumulator : element; global_index += get_global_size(0); } // Perform parallel reduction int local_index = get_local_id(0); scratch[local_index] = accumulator; barrier(CLK_LOCAL_MEM_FENCE); for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) { if (local_index < offset) { float other = scratch[local_index + offset]; float mine = scratch[local_index]; scratch[local_index] = (mine < other) ? mine : other; } barrier(CLK_LOCAL_MEM_FENCE); } if (local_index == 0) { result[get_group_id(0)] = scratch[0]; } }

By launching only enough work-groups to fill the compute device, we ensure that most of the reductions happen sequentially, which maximizes SIMD efficiency and drastically improves performance. In the fully-parallel, recursive reduction style mentioned above, during the first reduction phase, for a vector of length *n* elements, the number of parallel reductions we have to do is proportional to *n*. Using the two-stage reduction style, we only perform a constant number of parallel reductions, regardless of how large our input is, since we only do parallel reductions for as many work-groups as we need to fill the compute device. This makes the reduction much more efficient.

For this experiment, we found that launching 80 work-groups was a good choice for the ATI Radeon™ HD 5870 GPU. There are 20 compute units on the GPU, so launching 80 work-groups provides for simultaneous scheduling of multiple work-groups on the same compute unit, which can help cover memory latencies during execution.

### Two-stage Reduction Performance

**Figure 6: Two-stage GPU Performance**

Figure 6 shows the performance we achieve using the two-stage approach. As you can see, it’s quite a bit faster, averaging 27.3 GigaReductions/second for large vectors. This is 70% of our peak bandwidth-bound performance, quite a big difference. Vectorizing this kernel so that it performs operations on float4 vectors instead of just float elements improved the performance even more, bringing us to 30.3 GigaReductions/second, or 77% of our bound. It’s important to note that vectorizing reductions requires that the reduction operator be commutative, since the vectorization process inherently reorders the reduction tree.

This result is perhaps somewhat counterintuitive: switching from the most parallel algorithm we could devise for reduction to a more serial algorithm brought us great efficiency benefits. Despite the fact that GPU hardware needs a lot of parallelism to perform well, it turns out that for some algorithms, its best to choose only a degree of parallelism which fills the compute device, but no more.

Reductions on CPU Devices

After our finding that fully-parallel reductions introduce more work than necessary, it follows that targeting CPU devices will require a different style of reduction than targeting GPU devices, since CPU devices require smaller amounts of parallelism than GPU devices.

**Figure 7: CPU Performance**

Figure 7 shows the realized performance of several reduction kernels on the AMD Phenom™ II X4 965 processor, which has four compute units. Running the two-stage kernel which provided the greatest performance on the GPU brings us 79 MegaReductions/second for large inputs. Our bound is 21.3 GB/s * 1 SPFLOP/4 B = 5.3 SPGFLOP/s.

This means that the two-stage kernel is yielding only 1.5% of peak performance, which clearly shows room for improvement.

As we’ve shown, serial reductions are more efficient than parallel reductions, so the following code attempts to serialize the reduction as much as possible. It is still a two-phase reduction, but it expects that each work-group is a serial work-group with only one work-item. Each work-group performs the reduction serially, and then writes out the reduction for their chunk of the input vector. We can then instantiate a few work-groups, just enough to fill all the compute units of our CPU device, and finish the reduction with a very small, sequential reduction in host code. The following is the OpenCL™ kernel for this serial reduction.

__kernel void reduce(__global float* buffer, __const int block, __const int length, __global float* result) { int global_index = get_global_id(0) * block; float accumulator = INFINITY; int upper_bound = (get_global_id(0) + 1) * block; if (upper_bound > length) upper_bound = length; while (global_index < upper_bound) { float element = buffer[global_index]; accumulator = (accumulator < element) ? accumulator : element; global_index++; } result[get_group_id(0)] = accumulator; }

This code performs much better on our CPU device. When we launch a kernel with only one work-group, using just one of the four compute units the AMD Phenom™ CPU provides, we average 1.2 GigaReductions/second, or 22% of our bound for this device. When we parallelize the reduction across all the compute units of our CPU device, we average 3.0 GigaReductions/second, or 57% of our bound. Vectorizing the reductions improves performance to 3.3 GigaReductions/second, or 62% of our bound. It’s important to note that in this case, in order for the vectorized version to improve performance on the CPU, one must use the **min()** intrinsic provided by OpenCL™, rather than using the C ?: ternary operator, as we have been using in these code examples. The ?: operator, when used on the CPU on a SIMD vector, which introduces control flow into the SIMD vector, causing the OpenCL™ compiler to emit non-vectorized code, resulting in performance losses instead of performance gains.

Overall, getting 62% of our bandwidth limited performance shows that OpenCL™ can provide good performance on CPU devices as well as GPU devices.

Conclusion

When we started this article, we were faced with a challenge: how to take a sequential reduction loop and parallelize it. We took a look at how associativity and commutativity allow us to restructure a sequential loop into reduction trees, and then looked at several strategies for building efficient reduction trees. Perhaps surprisingly, we found that the most parallel reduction trees were also very inefficient, because they required lots of communication and synchronization, which is expensive on parallel platforms. We then found that performing the reduction as serially as possible provided the best performance, both on the CPU as well as the GPU. We saw a 15x performance improvement on the GPU by taking advantage of commutativity to reduce the number of local parallel reductions we executed, compared to the fully parallel, recursive reduction. We saw a 2.8x performance improvement on the CPU by using all the cores and the SIMD units, compared to a sequential reduction.

Since reductions are such an important part of data-parallel programming, many OpenCL™ programmers will encounter the need to write them at some point. Hopefully this article has given you some ideas about what approaches will work well for your problem, whether on the CPU or the GPU.

*OpenCL™ and the OpenCL™ logo are trademarks of Apple Inc. used by permission by Khronos*