Conclusion
When we started this article, we were faced with a challenge: how to take a sequential reduction loop and parallelize it. We took a look at how associativity and commutativity allow us to restructure a sequential loop into reduction trees, and then looked at several strategies for building efficient reduction trees. Perhaps surprisingly, we found that the most parallel reduction trees were also very inefficient, because they required lots of communication and synchronization, which is expensive on parallel platforms. We then found that performing the reduction as serially as possible provided the best performance, both on the CPU as well as the GPU. We saw a 15x performance improvement on the GPU by taking advantage of commutativity to reduce the number of local parallel reductions we executed, compared to the fully parallel, recursive reduction. We saw a 2.8x performance improvement on the CPU by using all the cores and the SIMD units, compared to a sequential reduction.
Since reductions are such an important part of data-parallel programming, many OpenCL™ programmers will encounter the need to write them at some point. Hopefully this article has given you some ideas about what approaches will work well for your problem, whether on the CPU or the GPU.
OpenCL™ and the OpenCL™ logo are trademarks of Apple Inc. used by permission by Khronos