Conclusion
Taking a close look at optimizing DIA sparse matrix vector multiply has illustrated several techniques for getting good performance with OpenCL™ C code:
- Pay attention to the interplay between SIMD execution and your data structure.
- Align and densify accesses as much as possible.
- Use local memory to eliminate off-chip memory accesses.
- Vectorize your code for greater efficiency.
- Use OpenCL™ images for intermediate-sized data structures with hard-to-predict access patterns, but lots of reuse, when targeting the GPU.
- When targeting the CPU, consider tailoring the amount of parallelism you express to the natural parallelism of the processor
With these techniques, we've been able to construct a high-performance DIA sparse matrix vector multiply routine that efficiently uses the resources of the ATI Radeon HD 5870 GPU.
The same code that worked well on the GPU also provides decent performance on the CPU, and slightly adjusting the parallelism of the computation to better fit the CPU improved CPU performance a bit as well.
As you write your own code, careful attention to these principles will help you achieve high performance results. Good luck!
References
[1] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. Parallel Computing, vol. 35, no. 3, pp. 178-194, 2009.
OpenCL™ and the OpenCL™ logo are trademarks of Apple Inc. used by permission by Khronos.