It has been exciting times in the math libraries group here at AMD. We recently released ACL Beta1 and a new open-source clSparse library. To keep the excitement going, I would like to talk a little bit about the performance of our clFFT library. clFFT is not new. It has been popular with our users over the past couple of years.
Since many of you are interested in the performance of clFFT, I think you’ll like the information given in this blog. I am providing charts to highlight:
- Performance improvement in v2.6.1 (released as part of ACL Beta 1) over prior versions
- Competitive performance of clFFT on AMD GPU against cuFFT on NVIDIA GPU
Performance Improvements Over Previous Versions
One of the improvements is in real transforms. Real-input discrete Fourier transforms have the advantage of requiring approximately half the compute and storage of an equal size complex transform. On the other hand, the “1 + n/2” storage of real transforms causes alignment and branching issues when computing on the GPU. Even with these factors, the library gets good efficiency and performance.
Figure 1 shows the performance improvement we made in real transforms for power-of-2 sizes. Previous versions of the library had an inefficient algorithm that caused performance to drop drastically for larger sizes. As you can see, this has been fixed. The black arrows indicate the significant uplift from this change. We are looking at orders of magnitude speed improvement for the largest transform sizes.
Performance Compared to the Competition
OK, so we’re faster than we used to be. But how do we stand up to the competition?
Take a look at Figures 2 and 3 to see the competitive performance of the clFFT library vs NVIDIA’s cuFFT library, for both complex and real transforms.
Figure 2 shows performance measured in gigaflops for complex transforms for power-of-2 sizes. As you can see, clFFT performs very well on the smaller, more common sizes, beating NVIDIA’s cuFFT by as much as 1.5x. For the larger sizes, the libraries are comparable. At the largest sizes, clFFT experiences a dip that we plan to address in the future.
In Figure 3 you see the performance for real transforms for power-of-2 sizes. Once again the clFFT library is significantly better than its NVIDIA counterpart and peaks at about 4x the performance. For larger sizes, the libraries are again comparable. In the middle ranges where clFFT dips, we have already determined how we can improve performance. We will check in the code as we implement the improvements.
clFFT is under active development. We test performance regularly, identify areas of improvement, and work to optimize the library continually. Our goal is for the performance of the ACL to improve even more in upcoming versions.
The timings reported here measure the ‘execution step’ at the library API level. The total problem size is kept constant at 32M elements at all transform sizes by varying the batch count. The client executables and performance scripts used for measurement are available in the clFFT source repository on GitHub. Not only is the library code open-source, so are the performance tests.
The benchmark system details are:
- cuFFT on OpenSUSE 13.2 Linux64, NVIDIA driver version 346.47 and CUDA 7 Toolkit, running on NVIDIA Tesla® K40, with i5-4690K CPU and 16GB RAM
- clFFT on OpenSUSE 13.1 Linux64, AMD FirePro™ driver version 14.502, running on AMD Firepro™ W9100 Professional Graphics card, with i5-4690K CPU and 16GB RAM
I thank Pradeep Rao for getting the benchmark code developed and Amir Gholami for updating & developing the scripts and collecting the data.
We encourage our users to download the latest version of the binary or build from source. We thank the community for using and supporting our library, and providing valuable feedback.
Bragadeesh Natarajan is a member of the technical staff at AMD. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.