Accelerating Performance: The ACL clFFT Library

It has been exciting times in the math libraries group here at AMD. We recently released ACL Beta1 and a new open-source clSparse library. To keep the excitement going, I would like to talk a little bit about the performance of our clFFT library. clFFT is not new. It has been popular with our users over the past couple of years.

Since many of you are interested in the performance of clFFT, I think you’ll like the information given in this blog. I am providing charts to highlight:

  • Performance improvement in v2.6.1 (released as part of ACL Beta 1) over prior versions
  • Competitive performance of clFFT on AMD GPU against cuFFT on NVIDIA GPU

Performance Improvements Over Previous Versions

One of the improvements is in real transforms. Real-input discrete Fourier transforms have the advantage of requiring approximately half the compute and storage of an equal size complex transform. On the other hand, the “1 + n/2” storage of real transforms causes alignment and branching issues when computing on the GPU. Even with these factors, the library gets good efficiency and performance.

Figure 1 shows the performance improvement we made in real transforms for power-of-2 sizes. Previous versions of the library had an inefficient algorithm that caused performance to drop drastically for larger sizes. As you can see, this has been fixed. The black arrows indicate the significant uplift from this change. We are looking at orders of magnitude speed improvement for the largest transform sizes.

FFTPerfChart1
Figure 1: Real transforms with larger sizes see orders of magnitude speed improvements in clFFT 2.6.1.

Performance Compared to the Competition

OK, so we’re faster than we used to be. But how do we stand up to the competition?

Take a look at Figures 2 and 3 to see the competitive performance of the clFFT library vs NVIDIA’s cuFFT library, for both complex and real transforms.

Figure 2 shows performance measured in gigaflops for complex transforms for power-of-2 sizes. As you can see, clFFT performs very well on the smaller, more common sizes, beating NVIDIA’s cuFFT by as much as 1.5x. For the larger sizes, the libraries are comparable. At the largest sizes, clFFT experiences a dip that we plan to address in the future.

FFTPerfChart2
Figure 2: Relative performance of clFFT and cuFFT for complex transforms

 

In Figure 3 you see the performance for real transforms for power-of-2 sizes. Once again the clFFT library is significantly better than its NVIDIA counterpart and peaks at about 4x the performance. For larger sizes, the libraries are again comparable. In the middle ranges where clFFT dips, we have already determined how we can improve performance. We will check in the code as we implement the improvements.

FFTPerfChart3
Figure 3: Relative performance of clFFT and cuFFT for real transforms

 

clFFT is under active development. We test performance regularly, identify areas of improvement, and work to optimize the library continually. Our goal is for the performance of the ACL to improve even more in upcoming versions.

Benchmarking Details

The timings reported here measure the ‘execution step’ at the library API level. The total problem size is kept constant at 32M elements at all transform sizes by varying the batch count. The client executables and performance scripts used for measurement are available in the clFFT source repository on GitHub. Not only is the library code open-source, so are the performance tests.

The benchmark system details are:

  • cuFFT on OpenSUSE 13.2 Linux64, NVIDIA driver version 346.47 and CUDA 7 Toolkit, running on NVIDIA Tesla® K40, with i5-4690K CPU and 16GB RAM
  • clFFT on OpenSUSE 13.1 Linux64, AMD FirePro™ driver version 14.502, running on AMD Firepro™ W9100 Professional Graphics card, with i5-4690K CPU and 16GB RAM

I thank Pradeep Rao for getting the benchmark code developed and Amir Gholami for updating & developing the scripts and collecting the data.

We encourage our users to download the latest version of the binary or build from source. We thank the community for using and supporting our library, and providing valuable feedback.


 

Bragadeesh Natarajan is a member of the technical staff at AMD. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

 

5 Responses

  1. jtrudeau

    Hi Nou! On the comparison you describe, while possible, the results would be affected by the OpenCL support in the NVIDIA driver. So we decided to show clFFT running AMD hardware with our current driver, and cuFFT running on NVIDIA hardware with their driver, as being the fairest comparison.

  2. Ben

    Impressive results!

    How does clFFT compare to cuFFT on more consumer-level hardware at similar price poiints? Say R9 Fury X vs GTX 980 Ti, and R9 Fury vs GTX 980?

    I’d also be very interested in a same-price-point (R9 Fury vs GTX 980) comparison of sgemm (matrix multiply) performance of ACML vs NVBLAS and clBLAS vs cuBLAS (I’m guessing AMD may have more of an edge in the former due to transfer bandwidth).

  3. helena

    I want if I can use the library clFFT on an Nvidia card , if possible ? It is that I want to try your berchmark active it is known that OpenCL should run on any device at the both want to know if fulfills that purpose.