Improve FFT post-processing performance using clFFT Post-callback

In my previous blog, I explained the pre-callback feature of the clFFT library that gives a new and faster way to pre-process input data before the FFT operation is performed. Instead of the conventional way of launching a separate kernel to pre-process the input, you can fold the pre-processing logic into the FFT kernel generated by the clFFT library using pre-callback. In this blog, I will introduce you to a new feature called post-callback that works on the same principle as pre-callback, however, for post-processing data.

Post-callback Introduction

The post-callback feature of clFFT gives you the ability to invoke user-provided OpenCL™ inline functions to post-process output data from within the FFT kernel. The inline OpenCL callback function is passed as a string to the library, and it is then incorporated into the generated FFT kernel. This eliminates the need for an additional kernel launch to carry out the post-processing tasks.

Let’s consider the example of magnitude extraction from complex vectors. In general, the output of FFT is complex numbers. Many times though, the application does not need this full information. Instead, only the amplitude of the resulting complex numbers or only their phase might be of interest. Without the post-callback feature, you may have to implement an additional kernel to perform such a post-processing operation. Instead, by using the new post-callback feature of clFFT, you can fold the post-processing logic into the FFT kernel, just as in the case with pre-callback. This improves performance by avoiding additional kernel launch overhead. Also, because only magnitude is of interest, only around half the amount of memory needs to be written to global memory by the FFT kernel.

So let’s look at both the approaches: a separate kernel to extract the magnitude during post-processing, and the same extraction using the clFFT post-callback. We’ll then look at the performance comparison with and without using post-callback. For brevity, the source code is kept high level. You can find the complete implementation on the clFFT GitHub page.

Note : Post-callback feature will be available on GitHub as part of the ACL 1.0 GA release scheduled for Jan-2016.

Post-processing without post-callback

The workflow without post-callbacks would be as follows:

  1. Declare input and output buffers. Declare two output buffers; one to store the result of FFT and the other to store the calculated magnitude in the post-processing kernel.
//input buffer.
cl_mem inputfftbuffer = clCreateBuffer( … );
 
//output FFT buffer
cl_mem outpufftbuffer = clCreateBuffer( … );
 
//output magnitude buffer
cl_mem magoutfftbuffer = clCreateBuffer( … );
 
…
//Initialize input buffer
…
…
  1. Initialize the standard clFFT parameters and invoke the Bake Plan.
 clfftBakePlan( );

This API prepares the clFFT plan for execution. It generates OpenCL kernels based on the arguments passed and compiles them.

  1. Execute FFT.
 clfftEnqueueTransform( plan_handle, CLFFT_FORWARD, 1, &commandQueue, 0, NULL, NULL,
&inputfftbuffer, &outfftbuffer, … );
  1. Pass the FFT output to the post-processing kernel to extract its magnitude.

The kernel takes complex number of float2 type as input and stores the calculated magnitude in another buffer of float type.

 const char* source = “
__kernel void extractMagnitude(__global float2 *output, __global float *magoutput) \n
{ \n
uint outoffset = get_global_id(0); \n
float magnitude = sqrt(output[outoffset].x * output[outoffset].x + output[outoffset].y * output[outoffset].y); \n
*(magoutput + outoffset) = magnitude; \n
} \n “;

 

  1. Compile the post-processing kernel, set the kernel arguments and launch it.
 cl_program program = clCreateProgramWithSource( context, 1, &source, … );
cl_kernel kernel = clCreateKernel( program, "extractMagnitude", … );
…..
clSetKernelArg( kernel, 0, sizeof( cl_mem ), (void*)&outfftbuffer );
clSetKernelArg( kernel, 1, sizeof( cl_mem ), (void*)&magoutfftbuffer);
status = clEnqueueNDRangeKernel( …, kernel, 1, NULL, … );
  1. Read the output buffer magoutfftbuffer.
 clEnqueueReadBuffer( commandQueue, magoutfftbuffer, CL_TRUE, 0, out_size_of_buffers, host_output, … );

 

Post-processing using clFFT post-callback

Now, let’s take a look at how we can implement the magnitude extraction post-processing logic using clFFT post-callback. The workflow would be as follows:

  1. Declare input and output buffers. Note that here a single output buffer is needed as the post-process result is written directly into it.
 //input buffer.
cl_mem inputfftbuffer = clCreateBuffer( … );
//output FFT buffer
cl_mem outpufftbuffer = clCreateBuffer( … );
…
//Initialize input buffer
…

 

  1. Write the post-callback function as an inline OpenCL function and store it in a string.
 //Post-callback inline function
const char* postcallbackstr = “
float extractMagnitude(__global void *output, uint outoffset, __global void *userdata, float2 fftoutput) \n
{ \n
float magnitude = sqrt(fftoutput.x * fftoutput.x + fftoutput.y * fftoutput.y); \n *((__global float*)output + outoffset) = magnitude; \n
} \n ”;

clFFT expects the user-provided post-callback function to be of a specific prototype. I’ll cover some details on the expected function prototype later in this blog.

  1. Register the post-callback function using clfftSetPlanCallback API.
 clfftSetPlanCallback(plan_handle, "extractMagnitude", postcallbackstr, 0, POSTCALLBACK, NULL, 0);

 

This is an important step in using the post-callback feature. The library uses the arguments passed here, including the callback function string, to stitch the post-callback code into the generated FFT kernel. The arguments for clfftSetPlanCallback include:

  • clFFT plan handle
  • Name of the callback function (the argument “extractMagnitude” in the preceding snippet)
  • Callback function in string form (the argument postcallbackstr in the preceding snippet)
  • Optionally, local memory size if needed by the callback function
  • Type of callback (this is an enum)
  • Supplementary user data, if any, used by the callback function
  • Number of user data buffers

 

  1. Bake the clFFT Plan.
 clfftBakePlan( … );

In this case, clFFT inserts the callback code into the main FFT kernel during bake plan and compiles it. If there are any compilation errors caused by syntax or an incompatible callback function prototype, the failure is reported to the user.

  1. Execute FFT.
 clfftEnqueueTransform( plan_handle, CLFFT_FORWARD, 1, &commandQueue, 0, NULL, NULL,       &inputfftbuffer, &outputfftbuffer, … );
  1. Read the FFT output outfftbuffer. This holds the final result after magnitude extraction.
 clEnqueueReadBuffer( commandQueue, outfftbuffer, CL_TRUE, 0, out_size_of_buffers, host_output,       0, NULL, NULL );

As you can see from these steps, all you have to do is pass the required post-processing callback function wrapped in a string to the library. The task of invoking the callback is handled inside the library.

Post-callback function prototype

clFFT expects the user-provided post-callback function to be of a specific prototype depending on the type of transform(real/complex) and whether local memory is used. As an example, consider the post-callback prototype for a Real to Complex FFT as shown in the following:

Post-callback function without local memory usage

void <post-callback func> (__global void *output, uint outoffset, __global void *userdata, float2 fftoutput)

 

Post-callback function with local memory usage

void <post-callback func> (__global void *output, uint outoffset, __global void *userdata, float2 fftoutput, __local void *localmem)

 

The output parameter is the base pointer of the output buffer. The outoffset parameter is the index of the current element in the output buffer. The userdata pointer is useful for passing any supplementary data to the callback function (e.g. convolution filter data or any scalar value). The userdata can be of any custom data type/structure, in which case, the user has to declare the custom data type and include it along with the callback function string. The fftoutput parameter holds the result computed by clFFT for the element corresponding to outoffset argument. The localmem parameter represents a pointer to the local memory. This memory is allocated by the library based on the size specified by the user in the clfftSetPlanCallback API call and is subject to the availability of local memory.

The complete list of compatible pre-callback and post-callback function prototypes is available at the clFFT Library API documentation page. You must write the callback function in adherence to the expected prototype. clFFT may compute a given FFT with multiple kernels. However, it invokes the post-callback function only from the last phase kernel and also only once for each point in the output.

Performance Comparison

The following chart shows the speedup achieved using clFFT post-callbacks.

clFFT_post_callback_blog_img

The chart shows the speedup for a separate post-processing kernel compared with the post-callback version. We did Real to Complex transforms using 32 million input elements. We observed a typical speedup of 1.8x using the post-callback approach. We ran the test on Ubuntu 14.04 LTS Linux64, with AMD FirePro™ driver version 14.502, running on AMD FirePro™ W9100 Professional Graphics card, with an AMD A10-7850K APU and 16GB RAM.

Conclusion

Callback feature of clFFT provides an easy and faster way to pre/post process data. Many applications that use FFT have the need to pre-process data before performing FFT and/or post-process the FFT output. The callback feature saves additional steps in larger algorithm by invoking the user-provided callback functions from within the FFT kernel. This avoids the overhead of a separate kernel launch and also better memory bandwidth.

Pradeep Rao is Senior Member of Technical Staff in the Developer Solutions Team at AMD. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

OpenCL is a trademark of Apple Inc. used by permission by Khronos.

3 Responses

    • Pradeep

      The latest AMD graphics driver can be downloaded from here http://support.amd.com/en-us/download.
      Looks like you have Radeon R5 series card based on the notebook model you have mentioned. Use the selection filters in the above link to download the appropriate driver based on the GPU and operating system.