2D Complex to Complex FFT for AMD GPUs

This is the second article in a series of two (part 1), related to the implementation of Fast Fourier Transform (FFT) on an AMD GPU using OpenCL™. In the first article, we developed a 1D complex-to-complex FFT implementation for AMD GPUs where we explored efficient ways of computing twiddle factors, introduced an indexing scheme for coalesced memory accesses and a multistage framework was used to compute large size FFTs. In this part, we continue to explore further optimization strategies of computing 1D complex-to-complex FFT involving the use of local memory and in-place computations. It is also shown here how to compute double precision FFTs in contrast to the single precision version developed thus far. We next proceed to develop a 2D Complex to Complex FFT implementation that uses the 1D FFT as a building block. This also involves development of an improved indexing scheme to compute FFT of multiple signals in batch mode. Performance results for the GPU computed 2D FFTs are presented at the end of this article.

FFT Kernels using Local Memory

In the first part of this article, we started with a basic FFT kernel that loads two complex numbers (float2 vectors) from the global memory to the private memory (Registers), performs the butterfly computations and writes the result back to the global memory. The first optimization strategy was to reduce the number of kernel launches and global memory accesses. This task was accomplished by performing multiple FFT stages in a single kernel. We saw that we reach a point of diminished returns beyond an 8-point FFT kernel where we were computing three stages in one kernel. No performance gain could be achieved when we tried to compute 4 or more FFT stages in one kernel; as the expected performance advantage was nullified by reduced device occupancy (reduced concurrent threads), as discussed in the first part. One possible method that can potentially reduce global memory accesses (or compute more FFT stages per kernel) without additional resource usage can be the use of local memory. We explore how to make use of this mechanism for our computation of 1D complex-to-complex FFTs below.

Why to use Local Memory?

Local memory or Local Data Share (LDS) is a high-bandwidth memory used for data-sharing among work-items within a work-group. ATI Radeon™ HD 5000 series GPUs have 32 KB of local memory on each compute unit. Figure 1 shows the OpenCL™ memory hierarchy for GPUs [1].

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Memory hierarchy of AMD GPU

Figure 1: Memory hierarchy of AMD GPUs

Local memory offers a bandwidth of more than 2 TB/s which is approximately 14x higher than the global memory [2]. Another advantage of LDS is that local memory does not require coalescing; once the data is loaded into local memory, it can be accessed in any pattern without performance degradation. However, LDS only allows sharing data within a work-group and not across the borders (among different work-groups). Furthermore, in order to fully utilize the immense potential of LDS we have to have a flexible control over the data access pattern to avoid bank conflicts. In our case, we used LDS to reduce accesses to global memory by storing the output of 8-point FFT in local memory and then performing next three stages without returning to global memory. Hence, we now return to global memory after 6 stages instead of 3 in the previous case. In the next section we elaborate on the use of local memory and the required data access pattern.

FFT Kernel using Local Memory

In order to incorporate the use of local memory in our implementation, we design an FFT kernel that performs six FFT stages without returning to global memory. In the approach mentioned in part I, the data access pattern is such that in the 8-point FFT kernel, each thread handles 8 input data points to compute 3 FFT stages. Similarly 16, 32 and 64-point FFT kernels require 16, 32 and 64 input data points per thread, respectively. One obvious drawback of such an approach is that the resource usage (registers and local memory) per thread will increase and therefore reduce the number of concurrent threads that can be launched on the device. This can potentially neutralize any performance advantage that can be gained by using local memory. We can remedy this problem by devising a new data access pattern that would allow us to perform 6 FFT stages without necessarily requiring 64 input points per thread. This kernel would employ N/8 threads to compute an N-point FFT, where each thread will work on 8 input data points. By doing this we are not only reducing global memory accesses but also the resource usage, thus allowing for more concurrency. The basic scheme of this kernel is outlined below:

  1. Read data from global memory into the registers.
  2. Perform 3 FFT stages on this data.
  3. Store these intermediate results in local memory.
  4. Read data from local memory for subsequent stages.
  5. Perform next 3 FFT stages.
  6. Write the results back in global memory.

The kernel code for FFT implementation using local memory is listed below:

#define Local_Stride 1
#define BLOCK_SIZE 16
__kernel void FFT_64(__global float2* dataout,  __global float2* datain,  int N, int Ns) 
        __local float sharedx[128*8];
        __local float sharedy[128*8];
	int in_stride = N / (128*8) ;
	int gId = get_global_id(0);
	int Local_Size = get_local_size(0);

	float2 in0, in1, in2, in3, in4, in5, in6, in7;

	int internal_blocksize = 8;
	int block_id = gId/BLOCK_SIZE;
	int internal_block_id = block_id%internal_blocksize;
	int tidx =gId&(BLOCK_SIZE-1);

                int new_block_id = map_id(block_id, in_stride,  internal_blocksize);
		int new_gId = new_block_id*BLOCK_SIZE+tidx;
		in0 = datain[(0*N/8)+ new_gId];
		in1 = datain[(1*N/8)+ new_gId];
		in2 = datain[(2*N/8)+ new_gId];
		in3 = datain[(3*N/8)+ new_gId];
		in4 = datain[(4*N/8)+ new_gId];
		in5 = datain[(5*N/8)+ new_gId];
		in6 = datain[(6*N/8)+ new_gId];
		in7 = datain[(7*N/8)+ new_gId];

		float angle0 = -2*PI*( new_gId%Ns)/(Ns*8);

		twidle_factor(1, angle0, in1);
		twidle_factor(2, angle0, in2);
		twidle_factor(3, angle0, in3);
		twidle_factor(4, angle0, in4);
		twidle_factor(5, angle0, in5);
		twidle_factor(6, angle0, in6);
		twidle_factor(7, angle0, in7);

		FFT8(in0, in1, in2, in3, in4, in5, in6, in7);

		int Idout = internal_block_id*128+(tidx)*8;

		sharedx[Idout+0] = in0.x;
		sharedy[Idout+0] = in0.y;
		sharedx[Idout+1] = in1.x;
		sharedy[Idout+1] = in1.y;
		sharedx[Idout+2] = in2.x;
		sharedy[Idout+2] = in2.y;
		sharedx[Idout+3] = in3.x;
		sharedy[Idout+3] = in3.y;
		sharedx[Idout+4] = in4.x;
		sharedy[Idout+4] = in4.y;
		sharedx[Idout+5] = in5.x;
		sharedy[Idout+5] = in5.y;
		sharedx[Idout+6] = in6.x;
		sharedy[Idout+6] = in6.y;
		sharedx[Idout+7] = in7.x;
		sharedy[Idout+7] = in7.y;

                Ns *= 8;
                int new_block_id = map_id(block_id, Local_Stride,  internal_blocksize);
	        int new_gId = new_block_id*BLOCK_SIZE+tidx;
		int Idin = (internal_block_id*16+tidx);
		in0.x = sharedx[0*Local_Size+Idin];
		in0.y = sharedy[0*Local_Size+Idin];
		in1.x = sharedx[1*Local_Size+Idin];
		in1.y = sharedy[1*Local_Size+Idin];
		in2.x = sharedx[2*Local_Size+Idin];
		in2.y = sharedy[2*Local_Size+Idin];
		in3.x = sharedx[3*Local_Size+Idin];
		in3.y = sharedy[3*Local_Size+Idin];
		in4.x = sharedx[4*Local_Size+Idin];
		in4.y = sharedy[4*Local_Size+Idin];
		in5.x = sharedx[5*Local_Size+Idin];
		in5.y = sharedy[5*Local_Size+Idin];
		in6.x = sharedx[6*Local_Size+Idin];
		in6.y = sharedy[6*Local_Size+Idin];
		in7.x = sharedx[7*Local_Size+Idin];
		in7.y = sharedy[7*Local_Size+Idin];

		float angle0 = -2*PI*(new_gId%Ns)/(Ns*8);

		twidle_factor(1, angle0, in1);
		twidle_factor(2, angle0, in2);
		twidle_factor(3, angle0, in3);
		twidle_factor(4, angle0, in4);
		twidle_factor(5, angle0, in5);
		twidle_factor(6, angle0, in6);
		twidle_factor(7, angle0, in7);

		FFT8(in0, in1, in2, in3, in4, in5, in6, in7);

		int Idout = get_output_Id(new_gId, Ns, 8); 

		dataout[(0*Ns)+Idout] = in0;
		dataout[(1*Ns)+Idout] = in1;
		dataout[(2*Ns)+Idout] = in2;
		dataout[(3*Ns)+Idout] = in3;
		dataout[(4*Ns)+Idout] = in4;
		dataout[(5*Ns)+Idout] = in5;
		dataout[(6*Ns)+Idout] = in6;
		dataout[(7*Ns)+Idout] = in7;

The above shown listing begins with a function map_id() that computes the relative memory offsets within each workgroup. The 64-point FFT kernel comprises of two stages of 8-point FFT thus completing 6 stages required for a 64-point FFT. In the first stage, after computing the relative indices 8-point data is loaded from global memory, twiddle factors are computed using the twiddle_factor() function introduced previously in Part-I of the article and then computes an 8-point FFT. The results are then stored in local memory variables sharedx and sharedy variables that had been created using the _local qualifier earlier on. Note the placement of a thread synchronization barrier at this point using the barrier(CLK_LOCAL_MEM_FENCE ) that ensures that all threads within a workgroup reach this point before proceeding forward with the second stage. The next code block in the kernel, after the barrier, then computes another stage of 8-point FFT that first loads data from local memory and after computation stores the data back in global memory.

Performance Comparison

Below we present the throughput comparison for the LDS based kernel vs. the original registers only version earlier developed in Part-I of the article computed on an Intel Core i7 930 CPU and an ATI Radeon™ HD 5870 GPU running 64-bit Windows 7 operating system and AMD APP SDK v2.4, the same configuration as before. Interestingly, the performance of both the kernels is about the same. The most probable reason for this is that bandwidth for the register memory is 13056 GB/sec in contrast to 2176 GB/sec for local memory which seems to be neutralizing any performance gain from the use of later [4]. It seems that for a kernel with small enough register usage, one can get the best performance from register memory only. Moreover, this also suggests that the execution time is dominated by overheads. The FFT is O(N * log(N)) arithmetic operations so the ratio of operations to memory access is O(log(N)) (arithmetic intensity). Therefore, as the problem size scales, it is not possible to amortize out the kernel execution overhead unless N is very large. The overhead costs are always dominant so many optimizations that are effective will have an unnoticeable effect.OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Performance Comparison of FFT Kernels with and without Local Memory
Figure 2: Performance Comparison of FFT Kernels with and without Local Memory

In-place vs. Out-of-place Computations

There are two computation styles based on memory allocation: in-place and out-of-place computations. In-place computation means that both input and output data share the same memory space. Whereas, in out-of-place computation input and output data are located at different memory location. In-place algorithms can utilize memory resources more efficiently for obvious reasons. So far all the FFT kernels presented perform out-of-place computations. For example the following listing shows the out-of-place computation of a 2-point FFT kernel:

__kernel void FFT_2( __global float2* datain, __global float2* dataout) 
	int gId = get_global_id(0);
	int gSize = N/2;
	float2 in0, in1;

        in0 = datain[(0*gSize)+gId];
        in1 = datain[(1*gSize)+gId];

if (Ns!=1) 
	float angle = -2*PI*(gId)/(N);
	twidle_factor(1, angle, in1);

     FFT2(in0, in1);

     int Idout = (gId/Ns)*Ns*K_W+(gId%Ns);

     dataout[(0*Ns)+Idout] = in0;
     dataout[(1*Ns)+Idout] = in1;

Here we can see that the data is read in and out from separate locations explicitly reference by datain and dataout pointers. For an in-place computation we would only utilize one pointer i.e. data as shown in the following in-place computation of the same 2-point FFT kernel.

__kernel void FFT_2( __global float2* data ) 

   	int gId = get_global_id(0);
	int tps = N/k_w;
	int input_offset = (gId / tps) * N;
	gId = gId % tps;
	int Idin = gId + input_offset;

	float2 in0, in1;	

        in0 = data[(0*N/K_W)+Idin];
        in1 = data[(1*N/K_W)+Idin];

	if (Ns!=1) 
		float angle = -2*PI*(Idin%Ns)/(Ns*K_W);
		twidle_factor(1, angle, in1);	

    FFT4(in0, in1);

    int Idout = get_output_id(Idin, Ns, K_W)+input_offset;

    data[(0*Ns)+Idout] = in0;
    data[(1*Ns)+Idout] = in1;


The same can be extended to 4 and 8-point FFT kernels. It is important to note here that in this simplistic in-place the FFT kernels do not write the output data in exactly same locations where they get the input from. Therefore, as long as the input size is such that all kernels get dispatched simultaneously, no data overwrites will occur and we should get correct results. However, if the input size exceeds this limit then the threads dispatched earlier would overwrite input array locations from where input has not been read yet, in which case some kind of global synchronization would be required. In the context of this article where we are going to utilize the 1D FFT as a building block for computation of 2D FFT of images as large as 2048 by 2048 we do not encounter this bottleneck.

Double Precision Computations

We now look at another aspect of computing the FFT on the GPU–double precision (64-bit) computations. The most fundamental programmable computational unit is a processing element (PE) in an OpenCL™-capable GPU compute device that can perform integer or single-precision floating point operations. Different GPU compute devices have different number of these compute units. For example, the ATI Radeon™ HD 5870 GPU has 20 compute units, each with 16 stream cores and each stream core contains five processing elements which yield 1600 physical processing elements. Double-precision floating point operations are processed by connecting two or four of the processing elements (excluding the transcendental PE) to perform a single double-precision operations [5]. Therefore, the double precision throughput of a GPU compute device would be considerable smaller then single precision throughput. For example, the ATI Radeon™ HD 5870 GPU has only a 544 GFLOPS peak double-precision throughput as compared to the 2.72 TeraFLOPS peak single-precision throughput [6]. Therefore, double precision computations should be used only when necessary.

Moreover, double precision arithmetic is not natively supported by OpenCL™ 1.1. To use data types double and doublen (doublen is a vector data type where each element of the vector is a double precision floating point value ) we need to use an optional extension # pragma cl_amd_fp64 : enable. Here also only basic arithmetic operations can be performed in double precision and buit-in functions like sine and cosine (required for twiddle factor calculations) can only be preformed in single precision. Therefore, the only changes required would be the pragma directive mentioned above and use of the double and double2 datatypes instead of the float and float2 datatypes respetively.

In Figure 3 below we show a performance comparison in terms of throughput for the single and double precision FFTs. We can clearly see that for input sizes greater than 32K we start to see roughly a 5x throughput drop in double precision computations.

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Performance Comparison of Single-Precision and Double-Precision FFT Kernels

Figure 3: Performance Comparison of Single-Precision and Double-Precision FFT Kernels

The 2D FFT Algorithm

Now that we have looked at the 1D FFT algorithm and various optimization strategies to compute it efficiently on the GPU we proceed to develop and algorithm to compute 2D complex to complex FFT that uses the 1D FFT as the building block. The prime application of 2D FFT is in Image Processing where it is heavily used for image filtering and enhancements applications [7].

2D FFT can be implemented by computing first 1D FFT of the input matrix rows followed by 1D FFT of columns of the input matrix. 2D input matrix is generally stored in row major order in computer memory. A matrix is converted to row major order by placing each successive row next to the previous one to get a 1D array, as shown in Figure 4.

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Converting a 4x4 Matrix to Row-Major Order
Figure 4: Converting a 4×4 Matrix to Row-Major Order

Particularly for GPUs, computing FFT on columns is very expensive due to large input strides in accessing data for calculations. A modified technique that utilizes a matrix transpose stage after 1D FFT of the row data is introduced here. The 2D FFT can be calculated in following steps.

  1. Perform 1D FFT on rows of input matrix
  2. Transpose the resulting matrix
  3. Perform 1D FFT on rows of transposed matrix
  4. Transpose the matrix again to restore natural order

This method specifically requires performing multiple 1D FFTs simultaneously. This style of computation is called batch mode execution.

Batch-Mode 1D FFT Implementation

Batch mode execution is the key to perform multi-dimensional FFTs. In this mode we can perform FFTs of multiple signals, of equal length, simultaneously. Below we describe how to modify our existing implementation to incorporate batch-mode FFTs.

Tuning the Indexing Scheme

The indexing scheme for batch mode needs careful investigation, as it is more complex than the case of a simple 1D FFT. The 1D FFT kernel, previously described in Part-I of this article, computes the FFT of a single N-point input signal (where N is integral power of 2). In this case, the number of FFT stages and total number of threads required correspond to the signal length N. However, in batch mode we have batch_size number of input signals having equal length N. The signal length N determines the required number of FFT stages and the product of batch size and N determines the number of threads required for the whole batch.

The parameters defined for batch-mode indexing are listed below:

– N = Signal length

– Ns = Number of completed FFT stages

– batch_size = Number of input signals

– gId = Global ID of a work item

– gSize = Global work size (total number of work items)

– Idin = Input index of each work item

– input_offset = The offset of a signal in the batch from first element of first signal

– tps = Number of threads required to compute FFT of individual signal

– Idout = Output index of each work item

– K_W = Kernel Size

Here we shall explain how the batch-mode indexing scheme works using 8-point FFT kernel from Part-I. Let us consider an example where we have a batch of 2 signals each of length (N) 8. The signal length N determines the number of FFT stages required for each signal which is 3 for this example. The number of input signals is 2 (batch_size = 2). Total number of work items (threads) required for the whole batch is determined by the product of N and batch_size.

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - batch-mode indexing scheme

Each signal requires exactly one thread for its FFT; so tps = 1. The input sequence required for batch-mode FFT with a batch of 2 signals of length of 8 is shown in Figure 5.


OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Input Sequence for batch_size = 2 and N = 8

Figure 5: Input Sequence for batch_size = 2 and N = 8

The input indexing scheme for batch-mode required only a slight modification. We just add input offset to the input index for simple 1-D FFT, discussed in Part-I. The code segment for batch-mode input indexing scheme is listed below:

#define K_W 8
    int gId = get_global_id(0);
    int tps = N/k_w;
    int input_offset = (gId / tps) * N;
    gId = gId % tps;
    int Idin = gId + input_offset;

    float2 vf0, vf1, vf2, vf3, vf4, vf5, vf6, vf7;	

    vf0 = datain[(0*N/K_W)+Idin];
    vf1 = datain[(1*N/K_W)+Idin];
    vf2 = datain[(2*N/K_W)+Idin];
    vf3 = datain[(3*N/K_W)+Idin];
    vf4 = datain[(4*N/K_W)+Idin];
    vf5 = datain[(5*N/K_W)+Idin];
    vf6 = datain[(6*N/K_W)+Idin];
    vf7 = datain[(7*N/K_W)+Idin];

Just like input indexing, the output index for batch mode is calculated by adding input offset to the output index for 1-D FFT. Here we define a function get_output_id which returns the output index for 1-D FFT. Final output index for batch mode is obtained by adding input offset to this value. The output pattern for batch-mode FFT with a batch of 2 signals of length of 8 is shown in Figure 6.

int get_output_Id(int Idin, int Ns, int R)
			return (Idin/Ns)*Ns*R+( Idin %Ns);

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Output Sequence for batch_size = 2 and N = 8

Figure 6: Output Sequence for batch_size = 2 and N = 8

The code segment for output indexing is listed below:

int Idout = get_output_id(Idin, Ns, K_W)+input_offset;
dataout[(0*Ns)+Idout] = vf0;
dataout[(1*Ns)+Idout] = vf1;
dataout[(2*Ns)+Idout] = vf2;
dataout[(3*Ns)+Idout] = vf3;
dataout[(4*Ns)+Idout] = vf4;
dataout[(5*Ns)+Idout] = vf5;
dataout[(6*Ns)+Idout] = vf6;
dataout[(7*Ns)+Idout] = vf7;

Matrix Transpose

Transpose of a matrix A is another matrix At in which each row has been interchanged with the respective column. A matrix transpose, thus swaps the elements along each axis of the matrix as depicted in Figure 7 [3]. Mathematically, transpose of a matrix Ai,j is defined by equation below, where i and j are the indices along X and Y axis respectively.

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - 4x4 Matrix Transpose Example

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - 4x4 Matrix Transpose Example

Figure 7: 4×4 Matrix Transpose Example

We start with writing a simple transpose kernel where each thread reads one element of the input matrix from global memory and writes back the same element at its transposed index in the global memory. Figure 7 shows a 4×4 matrix and its transpose.

The result of applying simple matrix transpose kernel to a 4×4 matrix is shown in Figure 8 [3]. All threads read consecutive elements from global memory, so reading from global memory is coalesced. Coalesced loads are the most efficient way of accessing global memory because the data is loaded as a single chunk. All global memory access should be coalesced for optimal performance. Each thread writes out the result in a non-coalesced fashion, as the threads are not writing consecutive elements. Each non-coalesced access to global memory is serviced individually and multiple accesses are serialized causing long latency [2].

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Applying Simple Matrix Transpose Kernel to 4x4 Matrix

Figure 8: Applying Simple Matrix Transpose Kernel to 4×4 Matrix

The simple transpose kernel is inefficient due to non-coalesced memory accesses and needs some modifications to coalesce all memory accesses. This can be achieved using the local memory available on GPU.

Use of Local Memory for Matrix Transpose

We have already highlighted the importance of using local memory earlier. Data in LDS can be accessed in any pattern without performance penalty if there are no bank conflicts [2]. Bank conflicts occur when two or more threads try to read the same bank from LDS simultaneously. Using LDS in transpose kernel can improve the performance by a huge factor because it allows us to coalesce all the global memory accesses.

The improved transpose kernel reads the input data from global memory in a coalesced fashion and writes it in the local memory. Figure 9 shows coalesced data transfer from global to local memory.

Figure 9: Coalesced Data Transfer from Global to Local Memory

Once the data is loaded in local memory each thread can access the transposed data element and write it out to the global memory in a coalesced fashion. Figure 10 shows how consecutive threads write the transposed elements to the global memory.

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Writing Transposed Elements to Global Memory

Figure 10: Writing Transposed Elements to Global Memory

Optimized Matrix Transpose Kernel

We perform matrix transpose by using a tile size of 16 x 16, as this is the maximum allowable workgroup size on Evergreen series of AMD GPUs. The use of local memory allows us to access data in an un-coalesced fashion but we at all cost have to avoid bank conflicts. A bank conflicts occurs when two or more threads try to read different pieces of data from the same bank simultaneously. In such a case, accesses to that bank are serialized resulting in performance degradation [2]. For a 16 x 16 matrix transpose, we need to access data with a stride of 16. This will cause bank conflicts as multiple threads will try to access same bank for different data elements.

In ATI RadeonTM HD 58xx series GPUs, the thread execution takes place in 64 element-wide wave-fronts. A quarter wave-front of 16 threads move in lock-steps. All even index threads of a quarter wave-front will try to access bank 0 and the odd index threads will access bank 16 simultaneously. This will result in 8 way serialization on each of these banks. To get around this problem, we will use the handy tool of padding the local memory. We will pad the local memory array by one element after every 16 elements. This will result in a dummy element after every 16 data elements. Now every thread will access a different bank in the local memory, thus avoiding the bank conflicts. The optimized transpose kernel with local memory padding is listed below:

#define BLOCK_DIM 16

__kernel void 	Transpose(
			 __global float2* datain, 
			 __global float2* dataout, 
			 int width, int height)

// width = N (signal length) 
// height = batch_size (number of signals in a batch)

// read the matrix tile into shared memory

__local float2 block[BLOCK_DIM * (BLOCK_DIM + 1)] 
unsigned int xIndex = get_global_id(0);
unsigned int yIndex = get_global_id(1);

	if((xIndex < width) && (yIndex < height))
	unsigned int index_in = yIndex * width + xIndex;
        int Idin = get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0);
        block[Idin]=  datain[index_in];


// write the transposed matrix tile to global memory

xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);
yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);

	if((xIndex < height) && (yIndex < width))
		unsigned int index_out = yIndex * height + xIndex;
		int Idout = get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1);
                dataout[index_out] = block[Idout];


Figure 12 below shows the performance comparison of the matrix transpose kernel with and without the use of local memory. We can clearly see that the local memory optimization works better for all input sizes greater than 128×128.

OpenCL™ Optimization Case Study Fast Fourier Transform - Part II - Comparison of Matrix Transpose Kernels with and without Local Memory

Figure 11: Comparison of Matrix Transpose Kernels with and without Local Memory

2D FFT Performance Results

So far we have described the batch mode 1D FFT and matrix transpose kernels, which are all we need to compute a full 2D FFT. Another pass of a batch mode 1D FFT on the transposed results gives us the final 2D FFT results. We can now look at the results of 2D FFT. For ease of comparison we chose to compare our results to performance of fft2() function in MATLAB which under-the-hood utilizes the FFTW which is known to be a highly optimized library for computing FFTs [8]. The computations were performed on two CPU platforms one an Intel Core i7 930 CPU (2.8 GHz) with 8GB DDR3 RAM and secondly on Intel Core 2 Quad Q8200 (2.33GHz) based machine 4GB DDR2 RAM. Both machines were running 64-bit Windows 7 Operating System. MATLAB Release 2009b was used for CPU based fft2() benches. GPU results were obtained using an ATI Radeon™ HD 5870 GPU and AMD APP SDK v2.5. The results are summarized in the following table:

Device ATI Radeon™ HD 5870 CPU 1 CPU 2
Core-i7 930 Core 2 Quad
Input Size FFT Time(ms) Total Time(ms) FFT Time(ms) FFT Time(ms)
64×64 0.05422 0.51391 0.085 0.1373
128×128 0.08593 0.55885 0.243 0.5045
256×256 0.16417 0.62105 0.965 1.8841
512×512 0.54508 2.88082 4.288 10.647
1024×1024 3.51204 7.34336 11.395 53.656
2048×2048 25.9263 45.8435 72.13 224.24

The results clearly show that 2D FFT implemented on an ATI Radeon™ HD 5870 GPU using OpenCL™ outperforms CPU based implementation. The performance advantage increases with increased data sizes due to resulting more data level parallelism which can be exploited on the GPU. We can also see that the overhead of memory transfer to and from the GPU is significant and limits performance gains. This is because the GPU is connected to the motherboard through a PCIe bus. The PCIe bus has a memory bandwidth of only 5.2 GB/s, therefore, data transfer between CPU and GPU is very expensive. However, the latest processor architectures promise to overcome this limitation e.g. the AMD Fusion family of APUs that have a CPU and GPU on a single-die processor [9].


[1] A. Munshi, “OpenCL™ Parallel Computing on the GPU and CPU”, Proceeding of SIGGRAPH 2008.

[2] ATI Stream Computing OpenCL™ Programming Guide, Ch-4 OpenCL Performance and Optimization, June 2010.

[3] D. Gohara, “OpenCL Tutorial Episode 4-Memory Access and Layout”, macresearch.org/opencl viewed 10 November, 2010.

[4] ATI Stream Computing OpenCL™ Programming Guide, Appendix-D, June 2010.

[5] ATI Stream Computing OpenCL™ Programming Guide, Ch-1 OpenCL Architecture and the ATI Stream Computing System, June 2010.

[6] Heterogeneous Computing OpenCL™ and the Radeon HD 5870 Architecture PDF (201003.pdf)

[7] R. Gonzlez and R. Woods, “Digital Image Processing”, 2nd Edition, Chapter 4, Prentice Hall 2002.

[8] FFTW, http://www.fftw.org, viewed on 10 November, 2010.

[9] AMD family of APUs, viewed on 10 November, 2010

© 2011 Advanced Micro Devices, Inc. All rights reserved.

AMD the AMD arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc., used with permission by Khronos. Windows is a registered trademark of Microsoft Corporation in the United States and/or other countries. All other names used in this document are for informational purposes only and may be trademarks of their respective owners.