____

The AMD Compute Library (ACL) 1.0 GA (General Availability) includes some new features and improvements over beta2 (for more information, click here). The ACL clFFT library includes the ability to perform in-place rectangular transposes of data. The in-place rectangular transpose reduces the memory footprint of large FFT computations and extends the range of the data sizes for which the FFT can be calculated within the memory constraints of the hardware. In addition to continuing the existing support for in-place square transposes, the library now extends support to rectangular matrices having dimensions in the ratio 1:2.

In this blog, we will discuss the need for transposes in FFT computations and then describe the in-place rectangular transpose algorithm.

## Transposes in calculating large FFTs

The clFFT library uses a variation of the Cooley-Tukey algorithm to compute FFTs. To compute an FFT of size N, where N can be expressed as the product of N_{1} and N_{2}, the Cooley-Tukey algorithm re-expresses the problem as two separate FFTs of sizes N_{1} and N_{2 }recursively to reduce the computational complexity from O(N^{2}) to O(NlogN).

This is accomplished by the following four-step algorithm:

- Compute N
_{1}FFTs of length N_{2}along the columns of the input array. - Multiply each element of the array by a twiddle factor of format ω
_{N}^{ij}, where i and j are the coordinates of the element in the array. - Transpose the resulting data array.
- Compute N
_{2}FFTs of length N_{1}along the columns of the array.

To improve the coalescing of memory loads from the global memory, the FFTs that are performed along the columns in the four-step algorithm are converted into row FFTs by transposing the data. This introduces two additional transposes and results in a six-step algorithm. Because three transposes are required, the transpose operation plays an integral part in the FFT computation and performance. In-place transposes reduce the memory foot-print of the algorithm by avoiding the need for extra global memory.

## Characteristics of the rectangular in-place transpose

Performing in-place transpose of matrices, other than squares, is a non-trivial problem because it involves a series of swapping operations in which the source and destination data elements do not replace each other as in the case of a square transpose. To understand the problem better, consider a 2 x 4 matrix.

The data is assumed to be stored in the row major format. The original and transposed data stored in the computer memory are shown in the following table:

From the table it is clear that, to perform in-place transposition of data, two cycles of permutations of length three are required:

(1, 2, 4), (3, 6, 5)

To achieve in-place transposition, the number of permutation cycles to be performed on the data depends on the size of the matrix. For example, a 3 x 6 matrix requires a single cycle of permutation of length 16, while a 64 x 128 matrix requires 640 cycles in total to complete the in-place transpose.

With the exception of the first and the last elements in the original array, which do not change their positions in the transposed array, there is a mathematical formula to calculate the positions of the other elements in the transposed array. For an M x N matrix the permutation mapping is given by:

where k represents the position of the element in the original array and P(k) represents its position in the transposed array.

## Performing 1:2 in-place rectangular transpose in GPU

The amount of parallelism that can be exploited by the GPU in the in-place transpose algorithm is equal to the number of permutation cycles. The GPUs, which rely on coalescing of data while accessing the global memory, have a problem in completely utilizing the memory bandwidth because each permutation cycle acts on unique data elements that may not occupy continuous memory location. To overcome this problem, clFFT performs two in-place square transposes within the rectangular matrix and then permutes the data to their final positions.

The algorithm can be better explained with an example. Consider a 4 x 8 matrix that needs to be transposed.

_{} Step 1, Square transpose:

The 4 x 8 input rectangular matrix consists of two square matrices of dimension 4 x 4 on which the in-place square transpose is performed. During the square transpose, the elements in the source and destination locations replace each other and the swapping of the elements proceeds independently, resulting in a parallel algorithm. Further, to improve coalescing of data, the matrix is split into tiles with each work group working on a tile of data. Both the source and the destination tiles are loaded into the local memory (LDS) and the transposing is done in local memory before writing the data back to the global memory. After this step, the data in the input array is as follows:

### Step 2, Permute:

The N consecutive data elements (rows) in the preceding diagram maintain their locations in the final transpose of the data. Thus the permutation, which is the second step, can be performed on blocks of data of size N instead of element by element. When a work group reads a block of data from the global memory, the GPU coalesces memory accesses into fewer loads, thereby improving the performance. The matrix can be expressed in the vector format as follows:

where represents the data block of size N, representing data from x up to y, which occupy contiguous memory locations in the system memory.

Following are the permutation cycles required to complete the transpose:

- (1, 4, 2)
- (3, 5, 6)

The data stored in the memory, before and after the permutation cycles are applied, are as shown in the following table with different colors representing different permutation cycles:

As you can see, each permutation cycle can be performed independently as there is no overlap in the memory locations being swapped by two different permutation cycles. Hence, multiple work items can be made to work on different permutation cycles to swap the data independently. The size of the temporary memory required for this algorithm is equal to the size of the data block ( ) being swapped. The local memory (LDS) of the GPU is used as a temporary scratch buffer to perform the swaps. If the size of the data block exceeds the available LDS memory, then the data blocks are divided between multiple work groups.

The number of permutation cycles required to complete the transpose and the swapping order are calculated by the CPU as a preprocessing step and this data is placed in a constant table of the kernel generated by the non-square kernel generator of the clFFT library.

## Enabling this feature in library

To enable the feature of in-place transpose, the environment variable CLFFT_REQUEST_LIB_NOMEMALLOC must be set. Setting this variable performs the in-place transforms by decomposing the inputs into square or rectangular matrices of ratio 1:2. Note that the in-place transpose performance can be lower compared to that of the out-of-place transpose because the in-place algorithm requires two steps: square transpose and permutation, which are performed on the global memory. This doubles global memory access and hence reduces the performance, but without allocating extra memory.

## Conclusion

The in-place transpose of rectangular matrix eliminates the need for large temporary global memory allocation in the clFFT library. This is helpful when the memory footprint of the algorithm must be kept as minimal as possible. This library feature is especially beneficial for applications that do not prefer significant global memory allocation by the library. Currently, this works for 1D transforms of sizes that are powers-of-2.

*Santanu Thangaraj is a Senior Software Engineer for AMD Compute Libraries. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.*