Bryan Catanzaro |
5/10/2010 |

### Introduction

This article discusses performance optimizations for AMD GPUs and CPUs using as a case study a simple, yet widely used computationally intensive kernel: Diagonal Sparse Matrix Vector Multiplication. We look at several topics which come up during OpenCL™ performance optimization and apply them to our case study:

- Translating C code to OpenCL™
- Choosing data structures for dense, aligned memory accesses
- Using local, on-chip memory
- Vectorizing the computation for higher efficiency
- Using OpenCL™ images to improve effective memory bandwidth
- Parallelism for multicore processors

At the end of our journey, we’ll have a high-performance kernel for both the AMD Radeon™ HD 5870 GPU, as well as the AMD Phenom™ II X4 965 CPU.

OpenCL™ allows developers to write portable, high-performance code that can target both GPUs and CPUs. OpenCL™ unlocks the performance capabilities of today’s parallel processors, although, as with any other programming environment, achieving high performance requires careful attention to how the code is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL™, performance optimization is a natural part of learning how to program in OpenCL™.

### Diagonal Sparse Matrix Vector Multiplication

Matrix vector multiplication is fundamental to a large number of algorithms in a variety of contexts, ranging from finite element modeling to image processing to machine learning. It’s a simple computation to understand, but optimizing matrix multiplication can be instructive, since it requires attention to memory access patterns, data structures, and ALU utilization, all topics which come up in OpenCL™ programming in general.

Consider the following matrix vector multiplication, shown in figure 1:

**Figure 1: Example Matrix Vector Multiplication**

For each element of the result, we simply sum the element-wise product of the corresponding row from the matrix with the given vector.

Notice that most of the elements in the matrix we’re multiplying are zero, so we can achieve better efficiency by treating it as a sparse matrix, avoiding lots of unnecessary multiplication by zero.

There are many kinds of sparse matrices, categorized by the structure of their non-zero elements. Here, we’ll be looking at diagonal sparse matrices (often referred to as DIA), which are efficiently represented as a set of vectors representing the diagonals of the matrix with non-zero elements. Figure 2 shows the structure of a sample matrix derived from a regular 2-D grid, where every element of the grid is connected to its neighbors within a circular radius. The 2-D grid is flattened into a 1-D vector, and each non-zero entry in the matrix connects two elements of the grid – the elements represented by its row and column indices.

**Figure 2: Sparse Matrix Derived From 2-D Grid**

Looking at the structure of the matrix, you can see multiple bands of diagonals, each with a different width. The different widths of the various diagonal bands arise from the rasterized circular neighborhood pattern which created this matrix. If the neighborhood pattern was rectangular, each of the diagonal bands would be the same width. In general, diagonal sparse matrices are useful for representing relationships between elements on regular grids. In this article, we’ll be using matrices derived from an image segmentation problem, which of course is expressed as a 2-D grid. Every element is related to all neighbors within a 5-element radius, which leads to a matrix with 11 bands of diagonals, 81 diagonals in total.

### Memory Access Patterns and Data Structure

The choice of data structure determines Sparse Matrix Vector Multiplication performance in general, especially when coding in OpenCL™.

What kind of data structure should we use to represent this matrix? Before answering this question, let’s discuss the need for dense and aligned memory accesses in OpenCL™ programming.

### Memory Access Patterns

OpenCL™ devices, whether from AMD, IBM, or others, prefer memory loads to be dense and aligned. This conserves memory bandwidth, which is often the most crucial resource we can optimize. To see why, let’s look at what happens when we load data from memory.

Figure 3 illustrates a load from memory on an abstract processor. When we request a word of data, the processor loads several consecutive words from memory, which include the word we requested. The processor loads more words than we asked for, because modern memory architectures can provide peak bandwidth when they return several contiguous words. This is fortuitous because data accesses are usually correlated in time and space, meaning that bringing in blocks of memory instead of just the word currently being accessed can serve as a prefetch, lowering overall effective memory latency. We’ll call the set of words that the processor loads when you execute a memory load, a cache line. In Figure 3, when we request word 0 from memory, the processor ends up transferring words 0-3 from off-chip memory.

**Figure 3: Cache Lines**

Due to the way memory is accessed, poorly structured memory access patterns can greatly decrease performance, especially for highly-parallel processors with large amounts of computational resources, like AMD GPUs. As Figure 4 shows, sparse memory accesses waste bandwidth, since the unused words the processor loads are immediately discarded, and unaligned accesses waste bandwidth because the processor must load twice as many words as necessary.

**Figure 4: Sparse and Misaligned Memory Accesses**

AMD CPUs such as the AMD Phenom II X4 965 CPU have a 64-byte cache line. AMD GPUs such as the ATI Radeon HD 5870 GPU have the equivalent of a 128-byte cache line. For optimal performance on both architectures, it’s best to have 128-byte aligned data structures. We will examine the implications of these for Diagonal Sparse Matrix Vector Multiplication.

### Data Structure

Now that we understand the need for aligned, dense memory accesses, let’s design a data structure for DIA sparse matrices. To do this, first note that if we consider the matrix as a set of dense vectors, one for each diagonal of the matrix with non-zeros, we will have converted a sparse data structure into a dense data structure. For example, one such structure for our example matrix is shown in Figure 5. If you compare the original matrix, you’ll see that we’ve simply taken each of the diagonals and flattened them into a vector, then we stacked the vectors to create a dense matrix representation. We’ve also produced an array of offsets, with an entry for each diagonal, which record the position of that diagonal, which is used to index the matrix and vector.

**Figure 5: Flattened diagonals + Offset array**

Let’s consider how this structure performs when we write an OpenCL™ C kernel to do the multiplication. Parallelization of this computation is trivial. The computation for each element of the output vector we’re computing is independent, so we will simply instantiate a work-item for each element of the output vector. Each work-item then iterates through the set of diagonals, reading from the flattened diagonals and the vector at each iteration, and accumulating the product. Since we know the work-items are executed in SIMD fashion on AMD GPUs, we can see that memory accesses from the work-items to the matrix will be dense, but they will not be aligned.

We can see this because the position at which each work-item will read from the diagonals will change for each row of this flattened set of diagonals, due to the way we’ve flattened them. Figure 6 shows an alternative way of flattening the diagonals to ensure aligned memory accesses from the matrix, given our parallelization.

**Figure 6: Matrix Representation For Aligned Access**

Accesses to the vector are still unaligned, but that is unavoidable, since the alignment depends on the placement of the original diagonals in the matrix, which arises from the problem we’re trying to solve, and is, therefore, much more difficult to change.

Initial OpenCL™ C Kernel

The translation to OpenCL™ C is then straightforward.

__kernel void dia_spmv(__global float *A, __const int rows, __const int diags, __global int *offsets, __global float *x, __global float *y) { int row = get_global_id(0); float accumulator = 0; for(int diag = 0; diag < diags; diag++) { int col = row + offsets[diag]; if ((col >= 0) && (col < rows)) { float m = A[diag*rows + row]; float v = x; accumulator += m * v; } } y= accumulator; }

The OpenCL™ C code is fairly similar to the original C version, with just the addition of a few OpenCL™ C qualifiers, and with the outermost loop replaced by a parallel kernel.

Initial Performance on ATI Radeon HD 5870 GPU

We benchmarked our SpMV code on a system with an AMD Phenom II X4 965 CPU, running at 3.4 GHz, with 21.3 GB/s of peak memory bandwidth, coupled to an ATI Radeon HD 5870 GPU, clocked at 850 MHz, with 153.6 GB/s of peak memory bandwidth on an MSI K9A2 Platinum motherboard with 8GB of DRAM. The system was running Ubuntu Linux® 9.10 (64-bit), the ATI Catalyst™ 10.4 Driver Suite and the ATI Stream SDK v2.1.

Figure 7 shows the performance we achieved across a range of matrix sizes using this basic implementation. For context, a dual-socket, quad-core system using the AMD Opteron™ 2356 CPU, which has the same amount of peak memory bandwidth as our AMD Phenom II X4 965 CPU, achieves 3.5 DP GFLOP/s on similar matrices [1]. To make a fair comparison, we note that we are examining single precision floating-point, which saves us about a factor of 1.5 in performance, assuming we’re bandwidth limited and have perfect caching of the vector. Additionally, we’re using the specialized DIA matrix format, as opposed to the general CSR format used in [1], which saves us about another factor of 2, again assuming perfect caching for the vector and the offset array. A rough estimate for optimized CPU performance with our setup would then be about 10.5 SP GFLOP/s.

Our unoptimized GPU implementation reaches 13.1 SP GFLOP/s, which is respectable in comparison.

Sparse matrix vector multiplication is simple enough that we can compute some bounds on performance to figure out how much performance headroom remains with our setup. First, we note that for every 2 floating-point operations in the inner-most loop, we have to load 12 bytes of data from off-chip memory: 4 bytes for the integer offset of the diagonal we’re computing, 4 bytes for the entry in the matrix, and 4 bytes for the entry in the vector. The ATI Radeon HD 5870 GPU has 153.6 GB/s of memory bandwidth, so if we’re perfectly bandwidth bound, we should be able to sustain 2 FLOP/12 Bytes * 153.6 GBytes/s = 25.6 GFLOP/s. This implementation is getting about 51% of our bound. Maybe we can improve on that.

**Figure 7: Initial ATI Radeon HD 5870 GPU Performance**

### Simplified, Aligned Kernel

Here is the code for a slightly modified version of the kernel with simplified indexing math, as well as an aligned matrix *A*, so that all the diagonal rows are aligned to 128-byte boundaries. To align *A*, we’ve introduced another argument, pitch_A, which tells us the length of each row of *A* in floats. For example, if rows=154401, pitch_A will be 154432 (just rounded up to the next biggest multiple of 32). The data representing *A* has to be padded to make each row as long as the pitch. We do this in the host code when loading the sparse matrix *A*.

__kernel void dia_spmv(__global float *A, __const int pitch_A, __const int rows, __const int diags, __global int *offsets, __global float *x, __global float *y) { int row = get_global_id(0); float accumulator = 0; __global float* matrix_offset = A + row; for(int diag = 0; diag < diags; diag++) { int col = row + offsets[diag]; if ((col >= 0) && (col < rows)) { float m = *matrix_offset; float v = x; accumulator += m * v; } matrix_offset += pitch_A; } y= accumulator; }

### Simplified, Aligned Performance on ATI Radeon HD 5870 GPU

Figure 8 shows that aligning the data structure and simplifying the indexing provided a slight improvement, to an average of 13.9 GFLOP/s, or 54% of our bound. Although the improvement was modest in this case, aligning data structures is always a good idea in OpenCL™ programming, and should be one of the first things you try when tuning your code.

**Figure 8: Aligned ATI Radeon HD 5870 GPU Performance**

### Kernel Using Local Memory

If we look at the inner loop, we can see that all work-items are loading from the offsets array, which means the same data is being used repeatedly. In addition, the offsets array is small since there is only one element per diagonal, so it could potentially be stored in local memory. Since our memory bandwidth consumption limits our performance, it’s a good idea to try to keep as many memory accesses as possible on-chip, which is exactly why local memory is provided in OpenCL™. This version of the code loads the offset array into local memory, synchronizes work-items with an OpenCL™ barrier, then proceeds with the computation. It’s limited to arrays with 256 or fewer diagonals, since that’s the amount of storage we’ve allocated in local memory. If we wanted to make a more general version, we could loop over blocks of 256 diagonals, until we finish the computation.

__kernel void dia_spmv(__global float *A, __const int pitch_A, __const int rows, __const int diags, __global int *offsets, __global float *x, __global float *y) { int local_id = get_local_id(0); int offset_id = local_id; while ((offset_id < 256) && (offset_id < diags)) { l_offsets[offset_id] = offsets[offset_id]; offset_id = offset_id + get_local_size(0); } barrier(CLK_LOCAL_MEM_FENCE); int row = (int)get_global_id(0); float accumulator = 0; __global float* matrix_offset = A + row; for(int diag = 0; diag < diags; diag++) { int col = row + l_offsets[diag]; if ((col >= 0) && (col < rows)) { float m = *matrix_offset; float v = x; accumulator += m * v; } matrix_offset += pitch_A; } y= accumulator; }

### Performance of Kernel Using Local Memory

Figure 9 shows the performance improvement from moving offsets array accesses to local memory. The average speed improved by 75%, to 24.5 GFLOP/s. Since we moved some memory accesses from off-chip DRAM to on-chip local memory, our earlier bound no longer applies. Now, for every 2 floating-point operations in the inner loop, we load 8 bytes of data (4 from the matrix, and 4 from the vector). This means that our upper bound on performance is 2 FLOP/8 Bytes * 153.6 GBytes/s = 38.4 GFLOP/s. Our performance is at 64% of bound. This is okay, but let’s try to improve it further.

**Figure 9: ATI Radeon HD 5870 GPU Performance Using Local Memory**

### Vectorized Kernel

The ATI Radeon HD 5870 GPU has 2.72 TFLOP/s peak computational throughput, indicating that our 24.5 GFLOP/s remains far from using all the number crunching capabilities of the processor. Perhaps we can improve the efficiency of our code by employing OpenCL™ C vector types, which will map more neatly onto the architecture of both AMD GPUs and CPUs than the scalar code we’ve been writing.

The following code shows how we do this, by redefining the partitioning of the code, so that each work-unit processes four rows of the output. This introduces a few corner cases which we have to handle correctly to ensure that we aren’t going beyond the boundaries of our data structures. Notice also our use of OpenCL™ C vector data types, like int4 and float4, as well as OpenCL™ C vector memory loads and stores (vload4, vstore4).

__kernel void dia_spmv(__global float *A, __const int pitch_A, __const int rows, __const int diags, __global int *offsets, __global float *x, __global float *y) { __local int l_offsets[256]; int local_id = get_local_id(0); int offset_id = local_id; while ((offset_id < 256) && (offset_id < diags)) { l_offsets[offset_id] = offsets[offset_id]; offset_id = offset_id + get_local_size(0); } barrier(CLK_LOCAL_MEM_FENCE); int row = get_global_id(0) * 4; float4 accumulator = 0; __global float* matrix_offset = A + row; for(int diag = 0; diag < diags; diag++) { int col = row + l_offsets[diag]; float4 m = vload4(0, matrix_offset); float4 v; if ((col >= 0) && (col < rows - 4)) { v = vload4(0, x + col); } else { int4 id = col + (int4)(0, 1, 2, 3); int4 in_bounds = (id >= 0) && (id < rows); v.x = in_bounds.x ? x[id.x] : 0; v.y = in_bounds.y ? x[id.y] : 0; v.z = in_bounds.z ? x[id.z] : 0; v.w = in_bounds.w ? x[id.w] : 0; } accumulator += m * v; matrix_offset += pitch_A; } vstore4(accumulator, 0, row + y); }

### Performance of Vectorized Kernel

Figure 10 shows the resulting performance, which improves 15% on average to 32.9 GFLOP/s. This is 86% of our bound, which tells us we’ve done a pretty good job of using the hardware. But, is there room to improve the bound even further?

**Figure 10: Vectorized ATI Radeon HD 5870 GPU Performance**

### Caching the Vector Using Images

The ATI Stream SDK v2.1 contains support for OpenCL™ images, which can provide additional memory bandwidth in certain cases by using the texture caches on GPUs, coupled with texture sampling hardware for reading from these caches. In our case, the vector *x* is an ideal candidate for OpenCL™ images. It is heavily reused, since each work-item loads as many elements from *x* as there are diagonals in our matrix, and *x* is also fairly small. Not all OpenCL™ runtimes and devices support images. OpenCL™ image support can be detected by using clDeviceGetInfo() and examining CL_DEVICE_IMAGE_SUPPORT.

There are strict limits on the size and data types of objects which can be placed into images. In our case, the ATI Radeon 5870 GPU supports 2-D images up to 8192×8192 pixels, where each pixel is a float4 vector (in graphics, RGBA components are often used to describe each pixel). We can repurpose this machinery for holding the vector *x*, as long as *x* is less than 8192*8192*4 elements – which it obviously is, since we have limited memory on the device, and the matrix to match such a vector would be too large to fit in device memory.

The vector *x* is naturally a 1-D object, but we can conceive of it as a 2-D image just by imposing a 2-D coordinate system of float4 elements on the vector. The standard way to do this is to define a width, and then generate coordinates: u, v = (pos % width, pos / width). In our case, since there is no natural width to our 1-D vector, we can choose the width arbitrarily. Noting that the modulus and integer division operations are dramatically simplified to bit-masking and shifting when the denominator is a power of 2, we choose to layout the image so that it is a power of 2 pixels wide. Additionally, each pixel holds four entries of the vector, since it is an RGBA image type. This complicates addressing the image somewhat, since each work-item will be accessing four contiguous elements of the vector, but not necessarily the four elements that are held in a particular pixel. The unaligned nature of these accesses is data dependent: it depends on the location of the diagonals in the original matrix (or with our data structure, the values in the offsets array), so it is not possible to guarantee pixel-aligned accesses to the image. Instead, we must deal with four special cases: aligned, misaligned by 1 float, misaligned by 2 floats, and misaligned by 3 floats. In the misaligned cases, we perform two reads from the image, and then assemble our float4 sampling of the vector by swizzling in the appropriate elements. In this configuration, the texture sampler takes care of accesses from the image which might be out of bounds by clamping to a default value of (0, 0, 0, 0), which is exactly what we need and removes some conditionals. Below is the code needed to read our vector elements from the image representation of the vector.

float4 unaligned_read_from_image( //OpenCL Image __read_only image2d_t vector, //how many bits to shift for division //(log_2 of the image width) __const int log_vector_width, //Boolean mask for modulus //(the image width - 1) __const int width_mask, //The unaligned index of the first element we need __const int col) { const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates CLK_ADDRESS_CLAMP | //Clamp to zeros CLK_FILTER_NEAREST; //Don't interpolate int alignment = col % 4; if (alignment < 0) alignment += 4; //col may be negative int rgbcol = col >> 2; float4 v; if (alignment == 0) { //Fully aligned to pixels in image int2 coord; coord.x = rgbcol & width_mask; coord.y = rgbcol >> log_vector_width; v = read_imagef(vector, smp, coord); } else if (alignment == 1) { //Off by one pixel int2 coord0; coord0.x = rgbcol & width_mask; coord0.y = rgbcol >> log_vector_width; int2 coord1; rgbcol++; coord1.x = rgbcol & width_mask; coord1.y = rgbcol >> log_vector_width; float4 v0 = read_imagef(vector, smp, coord0); float4 v1 = read_imagef(vector, smp, coord1); v.xyz = v0.yzw; v.w = v1.x; } else if (alignment == 2) { //Off by two pixels int2 coord0; coord0.x = rgbcol & width_mask; coord0.y = rgbcol >> log_vector_width; int2 coord1; rgbcol++; coord1.x = rgbcol & width_mask; coord1.y = rgbcol >> log_vector_width; float4 v0 = read_imagef(vector, smp, coord0); float4 v1 = read_imagef(vector, smp, coord1); v.xy = v0.zw; v.zw = v1.xy; } else if (alignment == 3) { //Off by three pixels int2 coord0; coord0.x = rgbcol & width_mask; coord0.y = rgbcol >> log_vector_width; int2 coord1; rgbcol++; coord1.x = rgbcol & width_mask; coord1.y = rgbcol >> log_vector_width; float4 v0 = read_imagef(vector, smp, coord0); float4 v1 = read_imagef(vector, smp, coord1); v.x = v0.w; v.yzw = v1.xyz; } return v; }

And here is our final code variant. It’s similar to the vectorized version, except for the way we read from the vector *x*.

__kernel void dia_spmv(__global float *A, __const int pitch_A, __const int rows, __const int diags, __global int *offsets, __readonly image2d_t *x, __const int log_vector_width, __const int width_mask, __global float *y) { __local int l_offsets[256]; int local_id = get_local_id(0); int offset_id = local_id; while ((offset_id < diags) && (offset_id < 256)) { l_offsets[offset_id] = offsets[offset_id]; offset_id = offset_id + get_local_size(0); } barrier(CLK_LOCAL_MEM_FENCE); unsigned int row = get_global_id(0) * 4; float4 accumulator = 0; int d = 0; int matrix_offset = A + row; if (row < rows) { while (d < diags) { float4 m = vload4(0, matrix_offset); int col = row + l_offsets[d]; float4 v = unaligned_read_from_image(x, log_vector_width, width_mask, col); accumulator += m * v; d++; matrix_offset += pitch_A; } vstore4(accumulator, get_global_id(0), y); } }

### Performance of OpenCL™ C Kernel Using Images

Figure 11 shows the resulting performance, which improves by another 76%, to reach an average of 49.2 GFLOP/s. Since we moved accesses of the vector to use the image hardware, we no longer need to perform so many accesses of the vector from off-chip memory. Assuming perfect caching of the vector, we need only read one float from memory for every 2 floating-point operations. This leads us to a bound of 2 FLOP/4 Bytes * 153.6 GB/s = 76.8 GFLOP/s, putting our final code variant at 64% of its bound.

**Figure 11: ATI Radeon HD 5870 GPU Performance Using Images**

During the optimization process, we were able to improve the bound twice, by eliminating memory accesses to data structures that are relatively small and frequently used, but it’s not possible to eliminate the final memory access, since each element of the matrix *A* is only used once, and the matrix itself is too large to fit into any structure we could use as an on-chip cache. At this point, we can be reasonably confident that our code is close to optimal. There may be other optimizations, like loop unrolling, etc., that we could try to apply, but since the compulsory loads from the matrix *A* will still need to be performed, such optimizations will only yield marginal benefits.

### AMD Phenom II X4 965 CPU Results

As we discussed earlier, a very well optimized implementation of this code for the AMD Phenom II X4 965 CPU might achieve 10.5 SP GFLOP/s on this computation. The upper bound on performance is 10.6 SP GFLOP/s since the processor has 21.3 GB/s of memory bandwidth, assuming perfect caching of the vector and offset array.

After applying our optimizations, except for the use of OpenCL™ images to cache the vector, since the ATI Stream SDK v2.1 does not currently support images on x86 CPUs, we reach 2.9 SP GFLOP/s, as shown in Figure 12. Although we only achieved 27% of bound, OpenCL™ still enabled us to run the same code and utilize all our cores.

**Figure 12: Vectorized AMD Phenom II X4 965 CPU Results**

### Conclusion

Taking a close look at optimizing DIA sparse matrix vector multiply has illustrated several techniques for getting good performance with OpenCL™ C code:

- Pay attention to the interplay between SIMD execution and your data structure.
- Align and densify accesses as much as possible.
- Use local memory to eliminate off-chip memory accesses.
- Vectorize your code for greater efficiency.
- Use OpenCL™ images for intermediate-sized data structures with hard-to-predict access patterns, but lots of reuse, when targeting the GPU.
- When targeting the CPU, consider tailoring the amount of parallelism you express to the natural parallelism of the processor

With these techniques, we’ve been able to construct a high-performance DIA sparse matrix vector multiply routine that efficiently uses the resources of the ATI Radeon HD 5870 GPU.

The same code that worked well on the GPU also provides decent performance on the CPU, and slightly adjusting the parallelism of the computation to better fit the CPU improved CPU performance a bit as well.

As you write your own code, careful attention to these principles will help you achieve high performance results. Good luck!

### References

[1] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms.*Parallel Computing*, vol. 35, no. 3, pp. 178-194, 2009.

*OpenCL™ and the OpenCL™ logo are trademarks of Apple Inc. used by permission by Khronos*.