*by Bryan Catanzaro, 2/11/2011*

In this article, we examine key kernels utilized in a Quadratic Programming solver for Support Vector Machine training. We optimize the evaluation of the Radial Basis Function SVM kernel by examining a variety of different data structures as well as their performance implications, improving performance by a factor of 5 compared to naive code running on an AMD Radeon™ HD 5870 GPU.We discuss general rules of thumb that lead to efficient data structures for OpenCL™ computation.

### Introduction

Support Vector Machines (SVMs) are a widely used binary classification technique used to analyze data, or in other words, to label data

as belonging to one of two categories. For example, SVMs can be used to determine whether email is important or unsolicited, whether an image contains a flower or not, or whether a DNA sequence codes for a particular trait, etc. Researchers in many fields, such as image, speech and gesture recognition, bioinformatics, and computer security have applied SVMs to enable programs to analyze data.

In order to use an SVM, it must first be trained, so that the classifier can “learn” the categories it distinguishes between. This training process is computationally intensive. In this article, we’ll discuss the issues that arise when building an iterative solver for Support Vector Machine training, running on AMD GPUs, using

OpenCL.Most of the article will focus on the proper use of on-chip and off-chip memory resources of the AMD Radeon HD 5870 GPU.The data structures we choose determine how efficiently we can use our memory resources, so this article examines several ways of laying out data structures and using them in OpenCL code, in order to achieve good performance.

### Support Vector Machine Training

To begin this discussion, first we’ll briefly describe the problem we’re trying to solve.As previously mentioned, Support Vector Machines are a widely used classification technique.Support Vector Machines process data as points in some abstract, highly dimensional feature space. For example, I could describe objects by measuring their exterior dimensions (length, width, height), as well as their mass. Each object would then be described by four numbers, which we can think

of as a four-dimensional vector in some feature space. Typical SVM applications use feature spaces with tens to hundreds of dimensions, or sometimes more.

Given a series of measurements of objects, along with their associated labels, I could then train an SVM to distinguish between two different classes of object, for example: between objects made of wood, and those of stone.Since wood objects would tend to reside in a certain region of my four dimensional feature space, and stone objects would tend to reside in another region, we can draw a boundary in our feature space that separates the wood objects from the stone objects. Of course, there are many such boundaries we could draw, but we want to learn the most general boundary we can, so that our classifier will be robust. Training an SVM is a specific approach for learning the boundary which separates our training data with maximal margin, meaning that the training data is as clearly classified as possible. Then, a program can classify new data by simply checking to see which side of the boundary the new data falls on.

SVMs can be trained using a Quadratic Programming optimization process called Sequential Minimal Optimization. In this algorithm, we take many tiny steps to update the boundary between our two object classes.This is done by finding the two training points which are explained most poorly by our current guess of what the boundary is, and then updating the boundary to fit to those points better.

Figure 1: A boundary separating pictures of plants from pictures of people

The main computational load in this algorithm is choosing the two most poorly explained points out of our training set, which may be hundreds of thousands of points in total. This involves evaluating a distance function between all our training points and two of the training points. This distance function, called a kernel function, can take many forms, but one of the most useful is the so-called Radial Basis Function. Our job, then, is to evaluate many copies of this function, for our many training points.

### Initial Analysis

Let’s look at a simple C version of the Radial Basis Function evaluations.We’re going to evaluate all data points against two of the data points, one called high and the other called low.

We’ll store the results in two arrays, one called high_kernel and the other called low_kernel. We have nPoints training points, each of which is a sample from an nDim dimensional feature space.We’ll store the data points in an array called data, where each row represents one training point, and so the resulting matrix has nPoints rows and nDim columns. For now, we’ll store the matrix using the standard C “row-major” data layout, which we’ll discuss in more detail later in the article.

gamma is the scalar parameter from our Radial Basis Function. Listing 1 shows some simple C code to accomplish this.

void svm_kernel(float* data, int nPoints, int nDim, int high, int low, float gamma, float* high_kernel, float* low_kernel) { for(int index = 0; index < nPoints; index++) { float high_accumulator = 0; float low_accumulator = 0; for(int d = 0; d < nDim; d++) { float x_i_d = data; float x_high_d = data[high * nDim + d]; float x_low_d = data[low * nDim + d]; float high_diff = x_i_d – x_high_d; high_accumulator += high_diff * high_diff; float low_diff = x_i_d – x_low_d; low_accumulator += low_diff * low_diff; } float high_result = exp(-gamma * high_accumulator); high_kernel = high_result; float low_result = exp(-gamma * low_accumulator); low_kernel = low_result; } }

Listing 1: Simple C code for RBF Kernel Evaluations

In this code, we’ve fused both the high and low kernel evaluations together into one inner loop, in order to avoid the overhead of loading the vector twice for each iteration of the outer loop. We can see the row-major data layout in the indexing arithmetic in the inner loop: to load data

We can also see that in the inner loop, there are 6 floating-point operations and three memory loads.

However, since the high and low vectors are the same for every iteration of the outer loop, if we write our code correctly, we’ll only have to load them once into on-chip memories, and then we’ll reuse them without having to execute an off-chip memory load in the inner loop. The code in listing 1 does not do this, it is an example of a simple C implementation without any optimizations. Later, we’ll see how to implement this optimization to avoid wasting memory bandwidth.

We can use this information to compute a bound on our expected performance of this code, running on the AMD Radeon HD 5870. The Radeon HD 5870 has 153.6 GB/s of off-chip memory bandwidth. Assuming perfect caching of the high and low vectors, we should see 153.6 GB/second * 6 Single Precision Floating Point Operations/Memory load * 1 Memory load/4 Bytes = 230.4 Single Precision Floating Point Operations/second. We’ll use this bound to evaluate the quality of our implementations as we optimize.

### Basic OpenCL Implementation

Translating the C code directly to OpenCL, we will instantiate a work-item for each iteration of the outermost loop. Listing 2 shows a direct OpenCL translation of our simple C implementation from Listing 1.

__kernel void svm_kernel(__global float* data, __const int nPoints, __const int nDim, __const int high, __const int low, __const float gamma, __global float* high_kernel, __global float* low_kernel ) { int global_index = get_global_id(0); float high_accumulator = 0; float low_accumulator = 0; for(int d = 0; d < nDim; d++) { float x_i_d = data[global_index * nDim + d]; float x_high_d = data[high * nDim + d]; float high_diff = x_i_d - x_high_d; high_accumulator += high_diff * high_diff; float x_low_d = data[low * nDim + d]; float low_diff = x_i_d - x_low_d; low_accumulator += low_diff * low_diff; } float high_kernel = exp(-gamma * high_accumulator); high_kernel[global_index] = high_kernel; float low_kernel = exp(-gamma * low_accumulator); low_kernel[global_index] = low_kernel; }

Listing 2: Direct OpenCL translation of simple C code for RBF kernel evaluation

Comparing Listing 2 to Listing 1, we can see that we have taken the outermost loop and turned it into an OpenCL kernel, where the index space of the OpenCL kernel corresponds to the iterations of the outermost loop.

### Performance

Figure 2: Row Major RBF Kernel Evaluation Performance

Figure 2 shows the performance we achieve with this implementation, with data of dimensionality 1000. For larger problems, we achieve 34 Single Precision Giga Floating Point Operations/Second (SPGFLOP/s). Unfortunately, this is only 15% of our bound. Clearly, there are some things we should investigate to improve

performance.

### Row-major versus Column-major

Since we are clearly bottlenecked by memory performance, we should examine our choice of data structures. Earlier, we decided to keep our training data matrix in the default C row-major layout.Row-major layout represents a two-dimensional matrix as a one-dimensional array by scanning across the rows, like so:

Figure 3: Row Major Layout, color indicates access pattern

Let’s envision how this data structure is being used in our program.We’re instantiating a work-item for each row of the matrix, and each work-item will then iterate

through the columns of the matrix. Consider the very first iteration of the innermost loop for the example matrix illustrated in Figure 3: work-item 0 will load element 0 from the data array, and work-item 1 will load element 3 from the data array. During the next iteration, work-item 0 will load element 1, and work-item

1 will load element 4, etc.This pattern is color coded in Figure 3.

With this data layout, as the computation proceeds, the work-items will not be accessing contiguous locations in off-chip memory. This causes inefficiencies, since the memory subsystems of all modern processors are optimized for loading contiguous vectors from off-chip memory. We can change this behavior by restructuring our data, in this case moving from a “row-major” layout to a “column-major” layout.

Column-major layout represents a two-dimensional matrix as a one dimensional array by scanning across the columns, like so:

Figure 4: Column Major Layout, color indicates access pattern

With column-major layout, adjacent work-items will access adjacent items in memory as the computation proceeds, as shown in Figure 4. This improves efficiency. This is a simple change in our OpenCL code. For an array with nDim elements in each row and nPoints rows, in order to index element data

void svm_kernel(__global float* data, __const int pitch, __const int nPoints, __const int nDim, __const int high, __const int low, __const float gamma, __global float* high_kernel, __global float* low_kernel ) { int global_index = get_global_id(0); __global float* x_i = data + global_index; float high_accumulator = 0; float low_accumulator = 0; __global float* x_high = data + high; __global float* x_low = data + low; for(int d = 0; d < nDim; d++) { float x_i_d = *x_i; float x_high_d = *x_high; float high_diff = x_i_d - x_high_d; high_accumulator += high_diff * high_diff; float x_low_d = *x_low; float low_diff = x_i_d - x_low_d; low_accumulator += low_diff * low_diff; x_i += pitch; x_high += pitch; x_low += pitch; } float high_result = exp(-gamma * high_accumulator); high_kernel[global_index] = high_result; float low_result = exp(-gamma * low_accumulator); low_kernel[global_index] = low_result; global_index += get_global_size(0); }

Listing 3: Column-major data storage

Comparing the code in Listing 3 to the code in Listing 2, we can see that the only change we’ve made is in how we’ve accessed the data structures.Instead of striding by 1 element we loop through the dimensions of the data, we now stride by pitch elements.This simple change has a big impact on performance.

Figure 5: Column-major Performance

Figure 5 shows the performance improvement we see when moving to a column-major data layout. For larger problem sizes, the average performance climbs from 33.7 to 65.7 SPFLOP/s, which is almost a factor of 2 improvement. Comparing to our performance bound, we see we have achieved 29% of our bound.Perhaps there is yet room for improvement.

### Vectorization

AMD GPUs (and CPUs, for that matter) are most efficient when each work-item operates on vectors of data, rather than just scalars. We can vectorize the loop by performing this transformation. Listing 4 shows how to accomplish this.

float sum(float4 in) { return dot(in, (float4)(1.0f, 1.0f, 1.0f, 1.0f)); } __kernel void svm_kernel(__global float* data, __const int pitch, __const int nPoints, __const int nDim, __const int high, __const int low, __const float gamma, __global float* high_kernel, __global float* low_kernel ) { int global_index = get_global_id(0); __global float* x_i = data + global_index * 4; float4 high_accumulator = 0; float4 low_accumulator = 0; __global float* x_high = data + high; __global float* x_low = data + low; for(int d = 0; d < nDim; d++) { float4 x_i_d = vload4(0, x_i); float4 x_high_d = vload4(0, x_high); float4 high_diff = x_i_d - x_high_d; high_accumulator += high_diff * high_diff; float4 x_low_d = vload4(0, x_low); float4 low_diff = x_i_d - x_low_d; low_accumulator += low_diff * low_diff; x_i += pitch; x_high += pitch; x_low += pitch; } float high_kernel = exp(-gamma * sum(high_accumulator)); high_kernel[global_index] = high_kernel; float low_kernel = exp(-gamma * sum(low_accumulator)); low_kernel[global_index] = low_kernel; }

Listing 4: Vectorized Computation

In Listing 4, we have changed the computation so that each work-item works on 4 data elements at a time. We use vector loads and stores from memory, via the vload4 intrinsic provided by OpenCLWe also use OpenCL’s dot intrinsic to execute a horizontal sum operation across the vector in each work-item.

### Performance

Figure 6: Vectorized Performance

This transformation improves average performance by 30%, to 85.9 SPGFLOP/s, which is 37% of our bound. Now that we have improved execution efficiency by vectorization, perhaps we should revisit our memory traffic to reduce off-chip memory loads.

### Using OpenCL Images

OpenCL images enable the use of on-chip memories for read-only data.Since the two vectors which are shared by all work-items are in fact read-only data, perhaps

it makes sense to use OpenCL’s image facilities to use the GPU’s on-chip memories.To do this, we represent each vector as a two-dimensional image, where each pixel holds four floating point values, and there is just one row of pixels in the image. In the OpenCL code, the high and low vectors are now represented by

image2d_t types, which encapsulate references to the high and low vectors, rather than indices into the data matrix. To initialize the proper image sampler for this application, we note that we do not want to use floating-point image addressing (CLK_NORMALIZED_COORDS_FALSE), we want to clamp accesses to the image to the image boundaries (CLK_ADDRESS_CLAMP), and we want to grab the nearest pixel to the coordinates we specify (CLK_FILTER_NEAREST). Reads from the high and low vectors are then performed via read_imagef() calls.Listing 5 shows what this looks like.

float sum(float4 in) { return dot(in, (float4)(1.0f, 1.0f, 1.0f, 1.0f)); } __kernel void svm_kernel(__global float* data, __const int pitch, __const int nPoints, __const int nDim, __read_only image2d_t high, __read_only image2d_t low, __const float gamma, __global float* high_kernel, __global float* low_kernel ) { const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; int global_index = get_global_id(0); __global float* x_i = data + global_index * 4; float4 high_accumulator = 0; float4 low_accumulator = 0; for(int d = 0; d < nDim; d++) { float4 x_i_d = vload4(0, x_i); int2 coord = (int2)(d, 0); float4 x_high_d = read_imagef(high, smp, coord); float4 high_diff = x_i_d - x_high_d; high_accumulator += high_diff * high_diff; float4 x_low_d = read_imagef(low, smp, coord); float4 low_diff = x_i_d - x_low_d; low_accumulator += low_diff * low_diff; x_i += pitch; } float high_kernel = exp(-gamma * sum(high_accumulator)); high_kernel[global_index] = high_kernel; float low_kernel = exp(-gamma * sum(low_accumulator)); low_kernel[global_index] = low_kernel; global_index += get_global_size(0); }

Listing 5: Using OpenCL images to load data

Listing 5 uses OpenCL images for all loads to the heavily reused low and high vectors. However, using images comes at a price: data must be placed in images

explicitly, and the cost for doing this is much higher than writing to standard OpenCL buffers.We discuss efficient means for solving this problem in the next section.

### Writing to Images

In our application, the high and low vectors change every iteration, which means we need to update the images which contain them. There is an OpenCL API call which can copy an OpenCL buffer, already resident on the device, to an OpenCL Image on the device: **clEnqueueCopyBufferToImage**. However, this function does more work than is strictly necessary for our purposes, and consequently incurs very large overhead. Since we’re calling this function thousands of times in an optimization loop, we must avoid incurring such overhead, or else our overall application will run slowly, regardless of how well we have optimized the kernels.

Instead of using clEnqueueCopyBufferToImage, we can write our own routine that performs the same functionality, since images in OpenCL can be written to directly. We show this routine in Listing 6.

__kernel void copyToImages(__global float *data, __const int pitch, __const int nPoints, __const int nDim, int high, int low, __write_only image2d_t high_image, __write_only image2d_t low_image) { int id = get_global_id(0); if (id < nDim) { __global float* high_pointer = data + high * 4 + id * pitch; float4 high_data = vload4(0, high_pointer); int2 coord = (int2)(id, 1); write_imagef(high_image, coord, high_data); __global float* low_pointer = data + high * 4 + id * pitch; float4 low_data = vload4(0, low_pointer); write_imagef(low_image, coord, low_data); } }

Listing 6: Routine to copy from OpenCL Buffers to OpenCL Images

By invoking this kernel to copy the correct high and low vectors into OpenCL Images every iteration, we incur some overhead, but about 100x less overhead than using the direct OpenCL API call incurs. Next, let’s look at performance with and without this overhead.

### Performance

Figure 7: RBF Kernel Performance using Images

Looking at the computation alone, using images improves our performance by 75%, to 150.9 SPGFLOP/s for large problems. However, to practically use images in this application, we have to include the extra overhead of initializing the images at every iteration, which reduces performance to 115.6 SPGFLOP/s, on average. Still, this is 35% better than our vectorized code without images, and brings us to 50% of our performance bound.

### Local Memory

In this application, we know exactly what data is going to be shared between work-items. Additionally, the size of the data is limited – since most SVM problems have feature space dimensionality which is only in the hundreds to low thousands, we can fit the high and low vectors in on-chip local memory directly. In Listing 7, we allocate space in local memory to contain feature vectors of up to 1000 dimensions.

float sum(float4 in) { return dot(in, (float4)(1.0f, 1.0f, 1.0f, 1.0f)); } __kernel void svm_kernel(__global float* data, __const int pitch, __const int nPoints, __const int nDim, __const int high, __const int low, __const float gamma, __global float* high_kernel, __global float* low_kernel ) { __local float4 l_high[256]; __local float4 l_low[256]; int local_index = get_local_id(0); if(local_index < nDim) { l_high[local_index] = vload4(high, data + pitch * local_index); l_low[local_index] = vload4(low, data + pitch * local_index); } barrier(CLK_LOCAL_MEM_FENCE); int global_index = get_global_id(0); __global float* x_i = data + global_index * 4; float4 high_accumulator = 0; float4 low_accumulator = 0; for(int d = 0; d < nDim; d++) { float4 x_i_d = vload4(0, x_i); float4 x_high_d = l_high[d]; float4 high_diff = x_i_d - x_high_d; high_accumulator += high_diff * high_diff; float4 x_low_d = l_low[d]; float4 low_diff = x_i_d - x_low_d; low_accumulator += low_diff * low_diff; x_i += pitch; } float high_kernel = exp(-gamma * sum(high_accumulator)); high_kernel[global_index] = high_kernel; float low_kernel = exp(-gamma * sum(low_accumulator)); low_kernel[global_index] = low_kernel; global_index += get_global_size(0); }

Listing 7: Using Local Memory to hold reused data

This code is very similar to the code in Listing 4, with the exception that we manually copy over the input vectors into OpenCL Local memory before proceeding.It’s

important to synchronize after we finish loading the data into Local memory, to ensure that all work-items finish loading their data, and that all the loads to local memory are visible to all work-items in the work group. Once we’ve done this, the rest of the computation can proceed using only on-chip memory for the high and low vectors.

### Performance

Figure 8: RBF Kernel Performance using Local Memory

Performance using local memory is still further improved, by 50% compared to our solution using Images. On average large problems, we attain 173.6 SPGFLOP/s, which is 75% of our performance bound.Although further improvement is probably still possible, we know that it will yield diminishing returns compared to our implementation, since we have gotten fairly close to our performance bound.

### Arg min & Arg max Reduction

In order to implement the SMO algorithm, we also need an arg min and arg max reduction.The arg min reduction finds the index of the smallest element in an array, along with the element itself, and the arg max reduction is the same, except it finds the largest element in an array.For example, a min reduction on the array (4.0, 2.0, 5.0, 6.0) would return the value “2.0”, but an arg min reduction on the same array would return the value “2.0”, at index position “1”.

To implement these reductions efficiently, the strategies we outlined our article on simple reductions apply. There are two extra things which need to be done to get good performance. Firstly, use the “Structure of Arrays” format to represent the array and its accompanying indices.In other words, instead of constructing a unified array of (index, value) pairs (the “Array of Structures” format), create two separate arrays, one for the indices, and one for the values.This simplifies memory indexing and generally improves efficiency. Secondly, in order to avoid introducing extraneous control flow into the reduction tree, which can reduce performance, it’s better to use the OpenCL select statement to perform the arg min and arg max operations. For example, the following code snippet uses the C ternary operator ?: to perform an arg min operation on a float4 vector.

float4 a, b;

int4 a_idx, b_idx;

int4 less_than = is_less(a, b);

float4 min = less_than ? a : b;

int4 min_idx = less_than ? a_idx : b_idx;

This code is functionally correct, but is more difficult to compile, and can therefore reduce performance. Instead of using control flow, one can use the OpenCL select intrinsic, which yields optimal performance

float4 a, b;

int4 a_idx, b_idx;

int4 less_than = is_less(a, b);

float4 min = select(b, a, less_than);

int4 min_idx = select(b_idx, a_idx, less_than);

Following these guidelines results in an arg min reduction which performs essentially identically with the simpler min reduction. In our application, we need to perform both an arg min and an arg max operation across the vector.We investigated improving performance by fusing both reductions into a single

OpenCL kernel, but found that the increased complexity of such a kernel outweighed the performance benefits. Specifically, fusing the two reductions together reduced performance by 13%, compared with using two separate reductions.

Figure 9: Min versus Arg min versus Fused Arg Extrema reduction performance

As shown in Figure 9, fusing both the arg min and arg max reductions into a single reduction was counterproductive. The implementation was significantly more complex, and the performance was slightly worse than just performing two simpler reductions.

The bandwidth limited bound on this computation is 39.4 GigaReductions/second, since the device has 157.6 GB/s, and each element of the vector being reduced is 4 bytes, the fastest we could go is 157.6/4=39.4 GigaReduction/second.

For large size problems, we’re performing 33.4 GigaReductions/second, which is 84% of our bandwidth limited bound, and tells us we are close to optimal.

### Conclusion

In this article, we’ve discussed implementation strategies for the main computational portions of SVM training. As is often the case in OpenCL optimization, getting good performance requires careful attention to the memory subsystem. In particular, we found that ensuring contiguous off-chip memory accesses

by choosing the column-major data layout yielded important performance benefits, and making use of on-chip memory resources was also crucial to performance. Following these guidelines, we took our initial RBF kernel evaluation routine from 34 to 174 SPGFLOP/s, achieving 75% of the theoretical memory-limited performance.We also showed how to make the more complicated reduction routines needed for SVM training perform well.

When you optimize your own OpenCL computations, remember these strategies:

- Try to figure out what the performance limits of your

application are on the hardware you’re targeting.

That way you’ll be able to tell how much better you could

potentially get .Memory

bandwidth bounds are often easy to derive and so they’re a good

place to start, although other constraints can also be

important. - Reorganize your data structures so that adjacent work-items

access adjacent words in memory, if possible.

For some work loads this is easier than others, but it

can be a very important optimization.

For two dimensional data structures, think about whether

a row-major or column-major layout would be most efficient for

your problem. - Use on-chip memories as effectively as possible.

Local memory is generally the fastest, but since you have

to explicitly fetch the data into the local memory, it can be

tougher to use than images. - Use Structure of Arrays format rather than Array of Structures

format, as a general rule of thumb.

OpenCL programming can be very exciting – small changes in your code can lead to big performance wins, which means that a little thought about how your code will execute in OpenCL can lead to big performance payoffs.

Best of luck!

**Bibliography**

*Support vector machine*. http://en.wikipedia.org/wiki/Support_vector_machine.

*OpenCL and the Open CL logo are trademarks of Apple Inc. used by permission by Khronos.*