In this article, we examine key kernels utilized in a
Quadratic Programming solver for Support Vector Machine training.
We optimize the evaluation of the Radial Basis Function SVM kernel by
examining a variety of different data structures as well as their performance
implications, improving performance by a factor of 5 compared to naive code
running on an AMD Radeon™ HD 5870 GPU. We
discuss general rules of thumb that lead to efficient data structures for OpenCL
computation.
Introduction
Support Vector Machines (SVMs) are a widely used binary
classification technique used to analyze data, or in other words, to label data
as belonging to one of two categories.
For example, SVMs can be used to determine whether email is important or
unsolicited, whether an image contains a flower or not, or whether a DNA
sequence codes for a particular trait, etc.
Researchers in many fields, such as image, speech and gesture
recognition, bioinformatics, and computer security have applied SVMs to enable
programs to analyze data.
In order to use an SVM, it must first be trained, so that
the classifier can “learn” the categories it distinguishes between.
This training process is computationally intensive.
In this article, we’ll discuss the issues that arise when building an
iterative solver for Support Vector Machine training, running on AMD GPUs, using
OpenCL. Most of the article will
focus on the proper use of on-chip and off-chip memory resources of the AMD
Radeon HD 5870 GPU. The data
structures we choose determine how efficiently we can use our memory resources,
so this article examines several ways of laying out data structures and using
them in OpenCL code, in order to achieve good performance.
Support Vector Machine Training
To begin this discussion, first we’ll briefly describe the
problem we’re trying to solve. As
previously mentioned, Support Vector Machines are a widely used classification
technique. Support Vector Machines
process data as points in some abstract, highly dimensional feature space.
For example, I could describe objects by measuring their exterior
dimensions (length, width, height), as well as their mass.
Each object would then be described by four numbers, which we can think
of as a four-dimensional vector in some feature space.
Typical SVM applications use feature spaces with tens to hundreds of
dimensions, or sometimes more.
Given a series of measurements of objects, along with their
associated labels, I could then train an SVM to distinguish between two
different classes of object, for example: between objects made of wood, and
those of stone. Since wood objects
would tend to reside in a certain region of my four dimensional feature space,
and stone objects would tend to reside in another region, we can draw a boundary
in our feature space that separates the wood objects from the stone objects.
Of course, there are many such boundaries we could draw, but we want to
learn the most general boundary we can, so that our classifier will be robust.
Training an SVM is a specific approach for learning the boundary which
separates our training data with maximal margin, meaning that the training data
is as clearly classified as possible. Then, a program can classify new data by
simply checking to see which side of the boundary the new data falls on.
SVMs can be trained using a Quadratic Programming
optimization process called Sequential Minimal Optimization.
In this algorithm, we take many tiny steps to update the boundary between
our two object classes. This is done
by finding the two training points which are explained most poorly by our
current guess of what the boundary is, and then updating the boundary to fit to
those points better.

Figure 1: A
boundary separating pictures of plants from pictures of people
The main computational load in this algorithm is choosing
the two most poorly explained points out of our training set, which may be
hundreds of thousands of points in total.
This involves evaluating a distance function between all our training
points and two of the training points.
This distance function, called a kernel function, can take many forms,
but one of the most useful is the so-called Radial Basis Function:.
Our job, then, is to evaluate many copies of this function, for our many
training points.
Initial Analysis
Let’s look at a simple C version of the Radial Basis
Function evaluations. We’re going to
evaluate all data points against two of the data points, one called
high and the other called
low.
We’ll store the results in two arrays, one called
high_kernel and the other called
low_kernel.
We have
nPoints training points, each of which is a sample from an
nDim
dimensional feature space. We’ll
store the data points in an array called
data, where each row represents one training
point, and so the resulting matrix has nPoints rows and nDim columns.
For now, we’ll store the matrix using the standard C “row-major” data
layout, which we’ll discuss in more detail later in the article.
gamma is the
scalar parameter from our Radial Basis Function.
Listing 1 shows some simple C code to accomplish this.
void svm_kernel(float* data,
int nPoints,
int nDim,
int high,
int low,
float gamma,
float* high_kernel,
float* low_kernel) {
for(int index = 0; index < nPoints; index++) {
float high_accumulator = 0;
float low_accumulator = 0;
for(int d = 0; d < nDim; d++) {
float x_i_d = data[index * nDim + d];
float x_high_d = data[high * nDim + d];
float x_low_d = data[low * nDim + d];
float high_diff = x_i_d – x_high_d;
high_accumulator += high_diff * high_diff;
float low_diff = x_i_d – x_low_d;
low_accumulator += low_diff * low_diff;
}
float high_result = exp(-gamma * high_accumulator);
high_kernel[index] = high_result;
float low_result = exp(-gamma * low_accumulator);
low_kernel[index] = low_result;
}
}
Listing 1:
Simple C code for RBF Kernel Evaluations
In this code, we’ve fused both the high and low kernel
evaluations together into one inner loop, in order to avoid the overhead of
loading the vector twice for each iteration of the outer loop.
We can see the row-major data layout in the indexing arithmetic in the
inner loop: to load data[row, column] or equivalently data[index, d] in listing
1, we index into data[index * nDim + d], where nDim is the number of columns in
each row. We’ll discuss this in
detail later.
We can also see that in the inner loop, there are 6
floating-point operations and three memory loads.
However, since the high and low vectors are the same for
every iteration of the outer loop, if we write our code correctly, we’ll only
have to load them once into on-chip memories, and then we’ll reuse them without
having to execute an off-chip memory load in the inner loop.
The code in listing 1 does not do this, it is an example of a simple C
implementation without any optimizations.
Later, we’ll see how to implement this optimization to avoid wasting
memory bandwidth.
We can use this information to compute a bound on our
expected performance of this code, running on the AMD Radeon HD 5870.
The Radeon HD 5870 has 153.6 GB/s of off-chip memory bandwidth.
Assuming perfect caching of the high and low vectors, we should see 153.6
GB/second * 6 Single Precision Floating Point Operations/Memory load * 1 Memory
load/4 Bytes = 230.4 Single Precision Floating Point Operations/second.
We’ll use this bound to evaluate the quality of our implementations as we
optimize.
Basic OpenCL Implementation
Translating the C code directly to OpenCL, we will
instantiate a work-item for each iteration of the outermost loop.
Listing 2 shows a direct OpenCL translation of our simple C
implementation from Listing 1.
__kernel
void svm_kernel(__global float* data,
__const int nPoints,
__const int nDim,
__const int high,
__const int low,
__const float gamma,
__global float* high_kernel,
__global float* low_kernel
) {
int global_index = get_global_id(0);
float high_accumulator = 0;
float low_accumulator = 0;
for(int d = 0; d < nDim; d++) {
float x_i_d = data[global_index * nDim + d];
float x_high_d = data[high * nDim + d];
float high_diff = x_i_d - x_high_d;
high_accumulator += high_diff * high_diff;
float x_low_d = data[low * nDim + d];
float low_diff = x_i_d - x_low_d;
low_accumulator += low_diff * low_diff;
}
float high_kernel = exp(-gamma * high_accumulator);
high_kernel[global_index] = high_kernel;
float low_kernel = exp(-gamma * low_accumulator);
low_kernel[global_index] = low_kernel;
}
Listing 2:
Direct OpenCL translation of simple C code for RBF kernel evaluation
Comparing Listing 2 to Listing 1, we can see that we have
taken the outermost loop and turned it into an OpenCL kernel, where the index
space of the OpenCL kernel corresponds to the iterations of the outermost loop.
Performance

Figure 2: Row
Major RBF Kernel Evaluation Performance
Figure 2 shows the performance we achieve with this
implementation, with data of dimensionality 1000.
For larger problems, we achieve 34 Single Precision Giga Floating Point
Operations/Second (SPGFLOP/s).
Unfortunately, this is only 15% of our bound.
Clearly, there are some things we should investigate to improve
performance.
Row-major versus Column-major
Since we are clearly bottlenecked by memory performance, we
should examine our choice of data structures.
Earlier, we decided to keep our training data matrix in the default C
row-major layout. Row-major layout
represents a two-dimensional matrix as a one-dimensional array by scanning
across the rows, like so:

Figure 3: Row
Major Layout, color indicates access pattern
Let’s envision how this data structure is being used in our
program. We’re instantiating a
work-item for each row of the matrix, and each work-item will then iterate
through the columns of the matrix.
Consider the very first iteration of the innermost loop for the example matrix
illustrated in Figure 3: work-item 0 will load element 0 from the data array,
and work-item 1 will load element 3 from the data array.
During the next iteration, work-item 0 will load element 1, and work-item
1 will load element 4, etc. This
pattern is color coded in Figure 3.
With this data layout, as the computation proceeds, the
work-items will not be accessing contiguous locations in off-chip memory.
This causes inefficiencies, since the memory subsystems of all modern
processors are optimized for loading contiguous vectors from off-chip memory.
We can change this behavior by restructuring our data, in this case
moving from a “row-major” layout to a “column-major” layout.
Column-major layout represents a two-dimensional matrix as
a one dimensional array by scanning across the columns, like so:

Figure 4:
Column Major Layout, color indicates access pattern
With column-major layout, adjacent work-items will access
adjacent items in memory as the computation proceeds, as shown in Figure 4.
This improves efficiency.
This is a simple change in our OpenCL code.
For an array with nDim elements in each row and nPoints rows, in order to
index element data[row, col], we change from row-major indexing: data[row * nDim
+ col] to column major-indexing: [column * nPoints + row].
We will also make an additional optimization by padding the data to
ensure that our loads are always aligned (see the OpenCL Optimization Case
Study: Diagonal Sparse Matrix Vector Multiplication article for more details).
This will change our indexing arithmetic to the following: data[row, col]
is found by indexing data[column * pitch + row], where pitch is nPoints rounded
up to the nearest multiple of 32 floats, which is the preferred alignment pitch
for the AMD Radeon 5870 GPU. We’ll
also simplify the indexing arithmetic to stride directly through memory, rather
than recompute the index anew every iteration.
This code is shown in Listing 3.
void svm_kernel(__global float* data,
__const int pitch,
__const int nPoints,
__const int nDim,
__const int high,
__const int low,
__const float gamma,
__global float* high_kernel,
__global float* low_kernel
) {
int global_index = get_global_id(0);
__global float* x_i = data + global_index;
float high_accumulator = 0;
float low_accumulator = 0;
__global float* x_high = data + high;
__global float* x_low = data + low;
for(int d = 0; d < nDim; d++) {
float x_i_d = *x_i;
float x_high_d = *x_high;
float high_diff = x_i_d - x_high_d;
high_accumulator += high_diff * high_diff;
float x_low_d = *x_low;
float low_diff = x_i_d - x_low_d;
low_accumulator += low_diff * low_diff;
x_i += pitch;
x_high += pitch;
x_low += pitch;
}
float high_result = exp(-gamma * high_accumulator);
high_kernel[global_index] = high_result;
float low_result = exp(-gamma * low_accumulator);
low_kernel[global_index] = low_result;
global_index += get_global_size(0);
}
Listing
3:
Column-major data storage
Comparing the code in Listing 3 to the code in Listing 2,
we can see that the only change we’ve made is in how we’ve accessed the data
structures. Instead of striding by 1
element we loop through the dimensions of the data, we now stride by
pitch elements. This simple
change has a big impact on performance.
Figure 5:
Column-major Performance
Figure 5 shows the performance improvement we see when
moving to a column-major data layout.
For larger problem sizes, the average performance climbs from 33.7 to
65.7 SPFLOP/s, which is almost a factor of 2 improvement.
Comparing to our performance bound, we see we have achieved 29% of our
bound. Perhaps there is yet room for
improvement.
Vectorization
AMD GPUs (and CPUs, for that matter) are most efficient
when each work-item operates on vectors of data, rather than just scalars.
We can vectorize the loop by performing this transformation.
Listing 4 shows how to accomplish this.
float sum(float4 in) {
return dot(in, (float4)(1.0f, 1.0f, 1.0f, 1.0f));
}
__kernel
void svm_kernel(__global float* data,
__const int pitch,
__const int nPoints,
__const int nDim,
__const int high,
__const int low,
__const float gamma,
__global float* high_kernel,
__global float* low_kernel
) {
int global_index = get_global_id(0);
__global float* x_i = data + global_index * 4;
float4 high_accumulator = 0;
float4 low_accumulator = 0;
__global float* x_high = data + high;
__global float* x_low = data + low;
for(int d = 0; d < nDim; d++) {
float4 x_i_d = vload4(0, x_i);
float4 x_high_d = vload4(0, x_high);
float4 high_diff = x_i_d - x_high_d;
high_accumulator += high_diff * high_diff;
float4 x_low_d = vload4(0, x_low);
float4 low_diff = x_i_d - x_low_d;
low_accumulator += low_diff * low_diff;
x_i += pitch;
x_high += pitch;
x_low += pitch;
}
float high_kernel = exp(-gamma * sum(high_accumulator));
high_kernel[global_index] = high_kernel;
float low_kernel = exp(-gamma * sum(low_accumulator));
low_kernel[global_index] = low_kernel;
}
Listing
4:
Vectorized Computation
In Listing 4, we have changed the computation so that each
work-item works on 4 data elements at a time.
We use vector loads and stores from memory, via the vload4 intrinsic
provided by OpenCL We also use
OpenCL’s dot intrinsic to execute a horizontal sum operation across the vector
in each work-item.
Performance

Figure
6:
Vectorized Performance
This transformation improves average performance by 30%, to
85.9 SPGFLOP/s, which is 37% of our bound.
Now that we have improved execution efficiency by vectorization, perhaps
we should revisit our memory traffic to reduce off-chip memory loads.
Using OpenCL Images
OpenCL images enable the use of on-chip memories for
read-only data. Since the two
vectors which are shared by all work-items are in fact read-only data, perhaps
it makes sense to use OpenCL’s image facilities to use the GPU’s on-chip
memories. To do this, we represent
each vector as a two-dimensional image, where each pixel holds four floating
point values, and there is just one row of pixels in the image.
In the OpenCL code, the high and low vectors are now represented by
image2d_t types, which encapsulate references to the high and low vectors,
rather than indices into the data matrix.
To initialize the proper image sampler for this application, we note that
we do not want to use floating-point image addressing
(CLK_NORMALIZED_COORDS_FALSE), we want to clamp accesses to the image to the
image boundaries (CLK_ADDRESS_CLAMP), and we want to grab the nearest pixel to
the coordinates we specify (CLK_FILTER_NEAREST).
Reads from the high and low vectors are then performed via read_imagef()
calls. Listing 5 shows what this
looks like.
float sum(float4 in) {
return dot(in, (float4)(1.0f, 1.0f, 1.0f, 1.0f));
}
__kernel
void svm_kernel(__global float* data,
__const int pitch,
__const int nPoints,
__const int nDim,
__read_only image2d_t high,
__read_only image2d_t low,
__const float gamma,
__global float* high_kernel,
__global float* low_kernel
) {
const sampler_t smp =
CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP |
CLK_FILTER_NEAREST;
int global_index = get_global_id(0);
__global float* x_i = data + global_index * 4;
float4 high_accumulator = 0;
float4 low_accumulator = 0;
for(int d = 0; d < nDim; d++) {
float4 x_i_d = vload4(0, x_i);
int2 coord = (int2)(d, 0);
float4 x_high_d = read_imagef(high, smp, coord);
float4 high_diff = x_i_d - x_high_d;
high_accumulator += high_diff * high_diff;
float4 x_low_d = read_imagef(low, smp, coord);
float4 low_diff = x_i_d - x_low_d;
low_accumulator += low_diff * low_diff;
x_i += pitch;
}
float high_kernel = exp(-gamma * sum(high_accumulator));
high_kernel[global_index] = high_kernel;
float low_kernel = exp(-gamma * sum(low_accumulator));
low_kernel[global_index] = low_kernel;
global_index += get_global_size(0);
}
Listing 5:
Using OpenCL images to load data
Listing 5 uses OpenCL images for all loads to the heavily
reused low and high vectors.
However, using images comes at a price: data must be placed in images
explicitly, and the cost for doing this is much higher than writing to standard
OpenCL buffers. We discuss efficient
means for solving this problem in the next section.
Writing to Images
In our application, the high and low vectors change every
iteration, which means we need to update the images which contain them.
There is an OpenCL API call which can copy an OpenCL buffer, already
resident on the device, to an OpenCL Image on the device:
clEnqueueCopyBufferToImage. However, this function does
more work than is strictly necessary for our purposes, and consequently incurs
very large overhead. Since we’re
calling this function thousands of times in an optimization loop, we must avoid
incurring such overhead, or else our overall application will run slowly,
regardless of how well we have optimized the kernels.
Instead of using
clEnqueueCopyBufferToImage, we can write our own routine that performs the same
functionality, since images in OpenCL can be written to directly.
We show this routine in Listing 6.
__kernel
void copyToImages(__global float *data,
__const int pitch,
__const int nPoints,
__const int nDim,
int high,
int low,
__write_only image2d_t high_image,
__write_only image2d_t low_image) {
int id = get_global_id(0);
if (id < nDim) {
__global float* high_pointer = data + high * 4 + id * pitch;
float4 high_data = vload4(0, high_pointer);
int2 coord = (int2)(id, 1);
write_imagef(high_image, coord, high_data);
__global float* low_pointer = data + high * 4 + id * pitch;
float4 low_data = vload4(0, low_pointer);
write_imagef(low_image, coord, low_data);
}
}
Listing 6:
Routine to copy from OpenCL Buffers to OpenCL Images
By invoking this kernel to copy the correct high and low
vectors into OpenCL Images every iteration, we incur some overhead, but about
100x less overhead than using the direct OpenCL API call incurs.
Next, let’s look at performance with and without this overhead.
Performance

Figure 7: RBF
Kernel Performance using Images
Looking at the computation alone, using images improves our
performance by 75%, to 150.9 SPGFLOP/s for large problems.
However, to practically use images in this application, we have to
include the extra overhead of initializing the images at every iteration, which
reduces performance to 115.6 SPGFLOP/s, on average.
Still, this is 35% better than our vectorized code without images, and
brings us to 50% of our performance bound.
Local Memory
In this application, we know exactly what data is going to
be shared between work-items.
Additionally, the size of the data is limited – since most SVM problems have
feature space dimensionality which is only in the hundreds to low thousands, we
can fit the high and low vectors in on-chip local memory directly.
In Listing 7, we allocate space in local memory to contain feature
vectors of up to 1000 dimensions.
float sum(float4 in) {
return dot(in, (float4)(1.0f, 1.0f, 1.0f, 1.0f));
}
__kernel
void svm_kernel(__global float* data,
__const int pitch,
__const int nPoints,
__const int nDim,
__const int high,
__const int low,
__const float gamma,
__global float* high_kernel,
__global float* low_kernel
) {
__local float4 l_high[256];
__local float4 l_low[256];
int local_index = get_local_id(0);
if(local_index < nDim) {
l_high[local_index] = vload4(high, data + pitch * local_index);
l_low[local_index] = vload4(low, data + pitch * local_index);
}
barrier(CLK_LOCAL_MEM_FENCE);
int global_index = get_global_id(0);
__global float* x_i = data + global_index * 4;
float4 high_accumulator = 0;
float4 low_accumulator = 0;
for(int d = 0; d < nDim; d++) {
float4 x_i_d = vload4(0, x_i);
float4 x_high_d = l_high[d];
float4 high_diff = x_i_d - x_high_d;
high_accumulator += high_diff * high_diff;
float4 x_low_d = l_low[d];
float4 low_diff = x_i_d - x_low_d;
low_accumulator += low_diff * low_diff;
x_i += pitch;
}
float high_kernel = exp(-gamma * sum(high_accumulator));
high_kernel[global_index] = high_kernel;
float low_kernel = exp(-gamma * sum(low_accumulator));
low_kernel[global_index] = low_kernel;
global_index += get_global_size(0);
}
Listing 7:
Using Local Memory to hold reused data
This code is very similar to the code in Listing 4, with
the exception that we manually copy over the input vectors into OpenCL Local
memory before proceeding. It’s
important to synchronize after we finish loading the data into Local memory, to
ensure that all work-items finish loading their data, and that all the loads to
local memory are visible to all work-items in the work group.
Once we’ve done this, the rest of the computation can proceed using only
on-chip memory for the high and low vectors.
Performance

Figure 8: RBF
Kernel Performance using Local Memory
Performance using local memory is still further improved,
by 50% compared to our solution using Images.
On average large problems, we attain 173.6 SPGFLOP/s, which is 75% of our
performance bound. Although further
improvement is probably still possible, we know that it will yield diminishing
returns compared to our implementation, since we have gotten fairly close to our
performance bound.
Arg min & Arg max Reduction
In order to implement the SMO algorithm, we also need an
arg min and arg max reduction. The
arg min reduction finds the index of the smallest element in an array, along
with the element itself, and the arg max reduction is the same, except it finds
the largest element in an array. For
example, a min reduction on the array (4.0, 2.0, 5.0, 6.0) would return the
value “2.0”, but an arg min reduction on the same array would return the value
“2.0”, at index position “1”.
To implement these reductions efficiently, the strategies
we outlined our article on simple reductions apply.
There are two extra things which need to be done to get good performance.
Firstly, use the “Structure of Arrays” format to represent the array and
its accompanying indices. In other
words, instead of constructing a unified array of (index, value) pairs (the
“Array of Structures” format), create two separate arrays, one for the indices,
and one for the values. This
simplifies memory indexing and generally improves efficiency.
Secondly, in order to avoid introducing extraneous control flow into the
reduction tree, which can reduce performance, it’s better to use the OpenCL
select statement to perform the arg min and arg max operations.
For example, the following code snippet uses the C ternary operator ?: to
perform an arg min operation on a float4 vector.
float4 a, b;
int4 a_idx, b_idx;
int4 less_than = is_less(a, b);
float4 min = less_than ? a : b;
int4 min_idx = less_than ? a_idx : b_idx;
This code is functionally correct, but is more difficult to
compile, and can therefore reduce performance.
Instead of using control flow, one can use the OpenCL select intrinsic,
which yields optimal performance
float4 a, b;
int4 a_idx, b_idx;
int4 less_than = is_less(a, b);
float4 min = select(b, a, less_than);
int4 min_idx = select(b_idx, a_idx, less_than);
Following these guidelines results in an arg min reduction
which performs essentially identically with the simpler min reduction.
In our application, we need to perform both an arg min and an arg max
operation across the vector. We
investigated improving performance by fusing both reductions into a single
OpenCL kernel, but found that the increased complexity of such a kernel
outweighed the performance benefits.
Specifically, fusing the two reductions together reduced performance by 13%,
compared with using two separate reductions.

Figure 9: Min
versus Arg min versus Fused Arg Extrema reduction performance
As shown in Figure 9, fusing both the arg min and arg max
reductions into a single reduction was counterproductive.
The implementation was significantly more complex, and the performance
was slightly worse than just performing two simpler reductions.
The bandwidth
limited bound on this computation is 39.4 GigaReductions/second, since the
device has 157.6 GB/s, and each element of the vector being reduced is 4 bytes,
the fastest we could go is 157.6/4=39.4 GigaReduction/second.
For large size problems, we’re performing 33.4
GigaReductions/second, which is 84% of our bandwidth limited bound, and tells us
we are close to optimal.
Conclusion
In this article, we’ve discussed implementation strategies
for the main computational portions of SVM training.
As is often the case in OpenCL optimization, getting good performance
requires careful attention to the memory subsystem.
In particular, we found that ensuring contiguous off-chip memory accesses
by choosing the column-major data layout yielded important performance benefits,
and making use of on-chip memory resources was also crucial to performance.
Following these guidelines, we took our initial RBF kernel evaluation
routine from 34 to 174 SPGFLOP/s, achieving 75% of the theoretical
memory-limited performance. We also
showed how to make the more complicated reduction routines needed for SVM
training perform well.
When you optimize your own OpenCL computations, remember
these strategies:
-
Try to figure out what the performance limits of your
application are on the hardware you’re targeting.
That way you’ll be able to tell how much better you could
potentially get . Memory
bandwidth bounds are often easy to derive and so they’re a good
place to start, although other constraints can also be
important.
-
Reorganize your data structures so that adjacent work-items
access adjacent words in memory, if possible.
For some work loads this is easier than others, but it
can be a very important optimization.
For two dimensional data structures, think about whether
a row-major or column-major layout would be most efficient for
your problem.
-
Use on-chip memories as effectively as possible.
Local memory is generally the fastest, but since you have
to explicitly fetch the data into the local memory, it can be
tougher to use than images.
-
Use Structure of Arrays format rather than Array of Structures
format, as a general rule of thumb.
OpenCL programming can be very exciting – small changes in
your code can lead to big performance wins, which means that a little thought
about how your code will execute in OpenCL can lead to big performance payoffs.
Best of luck!
Bibliography
Support vector
machine. http://en.wikipedia.org/wiki/Support_vector_machine.