View more in this series by selecting an option from the left navigation.

Step 1 – CPU


This is a step-by-step tutorial series targeted at the beginner/intermediate level OpenCL user. I use image convolution as an example to take us through the steps of the initial OpenCL program, OpenCL with C++ bindings, and a few optimizations. The source code associated with each step is available at the bottom of the page. The code for each step is an incremental addition to the previous step, so it should be really easy to understand and to find the changes added to a particular step. If you spot errors or have suggestions for improvement, please drop us a line. (Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.)

Convolution is a fundamental operation when it comes to signals and images and much more. You probably know about it already, in which case feel free to jump ahead. If you are hearing about it the first time, convolution is a process that combines one signal with a second signal to produce a third signal. Frequently, we say that we have one input signal that is convolved with a mask (aka a filter) to derive the output (filtered) signal.

The Algorithm

We will go over the algorithm with a specific example (see figure below). Here we have an 8×8 signal

(i.e., the input image) and we want to convolve it with a 3×3 signal (i.e., the mask or filter).

Image Convolution Using OpenCL - Algorithm

Figure 1. Convolution of a 8×8 image with a 3×3 filter to yield a 6×6 output image.

Each sample (pixel) of the resultant signal (i.e., the output image) is generated by

(a) Placing the filter over the input image, centered at the corresponding pixel location

(b) Weighting (multiplying) all the input image pixels covered by the filter with the corresponding filter values, and

(c) Accumulating (adding) all the results of step (b).

We perform these steps for all pixels in the output image. For now, we will ignore the calculations for the pixel locations near the boundary. For these locations, when the filter is placed on the equivalent location in the input image, the mask extends beyond the image boundary. So the output image dimensions will be (input_image_width – filter_width + 1) by (input_image_height – filter_width + 1). Since there are 6×6 locations where the filter can be placed over the input image (without the filter extending beyond the image bounds), the output image will have 6×6 pixels.

C code for convolution is given below. Here the output image dimensions are width by height, and the input image width is inWidth (equals width + filterWidth – 1) and the input height is (height + filterWidth – 1).

void Convolve(float * pInput, float * pFilter, float * pOutput,
const int nInWidth, const int nWidth, const int nHeight,
const int nFilterWidth, const int nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
const int yInTopLeft = yOut;
for (int xOut = 0; xOut < nWidth; xOut++)
const int xInTopLeft = xOut;
float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++)
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
} //for (int r = 0…
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
} //for (int xOut = 0…
} //for (int yOut = 0…


Key parameters have been grouped together into a structure:

struct paramStruct
int nWidth; //Output image width
int nHeight; //Output image height
int nInWidth; //Input image width
int nInHeight; //Input image height
int nFilterWidth; //Filter size is nFilterWidth X nFilterWidth
int nIterations; //Run timing loop for nIterations
//Test CPU performance with 1,4,8 etc. OpenMP threads
std::vector ompThreads;
int nOmpRuns; //ompThreads.size()
bool bCPUTiming; //Time CPU performance
} params;


There are command line options to set some parameters:

void Usage(char *name)
printf("\tUsage: %s [-h] [-c] [-f ] [-i ] [-x ] [-y ]\n", name);
printf(" -h Print this help menu.\n");
printf(" -c Supress CPU timing run.\n");
printf(" -f Sets the filter width.\n");
printf(" -i Number of iterations.\n");
printf(" -x Sets the image width.\n");
printf(" -y Sets the image height.\n");


We will reuse the Timer files from ATI Stream SDK v1.4-beta. The idea is to run the convolution times, and take the average time for one run.

 //Timer Started here();
for (int i = 0; i < nIterations; i++)
//Timer Stopped here();
//Average Time = ElapsedTime()/nIterations;

Using OpenMP for multi-threaded comparison

We will use OpenMP to get multi-threaded performance of the convolution code. The convolution loop structure lends itself to easy OpenMP scaling – all we have to do is add one line of code before the outer loop with the number of threads over which we want to split the workload.

 //This #pragma splits the work between multiple threads
#pragma omp parallel for num_threads(nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
We can time the OpenMP convolution performance with different numbers of threads. Set the number of threads in
void InitParams(int argc, char* argv[])
params.nOmpRuns = params.ompThreads.size();
If your compiler does not support OpenMP, you may need to comment out a few lines from the code.

View more in this series by selecting an option from the left navigation.

2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.