Step 2 – CL

Udeepta Bordoloi, ATI Stream Application Engineer  10/13/2009 

Let us now walk through the convolution code for OpenCL™. First, we will take a look at the kernel code, and then we will walk through the runtime code that will cause the kernel to run.

We have made only incremental changes to the previous files, so you can do a diff to get a quick view of what additional code is present.

The First Kernel

Our OpenCL convolution kernel is an almost exact replica of the C code for convolution. The major difference is that in the OpenCL kernel, we do not need the two outer for-loops that iterate over the output image. Instead, the variables xOut and yOut will be initialized by the get_global_id() call. Also, the variables nWidth and nHeight are left out; they are readily available via the get_global_size() call.

__kernel void Convolve(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0);
const int xOut = get_global_id(0); const int yOut = get_global_id(1);
const int xInTopLeft = xOut; const int yInTopLeft = yOut;
float sum = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++) { const int idxF = idxFtmp + c; const int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0…
const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; } 

Initialize OpenCL

All OpenCL initialization happens in the InitCL() function.

1.       We will create a context of a particular device type (in this case, a CPU) and check whether the operation was successfully executed.

 cl_context context = clCreateContextFromType(…,CL_DEVICE_TYPE_CPU,…); 


2.       We ask for a list of the devices present. My four-core AMD Phenom™ processor is returned as one device, not four.

 size_t listSize; /* First, get the size of device list */ clGetContextInfo(context, CL_CONTEXT_DEVICES, …, &listSize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listSize); /* Next, get the device list data */ clGetContextInfo(context, CL_CONTEXT_DEVICES, listSize, devices, …); 

3.       Next, we create a command queue: kernel executions and buffer read-write operations will be enqueued to this queue.

 cl_command_queue queue = clCreateCommandQueue(context, devices[0], …); 

4.       Now read the CL kernel file into a string (alternatively, just store the kernel as a string in your code), and create a program from that source string.

 cl_program program = clCreateProgramWithSource(context, 1, &source, …); 

Build the program.

 clBuildProgram(program, 1, devices, …); 

And, finally the kernel named “Convolve”.

 cl_kernel kernel = clCreateKernel(program, "Convolve", …); 

If we encounter compilation or other errors when we are building the program, we can get the error messages using the following.

 clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, …); 

Initialize OpenCL Buffers

Just as in the C code, we will need to create three buffers – for the input image, the output image, and the convolution filter. Since we are executing the kernel on the CPU, we will simply use the host buffer (we pass the host buffer pointer and use the CL_MEM_USE_HOST_PTR flag).

 cl_mem inputCL = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host-buffer-size, host-buffer-ptr, …); 

If the device is a GPU (CL_DEVICE_TYPE_GPU), we can explicitly copy data to the input image buffer on the device:

 clEnqueueWriteBuffer(queue, inputCL, …, host-buffer-ptr, …); 

And copy back from the output image buffer after the convolution kernel execution.

 clEnqueueReadBuffer(queue, outputCL, …, host-buffer-ptr, …); 

Execute OpenCL Kernel

Now we are all set to run the kernel on the OpenCL device. Instead of invoking the kernel like a function call, our kernel execution will be a two step process:

1.       Set the arguments for the kernel call. We will need to initialize the values of the arguments to be passed to the kernel execution. This step is required before the first kernel execution, and they do not need to be set again unless their values change. Compare the following with the kernel definition presented at the beginning.

 /* input buffer, arg 0 */ clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputCL); /* filter buffer, arg 1 */ clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&filterCL); /* output buffer, arg 2 */ clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&outputCL); /* input image width, arg 3*/ clSetKernelArg(kernel, 3, sizeof(int), (void *)&nInWidth); /* filter width, arg 4*/ clSetKernelArg(kernel, 4, sizeof(int), (void *)&nFilterWidth); 

2.       Enqueue the kernel to the command queue. It will be executed on the OpenCL device and the results will be available in the output buffer for readback. We will need to pass in the dimensionality of the data buffer (which is two, images are 2D). We also need the total number of items to be computed, which in our case is the number of pixels in the output buffer (let’s say the image is 2048×2048).

 clEnqueueNDRangeKernel(queue, kernel, data-dimensionality, …, total-work-size, work-group-size, …); 

In the above call, we also need to pass in a workgroup size. During computation, items within a work-group can share certain data and avail of some synchronization mechanisms tha t are not available to items across workgroups. We do not need any of those features in our current kernel, so it is tempting to use a workgroup of size 1.


While that will work in principle and produce correct results, that can produce bad performance. There are many considerations while choosing the appropriate workgroup size, including which device (CPU or GPU) the kernel is to be run on. We will not go into those details in this writeup; for our runs on the CPU device, we will use the largest possible workgroup size (32×32).

Release OpenCL Buffers

When we do not need them anymore, we will release the three buffers we created.


Shutdown OpenCL

Before the program is terminated, we also want to release all other OpenCL resources that we acquired.

 clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(queue); clReleaseContext(context); 


The timing code will be similar to what was used to time the C code in the previous write-up. We add a clFinish() call before both starting and stopping the timer – it ensures that we time the kernel execution activity to its completion and nothing else.

 clFinish(queue); //Timer Started here();
for (int i = 0; i < nIterations; i++) clEnqueueNDRangeKernel(…);
clFinish(queue); //Timer Stopped here(); //Average Time = ElapsedTime()/nIterations; 


For this discussion, we are using a computer with a 4-core AMD Phenom™ X4 9950 Black Edition processor and 8GB RAM. AMD’s OpenCL implementation exposes this quad-core CPU as a single OpenCL device.*

This is a graph showing the computation time for an output image of size 8192×8192. For a filter of width fw, the input image size is (8192+fw-1)x(8192+fw-1). We can see from the kernel code that, for each pixel, the loop runs for (filterWidth)2 times. This is in accordance with the graph – the computation time increases, more or less, as a function of square of filter width. It takes about 14.54s for a 20×20 filter while it takes about 3.73 for a 10×10 filter.

Error Checking

In the attached code, the OpenCL calls are interspersed with error checking code. In the next write-up, we will take a look at C++ Bindings for OpenCL, which among other things makes error checking easier.

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. *Configuration: AMD Phenom™ X4 9950 Black Edition processor (2.6GHz), 8GB RAM, Windows Vista™ 32-bit, ATI Stream SDK v2.0 Beta, Visual Studio 2008.

View more in this series by selecting an option from the left navigation.

2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.