Step 3 – C++ Bindings

In this write-up, we will look at using the C++ bindings for OpenCL™. I prefer them because, among other things, they make error checking easier and the code cleaner. They are available at the Khronos OpenCL API Registry. They are simply wrappers to the OpenCL API; you can look up the underlying OpenCL call easily from the source in cl.hpp.

We have made only incremental and necessary changes to the previous files, so you can do a diff to get a quick view of what code has changed and how.

The First Kernel

The kernel remains exactly the same as before.

Initialize OpenCL

All OpenCL initialization happens in the InitCL() function.

1.       We will create a context of a particular device type (in this case, a CPU) and check whether the operation was successfully executed.

 cl_context context = clCreateContextFromType(…,CL_DEVICE_TYPE_CPU,…);
cl::Context context = cl::Context(CL_DEVICE_TYPE_CPU);

2.       We ask for a list of the devices present. My four-core AMD Phenom™ processor is returned as one device, not four.

 size_t listSize; /* First, get the size of device list */ clGetContextInfo(context, CL_CONTEXT_DEVICES, …, &listSize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listSize); /* Next, get the device list data */ clGetContextInfo(context, CL_CONTEXT_DEVICES, listSize, devices, …);
std::vector<cl::Device> devices = context.getInfo();

3.       Next, we create a command queue: kernel executions and buffer read-write operations will be enqueued to this queue.

 cl_command_queue queue = clCreateCommandQueue(context, devices[0], …);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);

4.       Now read the CL kernel file into a string (alternatively, just store the kernel as a string in your code), and create a program from that source string.

 cl_program program = clCreateProgramWithSource(context, 1, &source, …);
cl::Program program = cl::Program(context, …);

Build the program.

 clBuildProgram(program, 1, devices, …);;

And, finally the kernel named “Convolve”.

 cl_kernel kernel = clCreateKernel(program, "Convolve", …);
cl::Kernel kernel = cl::Kernel(program, "Convolve");

If we encounter compilation or other errors when we are building the program, we can get the error messages using the following.

 clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, …);
string str = program.getBuildInfo(devices[0]);

Initialize OpenCL Buffers

Just as in the C code, we will need to create three buffers – for the input image, the output image, and the convolution filter. Since we are executing the kernel on the CPU, we will simply use the host buffer (we pass the host buffer pointer and use the CL_MEM_USE_HOST_PTR flag).

 cl_mem inputCL = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host-buffer-size, host-buffer-ptr, …);
cl::Buffer inputCL = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host-buffer-size, host-buffer-ptr);

If the device is a GPU (CL_DEVICE_TYPE_GPU), we can explicitly copy data to the input image buffer on the device:

 clEnqueueWriteBuffer(queue, inputCL, …, host-buffer-ptr, …);
queue.enqueueWriteBuffer(inputCL, …, host-buffer-ptr, …);

And copy back from the output image buffer after the convolution kernel execution.

 clEnqueueReadBuffer(queue, outputCL, …, host-buffer-ptr, …);
queue.enqueueReadBuffer(outputCL, …, host-buffer-ptr, …);

Execute OpenCL Kernel

Now we are all set to run the kernel on the OpenCL device. Instead of invoking the kernel like a function call, our kernel execution will be a two step process:

1.       Set the arguments for the kernel call. We will need to initialize the values of the arguments to be passed to the kernel execution. This step is required before the first kernel execution, and they do not need to be set again unless their values change. Compare the following with the kernel definition presented at the beginning.

 /* input buffer, arg 0 */ clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputCL); kernel.setArg(0, inputCL); /* filter buffer, arg 1 */ clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&filterCL); kernel.setArg(1, filterCL); /* output buffer, arg 2 */ clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&outputCL); kernel.setArg(2, outputCL); /* input image width, arg 3*/ clSetKernelArg(kernel, 3, sizeof(int), (void *)&nInWidth); kernel.setArg(3, nInWidth); /* filter width, arg 4*/ clSetKernelArg(kernel, 4, sizeof(int), (void *)&nFilterWidth); kernel.setArg(4, nFilterWidth);

2.       Enqueue the kernel to the command queue. It will be executed on the OpenCL device and the results will be available in the output buffer for readback. We will need to pass in the dimensionality of the data buffer (which is two, images are 2D). We also need the total number of items to be computed, which in our case is the number of pixels in the output buffer (let’s say the image is 2048×2048).

 clEnqueueNDRangeKernel(queue, kernel, data-dimensionality, …, total-work-size, work-group-size, …);
queue.clEnqueueNDRangeKernel(kernel, …, total-work-size, work-group-size, …);

In the above call, we also need to pass in a workgroup size. During computation, items within a work-group can share certain data and avail of some synchronization mechanisms tha t are not available to items across workgroups. We do not need any of those features in our current kernel, so it is tempting to use a workgroup of size 1.


While that will work in principle and produce correct results, that can produce bad performance. There are many considerations while choosing the appropriate workgroup size, including which device (CPU or GPU) the kernel is to be run on. We will not go into those details in this writeup; for our runs on the CPU device, we will use the largest possible workgroup size (32×32).

Release OpenCL Buffers

The buffers will be released at the end of the program.


Shutdown OpenCL

Before the program is terminated, all OpenCL resources will be released.

 clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(queue); clReleaseContext(context);


The timing code will be similar to what was used to time the C code in the previous write-up. We add a clFinish() call before both starting and stopping the timer – it ensures that we time the kernel execution activity to its completion and nothing else.

 queue.finish(); //Timer Started here();
for (int i = 0; i < nIterations; i++) queue.enqueueNDRangeKernel(…);
queue.finish(); //Timer Stopped here(); //Average Time = ElapsedTime()/nIterations;



For this discussion, we are using a computer with a 4-core AMD Phenom™ X4 9950 Black Edition processor and 8GB RAM. AMD’s OpenCL implementation exposes this quad-core CPU as a single OpenCL device.*

This is a graph showing the computation time for an output image of size 8192×8192. For a filter of width fw, the input image size is (8192+fw-1)x(8192+fw-1). We can see from the kernel code that, for each pixel, the loop runs for (filterWidth)2 times. This is in accordance with the graph – the computation time increases, more or less, as a function of square of filter width. It takes about 14.54s for a 20×20 filter while it takes about 3.73 for a 10×10 filter.

Source Code Package for this Step


View more in this series by selecting an option from the left navigation.

OpenCL™ is a trademark of Apple Inc. used under license to the Khronos Group Inc. *Configuration: AMD Phenom™ X4 9950 Black Edition processor (2.6GHz), 8GB RAM, Windows Vista™ 32-bit, ATI Stream SDK v2.0 Beta, Visual Studio 2008.

©2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.