Developer Central
China  |  India
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

Home > Resources > Heterogeneous Computing > OpenCL™ Zone > Programming in OpenCL™ > Image Convolution Using OpenCL™ > Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3

Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3

Step 3 – C++ Bindings
   

In this write-up, we will look at using the C++ bindings for OpenCL™. I prefer them because, among other things, they make error checking easier and the code cleaner. They are available at the Khronos OpenCL API Registry. They are simply wrappers to the OpenCL API; you can look up the underlying OpenCL call easily from the source in cl.hpp.

We have made only incremental and necessary changes to the previous files, so you can do a diff to get a quick view of what code has changed and how.

The First Kernel

The kernel remains exactly the same as before.

Initialize OpenCL

All OpenCL initialization happens in the InitCL() function.

1.       We will create a context of a particular device type (in this case, a CPU) and check whether the operation was successfully executed.

 cl_context context = clCreateContextFromType(…,CL_DEVICE_TYPE_CPU,…);

cl::Context context = cl::Context(CL_DEVICE_TYPE_CPU);

2.       We ask for a list of the devices present. My four-core AMD Phenom™ processor is returned as one device, not four.

 size_t listSize; /* First, get the size of device list */ clGetContextInfo(context, CL_CONTEXT_DEVICES, …, &listSize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listSize); /* Next, get the device list data */ clGetContextInfo(context, CL_CONTEXT_DEVICES, listSize, devices, …);

std::vector<cl::Device> devices = context.getInfo();

3.       Next, we create a command queue: kernel executions and buffer read-write operations will be enqueued to this queue.

 cl_command_queue queue = clCreateCommandQueue(context, devices[0], …);

cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);

4.       Now read the CL kernel file into a string (alternatively, just store the kernel as a string in your code), and create a program from that source string.

 cl_program program = clCreateProgramWithSource(context, 1, &source, …); 

cl::Program program = cl::Program(context, …);

Build the program.

 clBuildProgram(program, 1, devices, …); 

program.build(devices);

And, finally the kernel named “Convolve”.

 cl_kernel kernel = clCreateKernel(program, "Convolve", …);

cl::Kernel kernel = cl::Kernel(program, "Convolve");

If we encounter compilation or other errors when we are building the program, we can get the error messages using the following.

 clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, …);

string str = program.getBuildInfo(devices[0]);

Initialize OpenCL Buffers

Just as in the C code, we will need to create three buffers – for the input image, the output image, and the convolution filter. Since we are executing the kernel on the CPU, we will simply use the host buffer (we pass the host buffer pointer and use the CL_MEM_USE_HOST_PTR flag).

 cl_mem inputCL = clCreateBuffer(context,  CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host-buffer-size, host-buffer-ptr, …);

cl::Buffer inputCL = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host-buffer-size, host-buffer-ptr);

If the device is a GPU (CL_DEVICE_TYPE_GPU), we can explicitly copy data to the input image buffer on the device:

 clEnqueueWriteBuffer(queue, inputCL, …, host-buffer-ptr, …);

queue.enqueueWriteBuffer(inputCL, …, host-buffer-ptr, …);

And copy back from the output image buffer after the convolution kernel execution.

 clEnqueueReadBuffer(queue, outputCL, …, host-buffer-ptr, …);

queue.enqueueReadBuffer(outputCL, …, host-buffer-ptr, …);

Execute OpenCL Kernel

Now we are all set to run the kernel on the OpenCL device. Instead of invoking the kernel like a function call, our kernel execution will be a two step process:

1.       Set the arguments for the kernel call. We will need to initialize the values of the arguments to be passed to the kernel execution. This step is required before the first kernel execution, and they do not need to be set again unless their values change. Compare the following with the kernel definition presented at the beginning.

 /* input buffer, arg 0 */ clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputCL); kernel.setArg(0, inputCL); /* filter buffer, arg 1 */ clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&filterCL); kernel.setArg(1, filterCL); /* output buffer, arg 2 */ clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&outputCL); kernel.setArg(2, outputCL); /* input image width, arg 3*/ clSetKernelArg(kernel, 3, sizeof(int), (void *)&nInWidth); kernel.setArg(3, nInWidth); /* filter width, arg 4*/ clSetKernelArg(kernel, 4, sizeof(int), (void *)&nFilterWidth); kernel.setArg(4, nFilterWidth);

2.       Enqueue the kernel to the command queue. It will be executed on the OpenCL device and the results will be available in the output buffer for readback. We will need to pass in the dimensionality of the data buffer (which is two, images are 2D). We also need the total number of items to be computed, which in our case is the number of pixels in the output buffer (let’s say the image is 2048×2048).

 clEnqueueNDRangeKernel(queue, kernel, data-dimensionality, …, total-work-size, work-group-size, …);

queue.clEnqueueNDRangeKernel(kernel, …, total-work-size, work-group-size, …);

In the above call, we also need to pass in a workgroup size. During computation, items within a work-group can share certain data and avail of some synchronization mechanisms tha t are not available to items across workgroups. We do not need any of those features in our current kernel, so it is tempting to use a workgroup of size 1.

 

While that will work in principle and produce correct results, that can produce bad performance. There are many considerations while choosing the appropriate workgroup size, including which device (CPU or GPU) the kernel is to be run on. We will not go into those details in this writeup; for our runs on the CPU device, we will use the largest possible workgroup size (32×32).

Release OpenCL Buffers

The buffers will be released at the end of the program.

 clReleaseBuffer(inputCL);

Shutdown OpenCL

Before the program is terminated, all OpenCL resources will be released.

 clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(queue); clReleaseContext(context);

Timing

The timing code will be similar to what was used to time the C code in the previous write-up. We add a clFinish() call before both starting and stopping the timer – it ensures that we time the kernel execution activity to its completion and nothing else.

 queue.finish(); //Timer Started here();

for (int i = 0; i < nIterations; i++) queue.enqueueNDRangeKernel(…);

queue.finish(); //Timer Stopped here(); //Average Time = ElapsedTime()/nIterations;

Performance

 

For this discussion, we are using a computer with a 4-core AMD Phenom™ X4 9950 Black Edition processor and 8GB RAM. AMD’s OpenCL implementation exposes this quad-core CPU as a single OpenCL device.*

This is a graph showing the computation time for an output image of size 8192×8192. For a filter of width fw, the input image size is (8192+fw-1)x(8192+fw-1). We can see from the kernel code that, for each pixel, the loop runs for (filterWidth)2 times. This is in accordance with the graph – the computation time increases, more or less, as a function of square of filter width. It takes about 14.54s for a 20×20 filter while it takes about 3.73 for a 10×10 filter.

  • » Source Code Package for this Step

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. *Configuration: AMD Phenom™ X4 9950 Black Edition processor (2.6GHz), 8GB RAM, Windows Vista™ 32-bit, ATI Stream SDK v2.0 Beta, Visual Studio 2008.

Back to top

« 1 2 3 4 5 6 »
2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.

Get the hcNewsFlash.

Your email address:

No SPAM.
Easy unsubscribe.

HSA is going to rock your world.

Learn more about Heterogeneous System Architecture.

Got Questions?

Ask the Developer Forums Community. They’ve got answers.

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

©2013 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.

  • Contact Us
  • |
  • Careers
  • |
  • Site Map
  • |
  • Terms and Conditions
  • |
  • Privacy
  • |
  • Trademarks