Developer Central
China  |  India
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • Video Player Test
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

Home > Resources > Heterogeneous Computing > OpenCL™ Zone > Programming in OpenCL™ > Image Convolution Using OpenCL™ > Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4

Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4

 

Step 4 – Loop unrolling
   

We will now consider loop unrolling as a technique to reduce the computation time for convolution.

For a filter of width fw, for each pixel in the output image, the statements in the innermost loop are run (fw x fw) times. That is, the loop condition test and the ensuing branching happen (fw x fw) times. While this cost may be tiny for small filters (only 4 iterations for 2×2 filters), it becomes significant as the filter width increases (1024 iterations for 32×32 filter). The solution? Reduce loop count. While it does boost speeds in most cases, it is not a universal guarantee.

As before, we have made only incremental changes to the previous files, so you can do a diff to get a quick view of what additional code is present. The only significant change is the addition of new kernels to the kernels file.

Unrolled Loop Kernel

Following is the convolution kernel, with four iterations of the innermost loop unrolled.  The two changes are that (a) the innermost loop statements are repeated four times, and (b) there is another loop at the end to handle the remainder of the iterations when filter width is not an even multiple of four.

__kernel void Convolve_Unroll(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0);

const int xOut = get_global_id(0); const int yOut = get_global_id(1);

const int xInTopLeft = xOut; const int yInTopLeft = yOut;

float sum = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth;

const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft;

int c = 0; while (c <= nFilterWidth-4) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; c += 4; } for (int c1 = c; c1 < nFilterWidth; c1++) { const int idxF = idxFtmp + c1; const int idxIn = idxIntmp + c1; sum += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0…

const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; }

Performance

We compare the performace of the Convolve_Unroll kernel with the original Convolve kernel as the baseline.*

The results are expected: the unrolled kernel does not improve the timing when the filter width is small (since there are only a few iterations to be saved by unrolling). As the filter size grows, it helps improve speed by as much as 20%.

Speed savings are maximum when the filter width is an integral multiple of four. This sawthooth kind of behavior is due to the iterations that are left over after unrolling. The relative speeds for filter widths of 16 to 20 (i.e., filterWidth%4 is 0,1,2,3 and 0) are 74%, 78% 83%, 92% and 79%. A good guess at this point is that the sawtooth behavior is related to the number of iterations of the following loop:

for (int c1 = c; c1 < filterWidth; c1++)

Unrolled Loop Kernel (2)

Let us also unroll this second loop completely and substitute it with a if-else (or a switch-case) statement.

__kernel void Convolve_UnrollIf(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0);

const int xOut = get_global_id(0); const int yOut = get_global_id(1);

const int xInTopLeft = xOut; const int yInTopLeft = yOut;

float sum = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth;

const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft;

int c = 0; while (c <= nFilterWidth-4) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; c += 4; }

int cMod = nFilterWidth – c; if (cMod == 1) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; } else if (cMod == 2) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; sum += pFilter[idxF+1]*pInput[idxIn+1]; } else if (cMod == 3) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; sum += pFilter[idxF+1]*pInput[idxIn+1]; sum += pFilter[idxF+2]*pInput[idxIn+2]; } } //for (int r = 0…

const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; }

Performance (2)

This removes the sawtooth behavior completely.* Interested readers may want to read about Duff’s device which combines a switch statement together with a loop structure in a (unusual) way that eliminates the need for a trailing if-else test for leftover iterations after the unrolled loop.

Yet another way to achieve similar results is to write four different versions of the ConvolveUnroll kernel. The four versions will correspond to (filterWidth%4) equalling 0, 1, 2, or 3. The particular version called can be decided at run-time depending on the value of filterWidth, which is being passed to the kernel as a argument also.

Other Applications

This is a good place to mention that the optimization methods given in this series are general purpose techniques and not limited to image convolution only. Anywhere you encounter a large loop count, you should be able optimize it by unrolling loops. How much gain you see in the end depends, sadly, on the other part of your application that you haven’t optimized yet.

When it comes to convolution, we know that small filters like 5×5 or 7×7 are much more in vogue than larger ones. Unfortunately, we see that unrolling doesn’t help in cases where filter width is low. There is essentially no speedup until we reach filter width of 13, and we are actually worse off for very small filters.

In the next write-up, we will explore a solution that helps the small filter scenarios.

  • » Source Code Package for this Step

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. *Configuration: AMD Phenom™ X4 9950 Black Edition processor (2.6GHz), 8GB RAM, Windows Vista™ 32-bit, ATI Stream SDK v2.0 Beta, Visual Studio 2008.

Back to top

« 1 2 3 4 5 6 »
2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.

Get the hcNewsFlash.

Your email address:

No SPAM.
Easy unsubscribe.

HSA is going to rock your world.

Learn more about Heterogeneous System Architecture.

Got Questions?

Ask the Developer Forums Community. They’ve got answers.

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • Video Player Test
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

©2013 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.

  • Contact Us
  • |
  • Careers
  • |
  • Site Map
  • |
  • Terms and Conditions
  • |
  • Privacy
  • |
  • Trademarks