Step 5 – Invariants

Udeepta Bordoloi, ATI Stream Application Engineer  10/13/2009 

In the previous write-up we saw that loop unrolling did not help when the filter width is low. We will now try to reduce the small filter computation time.

Till now, we have written the kernels in a generic way so that they will work for all filter sizes; the filter width is passed as an argument. What if we can focus on a particular filter size? Let’s consider a kernel for a specific filter size, say 5×5. We can now unroll the inner loop five times and get rid of the loop condition altogether. Better still, if we use the invariant in the loop condition, and a good compiler will unroll the loop itself saving us the trouble.

As before, we have made only incremental changes to the previous files, so you can do a diff to get a quick view of what additional code is present. The only significant change is the addition of new kernels to the kernels file.

Kernel with Invariants

Fortunately we do not have to write a separate kernel for each filter size. The same kernel code will work for all filter sizes; but we will have to build the program separately for each filter size. This new kernel Convolve_Def will use the invariant FILTER_WIDTH instead of the argument int filterWidth taken by the kernel Convolve. Compare this kernel with the first kernel Convolve that we wrote.

(For the sake of clarity, we will continue to pass the argument so that the kernel arguments are in sync with what the previous kernels had. We want to keep our code changes to a minimum so that readers can easily diff between one code sample in the series and the next.)

__kernel void Convolve_Def(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0);
const int xOut = get_global_id(0); const int yOut = get_global_id(1);
const int xInTopLeft = xOut; const int yInTopLeft = yOut;
float sum = 0; for (int r = 0; r < FILTER_WIDTH; r++) { const int idxFtmp = r * FILTER_WIDTH;
const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < FILTER_WIDTH; c++) { const int idxF = idxFtmp + c; const int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0…
const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; }

We will pass in a value for FILTER_WIDTH when we build the OpenCL™ program. This can be done offline (in a separate pass, before the application starts), or online (as the application is running and requests a particular filter size). Here is an example, using C++ bindings, that shows how to pass in the value of the invariant.

/* create a cl source string */ std::string sourceStr = Convert-File-To-String(File-Name); cl::Program::Sources sources(1, std::make_pair(sourceStr.c_str(), sourceStr.length())); /* create a cl program object */ program = cl::Program(context, sources); /* build a cl program executable with some #defines */ char options[128]; sprintf(options, "-DFILTER_WIDTH=%d", filter-width);, options); /* create a kernel object for a kernel with the given name */ cl::Kernel kernel = cl::Kernel(program, "Convolve_Def");


Let us compare the Convolve kernel performace with the Convolve_Def kernel compute performance.*

In the above graph, the Convolve kernel performance is the 100% line. Defining the filter width as an invariant helps the Convolve_Def kernel gain about 20% performance, particularly for small kernel sizes.

We can use the same technique in the unrolled kernels (Convolve_Unroll with the invariant becomes Convolve_Def_Unroll, and Convolve_UnrollIf yields Convolve_Def_UnrollIf):

In the next write-up, we will show how to write the kernel so that packed arithmetic (SSE) inctructions are emitted on the CPU device.

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. *Configuration: AMD Phenom™ X4 9950 Black Edition processor (2.6GHz), 8GB RAM, Windows Vista™ 32-bit, ATI Stream SDK v2.0 Beta, Visual Studio 2008.

2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.

View more in this series by selecting an option from the left navigation.