## Step 4 – Loop Unrolling

We will now consider loop unrolling as a technique to reduce the computation time for convolution.

For a filter of width *fw*, for each pixel in the output image, the statements in the innermost loop are run (*fw *x *fw*) times. That is, the loop condition test and the ensuing branching happen (*fw *x *fw*) times. While this cost may be tiny for small filters (only 4 iterations for 2×2 filters), it becomes significant as the filter width increases (1024 iterations for 32×32 filter). The solution? Reduce loop count. While it does boost speeds in most cases, it is not a universal guarantee.

As before, we have made only incremental changes to the previous files, so you can do a diff to get a quick view of what additional code is present. The only significant change is the addition of new kernels to the kernels file.

### Unrolled Loop Kernel

Following is the convolution kernel, with four iterations of the innermost loop unrolled. The two changes are that (a) the innermost loop statements are repeated four times, and (b) there is another loop at the end to handle the remainder of the iterations when filter width is not an even multiple of four.

__kernel void Convolve_Unroll(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0); const int xOut = get_global_id(0); const int yOut = get_global_id(1); const int xInTopLeft = xOut; const int yInTopLeft = yOut; float sum = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth; const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft; int c = 0; while (c <= nFilterWidth-4) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; c += 4; } for (int c1 = c; c1 < nFilterWidth; c1++) { const int idxF = idxFtmp + c1; const int idxIn = idxIntmp + c1; sum += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0… const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; }

### Performance

We compare the performace of the Convolve_Unroll kernel with the original Convolve kernel as the baseline.^{*}

The results are expected: the unrolled kernel does not improve the timing when the filter width is small (since there are only a few iterations to be saved by unrolling). As the filter size grows, it helps improve speed by as much as 20%.

Speed savings are maximum when the filter width is an integral multiple of four. This sawthooth kind of behavior is due to the iterations that are left over after unrolling. The relative speeds for filter widths of 16 to 20 (i.e., filterWidth%4 is 0,1,2,3 and 0) are 74%, 78% 83%, 92% and 79%. A good guess at this point is that the sawtooth behavior is related to the number of iterations of the following loop:

for (int c1 = c; c1 < filterWidth; c1++)

### Unrolled Loop Kernel (2)

Let us also unroll this second loop completely and substitute it with a if-else (or a switch-case) statement.

__kernel void Convolve_UnrollIf(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0); const int xOut = get_global_id(0); const int yOut = get_global_id(1); const int xInTopLeft = xOut; const int yInTopLeft = yOut; float sum = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth; const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft; int c = 0; while (c <= nFilterWidth-4) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; sum += pFilter[idxF]*pInput[idxIn]; c += 4; } int cMod = nFilterWidth – c; if (cMod == 1) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; } else if (cMod == 2) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; sum += pFilter[idxF+1]*pInput[idxIn+1]; } else if (cMod == 3) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; sum += pFilter[idxF+1]*pInput[idxIn+1]; sum += pFilter[idxF+2]*pInput[idxIn+2]; } } //for (int r = 0… const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; }

### Performance (2)

This removes the sawtooth behavior completely.^{*} Interested readers may want to read about Duff’s device which combines a switch statement together with a loop structure in a (unusual) way that eliminates the need for a trailing if-else test for leftover iterations after the unrolled loop.

Yet another way to achieve similar results is to write four different versions of the ConvolveUnroll kernel. The four versions will correspond to (filterWidth%4) equalling 0, 1, 2, or 3. The particular version called can be decided at run-time depending on the value of filterWidth, which is being passed to the kernel as a argument also.

### Other Applications

This is a good place to mention that the optimization methods given in this series are general purpose techniques and not limited to image convolution only. Anywhere you encounter a large loop count, you should be able optimize it by unrolling loops. How much gain you see in the end depends, sadly, on the other part of your application that you haven’t optimized yet.

When it comes to convolution, we know that small filters like 5×5 or 7×7 are much more in vogue than larger ones. Unfortunately, we see that unrolling doesn’t help in cases where filter width is low. There is essentially no speedup until we reach filter width of 13, and we are actually worse off for very small filters.

In the next write-up, we will explore a solution that helps the small filter scenarios.

*OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. *^{*}*Configuration: AMD Phenom™ X4 9950 Black Edition processor (2.6GHz), 8GB RAM, Windows Vista™ 32-bit, ATI Stream SDK v2.0 Beta, Visual Studio 2008.*

View more in this series by selecting an option from the left navigation.

2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon,*©AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.*

*This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.*