### Introduction

One of the most popular and obvious GPU applications in the area of image processing is image filtering. So, what is image filtering? In its simplest form – using a fixed size separable filtering kernel – image filtering generates one output pixel where a fixed number of neighboring raw pixels must be fetched, multiplied with predefined coefficients, and added together to achieve a written result. The operation is commonly called “convolution,” although “convolution” has a wider meaning in the areas of signal processing, and it also has a precise mathematical definition, but we won’t focus on that in this paper.

By far, the most popular convolutions use 3×3 and 5×5 filter kernels. They are employed for smoothening, sharpening, and edge detection, to name a few. To produce the resulting pixel, a filter has to fetch from 9 to 25 pixels around the pixel you want to generate, applying the convolution process – multiplies each by a coefficient, sums them up and writes them out. This is called a two-dimensional convolution. In many cases the 3×3 or 5×5 filter kernels are “separable,” meaning that the filtering or convolution can be done in two separate passes. First, a one-dimensional convolution has to be applied to each row. For a 3×3 convolution, for example, you need to fetch 3 pixels above your target and apply a one-dimensional filter to them. To do this, multiply 3 pixels by 3 filter coefficients and sum them up to get the first horizontally filtered pixel. Then do the same thing with the other two rows – the middle and below — to obtain 3 horizontally filtered pixels. To get a final result, another one-dimensional convolution has to be applied; a vertical one-dimensional convolution. A convolution (or filter) kernel is a common name for a set of filtering coefficients. Please, do not confuse it with an OpenCl kernel, as they are completely different entities.

There are two major flavors of a fixed size separable convolution on the GPU. One is a two pass convolution, coming from an old tradition of filtering on slow one-threaded CPUs. The other is a one pass convolution that uses rather sophisticated facilities of OpenCl such as local memory. Although both convolution algorithms have their own merits, it is possible to mix the attributes of both algorithms to yield some surprising new results, which will be explained next.

This paper presents a fast implementation of 2D convolution over a monochrome image, which can also be applied over any matrix of floats. The algorithm is inspired by the two algorithm approaches just described, but it combines them in more natural way. As a result, the number of IO operations per pixel can be reduced by 4 times, and the number of arithmetic operations can be reduced by 2 times. As a bonus, this paper discusses a sticky point usually left out of any discussion in the field of GPU-based image processing – how do you deal with convolution boundaries to avoid multiple boundary checks? In this paper we will first discuss very briefly one and two pass convolution processes. Then we’ll describe the basic ideas behind the approach to a 2D convolution that is implemented, along with a discussion of the convolution boundary problem and the way it might be solved.

### Convolution and its discontent

Let’s start by briefly describing both two pass and one pass approaches to the 2D GPU-based convolution.

The two pass convolution applies a one-dimensional horizontal filter to the entire image row of pixels. The result is a horizontally filtered input frame. The first pass does an additional trick of transposing the output, and writing out the results column-wise rather than row-wise. The second pass, although it works along the row as well, actually does vertical filtering, since the frame has already been transposed in the first pass. Incidentally, the second pass also transposes the output, and the resulting image appears restored in its original orientation.

Leaving aside a transposition that’s been done to improve a memory pattern, the question is why have we done it this way? You may find an answer by looking at Figure 1. It shows that vertically adjusted pixels can be shared to generate a result pixel. In other words, it make sense first to filter everything horizontally and then reuse horizontally filtered pixels, reducing the computational bandwidth by two times. So, instead of four one-dimensional convolutions (3 horizontal + 1 vertical), it uses only two (1 horizontal and 1 vertical).

The two pass convolution, reduces the memory bandwidth requirement as well, since instead of fetching nine pixels as with the naïve algorithm, it requires six fetches per pixel (3 for horizontal filtering and 3 for vertical). It also requires two writes, which results in a total of eight I/O operations per pixel against ten for the naïve case (9 reads + 1 write).The one pass filter takes advantage of the most advanced feature of GPU (and OpenCl): local memory. Local memory, as it is defined by the OpenCl specification, is a block of memory available for sharing between all work-items belonging to the same OpenCl working group. Usually, in a GPU architecture, this formal language construct is mapped on a very fast on-chip memory. Specifically, an AMD GPU local memory bandwidth is slightly slower than the L1 cache and has a latency of four clocks.

You may wonder how the one pass 2D local memory-based convolution works with a 3×3 filter kernel? Usually it uses a two dimensional working group of 16×16 to process a 16×16 block of pixels. It allocates a 17×17 block of a local memory per group, and at the beginning it reads 17×17 pixels into the local memory. To produce a resulting pixel, each working-item fetches 3×3 pixels from the local memory, computing a result the same way as the naïve implementation does before writing it out.

What’s the advantage? Obviously, the memory bandwidth is saved. It requires 1 read/1 write from/to the global memory per pixel. It’s four times less than the two pass solution and five times less than using the naïve solution

So, what’s missing? The computational bandwidth is exactly the same as in the naïve case and most importantly it adds ten trips to the local memory (1 write and 9 reads). Although the local memory bandwidth is higher than the global memory, those trips are not free and, comparing it to the cache access, might be costly.

### What we can offer instead?

We’ll describe the 3×3 convolution algorithm in general terms here and explain it more precisely later in this paper. The 5×5 algorithm follows the same pattern and is available with the source code.

In particular, the downloadable algorithm uses an OpenCl working group 8×8 in size, instead of 16×16 as the one pass convolution does. However, it produces a 4×4 block of pixels per one working-item (32×32 per group total), and it uses a 32×16 block of local memory. The common term for a locally adjusted rectangular area of pixels is tile. Since we will generate a 32×32 square of pixels with our proposed algorithm, it will be called a tiled convolution.

### What does it do per each working-item?

The algorithm will do the following per each working item:

- It reads six pixels per each row — four raw pixels of the row and one from each side of the row.
- It applies a one-dimensional convolution to all four pixels of the row.
- It saves top and bottom resulting rows (2 out of 4) in the local memory. It completes the first part of the convolution process.
- In the second part, the working-item reads two rows from local memory. Those two rows belong to its adjusted upper and lower working-item neighbors (one from each row). The working-item has everything to complete the convolution of a 4×4 block. See Figure 3 later in this paper for an illustration of the process.
- A work-item does the final stage of the convolution by applying a one-dimensional vertical convolution to all sixteen pixels and writes them out four times by 4.

### What do we gain?

As in the one pass implementation, we have reduced the memory bandwidth requirement almost by the same amount as the one pass solution.

Since we’re already sharing sixteen raw pixels per a work-item, it does not make sense to prefetch them into the local memory and read them back into work-item registers. Instead, we will rely on the high bandwidth L1 cache to implicitly share eight neighboring raw pixels (2 per each of 4 rows). It saves a round trip to the local memory for the remaining sixteen pixels.

We also use the local memory, but only to save rows of already horizontally filtered pixels, and only those that are going to be shared between neighboring vertical 4×4 blocks (2 out of 4). It reduces overall local memory traffic by two times, without losing too much.

As in the two pass algorithm, we are sharing horizontally filtered pixels – sixteen work-item’s internal pixels and another eight from the neighboring vertical working-items stored in local memory. The computational bandwidth has been reduced by two times.

As a result of combining both the 2 pass and 1 pass filters and computing more per work-item, along with using local memory more wisely, we can beat both algorithms by a good margin.

Let’s briefly discuss a convolution border problem, before moving on. There are actually two problems with the border.

One problem comes naturally from the way the convolution is calculated. Given how pixels are taken from the immediate neighborhood of the target location, it is possible that a fetch from outside of the source image may be generated, which could produce an undesirable result. To deal with this problem, OpenCl kernel developers usually insert boundary checks on the read. With our example, however, given the processing of 16 pixels per working-item, the checking exerts a huge toll on the kernel performance.

The second problem is more serious and it’s introduced with our tiled approach to a convolution process. In particular, pixels generated inside of a working group might have addresses outside of the target image buffer. So, instead of writing 4 pixels at a time, we must check each pixel’s coordinates before writing. We could modify write coordinates that are outside of the output range with ones that are inside of the range. In reality, however, doing this could overwrite correct output values with incorrect ones. To prevent this, we must encircle each pixel write instruction with a branch that guarantees a major performance degradation.

A solution for these two problems is to split all working-groups across four classes. Ones that might fetch pixels outside of an input range and ones that cannot, ones that might write outside of the output range and ones that cannot. Those groups that cannot intersect borders on the read and/or on the write, will be called “free on read” and/or “free on write”. It’s obvious that the great majority of groups belong to free on read and/or free on write classes, since their numbers are proportional to the area of the image (order of 2). The number of the two other groups is proportional to the perimeter of the image (order of 1), and is thus much smaller than the previous number.

Let’s now examine this algorithm in more formal terms and try to explain how it works in more detail.

### The convolution and its boundaries

We would like to convolve a 3×3 or 5×5 (separable) filter kernel with a two-dimensional matrix of floats: MxN (source pixels). The result, or target, is also a MxN matrix of floats (filtered pixels).

Let’s define the following: a 4D vector ROI (region of interest) = [Top, Left, Bottom, Right] – all values included. ROI defines a rectangular boundary of a convolution process. The ROI width = Right + 1 – Left and the ROI height = Bottom + 1 – Top.

Our general convolution process requires 2 ROIs – a source ROI (sROI) and a target ROI (tROI) – [sTop, sLeft, sBottom, sRight] and [tTop, tLeft, tBottom, tRight].

Here are the convolution process boundary conditions:

- sWidth == tWidth and sHeight == tHeight
- sLeft >= 0 and sTop >= 0 and tLeft >= 0 and tTop >= 0
- sLeft + sWidth < M and tLeft + tWidth < M and sTop + sHeigh < N and tTop + tHeight < N
- No pixel shall be read from regions outside of the source ROI
- No pixel shall be written into regions outside the target ROI

And here is the simplest boundary condition case:

sTop = sLeft = tTop = tLeft = 0 sRight = tRight = M - 1 sBottom = tBottom = N -1

This is obviously the case of a full source image convolution.

Let’s discuss the filtering process in more detail.

Clearly, the major challenge of the convolution is to tame a memory bandwidth requirement. A naïve implementation of a 3×3 filter, for example, requires a kernel to obtain 9 surrounding pixels. The good news is that source pixels used to calculate neighboring target pixels are also close neighbors, which is important given how the reuse of source pixels can greatly reduce the memory bandwidth. We can also benefit greatly using separable filters, given how results from a one dimensional horizontal (or vertical) filtering can be shared as well.

Let’s take an example of a 3×3 convolution with the coefficients a, b and c. To convolve a pixel at coordinate {x, y} the kernel has to take 9 pixels at the following coordinates:

p_1_1 = {x-1, y-1}, p0_1 = {x, y-1}, p1_1 = {x+1, y-1}, p_10 = {x-1, y}, p00 = {x, y}, p10 = {x+1, y}, p_11 = {x-1, y+1}, p01 = {x, y+1}, p11 = {x+1, y+1},

It then has to compute three one-dimensional filters as shown here:

h0 = p_1_1 * a + p0_1 * b + p1_1 *c h1 = p_10 * a + p00 * b + p10 *c h2 = p_11 * a + p01 * b + p11 *c

And finally to calculate the following result:

f(x,y) = h0 * a + h1 * b + h2 * c

We can reuse h1 and h2 to generate a pixel at coordinate {x, y+1} and it requires only 3 raw pixels:

new_p_11 = {x-1, y+2}, new_p01 = {x, y+2}, new_p11 = {x+1, y+2}

And following two additional dot products:

h3 = new_p_11 * a + new_p01 * b + new_p11 *c

to yield the final result:

f(x,y+1) = h1 * a + h2 * b + h3 * c

See Figure 1 earlier in this paper for an illustration of how to reuse H1 and H2.

Let’s now examine the savings that result from using the described sharing while generating a block of 4×4 = 16 output pixels.

### Fetches

As mentioned earlier, a naïve implementation of a 3×3 filter would require nine fetches per pixel. Since we are going to produce 4×4 adjusting pixels, all sixteen raw pixels are going to be fetched, reused or shared. We also need one raw pixel at both sides of an input row. And we need one row above and one row beneath our 4×4 block input to have enough data to calculate convolution for all 4×4 output pixels. The total pixel number to be fetched is: (4+ 1 +1) – pixels per row – * (4 +1 +1) rows.

The number of pixels to be fetched = 36. Comparing it with the naïve 16*9 = 144, we see a savings of four times the amount.

Figure 2 illustrates the sharing of the six raw pixels in one row. Internal pixels are shared three times.

Figure 2. Horizontal pixel sharing per row; out of 6 raw horizontal pixels, 4 internal pixels are shared.

### Computation

Using a one-dimensional filter (a * p0 + b * p1 + c * p2) requires five arithmetic instructions. A naïve implementation of a single filter requires 5 * 3 (3 horizontal filters) + 5 (1 vertical filter) = 20 arithmetic instructions. 4×4 filters need 20*16= 320.

In our implementation, since we are going to share one-dimensional horizontal filters, we need 6 * 4 of those horizontal filters. Each row has to generate 4 horizontally filtered pixels (we need 6 such rows), and we are going to compute 16 one-dimensional vertical filters to finalize the filter for the entire 4×4 block.

So the computational bandwidth is: 6*4*5 + 16*5 = 200 – a 1.6 times saving. But we can go even further with our computational changes. We can share the horizontal filters between three neighboring vertical 4×4 blocks: current, immediate upper and immediate lower.

We will keep horizontally filtered top and bottom rows of a 4×4 block (0-th and 3-d row) in the local memory and share them with upper and lower neighboring 4×4 blocks, as shown in Figure 3.

Figure 3. Intra row sharing and inter sharing between vertically neighboring 4×4 block.

Now we will recalculate a number of fetches and arithmetic instructions we need to compute a 4×4 block.

Fetches: 4*6 = 24. This is a savings of 144/24 = 6 TIMES!!!

Computation: 16 * 5 + 16 *5 = 160. This is a savings of 320/160 = 2 times.

Interestingly, the increased speed achieved by the proposed algorithm relative to a good 2 pass GPU-based implementation, is about 4-5 times.

Like most things in life, when you gain something you are going to lose something somewhere else. Indeed, in this case, we are losing a very tangible GPU resource – registers, in particular.

The GPU registers are the fastest memory in the entire GPU memory hierarchy. They are also the scarcest resource that directly influence the number of GPU threads in flight per GPU core. Since a register file is limited in size, an increase of a register budget per work-item reduces the number of GPU threads in flight, which in turn reduces the ability by a GPU core to schedule additional HW threads to cover memory latency and could cause ALU stalls. However, the 6X reduction in memory bandwidth and an increase in a computational load per work-item is more important than the lost number of HW threads concurrently running on a GPU core. This is because we still have 8 HW threads running concurrently per a GPU core, which is enough to cover a significantly reduced memory bandwidth requirement.

### Border processing

Now let’s move our attention to border processing. As you recall, the border problems can manifest themselves in two cases: when the working group intersects the sROI and/or intersects a tROI.

Let’s define the coordinate of a 32×32 tile as a coordinate {t, l} of the top-left corner of the tile. Let’s split all 32×32 pixel tiles and incidentally all OpenCl working groups involved in the convolution in four classes as follows.

**Tiles satisfying the following condition:**sLeft <= l – filter radius and sTop <= t – fiter radius and l+32 <= sRight – filter radius and t+32 <= sBottom – filter radius. In other words, these are tiles that fetch raw pixels from within a source boundary.**Tiles not included in the class 1.**These are tiles that might fetch raw pixels outside of the source boundary and need to check the source boundary condition because of that.**Tiles satisfying the following condition:**l+32 <= tRight and t+32 <= tBottom are tiles that might include, but do not cross, target boundaries while writing out resulting pixels.**Tiles not included in the class 3.**These are tiles that might cross target boundaries on output, and need some way to prevent a possible “spill” outside of tROI to satisfy the boundary conditions.

In general, 1st and 2nd classes and 3rd and 4th classes are mutually exclusive. The 1st class, however, might intersect with the 3rd or 4th as the 2nd can do (see Figure 4).

Figure 4. While convolving the source ROI into the target ROI 4 different classes of groups are considered.

### On input

Let’s now examine the four classes that we have introduced in the previous section.

### Class 1

It’s clear from the class definition that groups belonging to it can read raw pixels without checking a boundary condition. None of the raw pixels from these groups are going to be read outside of the sROI. It’s apparent that the number of groups of the 1st class grows with the area, and the number of groups of the 2nd class grows linearly with the length of the image border or perimeter. That’s why the performance of the 1st class group dominates, and it’s exactly that group which is responsible for the significant increase in the performance of our tiled convolution algorithm .

### Class 2

Groups belonging to the 2nd class have to check source boundary conditions for each pixel they are going to read. To reduce the number of checking, we’ll fall back to the 1 pass local mamory-based convolution algorithm discussed earlier.

We will read all N = (32 + filter radius)x(32 + filter radius) raw pixels into the local memory while checking source boundary conditions for each pixel, thus avoiding reading from outside of the sROI. That’s why the size of the local memory is not really 32 x 16, as discussed earlier in this paper, but it is 34×34 to accommodate the class 2 groups.

Note, however, that raw pixels will be read cooperatively. It means that each work item reads only N/64 pixels. After reading all N pixels into the local memory, the 2nd class proceeds to calculate all 4×4 output blocks exactly the same way as the 1st class, including the sharing of horizontally filtered rows with neighboring 4×4 blocks. The only difference is that raw pixels are going to be “fetched”, not from a global, but from a local memory. Note, also, that since the local memory has been used to store raw pixels, it is going to be reused to keep rows of horizontally filtered pixels. Please, carefully consider why it’s possible (see the kernel source code).

### On output

Groups from both the 1st and 2nd classes keep a calculated 4×4 output pixel block in its registers. The algorithm writes them out by 2 different ways carried by group classes 3 and 4.

### Class 3

A group of the 3d class writes its output 4×4 block into the global memory unconditionally. It follows, from the definition of the 3d class, that the output pixels never happen to be outside of the tROI.

### Class 4

For the 4th class, the output process is going to be slightly more complicated and undoubtedly slower. Let’s consider the following property in our problem’s global domain:

Any working group always touches the target ROI at least at one (top-left) pixel- {l,t}.

It’s easily followed from the definition of the global domain and thus repeated here:

(((tROI.width + 31)/32) * 8) x (((tROI.height + 31)/32)*8)

Using the property, a group of the 4th class does the following: It saves in the local memory the value and the output position of the {l,t} output pixel and broadcasts those 2 values to all work-items in a group. Each work-item in a working group writes out 16 pixels (4×4 block), one by one, while checking the output boundary conditions. If, being outside of the target ROI, the pixel breaks the output boundary condition, then the value of the broadcasted {l,t} pixel is written out instead using its broadcasted position.

This completes the boundary discussion. You might be surprised that it took a sizable portion of the entire paper. My hope is that after realizing the importance of avoiding the border checking you may start replicating this technique in your other projects as I do.

### Conclusions

This paper has covered a very fast and highly generic 3×3 and 5×5 convolution. It has also covered fast and robust boundary processing that’s usually the major headache for a general convolution process.

The secret to the algorithm’s high performance is accounted for by the following properties:

- The kernel shares not only raw pixels, but also already computed horizontal 1D filters.
- It shares horizontally filtered pixels across vertically adjusted work-items.
- It saves a raw pixels’ round-trip to the local memory, relying on the highly coherent raw pixel behavior and a very efficient GPU L1 cache.
- It separates working groups by their relation to boundary conditions, taking advantage of the fact that the number of “free on read and free on write” groups are proportional to the area of the image and grow with a square function. Thus the performance will depend upon the performance of “free” working groups.

A similar approach could be taken using a separable filter of an arbitrary size without consuming a lot of local memory. Also the non-separable case could be considered in the same vein. Feel free to experiment with these approaches.

### SW sample

The sample in this paper is capable of running the 3×3 convolution kernel over an image of any resolution with an arbitrary source and a target ROI. The kernel file also includes the 5×5 convolution kernel, but the sample does not run it, although you can easily add a few program lines to make it work.

Three .bmp files of different resolutions are provided as input for a timing comparison.

The sample processes only the intensity plane of the input RBG image. A color convolution has been left for a reader to add. For example, in practice, the intensity plane can be processed by a 5×5 filter, while color planes can be processed with a 3×3 filter.

The resulting output image is a gray intensity TiledConv_Output.bmp image

A reader can run multiple iterations of the convolution for a more precise time measurement, and the timing statistics are provided in ms.

To compile the sample, copy the entire TiledConvolution directory into the …\ AMD APP\samples\opencl\cpp_cl\app folder and fire your VS10 development environment.

To run it, first check the possible parameters by setting a command argument to –h.

Finally, your command arguments might look like this:

-e -s TiledConv_Input512x512.bmp -t -i 10 -sL 3 -sT 3 -tL 2 -tT 2 -e -- force a pixel by pixel verification; -s -- name and location of the input image; -t -- print time statistics -i -- number of iteration; -sL -- source ROI top; -tL -- target ROI left; -tT -- target ROI top;