Abstract

In this paper we present a Run-Length Encoding (RLE) JPEG sample, which is a heterogeneous application that splits an image decompression process into two parts: one CPU-based and the other GPU-based. We will minimize the CPU->GPU data transfer rate by using a simple but effective RLE scheme. Along the way, we will explore the fastest way of performing an OpenCl CPU-GPU data transfer, along with the topics of GPU local memory utilization, instruction level parallelism, GPU register utilization and global memory utilization. As a result, this paper points to the download of the JPEG software decoder for verification and modification, as well as an example of using the OCL_EMU emulator debugger, and more.

Introduction

It is well known that various compression encoding schemes and standards use different types of a variable length encoding (VLE), quantization and decorrelating transforms to achieve different levels of bit rate and quality distortion. Inversely, on the decoding side, a decompression employs different types of variable length decoding (VLD), inverse quantization and inverse transforms.

Computationally-wise, VLE (and VLD) is usually performed on a low-latency sequential device, a CPU core. On the other hand an (inverse) quantization with an (inverse) transform could be successfully accelerated with modern massively parallel vector devices like a GPU. See Figure 1.

JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach - Classical CPU/GPU decoding pipeline

Figure 1. Classical CPU/GPU decoding pipeline.

 

 

 

 

 

 

This paper describes one approach to a system-wide performance improvement of 3-stage decompression pipeline – a CPU VLD combined with a CPU-based RLE encoding of JPEG “coefficients”, their GPU-based RLE decoding, a GPU-based inverse quantization and IDCT (IQIDCT). The paper focuses on the image decoding standard, and the JPEG in particular. I believe, however, that the sample presents a pattern for a wide variety of problems in the heterogeneous computational environment both in a specific application domain (image/video processing / encoding /decoding) and in a generic data communication domain. The latter comes from a perpetual problem of a misbalance between a narrow communication path between heterogeneous computational nodes and a computational throughput of those nodes.

For this particular case, the problem manifests itself when a matrix of coefficients built as a result of the JPEG Huffman VLD has to be transferred into GPU memory for the IQIDCT stage of the JPEG decoding. Here the ratio between the data transfer throughput (PCI bandwidth) and the IQIDCT kernel computational throughput varies from 0.7 to 4.5 depending upon the hardware involved. In other words, the time spent moving data through the PCI bus is either much higher or almost equal to the time to perform the IQIDCT over an entire HD frame.

You may wonder if it makes sense to transfer coefficients to the GPU memory and to do IQ/IDCT with the help of the GPU, or to do it on the CPU, thus avoiding an additional copy through a (rather slow) PCI bus? The answer very well might be to do it on the CPU, using the CPU’s powerful multimedia vectorized instruction set. However, if the image is going to be processed (or post processed) by the GPU, which is usually the case, then the answer clearly would be to copy the coefficients and process them on the GPU. By doing this, we might end up being bound by the PCI bus bandwidth. For example, with a 6Mpel image (HD YUV) and a transfer rate of about 6GBs, the PCI bus could sustain only a 1K fps rate, whereas an average modern GPU can do IQ/IDCT over the same image with a more than 3K fps rate.

To balance the pipeline you might consider a 0-cost lossless compression of the coefficients on the CPU side, transferring them with a much higher (due to compression) rate and a low-cost decompression on the GPU. For such a scheme to be viable the time loss–compression time + compressed transfer time + decompression time–should be significantly less than the uncompressed transfer time. See Figure 2 below for a proposed change in a classical pipeline.

What if we devise a lossless compression for coefficients with a compression ratio of 2? This means that instead of 1ms, we’ll spend 0.5 ms to transfer the coefficients. If our GPU decompression kernel takes 0.3 ms, we are still PCI bus-bound, but with a 2K frame rate, i.e. 2 times faster than before. All of this is true only with an assumption that the CPU side of the lossless encoding scheme takes very few additional CPU instructions per coefficient.

The paper’s goal is to describe all four stages of the scheme and it is divided into four parts. First, it describes a lossless encoding scheme for “coefficients” to be inserted at the final stage of a Huffman decoding on the CPU side, and a brief description and discussion of the decompression GPU kernel. Second, we cover the fastest way to move compressed coefficients through the PCI bus provide by the current version of the AMD OpenCl run-time. Third, we touch on a peculiarity of the IQIDCT kernel and its applicability to a wider variety of parallel decoding/encoding schemes. And finally we briefly look at the OCL_EMU debugger in the project. The conclusion is devoted to the resulting performance numbers on different AMD GPU platforms and possible future research.

img1

Figure 2. CPU/GPU cooperation pipeline CPU does a light-weight loslesss compression on the back of VLD per block. GPU runs a light-weight decoder to generate a 2d matrix of input coefficients for GPU-based IQ/IDCT decoding.

Part 1. Lossless for dummies.

If you are wondering why we will get any compression with our RLE JPEG sample, the answer is in the nature of any lossy compression we’re dealing with – JPEG, MPEG, H264 and so on. In simplest terms, each of those schemes tries to replace as much input data with 0s as possible, and then drops the resulting 0s altogether. On the decoding end, after the VLD pass is complete, we end up with a lot of 0s. So, our lossless compression idea is not to send 0s through a PCI bus, but instead to pack only non-zero data (with some additional control information) into a variable length buffer, making the buffer length as small as possible. In addition to this natural requirement, we don’t want our lossless compression to tax the CPU side (VLD) nor the GPU-side (decompression kernel).

Keep in mind that the purpose of this paper is not to create a breakthrough in the area of lossless compression. Instead, the goal is to show the viability of such an approach, and illustrate any highlights, requirements, and limitations of similar schemes in a heterogeneous CPU-GPU environment. Let’s take a look at the description of the encoding format, which is very simple.

Lossless RLE encoding design and implementation details

The basic element of our losless compressed stream of JPEG coefficients is a token. The token has a length of 12 bytes or 3 dwords. The first 2 dwords keep 4 non-zero coefficients (16bit each). Another 24 bits keep positions of those coefficients in an 8×8 block. Yet another 6 bits is a block id in a set of 64 blocks. 2 bits are left unused. The arrangement assumes that there are always multiples of 4 non-zero coefficients available in the 8×8 block. In the case when the assumption is not held, the token is padded with the last non-zero coefficient.

We’ve already mentioned a set of 64 blocks – we called it a group (of 8×8 blocks). The group holds all tokens belonging to the set. Evidently the number of groups is equel to the total number of all non-zero 8×8 blocks divide by 64. The length of the group varies. All groups are packed in the continous chunk of memory we call a stream of tokens, or a token stream. To separate one block from another we keep a block directory. Each element of the directory consists of 2dwords: offset of the group in the token stream and the number of tokens in the group. The directory is located at the start of the token stream. See the Apendix at the end of this paper for a more formal definition of the token stream.

The lossless stream structure presented in the previous paragraph provides a resonable compression ratio while making sure that the coding on the CPU side and the decoding on the GPU side are fast. Our experiment shows that the comprression ratio varies from 2.5 to 3.5, increasing the PCI bandwidth proportionally.

A special structure of the token stream allows us to easily parallelize the stream decompression. For the decompressing kernel, the OpenCl working group size is defined as 64. It thus allows an easy mapping of the stream group on the OpenCl group.

The processing starts with the work-items of an OpenCl group (not to be confused with the token group) zeroing out a block of the local memory of size 64x8x8 shorts. This is where all uncompressed coefficients from the current token group will reside before they are written out into the global memory. Each work-item reads the entry from the stream directory indexed by the OpenCl group id. That way each work-item obtains the starting offset of the token group in the stream of tokens and number of tokens in the token group.

Having that information, each work-item reads one token at a time in parallel and puts 4 coefficients in proper places in one of 64 8×8 blocks in the local memory, until there are no more tokens in the token group. Non-filled places remain 0 as they should be. See Listing 1 for a decompressor pseudo-code.

Listing 1. The pseudo-code of GPU-side lossless decompression scheme.

__local short block64x8x8[64*8*8];
uint local_id = get_local_id;
uint group_id = get_group_id(0);

for(uint i = 0; i < 8*8; i++)
{ 
     block64x8x8[mul24(local_id,64) + i] = 0;
}

uint2 group_dir =*( __global uint2 *)TokenStream[group_id];

__global uint *group_start = (__global uint *)  &TokenStream[group_dir.x];
uint group_number_of_tokens = group_dir.y;
uint loop_per_item = (group_number_of_tokens + 63)/64;
for(uint i = 0; i < loop_per_item; i++)

{
short4 coeff = getCoeff(group_start[ i*3]);
uint4 in_block_pos = getInBlockPos(group_start[i*3 +2]);
uint block8x8_id = getBlock8x8Id(group_start[i*3 + 2]);

    block64x8x8[mad24(block8x8_id,  8*8,  in_block_pos.x) ] = coeff.x;
    block64x8x8[mad24(block8x8_id,  8*8,  in_block_pos.y) ] = coeff.y;
    block64x8x8[mad24(block8x8_id,  8*8,  in_block_pos.z) ] = coeff.z;
    block64x8x8[mad24(block8x8_id,  8*8,  in_block_pos.w) ] = coeff.w;

}

To reduce local memory collisions each 8×8 block in the local memory is organized in rows of 4×16 coiefficients, each with the stride equalling 256*2 bytes. See the “OpenCl programming guide/OpenCL Performance and Optimization”. http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf for details.

Having a block id inside each token allows the token to be processed by an arbitary work-item. This, in turn, guarantees a “fair” work-load, given its independence from the actual number of coefficients per a specific 8×8 block. Having 4 coefficients per token creates a higher level of instructional parallelism, thus reducing the number of loops.

By processing groups of 64 blocks, it is possible that the total number of non-zero blocks fails to result in a multiple of 64. To handle this, we have allocated a buffer with a multiple of a 64 ceiling. Thus, if the input block is out of this range, there is a place to write out the dummy output data without checking border conditions. Note also that the output is not organized in a 2D array, but instead forms a sequential stream of block coefficients.

And finally, due to the limited size of local memory per GPU core – 32K – only 4 working groups could run on a GPU core concurrently for our implementation.

Let’s now look at the OpenCL interface we will use for our data transfers.

Part 2. Pinning for a good bus.

“Pinning” is a hardware geek’s term for making system memory available to specialized hardware. Using Direct Memory Access (DMA) for a transfer without an explicit use of CPU or GPU cores, for example, requires such “pinning”. Let’s look at various examples of pinning presented by the current version of the AMD OpenCl run-time.

There are three commonly used data transfer interfaces in OpenCl (listed here by their commonly known name, not their exact OpenCl name): Write/Read buffer, Map/Unmap buffer and Copy buffer.

See also http://developer.amd.com/sdks/AMDAPPSDK/assets/opencl-1.1-rev33.pdf, chapter 5.2.2.

Write/read buffer is the slowest interface due to the cost of the “pinning” of an arbirary system memory pointer.

The Map/Unmap buffer has a lower pinning cost than the Write/Read buffer. This is apparent when you take a look at the parameters of the two interfaces. The Write/Read interface uses an arbirary system pointer, whereas the Map/Unmap interface returns a system pointer provided by OpenCl, which is most likely already pinned by it.

The use of the Copy buffer interface is slightly more complicated. If you allocate the OpenCl buffer with the flag CL_MEM_ALLOC_HOST_PTR (AHP buffer) and then call the Map interface, you’ll get a system pointer, BUT it comes from a special heap managed by the OpenCl run-time. So when you Unmap it there is no an implicit transfer of the data through the PCI bus from a system to a video memory as happens with a generic buffer. The cost of the Unmap is almost 0 unlike the generic buffer case. To make a real data transfer hopwever one has to call the Copy buffer interface explicitly. The “pinning” in this case is much cheaper than for a generic buffer with a Write/Read interface since the AHP buffer has been already prepinned at the creation by the OpenCl run-time.

Keep in mind that a generic buffer transfer rate using a Map/Unmap interface or an AHP buffer with Map/Unmap/Copy interfaces is almost the same. The difference is that with the Map interface a developer has to know the size of the mapped buffer in advance. On the other hand, when using Map/Unmap/Copy a developer can Map with the maximum size and Copy data using an actual size parameter. For our example in this paper we DO NOT KNOW in advance the size of the buffer we are going to transfer, since to define the size (make is as small as possible, compress it) IS our task. This leaves us with only one choice: the AHP buffer using Map/Unmap/Copy. See the source code …\jpeg-ocl\fw-jpeg\ocl.c for the exact implementation.

Part 3. “Faster, Higher, Wider.”

We are now going to examine the IQIDCT kernel and its applicability to a wide variety of parallel decoding/encoding schemes. The IQIDCT kernel is straightforward, which is why it is very fast (you may see it in …\jpeg-ocl\jpeg-ocl-test\amdIDCT8x8_kernel.cpp, amdIDCT8x8.hpp). The areas we will focus on when discussing the kernel have to do with reading and writing, using GPU registers and using vectorized data types.

On the read we are effectively compressing input data, taking advantage of the fact that 8 input shorts (exactly one row of a 8×8 block) fit perfectly into a uint4 type. The uint4 (and float4) type represent the most efficient reading format on a GPU. When writing, we fit 8 8bit pels into a uint2 type. Although this isn’t the most efficient write format to use, in our case given how the neighboring work-items are writing sequentially, we can rely on the GPU write cache to efficiently write-combine the output data.

The major reason why the kernel is very fast is because it uses registers to store all intermediate results in registers (as opposed to local shared memory, which is the method used in all other implementations the author has studied). Registers are the fastest memory in the entire GPU memory hierarchy. Due to a GPU high memory latency, its designers rely on the relatively big number of registers to keep the computation going while memory requests have being served. The AMD GPU has a total of 256 registers per work-item per core. They are split equally between all wavefronts (each wavefront is comprised of 64 work-items running synchronously) concurrently running on the GPU core. For example, the IQIDCT kernel uses 23 general registers. This means there might be at least 10 wavefronts (256 / 23 ~= 10) in flight per core. I’d advise to use the register as much as you can if you’d like to make your kernel fast. The only limitation you have is the number of wavefronts in flight per core. That number has to be as big as possible for GPU to be able to cover its high memory latency. This contradiction has to be resolved at the design stage of any kernel you are developing.

The 2D IDCT transform consists of the vertical transform followed with a horizontal one. Since 8×8 blocks are organized in their scan order, the vertical transform requires a transposition on the input and inverse transposition on the output. As a possible kernel speed-up, this might be avoided with a transposition taking place in the RLE decoding stage (on its output) or even on the CPU-side where the compressed stream tokens are constructed. It might also require transposing the quantization coefficients’ table.

Even more promising improvement can be achieved by combining the RLE decoder with the IQIDCT in one kernel. This significantly reduces memory bandwith – skipping one round trip to video memory – and produces a final image as an output in one pass. The drawback could be the number of wavefronts in flights per core due to the local memory limitation discussed in Part 1 of this paper. The advantages are clear on low-end devices with relatively low global memory bandwidth. You may see the implementation of the combined kernel in …\jpeg-ocl\jpeg-ocl-test\amdRLEIDCT8x8_kernel.cpp, amdRLEJPEG.hpp, amdIDCT8x8.hpp.

Part 4. Testing and using the OCL_EMU emulator debugger

To get started, take a look at the VS10 solution fw-jpeg.sln found in …\jpeg-ocl\fw-jpeg.

The application consists of several major parts:

  1. The CPU-GPU JPEG decoder includes a CPU haffman decoder modified to extract a compressed stream of coeffiecient tokens, an OpenCl-based separate RLE decoding, a IQIDCT transform and a combined RLEIQIDCT kernel.
  2. All 3 kernels have a verification framework.
  3. The input/output buffers can be saved for further testing, development and verification.
  4. The saved buffers can be loaded back from HD and be used for testing, development and verification.
  5. A simple timing framework is also available to compare three different ways of doing JPEG decoding on the GPU.

All OpenCl-related modifications of the CPU-based JPEG decoder can be traced and found inside:

#ifdef O_CL 
…
#endif

brackets.

The major CPU portion – the huffman decode with coefficients tokens’ construction — is located in …\jpeg-ocl\fw-jpeg\jdhuff.c.

  • The data structures and procedure declaration is shared between a host-based decoder and an OpenCl host driver found in …\jpeg-ocl\fw-jpeg\ocl.h file.
  • All OpenCl related host code is in …\jpeg-ocl\fw-jpeg\ocl.c. The OCL_EMU code has been called from oclIQIDCTemu.cpp.
  • The main.cpp includes an application parmeter parser and two major functions: testRLEJPEGDecompress – host-based haffman, OpenCl RLE, IQIDCT, verification and saving.
  • testOclOnlyJPEGDecompress – load data fron HD, OpenCl RLE, IQIDCT, verification, timing of a separate and combined REL/IQIDCT against a “pure” IQIDCT.

Application command line arguments:

-cpu, -gpu – device (mutually exclusive);
-i .jpg – input JPEG-compressed image;
-rlejpeg_emu – emulate RLEJPEG;
-jpegiqidct_emu – emulate IQIDCT;
-rlejpeg_verify – verify RLE decoder;
-jpegiqidct_verify – verify IQIDCT output;
-rlejpeg_onepass – run combined RLEIQIDCT kernel;
-rlejpeg_onepass_verify – verify combined RLEIQIDCT kernel;

-rlejpeg_realrun – measure a RLE decoder [performance with a specfied number of iterations;

-jpegiqidct_realrun - measure an IQIDCT transform performance with a specfied number of iterations;

-jpegiqidct_onepass_realrun - meature a combined RLEIQIDCT performance with a specfied number of iterations;

Note: if the -rlejpeg_onepass flag has been set -rlejpeg_emu, -rlejpeg_verify, -jpegiqidct_emu, -jpegiqidct_verify flags won’t have any effect.

You may use the OCL_EMU debugger (see http://developer.amd.com/zones/opensource/pages/ocl-emu.aspx) along with the paper’s sample application, or you may opt not to. By default, the project is prepared to be compiled and linked with the OCL_EMU, but the emulator won’t be in use until you set a proper parameter (see above).

To avoid linking to the OCL_EMU altogether, please remove a preprocessor variable (_GPU_EMU_IMPL) from both projects(fw_jpeg, jpeg-ocl-test: Properties->Configuration Properites->C/C++->Preprocessor).

Be sure to exclude 3 kernel source files - amdIDCT8x8_kernel.cpp, amdRLEJPEG_kernel.cpp, amdRLEIDCT8x8_kernel.cpp - from a jpeg-ocl-test build (Properites->Configuration Properties->General->Excluded From Build->Yes).

Without the emulator your command arguments look like this:

-gpu -i ThreePalms.jpg -rlejpeg_verify -jpegiqidct_verify -rlejpeg_realrun 10 -jpegiqidct_realrun 10

or

gpu -i ThreePalms.jpg -rlejpeg_onepass -rlejpeg_onepass_verify -rlejpeg_realrun 10 -jpegiqidct_realrun 10 -jpegiqidct_onepass_realrun 10

If you decided to use OCL_EMU, your command argument line looks like the following:

-cpu -i ThreePalms.jpg -rlejpeg_verify -rlejpeg_emu -jpegiqidct_verify -jpegiqidct_emu

Now you can set a breakpoint inside both kernels if you run a “gpu_emu’ device or run a mixture of “gpu_emu”, “cpu” and “gpu” devices.

Summary

In conclusion I’d like to present few numbers for various HW platforms. Keep in mind that I used a 4kx3k still color image for measurement, and to estimate a HD performance you may need to scale all numbers down by approximately 6. All numbers in rows from 1 to 6 are in ms.

I added a global memory bandwidth and a PCI transfer rate in Gb/s as reference points. These numbers are not theoretical but benchmarked on the same computers I ran the RLE JPEG application on.

Several notes:

My claim of 3K fps for HD frame with the GPU-based JPEG IQ/IDCT was even pessimistic. A high-end GPU (Cayman) can rich a sustainable rate of more than 5.5 Kfps with that kernel. Even the mobile part can do 1.2 Kfs. It makes the transfer rate reduction even more relevant.

The jump in overall system performance in the Bart column is caused by an old PCI bus.

Lengend:

  1. Full matrix coefficient transfer and IQ/IDCT kernel (ms).
  2. RLE JPEG compression, token stream transfer, GPU decompression and IQ/IDCT as separate kernels (ms).
  3. RLE JPEG compression, token stream transfer, GPU decompression and IQ/IDCT combined as one kernel (ms).
  4. GPU decompression kernel (ms).
  5. IQ/IDCT kernel (ms).
  6. Combined decompression and IQ/IDCT kernel (ms).
  7. Frame rate of “classical” pipeline measure in 1. (in fps)
  8. Frame rate of the new pipeline with single kernel for decompression and IQ/IDCT, measured in 3 (in fps).
  9. Time spent on the global memory round trip in the 2 pass kernels (3. And 4.)
  10. Theoretical time needed to make the round trip of 10.
  11. OpenCl global memory bandwidth (in Gb/s).
  12. AMD OpenCl run-time PCI bus transfer rate (in Gb/s).
HW Ceyman 6970 Barts 6870 Turks 6750M Liano A8-350
1 18 37.8 28 23
2 7.6 15.6 13 19
3 7.27 15.3 11 14
4 0.65 0.87 3.7 6.7
5 0.77 1.07 3.8 7.50
6 1.0 1.62 5.5 7.60
7 55 26 35 43
8 129 65 90 71
9 6.27 14.3 1 0.84
10 0.42 0.32 2 6.6
11 0.25 0.28 1 2.25
12 140 126 35 16
13 5.9/6.1 1.38/1.45 3.25/3.32 1.91/2.66

Appendix

Lossless RLE encoding definitions

  • A non-zero block is an 8x8 block having at least 1 non-zero coeffient.
  • A coefficient group is a group of 64 non-zero blocks.
  • A coefficient token is an unit of a compressted data stream, it occupies 3 dwords per each 4 non-zero coefficients and has the following structure:
struct coefficient_token{
   short coeff[4];
   uint pos_in_8x8_block[4] : 6;
   uint 8x8_block_id             : 6;
   uint reserved                      : 2;
}
  • coeff field keeps 4 coefficients from the block defined by 8x8_block_id field.
  • pos_in_8x8_block keeps 4 positions of the above coefficients in the block defined by 8x8_block_id field.
  • all 4 coffitien values are valid. if the number of non-zero coefficients in the given block is not a multiple of 4, the coeff and pos_in_8x8_block fileds are padded with the last valid coefficient.
  • 8x8_block_id keeeps a block id in the coefficient group and 0 <= 8x8_block_id < 64.
  • A coefficient tokens stream is a continous stream of coeffient tokens defined above.
  • A compressed buffer consists of the fixed-size coefficient group directory and the varible-size coefficient tokens stream.
  • A coefficient group directory is an array of 2 dwords per each coefficient group: an offset of the group in the tokens stream; a number of non-zero coefficients in the group.

Note: For the CPU side of this implementation, please, see …\fw-jpeg\jdhuff.c file, and search for #ifdef O_CL string. You can find the GPU kernel implementation inside ...\jpeg-ocl-test\amdRLEJPEG_kernel.cpp, amdRLEJPEG.hpp.