Salman Ul Haq, Jawad Masood, Aamir Majeed, Usman Aziz

10/11/2011

### Abstract

This article covers the implementation and optimization of the Advanced Encryption Standard (AES) on AMD GPUs using OpenCL™, which is fine-tuned for bulk encryption applications. Reliable encryption schemes are needed to ensure the information security of individuals, organizations and governments by protecting against potential threats. One particular scheme is the AES algorithm-based bulk encryption technique, which is based upon the Rijndael algorithm, a symmetric block cipher with 128-bit, 192-bit and 256-bit cipher keys. OpenCL™ also allows you to tap into the huge parallel processing power of GPUs for data parallel computing applications. This article begins by exploring the AES algorithm, focusing on a parallel breakdown of the problem and explaining suitable indexing schemes. This is followed by GPU-specific optimization strategies, such as using local memory, covering their relation to the memory bandwidth and computational intensity that is required. We finish the article by examining the final benchmarks that signify the acceleration achieved using AMD GPUs.

### Introduction

Information security is becoming increasingly important given the ever -increasing number of new applications in the public and private domain. There is a continuing trend to secure data in all of its uses, ranging from its live communication to archived data storage. The unauthorized access to intercepted transmissions can result in the compromise of sensitive and vital information. Data managers around the world are, thus, facing an interesting dilemma;: how to store data securely while still being able to access it quickly. Encryption is an effective solution for protecting valuable data assets against such attacks.

### Encryption

Encryption is the process of transforming information referred to as plain-text into an unintelligible code called cipher-text, using a secret key and an algorithm generally referred to as the cipher [1]. The cipher-text (encrypted data) can be decoded back into its original form using the same cipher algorithm and the secret key. In this process, critical information can be protected from hackers, competitors and others who would use the information for malicious intent.

Common uses for encryption technology are found in the static archiving of large amounts of sensitive data, as well as its communication over the local area network (LAN) or across an Internet gateway in the case of Wide Area Networks (WANs) or Virtual Private Networks (VPNs). Similar applications can also be abundantly found in the telecommunications industry and other proprietary setups dealing with data protection issues.

### Bulk Encryption

Bulk encryption provides safe and effective methods for protecting data transmissions from its compromise and theft. This can be achieved through secured storage and the transmission of bulk data.

Bulk encryption technology provides a method to encrypt large amounts of data during transmission or storage. The amount of information that must be encrypted, however, simultaneously leads to very large response times. Currently, the processing power requirements for bulk encryption are being met by hardware extensions in the form of cryptographic accelerators [2]. There exists the potential to use the parallel processing power of a GPU as a co-processor in a similar role that existing hardware cryptographic solutions play.

### Bulk Encryption Methods

Most modern encryption algorithms or ciphers can be categorized in one of the following ways:

- Whether the same key is used for both encryption and decryption (symmetric key algorithms), or if a different key is used for each (asymmetric key algorithms).The use of a symmetric algorithm requires only a single key for both the encryption and decryption process. Encryption schemes that do not involve the use of a key are far less secure and subject to compromise. In fact, anyone who is in possession of the decryption algorithm can decipher any transmission written with that particular algorithm.
- Whether they work on blocks of symbols of a fixed size (block ciphers), or on a continuous stream of symbols (stream ciphers).A block cipher is a symmetric key cipher operating on a fixed-length groups of bits, called blocks, with an unvarying transformation. A block cipher encryption algorithm might take (for example) a 128-bit block of plain-text as input, and output a corresponding 128-bit block of cipher-text. The exact transformation is controlled using a second input called the secret key. Decryption is similar; the decryption algorithm takes, in this example, a 128-bit block of cipher text together with the secret key, and yields the original 128-bit block of plain-text.A message longer than the block size (128 bits in the above example) can still be encrypted with a block cipher by breaking the message into blocks and encrypting each block individually. Since all pure block ciphers have independent workloads, they are the ideal candidates for parallel implementation.

### Advanced Encryption Standard

The Advanced Encryption Standard (AES) is a symmetric-key encryption standard approved by NSA for top secret information and is adopted by the U.S. government. The standard was adopted from a larger collection originally published as Rijndael [5]. The Rijndael cipher was developed by two Belgian cryptographers, Joan Daemen and Vincent Rijmen, and submitted by them to the AES selection process. AES is based on a design principle known as a substitution permutation network. The standard comprises three block ciphers: AES-128, AES-192 and AES-256. Each of these ciphers has a 128-bit block size, with key sizes of 128, 192 and 256 bits, respectively. The AES ciphers have been analyzed extensively and are now used worldwide, as was the case with its predecessor, the Data Encryption Standard (DES) [5].

AES was selected due to the level of security it offers and its well documented implementation and optimization techniques [6]. Furthermore, AES is very efficient in terms of both time and memory requirements. The block ciphers have high computation intensity and independent workloads (apply the same steps to different blocks of plain text), so acceleration using a GPU is the next logical step.

### AES Algorithm

In this section, we will provide a brief overview of the AES algorithm and the working of its major constituent computations.

The AES block-cipher operates on a 4×4 array of bytes (128 Bits), termed as the state. For the AES algorithm, the size of the input block, the output block and the state is 128 bits. This is represented by Nb = 4, which reflects the number of 32-bit words (number of columns) in the state array. The permissible lengths of the Cipher Key, K, are 128, 192, and 256 bits. The key length is represented by Nk = 4, 6, or 8, which reflects, again, the number of 32-bit words (number of columns) in the Cipher Key array [6].

The state is encrypted or decrypted by applying byte-oriented transformations for a specific number of rounds. The number of rounds to be performed is dependent on the key size. The number of rounds is represented by Nr, where Nr = 10 when Nk = 4, Nr = 12 when Nk = 6, and Nr = 14 when Nk = 8 [6].

The AES algorithm specifies both cipher and its inverse for the complete encrypt-decrypt cycle. The Forward Cipher takes plain-text as input along with the cipher-key and its output is the encrypted data or cipher-text. The Inverse Cipher takes this cipher-text as input and decrypts it back to plain-text using the same cipher-key used for encryption.

The AES algorithm consists of following phases:

**Key Expansion.**Round keys are derived from the cipher key using the Rijndael’s**Initial Round.****AddRoundKey—**each byte of the state is combined with the round key using a bit-wise operation.**Middle Rounds.**Nr = 1 till Nr-1 Repeatedly perform the following transformations:**SubBytes—**a non-linear substitution step where each byte is replaced with another according to a lookup table.**ShiftRows—**a transposition step where each row of the state is shifted cyclically a certain number of steps.**MixColumns—**a mixing operation which operates on the columns of the state, combining the four bytes in each column.**AddRoundKey—**same as described above.

**Final Round (no MixColumns)****SubBytes—**same as described above.**ShiftRows—**same as described above.**AddRoundKey—**same as described above.

**Figure 1: AES Forward Cipher Flow Graph**

### AES Transformations

For both the Forward and Inverse Cipher, the AES algorithm uses a round function that is composed of four different byte-oriented transformations [6]:

**SubBytes Transformation.**The SubBytes transformation is a non-linear byte substitution that operates independently on each byte of the state. In this step, each byte of the state array is updated using an 8-bit substitution box, the Rijndael S-box. This operation provides the non-linearity in the cipher and helps avoid attacks based upon algebraic manipulation. The S-box used is derived by combining the multiplicative inverse over GF(28), known to have good non-linearity properties, with an invertible affine transformation. The complete S-box table is displayed below in Figure 2 [6].**Figure 2: AES S-BOX**- AddRoundKey Transformation: In the AddRoundKey transformation, a Round-Key is added to the state by a simple bitwise XOR operation. Round-Key is derived for each round, from the cipher-key using Rijndael’s key schedule. Each Round-key is the same size as the state. AddRoundKey can be mathematically represented as follows [6]:[wi] are the Expanded key words, and round is a value in the range 0 = round = Nr. In the Cipher, the initial Round Key addition occurs when round = 0, prior to the first application of the round function.
**ShiftRows Transformation:**The ShiftRows transformation operates on the rows of the state and it cyclically shifts the bytes in each row by a different (number of bytes) offset. For AES, the first row is not shifted (left unchanged). Each byte of the second row is shifted one byte to the left. Similarly, the third and fourth rows are shifted by offsets of two and three bytes respectively. In this way, each column of the output state of the ShiftRows step is composed of bytes from each column of the input state. Specifically, the ShiftRows transformation proceeds as follows [6]:where the shift value shift(r,Nb) depends on the row number, r, as follows:

shift(1,4) =1; shift(2,4) = 2 ; shift(3,4) = 3. |

- MixColumns Transformation: In the MixColumns transformation, the four bytes of each column of the state are combined using an invertible linear transformation. Each column is treated as a polynomial over GF(28) and is multiplied by the coefficient polynomial c(x) = 3×3+x2+x+2 (modulo x4+1). The coefficients are displayed in their hexadecimal equivalent of the binary representation of bit polynomials from GF(28). This transformation, in conjunction with the ShiftRows transformation, provides diffusion in the original message, spreading out any non-uniform patterns.

The AES Transformations discussed above are the Forward Transformations used by the Forward Cipher for encryption of the plain-text. The Inverse Cipher uses the Inverse Transformations for decryption of cipher-text back to plain-text. The AddRoundKey, ShiftRows and MixColumns Transformations remain the same for both Forward and Inverse Cipher. For Inverse Cipher, the SubBytes Transformation is replaced by the InvSubBytes Transformation, which takes the substitution values from the Inverse S-box table.

Key expansion takes the input key (cipher key) of 128, 192 or 256 bits and produces an expanded key for use in the rounds of subsequent stages. The expanded key’s size is related to the number of rounds to be performed. For 128-bit keys, there are 10 rounds and the expanded key size is 1408 bits. For 192 and 256 bit keys, the number or rounds increases to 12 and 14 rounds respectively with an overall expanded key size of 1664 and 1920 bits. During each round, a different portion of the expanded key is used in the AddRoundKey step.

### Modes of Operation

A block cipher by itself allows encryption of a single data block of size equal to the cipher’s block size. Modes of operation enable the repeated and secure use of a block cipher, on multiple data blocks, under a single key [7]. When targeting a variable-length message, the data must first be partitioned into separate cipher blocks. Typically, the last block must also be extended to match the cipher’s block length using a suitable padding scheme. A mode of operation describes the process of encrypting each of these data blocks, and generally uses randomization based on an additional input value, often called an initialization vector.

There are different modes under which encryption can take place, where some modes are inherently more secure and some lend themselves more to parallelism. The gfollowing table lists various modes of operation along with their inherent level of parallelism [7]. For details on the modes of operation, look at the Resources section [7].

Mode of Operation |
Parallelism |

Electronic codebook (ECB) | High |

Counter (CTR) | High |

Cipher-block chaining (CBC) | Low |

Cipher feedback (CFB) | Low |

Output feedback (OFB) | Low |

The ECB mode comes out to be the most parallel implementation. The message is divided into blocks and each block is encrypted with an identical key and there is no serial dependence between the blocks. The advantage of ECB mode is the extensive parallelism which scales well to the GPU architecture. The disadvantage of this method is that, identical plain-text blocks are encrypted into identical cipher-text blocks; thus, it does not hide data patterns well and the large scale structures in the plain-text are preserved [7].

In the Counter (CTR) mode, the large scale structures that may have been present in the original plain-text are diminished. Thus, the cipher-text blocks obtained by encrypting two identical plain-text blocks using CTR mode are completely different. This provides better security level as compared to the ECB mode [7]. We have implemented the ECB mode of operation that is not only parallel but can be easily extended to CTR.

### Exploiting Parallelism in AES

GPUs are massively parallel devices. The SPMD architecture allows GPUs to perform the same operation on multiple data sets simultaneously. AES is based on a block cipher algorithm that operates on 128 bit data chunks independently. This provides a high level of Data Parallelism as the same operation is performed on each state block, with no dependencies in between (provided that the AES is being used in the ECB mode). So, multiple blocks can be encrypted simultaneously, which is more parallel threads that GPUs love. As the data size increases so does the level of parallelism, thus making GPUs more efficient for bulk encryption. The only serial operation in AES is the key-expansion, which provides the round keys to be used in subsequent rounds. However, keeping in view its serial nature and the fact that it is just a one-time operation, key-expansion can be safely moved to the CPU Host Code for better performance. The figure explaining the parallel nature of AES is included in the Design Approach Section.

### OpenCL™ Implementation of AES

Now that we have a fair understanding of the AES encryption algorithm and its different ingredients, such as key-expansion and AES-Transformations, we are ready to start an implementation using OpenCL™ specific to an AMD GPU. In the remaining part of this article, we will discuss our design approach in detail, including an efficient indexing scheme for handling input and output data, and device functions for various AES-Transformations. Later on, we will investigate memory optimization strategies, such as using local memory, constant cache and coalesced memory accesses along with their impact on throughput performance and memory bandwidth.

### Design Approach

In this section we will discuss the level of parallelism we want to exploit, the portion of code that should be ported to GPU, and the Host-Device work division.

In the current approach, we will exploit parallelism only on the block level without changing the original algorithm. (The algorithm breakdown can further optimize the results, though that discussion is outside the scope of this article.). Each work-item will take one state block as input and convert it to cipher-text. This implies that the Global Work-Size is directly proportional to the exploited parallelism. Encryption of the one state block of 128 bit will remain serial; however we will use loop unrolling to optimize the code. Another serial operation in AES is the key-expansion which provides the round keys to be used in subsequent rounds. However, keeping in view its serial nature and the fact that it is just a one-time operation, key-expansion can be safely moved to the CPU Host Code for better performance. Figure 3 below explains the parallelism in AES as well as our design approach.

**Figure 3: Parallel AES Breakdown **

### Algorithm Phases & GPU Kernels

In this section we will discuss the device functions that implement the AES-Transformations in OpenCL™. We will also list the sample codes for implementing each of these transformations in OpenCL™.

In the SubBytes transformation, each state value is updated by a value from S-Box, having the same index as the value of the state. For example, if S1,1 = {53}, then the substitution value from the S-Box would be determined by the intersection of the row with index ‘5’ and the column with index ‘3’. This process is explained in Figure 4 below.

**Figure 4: AES SubBytes Transformation **

A sample code for SubBytes transformation is listed next:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; state[4*x + y] = gpu_AES_sbox[state[4*x + y]]; } |

In the ShiftRows transformation, the bytes in the last three rows of the state are cyclically shifted over different numbers of bytes. The first row, r = 0, is not shifted. Each byte of the second row is shifted one byte to the left. Similarly, the third and fourth rows are shifted by offsets of two and three bytes respectively. This has the effect of moving bytes to “lower” positions in the row, while the “lowest” bytes wrap around into the “top” of the row. Figure 5 below explains the ShiftRows transformation. Here S represents the state array and S’is the n_state array.

The code for ShiftRows transformation is listed below. The shift rows transformation uses both n_state and the state buffers as it is not an in-place transform:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; n_state[4*x + y] = state[4*x + ((y+x)& 0x03)]; } |

**Figure 5: AES ShiftRows Transformation **

In the AddRoundKey transformation, a Round Key is added to the state by a simple bitwise XOR operation as depicted in the figure below. Each Round Key consists of Nb words from the expanded key obtained from the Key-Expansion function, described earlier. Figure 6 depicts the AddRoundKey Transformation.

The sample code for AddRoundKey transformation is listed next:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; state[4*x + y] = state[4*x + y] ^ ((keysched[y] & (0xff << (x*8))) >> (x*8)); } |

**Figure 6: AES AddRoundKey Transformation **

In the MixColumns transformation, each column is treated as a polynomial over GF(28) and is multiplied modulo x4+1 by the coefficient polynomial a(x) [6] shown here:

The MixColumns transformation updates each column of the state using a matrix multiplication, as explained by the following equation [6]:

Figure 7 below explains the MixColumns transformation.

**Figure 7: AES MixColumns Transformation **

### Kernel Execution Stages

By now we have explained how various AES Transformations are implemented in OpenCL™, and we are ready to discuss the kernel codes for both Forward and Inverse AES Ciphers.

**Forward Cipher**

We explain the working of our kernel by considering the simplest case where we have a single work item operating on a 128-Bit state block. Kernel arguments would be the input and output buffer, AES fixed table buffers and the expanded key buffer. Also, the key-Length parameter adds the flexibility to use all three allowed key sizes–128, 192 and 256-bit keys–and they are passed to the kernel as an argument. The number of rounds to be performed is calculated based upon the key length. 128-bit state is copied from global plain-text buffer into the registers for computing. Two blocks of state size are created in the register files, as all the AES-Transformations can’t be performed in-place. The input is copied to the state block in registers using a special access pattern to allow coalescing (more on this latter). Forward AES-Transformations are applied to the state block as described by the AES flow graph. The resulting cipher-text block is copied back to the Global cipher-text buffer using the same indexing scheme that was followed while copying plain-text to the state.

**Inverse Cipher**

In this section we will discuss the major changes required to convert the Forward Cipher into the Inverse Cipher for decryption process. The Inverse Cipher essentially runs the forward cipher in the reverse order for decryption process. The AES transformations used in Inverse Cipher are the inverse versions of previously discussed forward transforms.

The Inverse Cipher incorporates minor changes in the transformations, the order of execution and the required AES-Tables. For example the InvSubBytes transform, which is the inverse of SubBytes transform, requires Inverse S-box table instead of the S-box table. The code for InvSubBytes is shown here:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; state[4*x + y] = gpu_AES_isbox[state[4*x + y]]; } |

All other transformations: ShiftRows, MixColumns and the AddRoundKey remain the same. The order in which these transforms are applied is different from the Forward Cipher. Figure 8 displays the flow-graph for Inverse Cipher.

**Figure 8: AES Inverse Cipher Flow-Graph **

### Indexing Schemes

Now we will examine the input and output indexing schemes for a simple AES kernel with a single work-item in detail. We will then explain what needs to be added to run the kernel with multiple work-groups and larger work-group sizes. The described indexing scheme applies to both Forward and Inverse Cipher.

At the beginning of the Forward or the Inverse Cipher, the input array is copied to the state array according to the following convention [6]:

S [ r , c ] = In [ r + 4c ] for 0 = r < 4; and 0 = c < Nb |

where Nb = 4 for our case. This is the column major access pattern as depicted in Figure 9 below. The code for column major access pattern for input is listed here:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; state[4*x + y] = gpu_input[i]; } |

**Figure 9: AES Data Patterns**

At the end of the Forward or the Inverse Cipher, the state array is copied to the output array as follows:

Out [ r + 4c ] = S [ r , c ] for 0 = r < 4; and 0 = c < Nb |

The code for output indexing pattern is listed here:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; gpu_output[i] = state[4*x + y]; ,br />} |

Generalizing this indexing scheme to accommodate more threads require some mechanism of identifying which thread is being executed. A new variable named idx is introduced that queries the OpenCL™ runtime for the Global Id of each thread. Now, the input array will be copied to the state array as follows:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; state[4*x + y] = gpu_input[i + 16*idx]; } |

The net offset for each thread is 16*idx, as each thread handles 16 elements (Bytes) of the input array. The same holds for writing the data to the output array after completion of Encryption or Decryption Process:

for(i = 0; i < 16; i++) { x = i & 0x03; y = i >> 2; gpu_output[i + 16*idx] = state[4*x + y]; } |

### Memory Optimizations

Here we will discuss the drawbacks in the basic implementation described above and suggest improvements to overcome these.

In the basic implementation we have used only the Global memory available on the GPU. Remember, Global memory has the least memory bandwidth compared to other memory spaces available on the GPU. The main disadvantage of low memory bandwidth is long latency access. Another drawback is the huge resource usage per thread as all the calculations takes place in the register files, thus limiting the number of parallel threads and degrading performance. The possible memory optimizations include the use of local and constant memory. The results of the significant performance increments with these optimizations have been included; however, the discussion is outside the scope of this article.

### Key Expansion using Local Memory

We have also investigated the key expansion on the GPU using local memory. In this approach the original/unexpanded cipher key is passed directly to the GPU as a kernel argument. As the local memory is consistent only across a single work-group [8], we need to expand key for each work-group separately. An array named keysched is created in the local memory having size equal to the size of expanded key. The first work-item (tid=0) of each work-group copies the cipher-key into the keysched and calls the key expansion function. The expanded key in the keysched array is then accessible by all the work-items in that work-group. A barrier is placed in the kernel after the key expansion function. This is required to make sure that no thread proceeds with the AES-Transformations before the key expansion is complete. The incurred overhead is not much as the key expansion is not compute intensive; also the overheads are being nullified by much faster access to the expanded key in the local memory. Figure 10 below explains the key expansion process at the work-group level. Here tid is the local-id of each work-item within a work-group. A condition (tid==0) is evaluated to direct only the first thread of each work-group to the key expansion procedure. The rest of the threads wait at the barrier until the first thread hits the barrier after key expansion is complete. All the threads are now ready to proceed with the AES-Transformations.

**Figure 10: Key Expansion using Local Memory**

### Performance Results

Performance tests were carried out on two different machines, both running a 64-bit version of the Windows® 7 operating system and AMD APP SDK v2.3 with OpenCL™ 1.1 support. The kernel execution times have been measured using the AMD APP Profiler v2.1. The hardware details for both systems are described below.

SPECIFICATIONS |
TEST SYSTEM 1 |
TEST SYSTEM 2 |

CPU | Intel Core i7 930 2.80GHz | Intel Core i3 370M 2.40GHz |

GPU | ATI Radeon™ HD 5870 | ATI Mobility Radeon™ HD 5650 |

MEMORY (RAM) | 8GB DDR3 | 4GB DDR3 |

Due to the inherent parallelism in the AES algorithm, it shows better performance gains for large data sizes, which are suited for bulk encryption. In the benchmarks, we validated this through performance results taken on various input sizes. The results also show the impact of various optimization techniques applied to the standard implementation to further increase the performance, especially by reducing global memory calls and moving more data to constant and local caches.

Figure 11 shows a performance comparison of various AES kernels. The graph has been plotted with input size on the horizontal axis (Mega Bytes) and the kernel execution time (milliseconds) for 256-Bit AES on the vertical axis.

**Figure 11: Performance comparison of AES kernels.**

Figure 12 shows performance comparison of various hardware for fully optimized AES kernels. The benchmarks for Core i7 CPU are obtained using 8 threads in OpenCL™.

**Figure 12: Hardware for AES kernels**

Figure 13 displays the AES speedup chart. These speedups are against the Core i7 CPU running 8 threads OpenCL™. We have achieved up to a 16x speedup at larger input sizes.

**Figure 13: AES speedup chart**

### Conclusion

The results illustrated in this article prove the viability of implementing the AES algorithm on the AMD GPUs, which show considerable speedups compared to the current generation Intel processor or commodity graphics cards. We have obtained a speedup of up to 16 times with the ATI Radeon™ HD 5870 GPU while the ATI Mobility Radeon™ HD 5650 GPU is showing up to 3 times the performance increase.

### References

[1] http://en.wikipedia.org/wiki/Encryption viewed 20 March, 2011.

[2] “AES Encryption Implementation and Analysis on Commodity Graphics Processing Units” Owen Harrison and John Waldron, 2007.

[3] http://en.wikipedia.org/wiki/Cipher viewed 20 March, 2011.

[4] http://en.wikipedia.org/wiki/Block_cipher viewed 20 March, 2011.

[5]http://en.wikipedia.org/wiki/Advanced_Encryption_Standardviewed 20 March, 2011.

[6]“Announcing the ADVANCED ENCRYPTION STANDARD (AES)” Federal Information Processing Standards Publications, November 26, 2001.

[7] http://en.wikipedia.org/wiki/Modes_of_operationviewed 20 March, 2011.

[8] ATI Stream Computing OpenCL™ Programming Guide, Ch-4 OpenCL™ performance and optimization, June 2010.

### GLOSSARY

**Forward Cipher** Series of transformations that converts plain-text to cipher-text using the Cipher Key.

**Cipher Key** Secret, cryptographic key that is used by the Key Expansion routine to generate a set of Round Keys.

**Cipher-Text** Data output from the Cipher or input to the Inverse Cipher.

**Inverse Cipher** Series of transformations that converts ciphertext to plaintext using the Cipher Key.

**Plain-Text** The data to be encrypted, Data input to the Forward Cipher or output from the Inverse Cipher

**Round Key** Round keys are values derived from the Cipher Key using the Key Expansion routine; they are applied to the State in the Forward and Inverse Cipher.

**Host** A standard CPU device running the main operating system.

**Compute Device** A GPU or CPU device that provides the processing power for OpenCL™. In our case, a GPU device.

**Host Code** A C/C++ program executing on the Host to setup the OpenCL™ resources and invoke the Kernel Code.

**Kernel Code**The parallel executable code which is executed on the Compute Device, also called as Device Function.

**Global Memory** The DRAM video memory available on Graphics Cards.

**Local Memory** High bandwidth on-chip memory available to each compute unit.