Porting CUDA Applications to OpenCL - Header Image
Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.
If you have an application already written in C for CUDA, it is simple and relatively straightforward to convert the program to OpenCL™.Porting your CUDA applications to OpenCL™ is often simply a matter of finding the equivalent syntax for various keywords and built-in functions in your kernel. You also need to convert your runtime API calls to the equivalent calls in OpenCL™.Some things to keep in mind:

  • Pointers in OpenCL kernels must be annotated with their memory space. For example, a pointer to local memory would be declared as __local int* p; This applies to kernel arguments as well: data being passed to a kernel is usually in arrays represented by __global pointers.
  • CUDA encourages the use of scalar code in kernels. While this works in OpenCL as well, depending on the desired target architecture, it may be more efficient to write programs operating on OpenCL’s vector types, such as float4, as opposed to pure scalar types. This is useful for both AMD CPUs and AMD GPUs, which can operate efficiently on vector types. OpenCL also provides flexible swizzle/broadcast primitives for efficient creation and rearrangement of vector types.
  • CUDA does not provide rich facilities for task parallelism, and so it may be beneficial to think about how to take advantage of OpenCL’s task parallelism as you port your application.

The following information is extracted from the “OpenCL™ and the ATI Stream SDK v2.0” white paper:

Terminology 

C for CUDA terminology OpenCL terminology
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory

Table 1 : General Terminology

Table 1 lists some general terminology for describing computations and memory spaces in both C for CUDA and OpenCL. These terms are fairly similar between the two systems.

Writing Kernels: Qualifiers

C for CUDA terminology OpenCL terminology
__global__ function (callable from host, not callable from device) __kernel function (callable fromdevice, includingCPU device)
__device__ function (not callable from host) No annotation necessary
__constant__ variable declaration __constant variable declaration
__device__ variable declaration __global variable declaration
__shared__ variable declaration __local variable declaration

Table 2 : Qualifiers for Kernel Functions

Table 2 shows qualifiers that are added to functions and data when writing kernels in both CUDA and OpenCL. The biggest difference between the two is that in CUDA, __global__ functions are GPU entry points, and __device__ functions are to be executed on the GPU, but are not callable from the host. In OpenCL, entry point functions are annotated with the __kernel qualifier, but non-entry point functions do not need to be annotated.

Writing Kernels: Indexing

C for CUDA terminology OpenCL terminology
gridDim get_num_groups()
blockDim get_local_size()
blockIdx get_group_id()
threadIdx get_local_id
No direct equivalent. Combine blockDim, blockIdx, and threadIdx to calculate a global index. get_global_id()
No direct equivalent. Combine gridDim and blockDim to calculate the global size. get_global_size()

Table 3 : Indexing functions for use in Kernels

Table 3 shows the various indexing mechanisms provided by CUDA and OpenCL. CUDA provides kernel indexing via special pre-defined variables, while OpenCL provides the equivalent information through function calls. OpenCL also provides global indexing information, while CUDA requires manual computation of global indices.

Writing Kernels: Synchronization

C for CUDA terminology OpenCL terminology
__syncthreads() barrier()
__threadfence() No direct equivalent.
__threadfence_block() mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE)
No direct equivalent. read_mem_fence()
No direct equivalent. write_mem_fence()

Table 4 : Synchronization functions for use in Kernel Functions

Table 4 shows functions provided for synchronization in kernel functions. __syncthreads() and barrier() both provide a mechanism for synchronizing all work-items in a work-group, where calling the function suspends work-item execution until all work-items in the work-group have called the barrier.

__threadfence() and mem_fence() provide a more subtle mechanism for sharing data between work-items. Essentially, they force various orderings on outstanding memory transactions, which can allow for more sophisticated sharing of data. For example, mem_fence() forces all outstanding loads and stores to be completed before execution proceeds, disallowing the compiler, runtime, and hardware from reordering any loads and stores through the mem_fence(). This can be used to ensure that all data produced in a work-group is flushed to global memory before proceeding to signal another work-group that execution has completed, which opens up some possibilities for work-groups to communicate without terminating a kernel.

Important API objects

C for CUDA terminology OpenCL terminology
CUdevice cl_device_id
CUcontext cl_context
CUmodule cl_program
CUfunction cl_kernel
CUdeviceptr cl_mem
No direct equivalent. Closest approximation would be the CUDA Stream mechanism. cl_command_queue

Table 5 : Selected API objects used in Host code

Table 5 shows some objects provided by the respective APIs, which are used in host code to control execution on various devices, manage data, and so forth. Of note is the cl_command_queue, which provides OpenCL’s task parallelism capabilities, by allowing the developer to declare dependences between tasks executing on a device. CUDA does not provide such flexibility – the closest thing CUDA provides is their Stream mechanism, which allows kernels and memory transactions to be placed in independent streams. This is not as general as OpenCL’s task parallelism capabilities provided by Command Queues, because it does not allow for parallelism within a queue, and synchronizing between streams is difficult, while Command Queues provide parallelism within and between queues, as well as flexible synchronization capabilities through the use of OpenCL events.

Important API Calls

C for CUDA terminology OpenCL terminology
cuInit() No OpenCL initialization required
cuDeviceGet() clGetContextInfo()
cuCtxCreate() clCreateContextFromType()
No direct equivalent clCreateCommandQueue()
cuModuleLoad() [requires pre-compiled binary] clCreateProgramWithSource() or clCreateProgramWithBinary()
No direct equivalent. CUDA programs are compiled off-line clBuildProgram()
cuModuleGetFunction() clCreateKernel()
cuMemAlloc() clCreateBuffer()
cuMemcpyHtoD() clEnqueueWriteBuffer()
cuMemcpyDtoH() clEnqueueReadBuffer()
cuFuncSetBlockShape() No direct equivalent [functionality in clEnqueueNDRangeKernel()]
cuParamSeti() clSetKernelArg()
cuParamSetSize() No direct equivalent [functionality in clSetKernelArg()]
cuLaunchGrid() clEnqueueNDRangeKernel()
cuMemFree() clReleaseMemObj()

Table 6 : Selected API calls used in Host code

Table 6 lists some important API calls used in host code to set up parallel computations and execute them, as well as manage data on compute devices. For the most part, these functions are fairly similar, although sometimes functionality is divided slightly differently, as shown in the table. The biggest difference is that OpenCL has both a runtime compiled model as well as allowing programs to be compiled offline, whereas CUDA only allows programs to be compiled off-line. To precompile OpenCL, developers can use the clGetProgramInfo() API call to retrieve a compiled binary and save it for later use, along with the clCreateProgramWithBinary() call to create an OpenCL program object from a compiled binary.

 Next Topic: Benchmarking Performance

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.