Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.
|If you have an application already written in C for CUDA, it is simple and relatively straightforward to convert the program to OpenCL™.Porting your CUDA applications to OpenCL™ is often simply a matter of finding the equivalent syntax for various keywords and built-in functions in your kernel. You also need to convert your runtime API calls to the equivalent calls in OpenCL™.Some things to keep in mind:
The following information is extracted from the “OpenCL™ and the ATI Stream SDK v2.0” white paper:
Table 1 lists some general terminology for describing computations and memory spaces in both C for CUDA and OpenCL. These terms are fairly similar between the two systems.
Writing Kernels: Qualifiers
Table 2 shows qualifiers that are added to functions and data when writing kernels in both CUDA and OpenCL. The biggest difference between the two is that in CUDA, __global__ functions are GPU entry points, and __device__ functions are to be executed on the GPU, but are not callable from the host. In OpenCL, entry point functions are annotated with the __kernel qualifier, but non-entry point functions do not need to be annotated.
Writing Kernels: Indexing
Table 3 shows the various indexing mechanisms provided by CUDA and OpenCL. CUDA provides kernel indexing via special pre-defined variables, while OpenCL provides the equivalent information through function calls. OpenCL also provides global indexing information, while CUDA requires manual computation of global indices.
Writing Kernels: Synchronization
Table 4 shows functions provided for synchronization in kernel functions. __syncthreads() and barrier() both provide a mechanism for synchronizing all work-items in a work-group, where calling the function suspends work-item execution until all work-items in the work-group have called the barrier.
__threadfence() and mem_fence() provide a more subtle mechanism for sharing data between work-items. Essentially, they force various orderings on outstanding memory transactions, which can allow for more sophisticated sharing of data. For example, mem_fence() forces all outstanding loads and stores to be completed before execution proceeds, disallowing the compiler, runtime, and hardware from reordering any loads and stores through the mem_fence(). This can be used to ensure that all data produced in a work-group is flushed to global memory before proceeding to signal another work-group that execution has completed, which opens up some possibilities for work-groups to communicate without terminating a kernel.
Important API objects
Table 5 shows some objects provided by the respective APIs, which are used in host code to control execution on various devices, manage data, and so forth. Of note is the cl_command_queue, which provides OpenCL’s task parallelism capabilities, by allowing the developer to declare dependences between tasks executing on a device. CUDA does not provide such flexibility – the closest thing CUDA provides is their Stream mechanism, which allows kernels and memory transactions to be placed in independent streams. This is not as general as OpenCL’s task parallelism capabilities provided by Command Queues, because it does not allow for parallelism within a queue, and synchronizing between streams is difficult, while Command Queues provide parallelism within and between queues, as well as flexible synchronization capabilities through the use of OpenCL events.
Important API Calls
Table 6 lists some important API calls used in host code to set up parallel computations and execute them, as well as manage data on compute devices. For the most part, these functions are fairly similar, although sometimes functionality is divided slightly differently, as shown in the table. The biggest difference is that OpenCL has both a runtime compiled model as well as allowing programs to be compiled offline, whereas CUDA only allows programs to be compiled off-line. To precompile OpenCL, developers can use the clGetProgramInfo() API call to retrieve a compiled binary and save it for later use, along with the clCreateProgramWithBinary() call to create an OpenCL program object from a compiled binary.
Next Topic: Benchmarking Performance
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.