Developer Central
China  |  India
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • Video Player Test
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

Home > Resources > Heterogeneous Computing > OpenCL™ Zone > Programming in OpenCL™ > Porting CUDA Applications to OpenCL™

Porting CUDA Applications to OpenCL™

 


Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.
If you have an application already written in C for CUDA, it is simple and relatively straightforward to convert the program to OpenCL™.Porting your CUDA applications to OpenCL™ is often simply a matter of finding the equivalent syntax for various keywords and built-in functions in your kernel. You also need to convert your runtime API calls to the equivalent calls in OpenCL™.Some things to keep in mind:

  • Pointers in OpenCL kernels must be annotated with their memory space. For example, a pointer to local memory would be declared as __local int* p; This applies to kernel arguments as well: data being passed to a kernel is usually in arrays represented by __global pointers.
  • CUDA encourages the use of scalar code in kernels. While this works in OpenCL as well, depending on the desired target architecture, it may be more efficient to write programs operating on OpenCL’s vector types, such as float4, as opposed to pure scalar types. This is useful for both AMD CPUs and AMD GPUs, which can operate efficiently on vector types. OpenCL also provides flexible swizzle/broadcast primitives for efficient creation and rearrangement of vector types.
  • CUDA does not provide rich facilities for task parallelism, and so it may be beneficial to think about how to take advantage of OpenCL’s task parallelism as you port your application.

The following information is extracted from the “OpenCL™ and the ATI Stream SDK v2.0” white paper:

Terminology 

C for CUDA terminology OpenCL terminology
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory

Table 1 : General Terminology

Table 1 lists some general terminology for describing computations and memory spaces in both C for CUDA and OpenCL. These terms are fairly similar between the two systems.

Writing Kernels: Qualifiers

C for CUDA terminology OpenCL terminology
__global__ function (callable from host, not callable from device) __kernel function (callable fromdevice, includingCPU device)
__device__ function (not callable from host) No annotation necessary
__constant__ variable declaration __constant variable declaration
__device__ variable declaration __global variable declaration
__shared__ variable declaration __local variable declaration

Table 2 : Qualifiers for Kernel Functions

Table 2 shows qualifiers that are added to functions and data when writing kernels in both CUDA and OpenCL. The biggest difference between the two is that in CUDA, __global__ functions are GPU entry points, and __device__ functions are to be executed on the GPU, but are not callable from the host. In OpenCL, entry point functions are annotated with the __kernel qualifier, but non-entry point functions do not need to be annotated.

Writing Kernels: Indexing

C for CUDA terminology OpenCL terminology
gridDim get_num_groups()
blockDim get_local_size()
blockIdx get_group_id()
threadIdx get_local_id
No direct equivalent. Combine blockDim, blockIdx, and threadIdx to calculate a global index. get_global_id()
No direct equivalent. Combine gridDim and blockDim to calculate the global size. get_global_size()

Table 3 : Indexing functions for use in Kernels

Table 3 shows the various indexing mechanisms provided by CUDA and OpenCL. CUDA provides kernel indexing via special pre-defined variables, while OpenCL provides the equivalent information through function calls. OpenCL also provides global indexing information, while CUDA requires manual computation of global indices.

Writing Kernels: Synchronization

C for CUDA terminology OpenCL terminology
__syncthreads() barrier()
__threadfence() No direct equivalent.
__threadfence_block() mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE)
No direct equivalent. read_mem_fence()
No direct equivalent. write_mem_fence()

Table 4 : Synchronization functions for use in Kernel Functions

Table 4 shows functions provided for synchronization in kernel functions. __syncthreads() and barrier() both provide a mechanism for synchronizing all work-items in a work-group, where calling the function suspends work-item execution until all work-items in the work-group have called the barrier.

__threadfence() and mem_fence() provide a more subtle mechanism for sharing data between work-items. Essentially, they force various orderings on outstanding memory transactions, which can allow for more sophisticated sharing of data. For example, mem_fence() forces all outstanding loads and stores to be completed before execution proceeds, disallowing the compiler, runtime, and hardware from reordering any loads and stores through the mem_fence(). This can be used to ensure that all data produced in a work-group is flushed to global memory before proceeding to signal another work-group that execution has completed, which opens up some possibilities for work-groups to communicate without terminating a kernel.

Important API objects

C for CUDA terminology OpenCL terminology
CUdevice cl_device_id
CUcontext cl_context
CUmodule cl_program
CUfunction cl_kernel
CUdeviceptr cl_mem
No direct equivalent. Closest approximation would be the CUDA Stream mechanism. cl_command_queue

Table 5 : Selected API objects used in Host code

Table 5 shows some objects provided by the respective APIs, which are used in host code to control execution on various devices, manage data, and so forth. Of note is the cl_command_queue, which provides OpenCL’s task parallelism capabilities, by allowing the developer to declare dependences between tasks executing on a device. CUDA does not provide such flexibility – the closest thing CUDA provides is their Stream mechanism, which allows kernels and memory transactions to be placed in independent streams. This is not as general as OpenCL’s task parallelism capabilities provided by Command Queues, because it does not allow for parallelism within a queue, and synchronizing between streams is difficult, while Command Queues provide parallelism within and between queues, as well as flexible synchronization capabilities through the use of OpenCL events.

Important API Calls

C for CUDA terminology OpenCL terminology
cuInit() No OpenCL initialization required
cuDeviceGet() clGetContextInfo()
cuCtxCreate() clCreateContextFromType()
No direct equivalent clCreateCommandQueue()
cuModuleLoad() [requires pre-compiled binary] clCreateProgramWithSource() or clCreateProgramWithBinary()
No direct equivalent. CUDA programs are compiled off-line clBuildProgram()
cuModuleGetFunction() clCreateKernel()
cuMemAlloc() clCreateBuffer()
cuMemcpyHtoD() clEnqueueWriteBuffer()
cuMemcpyDtoH() clEnqueueReadBuffer()
cuFuncSetBlockShape() No direct equivalent [functionality in clEnqueueNDRangeKernel()]
cuParamSeti() clSetKernelArg()
cuParamSetSize() No direct equivalent [functionality in clSetKernelArg()]
cuLaunchGrid() clEnqueueNDRangeKernel()
cuMemFree() clReleaseMemObj()

Table 6 : Selected API calls used in Host code

Table 6 lists some important API calls used in host code to set up parallel computations and execute them, as well as manage data on compute devices. For the most part, these functions are fairly similar, although sometimes functionality is divided slightly differently, as shown in the table. The biggest difference is that OpenCL has both a runtime compiled model as well as allowing programs to be compiled offline, whereas CUDA only allows programs to be compiled off-line. To precompile OpenCL, developers can use the clGetProgramInfo() API call to retrieve a compiled binary and save it for later use, along with the clCreateProgramWithBinary() call to create an OpenCL program object from a compiled binary.

 Next Topic: Benchmarking Performance

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

 

Get the hcNewsFlash.

Your email address:

No SPAM.
Easy unsubscribe.

HSA is going to rock your world.

Learn more about Heterogeneous System Architecture.

Got Questions?

Ask the Developer Forums Community. They’ve got answers.

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • Video Player Test
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

©2013 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.

  • Contact Us
  • |
  • Careers
  • |
  • Site Map
  • |
  • Terms and Conditions
  • |
  • Privacy
  • |
  • Trademarks