If you haven’t heard the news already, AMD has just released drivers supporting the latest OpenCL™ 2.0 standard from Khronos. We think this marks a huge milestone in the path toward improving heterogeneous compute acceleration. OpenCL 2.0 implements many of the advances AMD has been discussing as part of our Heterogeneous System Architecture (HSA) initiative. Notably, the notion of sharing memory with pointer-based data structures between GPU and CPU devices can greatly simplify the steps involved in enlisting the GPU for compute acceleration. Also, the ability of the GPU device to initiate compute tasks via the OpenCL 2.0 Device Enqueue feature opens up a much more powerful programming model for compute kernels. Generic address space is also a huge programmability advantage over OpenCL 1.2, simplifying the OpenCL memory model. OpenCL 2.0 also introduces a new memory object called Pipe, which helps in organizing data as a FIFO. This is useful for applications having producer-consumer design. These and other advances of OpenCL 2.0 will help you tap into the tremendous performance potential of modern heterogeneous systems.
Comprehensive OpenCL 2.0 support
In concert with the OpenCL 2.0 driver, AMD has made available a Beta version of the AMD APP SDK 3.0 with what we believe is the industry’s most comprehensive OpenCL 2.0 support. AMD APP SDK 3.0 Beta contains a complete set of sample code illustrating how to utilize each of the major new features of OpenCL 2.0. Some of these have been made available over the past weeks via a blog-series found here.
For the Linux® developers out there targeting AMD APUs, AMD APP SDK 3.0 Beta provides a glimpse into some the most interesting optional features defined with OpenCL 2.0 – fine-grained-buffer SVM with platform atomics. These features allow you to share data-objects coherently between the CPU and GPU with fine-grained synchronization. This programming construct has long been available on multi-core CPUs – and now it’s the GPU’s turn to join the party. Over time, expect to see this feature much more widely available, but you can check it out today by downloading the AMD APP SDK 3.0 Beta.
One of the cool things about this OpenCL 2.0 announcement is that this new functionality is already supported on recent APUs and GPUs from AMD. See here for the full list of supported products. If you’ve got one of these products, you can start programming and deploying software with OpenCL 2.0 today!
There are a few other items also worth mentioning for AMD APP SDK 3.0 Beta. We’ve added support for the Bolt 1.3 library with new samples for Bolt C++ AMP library as well as a sample demonstrating SPIR 1.2 binary consumption. We’ve also improved the installation process by providing a Web-based installer that allows you to download only what you choose, but still allows the downloaded package to be distributed locally to the rest of your team. (Windows only for now – stay tuned for announcements on Linux support). We’ve also updated the OpenCL Programming Guide with many improvements – including full coverage of OpenCL 2.0 features. Check it out – we think it’s the best OpenCL programmers guide available.
And if you’re curious, at the end of this post you will find a list of the new and updated samples in the SDK. A quick look at the list will give you an idea of the new power and capability available to you in the AMD APP SDK v 3.0 Beta.
We look forward to hearing from you on the forum. You can also comment by replying to this blog. We do listen.
|Sample||OpenCL™ 2.0 Feature||Description|
|SVMBinaryTreeSearch||SVM Coarse Grain||Demonstrates the coarse-grain Shared Virtual Memory (SVM) feature of OpenCL 2.0 using a Binary Tree search algorithm|
|SimplePipe||Pipe||Demonstrates the Pipe memory object and its APIs|
|PipeProducerConsumerKernels||Pipe||Demonstrates the Pipe as a data-sharing FIFO for a producer kernel and a consumer kernel|
|BuiltInScan||New Workgroup Built-in APIs||Demonstrates the work group level scan and work group level broadcast features introduced in OpenCL 2.0 using the PrefixSum algorithm|
|ImageBinarization||Image Read and Write||Demonstrates using images with read_write qualifier support, which is new in OpenCL 2.0|
|RecursiveGaussian_ProgramScope||Program Scope Variable||Demonstrates Program Scope Variables, a new feature of OpenCL 2.0, using a Recursive Gaussian filter implementation|
|SimpleGenericAddressSpace||Generic Address Space||Demonstrates the Generic Address Space feature introduced in OpenCL 2.0, which allows pointers to be declared without qualifying with a named address space|
|RangeMinimumQuery||Shared Virtual Memory pointer with offset||Demonstrates passing a pointer with offset as a kernel argument using Range Minimum Query algorithm, new in OpenCL 2.0|
|SVMAtomicsBinaryTreeInsert||SVM Fine Grain Buffer + Platform Atomics||Demonstrates the Fine Grain SVM buffer with Platform atomics using a Binary Tree node insertion algorithm|
|CalcPie||C++ 11 Atomics||Demonstrates atomics in OpenCL 2.0. It calculates the value of Pi using MonteCarlo analysis|
|FineGrainSVM||SVM Fine Grain Buffer + C++ 11 Atomics||Demonstrates the memory model of loads and stores with new C++11 standard, which is adopted by OpenCL 2.0 (Linux APU device)|
|FineGrainSVMCAS||SVM Fine Grain Buffer + C++ 11 Atomics||Demonstrates the atomic operation “CompareAndSwap” call called “atomic_compare_exchange”, introduced in OpenCL 2.0 (adopted fromC11 standards – requires Linux APU device)|
|RegionGrowingSegmentation||Device-side Enqueue||Demonstrates how to use the device-side enqueue feature of OpenCL 2.0 for a Region Growing Segmentation algorithm|
|DeviceEnqueueBFS||Device-side Enqueue||Demonstrates Breadth First Search implementation using the device-side enqueue feature of OpenCL 2.0.|
|ExtractPrimes||Device-side Enqueue + New Workgroup Built-in APIs||Demonstrate the new workgroup builtins and device-side enqueue in a finding Prime number algorithm|
|SimpleSPIR||SPIR Consumption (Not a OpenCL 2.0 feature)||Demonstrates SPIR code consumption using OpenCL APIs|
|SimpleDepthImage||Depth Image||Demonstrates the depth Image APIs|
|Sample||OpenCL™ 2.0 Feature||Description|
|GlobalMemoryBandwidth||Shared Virtual Memory||Measures the peak-bandwidth of the device buffer. For devices on OpenCL version 2.0 and higher, it additionally shows peak bandwidth for the Shared Virtual Memory (SVM) buffer|
|BinarySearch_DeviceSideEnqueue||Device-side Enqueue||Enhanced to use device-side enqueue for Binary Search on an OpenCL 2.0 device. Uses iterative host side enqueue for OpenCL 1.x devices|
|BufferImageInterop||Buffer Image Interop||Updated to skip checking for the BufferImageInterop extension on an OpenCL 2.0 compliant platform as it is a core feature of OpenCL 2.0|
|BufferBandwidth||Shared Virtual Memory||Now also measures the SVM buffer bandwidth on a OpenCL 2.0 compliant device|
Marty Johnson is Director of Product Engineering at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.