More options for accelerating compute on AMD platforms
Our primary aim with the AMD APP SDK is to enable developers to leverage the processing power of heterogeneous compute (hc). OpenCL™ is a primary mechanism for achieving this, but with the release of the AMD APP SDK 2.8, we are moving to actively support a broader array of paths for accessing that compute power in your applications. This is a theme that you will see extended as we move towards heterogeneous system architecture (HSA) based solutions. OpenCL™ is not going away! Rather, our goal is to enable you accelerate compute using your preferred programming language, to provide mechanisms for you to leverage AMD’s compute acceleration while continuing to use the programming paradigm you are already using.
Is your application in Java? No problem. Use the popular open source project Aparapi, and look forward to Project Sumatra. Project Sumatra illustrates a path to the future of heterogeneous compute your way – and this is just the beginning!
Is your application in C++? Again, no problem, use C++ AMP, or makes heavy use of the C++ Standard Template Library (STL)? No problem – use Bolt (see below for more details on Bolt).
Each of these technologies, Aparapi, C++ AMP, and Bolt leverage AMD’s hc acceleration to speed up your application. Samples to show you how are included with this SDK.
Bolt is an STL compatible template library of data parallel primitives. The library is open source and this release of the SDK includes a preview. Bolt provides a standard way to develop an application that can execute on either a regular CPU2, or use any available OpenCL™ capable accelerated compute unit, with a single code path. In the near future we will have a beta release of Bolt – please join the Bolt forum to be eligible. For more information on Bolt, and the drawing we are running for those that register for the Bolt beta, read my Blog on Bolt here.
But let’s not forget about OpenCL™. With this SDK release we are continuing to improve and extend our OpenCL solution by including support for the Direct3D 11 sharing Khronos extension in addition to fleshing out our atomics support to now also include 64-bit atomics.
The AMD APP SDK 2.8 includes dozens of new and improved samples for OpenCL™, Aparapi and C++ AMP that deliver significantly faster performance than APP SDK 2.7 – up to 2.3x faster3 on average in 9 key benchmarks Furthermore, virtually all samples are licensed under permissive open source licenses..
Let’s not forget about tools. If you are not already using AMD’s latest tools, let me introduce you to CodeXL, AMD’s new unified tool suit for hc. CodeXL is being launched in conjunction with APP SDK 2.8 and can be downloaded here. CodeXL is a complete tools set for debugging, profiling, and analyzing OpenCL™ applications on both the CPU, and the GPU; take a look at the CodeXL blog for more information. This is the leading tool set tor application development using the AMD APP SDK.
Additionally, both the FFT and BLAS libraries have been updated for APPML 1.8, FFTs now support Real-To-Complex transforms, and all BLAS 2 and BLAS 3 functions are now supported.
If you have not done so already, join the rapidly expanding OpenCL™ ecosystem, download and install the APP SDK now. Then, register for the Bolt forum to gain early access to the Bolt beta once it is available, access to private forums, be entered into a drawing with a chance to win one of 5 AMD A8 APU based laptops, and be eligible for our Bolt sample competition that will be announced early next year.
1For best results with APP SDK 2.8 we recommend that you update to AMD Catalyst 12.10 drivers or newer.
2Bolt will be supported for standard CPU code starting with the Bolt beta.
3Tests conducted at AMD using performance optimized code samples from AMD APP SDK 2.8 compared to those from 2.7. On a notebook PC with AMD A10-4600M APU with Radeon™ HD 7660D graphics, 2x2GB of DDR-1600 RAM, and Windows® 7 Pro 64-bit, video driver 9.011, the times to execute the code samples are as follows: AESEncryptDecrypt (.024 seconds for SDK 2.7 vs .005 seconds for SDK 2.8); BinarySearch (.028 vs .003); BitonicSort (5.36 vs 0.14); RadixSort (.015 vs .011); ScanLargeArrays (.16 vs .08); Histogram (.20 vs .10); HistogramAtomics (.026 vs .025); QuasiRandomSequence (.037 vs .017); and MonteCarloAsian (1.84 vs .85) for an average time of .114 seconds for SDK 2.7 vs .034 seconds for SDK 2.8 across the 9 samples.
Mark Ireton is the Sr. Product Manager for Compute Solutions at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.