Today we released a preview version of the Bolt C++ Template Library included in the APP SDK 2.8 – get it here. The primary goal of Bolt is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing (hc). In this first version, we’ve delivered an introductory set of common compute-optimized routines including sort, scan, transform, and reduce. Compared to writing the equivalent functionality in OpenCL™, you’ll find that Bolt requires significantly fewer lines-of-code and less developer effort. The preview version also provides good acceleration when compared to stock CPU implementations, and we have already identified additional optimizations to be included in future releases. One of our primary goals has been to make the interfaces easy to use, and we have included comprehensive documentation for the library routines, memory management, control interfaces, and host/device code sharing.
The long-term vision for Bolt provides two exciting capabilities that I wish to share with you in more depth:
Single-Code Path Programming:
One of the challenges that discrete GPU developers have historically faced is a need to develop a “special” code path for compute acceleration that is separate and unique from the CPU code path. Typically the Accelerated Compute code path is even written in a different language, such as OpenCL™. While this can be manageable for small research projects or point optimization efforts, it creates a significant challenge in code maintenance for larger ‘real-world’ projects. Bolt provides three powerful concepts that address this challenge:
- Familiar programming model: The Bolt programming model is C++ template programming (similar to C++’s own Standard Template Library) – rather than a new dedicated compute language. Developers are able to code in a familiar environment which is similar to how they program for multi-core CPUs.
- Platform Submission: Bolt will provide a path to both the CPU and to Accelerated Compute Units, from the same source code. (Note the preview version does not include support for the CPU path for accelerated code – stay tuned.) The device selection is done dynamically (at runtime) – so developers can ship a single binary that takes advantage of acceleration if available or falls back to an optimized multi-core CPU implementation if not.
- Higher-level Abstraction: Other compute programming models also support both CPUs and Accelerated Compute Units, but not in a performance-portable way due to the need to encode micro-architecture specific details in the kernel implementation. The Bolt template function APIs provide a higher-level abstraction and the freedom to customize the underlying algorithm and data access patterns for the target device. For example, the compute-optimized implementation of Bolt’s reduce function uses local memory and barriers for the reduction, while the CPU uses a completely different algorithm. For both implementations, the developer specifies the reduction operation (ie plus) and the input arrays.
Path to Heterogeneous System Architecture (HSA):
Bolt’s device_vector class provides a convenient vector-like interface for managing device memory. On today’s heterogeneous computing devices, managing device memory (and the associated copies to and from host space) is often necessary to obtain good performance from discrete GPUs, and the device_vector makes this as painless as possible. However, in the future, the Heterogeneous Systems Architecture will provide a single, shared, coherent heap of memory with fast access from both the CPU and Accelerated Compute Units – this will eliminate the need for developers to manage the separate device memory.
HSA devices will be able to directly access host memory with good performance, and with full support for pageable virtual memory and associated large-memory footprints.
HSA devices will be able to directly access host memory with good performance, and with full support for pageable virtual memory and associated large-memory footprints. This is a powerful concept which will revolutionize the way we program heterogeneous computers and transform it from the domain of special-purpose “compute” languages to being a standard part of popular programming languages. Bolt includes the forward-looking feature of direct access to host memory – ie you can use std::vector<> or pointers (ie int*) as arguments to Bolt template functions. This will become even more important as hc programming evolves to enable pointers to be shared across host and device, including support for complex pointer-containing data structures such as lists or trees. Bolt’s productivity-oriented development environment enables developers to get good performance on today’s hc platforms. And, the exact same code will run on future HSA Platforms at increased performance and reduced power.
We are excited to take this first step in making heterogeneous compute programming more accessible. And there is much more to come as we expand the Bolt functionality and prepare for HSA devices.
Ben Sander is a Senior Fellow at AMD and the architect for Bolt, as well as the software architect for AMD’s Heterogeneous System Architecture. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.