Bolt: Now with One Code Path for Multi-Core CPU and Compute Devices
The latest release of Bolt includes several new features including C++ AMP support, an expanded set of Bolt APIs, and new specialized iterators – and all the code is now available in an open-source distribution on GitHub (https://github.com/HSA-Libraries/Bolt). In this Blog, I wanted to focus on perhaps the most exciting new feature, which is the ability for applications that use Bolt to run on any PC platform, including those that don’t have an accelerated compute device at all.
Bolt now includes a high-performance multi-core path for CPUs, in addition to the accelerated paths for the compute devices. Bolt dynamically queries the platform capabilities at startup and selects the accelerated path if possible, but will also run on multi-core CPUs when the accelerated path is not available. Programmers can thus develop a single code base which runs on any available platform, and will use accelerated compute paths if available. This is great leap forward from previous compute programming environments which required dedicated and separate paths for CPU and compute device, often written in different languages.
Under the covers, the Bolt CPU path uses Threading Building Blocks – a well-known C++ template library for developing applications for multi-core CPUs. Bolt adds only a thin abstraction layer, so the Bolt-on-CPU path provides very similar performance to what developers can get from coding directly in TBB. Generally the CPU is more flexible while the compute accelerator provides higher performance – so any code that can run on the compute device can also run on the CPU, but the compute device is preferred when available. Bolt does not replace the entire TBB API, but rather implements the set of APIs which can run efficiently on both compute and CPU devices. For example, Bolt contains a “bolt::transform” API that efficiently calls the tbb::parallel_for_each when running on CPU, and will generate efficient OpenCL code for the compute devices. However, Bolt does not provide an implementation for TBB’s memory allocators or concurrent_hash_map, because these are not efficiently implemented on compute devices. Programmers who want these features on the CPU can use the TBB APIs directly.
We believe this approach will be a model as other mainstream programming models adopt heterogenous computing. Specifically, the programming models will allow developers to specify the compute kernels in the same source file as the host code, and in the same language. The compute kernels will provide a relatively rich but still restricted subset of the full programming model supported by the host – the restricted subset is driven by the programming model features which run efficiently on the compute device. And the models will present a model of parallelism that is flexible enough so the code can run efficiently on both multi-core CPUs and accelerated compute devices. In the case of Bolt, the cross-device support will initially be leveraged to massively increase the number of platforms that can run Bolt, by using an initialization-time check of the platform capabilities. In the future, Bolt may utilize a more fine-grained selection process, choosing between CPU and compute device at each Bolt API call based on algorithm appropriateness (ie small grain sizes running best on CPU) or device availability. Stay tuned!
Ben Sander is a Senior Fellow at AMD and the architect for Bolt, as well as the software architect for AMD’s Heterogeneous System Architecture. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.