The recently released beta of Bolt introduces several new pieces of technology users should find both exciting and useful. It introduces a C++ AMP interface, which increases the choice available to developers of the underlying technologies to use. Support for TBB is added, to increase the compatibility of a Bolt application to run across a variety of platforms, including CPU-only platforms without heterogeneous compute devices installed. The Bolt OpenCL interface has been expanded with increased support of new features and performance optimizations. Last, but not least, the source to Bolt is made available on the open source social programming platform GitHub, which is available at: https://github.com/HSA-Libraries/Bolt

I refine details about each of these areas below:

AMP interface

The Bolt beta ships AMP versions of the original interfaces the preview version of Bolt released with, including transform, reduction, scan and sort interfaces.  In addition, the AMP device_vector container wraps around the native AMP concurrency::array constructs and provides the necessary vector interface that STL style functions expect.  This includes the ability to declare 0-sized device_vectors, to query for vector iterators with begin()/end() methods and all the other appropriate convenience methods to insert and delete elements into a vector.

A distinguishing feature of the Bolt AMP interface is its ability to accept lambda expressions where explicit functor objects were expected before; this promotes less verbose and easier to maintain code.  AMP also provides for a friendlier development environment as all device kernel code is compiled at compile-time, as opposed to OpenCL’s more difficult but more flexible runtime compilation.  Microsoft’s Visual Studio integrated AMP debugger environment is also friendly and easy to use.

The introduction of AMP allows Bolt to run across a variety of vendor accelerator hardware, as the AMP runtime ensures compatibility.  AMP support does not exist on Linux yet but Bolt is ready for it.

TBB runtime

The beta also introduces new support for homogenous multi-core CPU’s.  The TBB backend is not a static compile time feature, but rather a runtime decision that allows a Bolt program to gracefully fall-back to homogenous CPU core acceleration when an accelerator device is not found.  This behavior can also be controlled with the Bolt Control object, which defines flags to allow the developer to choose whether they want to default to the systems massively parallel accelerator or the traditional CPU core accelerators if they know that to be the most appropriate choice.

New OpenCL features

The OpenCL interface and backend also received new features and optimizations in the beta.  Our scan family of functions measures up to approximately a 3x performance boost over the functions included in the preview version of Bolt.  Many new ‘utility’ functions such as generate(), fill(), copy(), min_element(), max_element() and more have been added, all optimized to work with both std::vectors and bolt::cl::device_vectors.

One of the more exciting new additions to Bolt is the inclusion of a small subset of the ‘specialized adaptors’ described in Boosts iterator library:

http://www.boost.org/doc/libs/1_53_0/libs/iterator/doc/index.html#specialized-adaptors

Bolt has had device_vector iterators since the preview release last December, but now adds ‘specialized adapters’ constant_iterator and counting_iterator.  Both of these iterators are ‘fake’ in that they don’t actually point or iterator over memory, but rather they keep the same interface as true iterators and generate a value, based upon some internal algorithm, that is returned when the iterator is dereferenced.  This can be convenient, especially when combined with many of the algorithms Bolt provides which take iterators as arguments.  It is important to remember though, that these iterator adapters are non-mutable, meaning that since they do not actually point to memory they cannot be written to.

As an example, let’s take the new counting_iterator and combine it with the bolt::cl::copy() algorithm to create an easy way to initialize a vector of memory with linearly increasing values.

#include "bolt/cl/device_vector.h"
#include "bolt/cl/iterator/counting_iterator.h"
#include "bolt/cl/copy.h"

int _tmain( int argc, _TCHAR* argv[] )
{
    bolt::cl::device_vector< int > devV( 100 );
    bolt::cl::copy( bolt::cl::make_counting_iterator< int >( 10 ),
                    bolt::cl::make_counting_iterator< int >( 10 + devV.size( ) ),
                    devV.begin( ) );
}

Let’s take a deeper look at what is happening in this code.  First, the relevant include files are included at the beginning of the program with the new header file ‘bolt/cl/iterator/counting_iterator.h’ to bring in the definition of the counting iterator.  A local device vector is created inside of the standard Microsoft tmain() function of size 100.  By default, since no initial value is specified, the vector of int’s is initialized to 0’s.  We would like for the vector to hold a linearly increasing range of values, but unfortunately there is no standard constructor provided by the vector class to specify this.  This is where the counting_iterator provides value; using the copy routine, we create counting_iterators on the fly within the arguments to copy.  Normally, iterators are created by a method of a container, typically the begin() or end() methods.  However, iterator adapters have no associated container, so a convenience function is provided to help users create the iterator adapters.  In the case of the counting iterator, that function is bolt::cl::make_counting_iterator().  It is a template function that returns a fully formed counting_iterator iterator, which is immediately passed as input into the bolt::cl::copy() routine.  The counting_iterator that is passed into the first argument is initialized to 10, meaning that if you dereference the iterator iter[ 0 ], it returns 10.  If you dereference the 5th element it returns 15 and so on.  The second parameter to bolt::cl::copy denotes the end of the input range, which is the initial value specified in the first iterator plus the size of the vector.  Of course, the device_vector is specified as the target of the copy operation as the last parameter.

When the bolt::cl::copy routine uses the standard iterator increment and dereference operators, the bolt::cl::counting_iterator uses an algorithm to return the appropriate value to the bolt::cl::copy routine, which then copies that value into the appropriate index in the bolt::cl::device_vector.  The counting operator also provides the appropriate subtraction and equality operators so that the bolt::cl::copy routine stops looping at the appropriate index.  The end result is a very simple method to initialize the device_vector (possibly residing in GPU resident memory) with an incrementing series of values, reusing existing API’s and containers!

Bolt source released through GitHub

I am proud to announce that AMD is making the source for Bolt available on GitHub, available at: https://github.com/HSA-Libraries/Bolt

The repository includes the source, documentation and the build infrastructure to be able to compile Bolt on your own.  The build infrastructure is written with CMake to ease the burden of supporting multiple platforms in the future.  You start the process of building Bolt by first downloading the cmake build tool at: http://cmake.org/cmake/resources/software.html.  That should be all that is needed; the cmake scripts take care of all the dependencies on their own!

Users should be able find all the information they need about Bolt on the GitHub site itself, so I encourage you to go visit repository, download the code or fork it and contribute to make this a better product for the whole developer community!  Bolt is continuing to gain momentum and steam as it approaches its 1.0 milestone.  It is actively developed and we will continue to add new features and optimizations with the goal of making heterogeneous processing development easier and less time consuming.  Please join us on the forums at http://devgurus.amd.com/groups/bolt, tell us your experiences and give us your feedback!

Kent Knox is a Member of Technical Staff at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only.  Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Leave a Reply