Monte Carlo Sample in Bolt
Heterogeneous compute platforms (such as APUs) continue to become more and more ubiquitous. Most developers recognize the tremendous compute capabilities of these devices; however, for mainstream developers, tapping this capability has been a challenge.
With Bolt, we are removing many of the challenges that in the past have discouraged mainstream developers from leveraging heterogeneous computing (HC) in the solutions and in the process we wish to change the perception that developing for a heterogeneous platform is only for select code ninjas.
The Bolt template library is compatible with the C++ Standard Template Library (STL) providing a development environment familiar to most C++ developers and at same time enabling developers to easily unlock the exceptional performance potential of HC. Bolt doesn’t require the knowledge of specialized HC programming APIs such as OpenCL, C++ AMP, etc… Furthermore, it simplifies code development and maintenance by having a single code path that will execute efficiently on both the CPU and the compute accelerator.
In this blog, I’m going to present a simple Monte Carlo π estimator implemented in C++ STL. Then, I’m going to show you how, with just a few simple steps, to get HC acceleration by using Bolt. If this is your first time to hear about Bolt, I would recommend you take a look at Ben Sander’s blog to learn more about Bolt’s design goals and its architecture. Also, I would like to refer you to Kent Knox’s blog, which will give you a brief introductory tour on the programming aspect of Bolt.
The sample I’m using here implements an estimator of π using the Monte Carlo method. Monte Carlo methods solve computation problems by observing the outcome of a large number of random samples in a system. To estimate the value π, it uses the knowledge that the area of a circle and the area of its bounding square have a ratio of π/4. Then, it runs an experiment by randomly scattering a large number of points over the area of the bounding square and observing the probability that a point will be dropped inside the circle. Figure 1 shows an example of such experiment performed by the program.

Figure 1. Circle shown in its bounding square, with random points uniformly scattered over the area.
First, let’s take a look at how we implemented this algorithm for the CPU using C++ STL:
#include <vector>
#include <algorithm>
struct Point {
float x;
float y;
};
// Functor to check if a point is inside a circle
struct isInsideCircleFunctor {
isInsideCircleFunctor(float radius):radius(radius) { };
// returns 1 if the point is inside the circle
int operator() (const Point& point) {
...
};
private:
float radius;
};
// a vector container to store the random points
std::vector<Point> points;
// generate random points scattered in the bounding square
generateRandomPoints(points);
...
// store intermediate result from the transform function
std::vector<int> insideCircle(points.size());
// use the transform function to check if a point is inside the circle
std::transform(points.begin(), points.end()
, pointsInsideCircle.begin(),isInsideCircleFunctor(1.0f));
// count the number of points inside the circle
int count = std::accumulate(insideCircle.begin(), insideCircle.end(), 0);
Initially a large number of points scattered over the bounding square are generated and stored into an STL vector container. To determine if a point is inside the circle, the distance between the point and the center of the circle is calculated and then compared to the circle’s radius. This computation has to be performed for every point in the vector and the transform function in C++ STL provides a convenient way to achieve that. The functor isInsideCircleFunctor, which performs computation described previously, is passed in as the transform operator. The results from the transform function are stored into another vector. Then, we use the accumulate function, which is also from C++ STL, to count the number of points that are dropped inside the circle. It requires a few more steps to calculate the estimated value of π but I’m going to omit them here since their computational cost is negligible.
Now, I’m going to show you that with only a few simple changes, you can get this application accelerated with Bolt on a heterogeneous system.
First of all, you have to make Bolt aware of all the classes and functors that are referenced by the Bolt routines. This is necessary to enable Bolt to generate different code for both the CPU and the accelerator behind the scenes. The Bolt API provides a helper macro (BOLT_FUNCTOR) for this purpose. Taking the Monte Carlo CPU example, all it requires is simply to wrap the Point structure definition and the isInsideCircleFunctor definition with this macro:
BOLT_FUNCTOR(Point,
struct Point {
float x;
float y;
};
);
BOLT_FUNCTOR(isInsideCircleFunctor,
struct isInsideCircleFunctor {
...
};
);
The Bolt preview release includes an initial set of commonly used compute routines, such as count_if, reduce, scan, sort and transform. These Bolt routines have a similar interface as their C++ STL counterparts. They support the usage of a C++ STL vector as data container, which significantly enhances the programming experience. Additionally, Bolt supports a device_vector, which is a std::vector-like data container but its storage is allocated from the heterogeneous data parallel compute device resident memory. It provides an additional option to developers who wish to further optimize performance using Bolt. Kent’s blog on BlackScholes has a more extensive coverage on this topic and I encourage you take a look at it. For this sample, I make this change and convert the data containers to device_vector:
bolt::cl::device_vector<Point> points; bolt::cl::device_vector<int> pointsInsideCircle(points.size());
For the transform routine, you can expect the Bolt version to produces the same results as the C++ STL version, so the only source change required is the namespace of the function.
Bolt currently doesn’t offer a function that looks exactly like the accumulate function, which is essentially a reduction ADD operation. The good news is that Bolt has implemented a more generalized reduce routine, which gives developers more flexibility in terms of choosing a reduction operator. By default, reduce performs a reduction ADD. (Alternatively, you may specify a different associative operator by passing it as a forth parameter – not shown in this example):
bolt::cl::transform(points.begin(), points.end()
, pointsInsideCircle.begin(),isInsideCircleFunctor(1.0f));
int count = bolt::cl::reduce(pointsInsideCircle.begin()
, pointsInsideCircle.end(), 0);
Voilà, this sample program has now become Bolt accelerated. Notice how few code changes were required and more importantly, there isn’t any modification to the core computations of the Monte Carlo PI algorithm!
For performance seekers, let me introduce you to another Bolt routine called transform_reduce, which fuses the transform and the reduce operations together to gain additional performance optimization by reducing memory overheads. The change to make use of transform_reduce is trivial:
int count = bolt::cl::transform_reduce(points.begin(), points.end()
, isInsideCircleFunctor(1.0f), 0, bolt::cl::plus<int>());
For those who are familiar with C++ STL algorithms, you’ve probably noticed that the transform & reduce pass could also be implemented with the count_if function, for which Bolt also provides an accelerated version:
int count = bolt::cl::count_if(points.begin(), points.end()
, isInsideCircleFunctor(1.0f));
I’ve tested all the different implementations discussed here on my notebook, which is powered by an A10-4600M APU with an integrated Radeon HD 7660G providing the heterogeneous compute acceleration. The following chart shows the relative speed-up compared to the CPU, transform & accumulate baseline implementation. A higher number indicates better performance:

Figure 2. Relative speedup normalized to the STL CPU, transform & accumulate implementation. Higher is better.
From the chart, you can see that the performance gain delivered by Bolt is very impressive and the Bolt fused transform_reduce implementation almost achieves a 50X speed up in the 10 million-point scenario!
I hope that through this example you are as excited at the capabilities of Bolt as I am, seeing how Bolt offers you: ease of use and performance. Bolt is currently released as a preview and you can get it as part of the the AMD APP SDK v2.8. For correct operation of this sample, please download and use the latest version of the Bolt Preview which can be found here. Download the Monte Carlo PI sample here and try to modify the source code by yourself as an exercise. Join us in the Bolt forum for any Bolt related discussions and stay tuned for more news and updates from us!
Siu Chi Chan is a Sr. Software Engineer in Performance Application Engineering at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
Share Your Thoughts!