Blog

Implementing Black-Scholes using Bolt

BOLT logoNote to readers: The contents of this blog is written against a previous iteration of the Black-Scholes sample shipping in the APP SDK v2.8.  The blog stands on its own, but will be updated in the future to reflect the latest sample source.

The recently launched AMD APP SDK v2.8 contains a preview of Bolt, a C++ template library that provides an intuitive and robust C++ library implementation to assist C++ developers in writing stable and performance-aware heterogeneous code.  In a nutshell, the goal of Bolt is to make programming for heterogeneous platforms feel as comfortable and familiar to C++ programmers as programming for traditional X86 based platforms, while gaining the performance benefits of heterogeneous compute.  Bolt uses interfaces and concepts familiar to any programmers who use STL (the C++ Standard Template Library), implementing a subset of the algorithms, a vector container and the iterators that bridge the gap between them.   For additional architectural details I recommend visiting Ben Sander’s blog, or better yet, join me on December 11th for a webinar where I’ll dive into the various nuances of Bolt and answer any questions you have in real time, register here.

This is a powerful and exciting concept.  STL provides for separation of containers (data) and algorithms, enabling STL library functions to be written in such a fashion as to not care about data layout or data location.   Given this, we can provide an abstraction for GPU device memory to model an STL container and use STL algorithms on device memory, even if it is not resident in host memory, such as on a discrete device!

Bolt provides the container device_vector, whose interface strongly resembles std::vector with small documented exceptions.  Using the device_vector abstraction, one can use standard STL algorithms to interact with OpenCL™ device memory, as illustrated below:

bolt::cl::device_vector< cl_int > boltVector( 1024 );

//  Initialize random data in device_vector
std::generate(boltVector.begin( ), boltVector.end( ), rand );

The code above allocates linear, consecutive device-resident memory on the OpenCL device (guaranteed property of the vector class) that will fit 1024 cl_int value_types and initializes that memory with random data, in 2 lines of code!  This works, because the STL std::generate() method interacts with the container through the iterator range abstraction which device_vector supports.  This is truly an exemplary demonstration of the power of data and algorithm separation.

To illustrate how much simpler Bolt can make it to develop heterogeneous solutions, I will step through a sample Bolt program shipping with the APP SDK v2.8, and explain step-by-step how it works.  This example program implements the Black Scholes algorithm using,  in the case of the sample, random data as input.  This sample can be found at …AMD APPsamplesboltexamplesBlackScholes.

Program execution starts at the main() function located at the bottom of blackscholes.cpp.  The first few lines of the sample are standard AMD sample boilerplate code, initializing the sample and parsing command line arguments.  Program execution reaches the point where containers are allocated and data is initialized.  The relevant code is pasted below for reference:

/****************************************************************
* BlackScholes                                                 
* Class implements Black-Scholes implementation for European 
* Options 
*****************************************************************/

class BlackScholes : public BoltSample
{
/**< Random generated options */
std::vector<float> cpuOptions;

/**< CPU call & put prices    */
std::vector<blackScholesPrice> cpuPrice;

/****************************************************************
* Create input option and output prices on the device using 
* device_vector to gain a huge performance boost. In this 
* sample the input options used for cpu calculations are 
* replicated to exist on the device              
****************************************************************/

/**< Random generated options */
bolt::cl::device_vector<float> boltOptions;

/**< BOLT call & put prices */
bolt::cl::device_vector<blackScholesPrice> boltPrice;

…

/***************************************************************
* Implementation of setup()                                    
***************************************************************/

int BlackScholes::setup()
{
cpuOptions.resize(samples);
boltOptions.resize(samples);
boltPrice.resize(samples);
cpuPrice.resize(samples);

/***************************************************************
* The .data() method is called to map GPU memory to device 
* memory. The mapped memory is not unmapped until the pointer 
* goes out of scope   
***************************************************************/

bolt::cl::device_vector<float>::pointer boltPtr = 
     boltOptions.data();

for(unsigned i = 0; i < samples; i++)
{
    boltPtr[i] = (float)rand() / (float)RAND_MAX;
    cpuOptions[i] = boltPtr[i];
}

sampleTimer.setup(sampleName, iterations);

if(!quiet)
    std::cout << "Completed setup() of BlackScholes 
    sample" << std::endl;

return SDK_SUCCESS;
}

The BlackScholes class declares two empty vectors (cpuOptions/cpuPrice) that represent input and output for host side calculation, and two empty device_vectors (boltOptions/boltPrice) that represent the input and output for OpenCL device calculation.  The main program calls BlackScholes::setup() to allocate memory and initialize the vectors with random data.  I would like to call attention to how the sample initializes the random data on the OpenCL device; the device_vector .data() method is called which returns a device_vector<float>::pointer type.  Programmers familiar with OpenCL will recognize that memory allocated on current generation data parallel devices is not directly accessible to the host processor, so this method maps the data parallel accessible memory to system memory and returns a pointer to the system memory.   The pointer returned is not a naked pointer, but a smart pointer (actually boost::shared_array<>) type.  The programmer treats this pointer like a normal pointer and uses it to read/write data to system memory, but as soon as the pointer goes out of scope, the custom destructor unmaps the memory and copies the new data to the device.  The use of the smart pointer class enables developers to read/write to device memory as easily as for system memory, but the developer should be cautious to reset  or allow the pointer to go out of scope before attempting to use the memory in an OpenCL EnqueueNDRange() call.

After initializing the data, the sample calls the run method:

/****************************************************************
* Implementation of run()                              
****************************************************************/

int BlackScholes::run()
{
    for(unsigned i = 0; i < 1 && iterations != 1; i++)
    {
        if(blackScholesBOLT() != SDK_SUCCESS)
            return SDK_FAILURE;
        if(!quiet)
            std::cout << "Completed Warm up run of Bolt code" 
                << std::endl;
    }

    if(!quiet)
        std::cout << "Executing BlackScholes sample over " 
            << iterations << " iteration(s)." << std::endl;

    for(unsigned i = 0; i < iterations; i++)
    {
        sampleTimer.startTimer();
        if(blackScholesBOLT() != SDK_SUCCESS)
            return SDK_FAILURE;
        sampleTimer.stopTimer();
    }

    if(!quiet)
        std::cout << "Completed Run() of BlackScholes sample" 
            << std::endl;
    return SDK_SUCCESS;
}

The purpose of the run method is to do the Black Scholes computations in a timing loop depending on the iterations the user specifies on the command line.  At the end of the program, timing information is printed to stdout if requested.  There are two loops in this method; the first loop iterates a maximum of one time, and calls into the bolt library to do the Black Scholes calculation.  This is a “warm-up” loop and its purpose is described more completely in the following paragraph.

Users of OpenCL will be familiar with its online compilation model, also known as ‘just-in-time’, on-demand or dynamic compilation.  OpenCL kernels are typically strings stored in memory (read from file or generated at runtime), and the kernels are not compiled until the process calls the OpenCL clBuildProgram() API.  The OpenCL strings in Bolt are statically linked into the Bolt library, and are not compiled until the first call into a Bolt API.  Online kernel compilation can be relatively slow, and not representative of the general speed of the API.  It is a warm-up cost; the kernel is compiled on the first invocation of the API, but calls thereafter all use a cached copy of the kernel binary.  Therefore, the Black Scholes sample calls Bolt once outside of the timing loop to compile the kernels in the library, in what it calls a ‘warm up’ run.

After computing the Black Scholes result with Bolt, the sample optionally compares the result against the same result computed on the CPU.  The two pieces of code below illustrate how extremely similar it is to call Bolt as it is to call the STL to compute Black Scholes.

/****************************************************************
* Implementation of blackScholesCPU()            
****************************************************************/
int BlackScholes::blackScholesCPU()
{
    std::transform(cpuOptions.begin(), cpuOptions.end(),
    cpuPrice.begin(), blackScholesFunctor());
    return SDK_SUCCESS;
}

/****************************************************************
* Implementation of blackScholesBOLT()                    
****************************************************************/
int BlackScholes::blackScholesBOLT()
{
    bolt::cl::transform(boltOptions.begin(), boltOptions.end(),
    boltPrice.begin(), blackScholesFunctor());
    return SDK_SUCCESS;
}

The only visible difference between these two implementations is the namespace scoping used as the prefix to the transform() call.  Of course, the std::vector is passed into the STL API and the bolt::cl::device_vector is passed into the Bolt API.  The Bolt API also accepts iterators from std::vector, but this implies a host-device copy on input and a device-host copy before returning (all managed within the Bolt API).  Depending on where the app wishes the data to reside on API exit, this may or may not be the most optimal usage.

So, how does the Bolt API compute the Black Scholes formula with a transform call?  The magic happens in the functor that is passed as the last parameter.  blackScholesFunctor() calls the constructor of the blackScholesFunctor class declared at the beginning of blackscholes.hpp.   Let’s take a closer look at this class definition:

/****************************************************************
* blackScholesFunctor                                    
* functor definition that performs the BlackScholes algorithm 
****************************************************************/
BOLT_FUNCTOR(blackScholesFunctor,
    struct blackScholesFunctor
    {
        /*********************************************************
        * @fn phi
        * @brief Abromowitz Stegun approxmimation for PHI on the
        *        CPU(Cumulative Normal Distribution Function)
        * @return a float PHI value
        *********************************************************/
        float phi(float X)
        {
            …
        };

        /*********************************************************
        * @fn operator override used as functor to calculate 
        * BlackScholes
        * @brief Calulates the call price and put price of a given
        *   option using BlackScholes formula as described 
        *   in wikipedia 
        * http://en.wikipedia.org/wiki/Black%E2%80%93Scholes
        * @return a blackScholesPrice struct variable that
        *   contains the resultant floating point call price and 
        * put price
        *********************************************************/
        blackScholesPrice operator() (const float& inpOption)
        {
            …
        };
    };
);

I’ve edited the implementation of the class out of the code above, what is most interesting to show is the BOLT_FUNCTOR() macro call that wraps the class definition.  BOLT_FUNCTOR() is a helper macro available in Bolt whose first parameter takes the name of a type and whose second parameter takes the definition of that type.  It creates both host code and a string representation of that code which can be passed to the OpenCL compiler.  Since the code wrapped in this macro has to be able to compile both for host and device, there are restrictions on this code.  For instance, LDS usage would cause a host side C++ compiler failure, as would OpenCL vector data types such as float4.  Likewise, using exceptions would cause the OpenCL compiler to fail.  As long as the code in the macro uses the set intersection of features available in both compilers, the code will compile for both targets.  When the user calls the Bolt API, and passes in an instantiation of the functor created with the BOLT_FUNCTOR() macro, the Bolt library appends the string definition of the functor to the internal kernel string (in this case transform), builds it and then calls it.

I hope that stepping through this sample explaining the code illustrates both the power and the value of Bolt. To reiterate, Bolt allows a C++ user to write code in a generic and familiar style, and still get the inherit performance benefits OpenCL provides by offloading computation to OpenCL devices.  Bolt is being released today as a ‘Preview’ version, to enable you to experiment with the code and runtime, write your own code and have an opportunity to provide us early feedback and help us to prioritize features and fix bugs as we make our way to our v1.0 release.  As such, we are not promising yet that we won’t change the Bolt API or break backwards compatibility as we approach our release, but with your help and feedback we will ensure both completeness of the Bolt API, and a robust Bolt solution.

Kent Knox is a Member of Technical Staff at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

There are 4 comments.

Anatoliy —

Hello!

Can I use AMD APP with Visual Studio Express?

Reply »

m1k3 —

Is Bolt going to be open source?

Reply »

    Kent Knox —

    Hi M1k3~

    The source code distributed with Bolt is provided using an Apache 2.0 license. Once AMD feels that the Bolt solution is sufficiently mature, we plan to make the complete solution available on a Social Programming platform. We anticipate that this will occur the first half of 2013.

    Reply »

Share Your Thoughts!