Fine-Grain Shared Virtual Memory Buffer With Examples

As part of our OpenCL™ 2.0 Demystified series, we are going to look at the benefits of using Fine-Grain Shared Virtual Memory (SVM) Buffer. I’ll give you a couple of examples. Prakash Raghavendra wrote about Coarse Grain SVM and Fine-Grain SVM. His blogs give a great introduction to SVM.

OpenCL 2.0 is a big leap in heterogeneous computing. It includes several new features that improve programmability and performance compared to earlier OpenCL versions. The Khronos™ Group has an excellent quick reference card for OpenCL 2.0 (found in the reference cards section of their website).

New Examples

The AMD APP SDK 3.0 final release is now available, and includes some new examples that were not present in the beta release. This blog will discuss two of them, SVMAtomicsBinaryTreeInsert and HeatPDE.

One of the prominent new features in OpenCL 2.0 that aid in ease of programming is SVM. SVM enables the host and the device to access the same virtual address space within a context. As a result, programmers no longer need to explicitly manage data transfer between host and device. SVM allows buffers to include pointers, which can be used by both the CPU and GPU. This was not possible in earlier versions of OpenCL.

There are three types of SVM in the OpenCL 2.0 specification:

  • Coarse grain SVM: Sharing occurs at the granularity of regions of OpenCL buffer memory objects.
  • Fine-Grain SVM buffer: Sharing occurs at the granularity of individual loads; stores into bytes within OpenCL buffer memory objects.
  • Fine-Grain SVM system: Sharing occurs at the granularity of individual loads; stores into bytes occurring anywhere within the host memory. 

NOTE: Fine-Grain SVM is an optional feature of the OpenCL 2.0 specification. Support for Fine-Grain SVM buffers and atomics is available only on Linux® 64-bit with the latest Catalyst™ driver. There is no support for Fine-Grain SVM system at this time.

Fine-Grain SVM Buffer

As described in Prakash’s blog, Fine-Grain SVM buffer provides fine grain control within the SVM buffer object. Also, if the buffer is created using the atomics flag, memory consistency is achieved at the level of atomic operations. This is a very useful feature for many real-world applications where CPU and GPU must simultaneously access the same buffer and require a consistent view of the data. We will explore two such scenarios where Fine-Grain SVM buffer enables the CPU and GPU to work in collaboration. In the first example I will explain node insertion into a binary tree done simultaneously by CPU and GPU. In the second example we will look at simulating a heat field using the Fine-Grain SVM buffer.

Insert Nodes into a Binary Tree

Consider a binary tree as shown in Figure 1. Inserting a large number of nodes into a binary tree can be a very time consuming operation when done sequentially on a CPU. The tasks include tree traversal, identifying the parent node, and insertion. Using OpenCL on GPUs can accelerate this task significantly by parallelizing the node insertion. To further improve the throughput, we can divide the workload such that the GPU inserts some of the nodes and the CPU does the rest.

fig1_binary_tree
Figure 1: A binary tree

 

Implementing node insertion in a binary tree, however, poses few challenges on earlier versions of OpenCL (1.2 and earlier)

  • A typical binary tree data structure includes pointers that reference the left and the right child nodes.
struct node
{
int value;
node* left;
node* right;
};

Passing data structures with pointers is not allowed in OpenCL 1.x. So you must pass the data between host and device using an indices-based data structure. These data transformation operations introduce additional overhead, which of course affects performance. The code complexity also increases because of the additional data transformations.

  • For concurrent node insertion (by both the CPU and GPU), you must synchronize the host and the OpenCL device to ensure that both devices don’t write to the same parent node at the same time. This means that a node insert operation must happen atomically across the CPU and GPU. But OpenCL 1.x does not support atomic operations whose scope is valid across both CPU and GPU, i.e. no support for platform atomics.

The Fine-Grain SVM feature of OpenCL 2.0 helps overcome both limitations. It supports passing pointer-based data structures to kernels and also enables synchronization between the host and the OpenCL device through platform atomics. Let’s look at some code that shows how to use Fine-Grain SVM buffer. The complete implementation is available in AMD APP SDK sample, SVMAtomicsBinaryTreeInsert.

  • Include a mutex variable in the node structure that will be used for atomic operation
typedef struct {
volatile int count;
} svm_mutex;
 
struct node
{
int value;
node* left;
node* right;
svm_mutex mutex_node;
};

 

  • Create the tree structure as a Fine-Grain SVM buffer using clSVMAlloc() and passing the atomics flag
// initialize any device/SVM memory here.
int flags = CL_MEM_READ_WRITE | CL_MEM_SVM_FINE_GRAIN_BUFFER | CL_MEM_SVM_ATOMICS;
node* svmTreeBuf = (node *) clSVMAlloc(context,
flags,
total_nodes*sizeof(node),
0);
  • Create a binary tree with a given number of initial nodes on the host/CPU.
//Make tree with given initial nodes, init_tree_insert
cpuMakeBinaryTree(init_tree_insert, svmTreeBuf);
  • Read the number of nodes to be inserted into this tree, num_insert
  • Decide the number of nodes to be inserted by the GPU and CPU and assign to respective variables. You can heuristically arrive at the best split between CPU and GPU based on the performance observed.
//Num of nodes to insert on host and device
host_nodes = (num_insert * ((float)hostCompPercent / 100));
device_nodes = num_insert - host_nodes;
  • Start node insertion on the GPU by launching the OpenCL kernel. Each work-item inserts one node into the tree. Here’s the kernel code.
    • Get the key of the node that has to be inserted by current work-item
size_t gidx = get_global_id(0);
node* new_node = &(svmTreeBuf[gidx]);
int key = (new_node->value);

 

  • Iterate through the tree to find the parent node for this key
//root is the root node as obtained after initial nodes are inserted
node* tmp_parent = root;
node* tmp_node = root;
 
/* Search the parent node. */
while (tmp_node)
{
tmp_parent = tmp_node;
int flag = (key - (tmp_node->value));
tmp_node = (flag < 0) ? tmp_node->left : tmp_node->right;
}

 

  • Once the parent node is found, use atomic Compare-Exchange-Strong (CAS) operation to insert the node
//Get the mutex node of the parent
svm_mutex tmp_mutex = &tmp_parent->mutex_node;
int expected = SVM_MUTEX_UNLOCK;
 
//CAS operation
int status = atomic_compare_exchange_strong_explicit((atomic_int *)&tmp_mutex->count, &expected, SVM_MUTEX_LOCK, memory_order_seq_cst,memory_order_seq_cst, memory_scope_all_svm_devices);
 
//If parent node lock is successful
if(status)
{
//Insert the node
if (flag < 0)
tmp_parent->left = new_node;
else
tmp_parent->right = new_node;
 
expected = SVM_MUTEX_LOCK;
 
//Reset the lock status
atomic_compare_exchange_strong_explicit((atomic_int *)&tmp_mutex->count, &expected, SVM_MUTEX_UNLOCK, memory_order_seq_cst,memory_order_seq_cst, memory_scope_all_svm_devices);
 
atomic_work_item_fence(CLK_GLOBAL_MEM_FENCE, memory_order_seq_cst, memory_scope_all_svm_devices);
}

In the CAS operation, if the value of status is set, then we insert the new node at the parent node obtained in the previous step. If status is not set, it would mean that the parent node is locked (by either another work-item on the GPU or by a thread on the CPU). If the parent node is locked, the CAS operation must be repeated until the required parent node is found and lock is obtained. For brevity, that part is not included here. Refer to the Compute SDK sample SVMAtomicsBinaryTreeInsert for details on this implementation.

Here I have used memory_order_seq_cst for the memory_order argument to ensure sequential consistency of memory access. The memory scope is set to memory_scope_all_svm_devices because both the CPU and GPU are atomically accessing the tree nodes and need a consistent view.

  • Leaving the GPU kernel behind, you also invoke similar atomic CAS operation on the CPU to implement its share of the node-insertion workload. We parallelize the node insertion on the CPU using OpenMP.
int expected = SVM_MUTEX_UNLOCK;
int status = atomic_compare_exchange_strong_explicit(&lock->count, &expected, SVM_MUTEX_LOCK, std::memory_order_seq_cst,std::memory_order_seq_cst);
if(status)
{
//Insert the node
if (flag < 0)
tmp_parent->left = new_node;
else
tmp_parent->right = new_node;
 
atomic_compare_exchange_strong_explicit(&lock->count, &expected, SVM_MUTEX_UNLOCK, std::memory_order_seq_cst,std::memory_order_seq_cst);
}

If you compare the OpenCL kernel code with the CPU code, you’ll see they are very similar. SVM-based pointer data structures make programming for the GPU very CPU-like and the code is much more readable.

Test Results

So let’s look at performance using three different approaches: the CPU does all the work; the GPU does all the work; and both the CPU and GPU insert nodes. Here’s what I got for results.

fgsvm_chart

Not surprisingly, I got the best average throughput when the CPU and GPU share the load, with the CPU inserting 40% of the nodes and GPU inserting 60%. I ran the test on an AMD A-10 7850K APU with 4 CPU cores and an integrated GPU with 8 compute cores. I was using Ubuntu 14.04 64-bit OS and the AMD Catalyst™ Omega 14.12 driver.

Heat Field Simulation

Let’s consider another example of the use of Fine-Grain SVM buffer. Let’s say we want to simulate a controlled heat field on a rectangular area/plate as shown in Figure 2. The plate is heated by a grid of burners (green triangles). Sensors (blue rectangles) measure the heat at their location and ensure that the temperature stays within a given range. We are looking here at a simple feedback loop.

fig2_hotsurface
Figure 2: A hot surface: triangles are heat sources; squares are sensors

The GPU calculates the temperature at different points on the plate. The CPU manages the feedback loop to monitor the temperature at the sensor positions. This requires co-ordination between the CPU and GPU so that while the GPU is updating the temperature values, the CPU runs the feedback loop in the background and sets (or unsets) a flag for each burner, based on whether temperature is above or below a predefined value, turning the burner on or off as appropriate. The GPU computes the heat field according to the status of this flag. The feedback loop on the CPU accesses the same memory as the GPU. For this we use Fine-Grain SVM buffer and ensure a consistent view of memory to both devices.

Check out the APP SDK sample, HeatPDE to get more details on the implementation of this sample and how to use a Fine-Grain buffer. The simulation output of this sample is shown in Figure 3.

simulation_image1

Simulation_over_time2_img
Figure 3: Heat Field simulation output

In the real world, what this shows is that you can design and implement a feedback system that is more responsive and has finer control over a process than you could with sequential code on a CPU alone.

Conclusion

Many real-world performance-critical applications work on data that can be accessed by both a CPU and GPU. Fine-Grain SVM buffer, introduced in OpenCL 2.0, is a very useful feature in developing such applications. A developer sees easier programming because he or she can use the same pointers across the CPU and GPU. Also, the ability to atomically access data across the devices without requiring the GPU kernel to complete leads to improved program performance.


 

Pradeep Rao is Member of Technical Staff in the Developer Solutions Team at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

 

OpenCL is a trademark of Apple Inc. used by permission by Khronos.

Comments are closed.