AMD Joins OpenACC

At the Supercomputing Conference in November 2014 AMD announced that it had joined the OpenACC standards group. With a growing number of members and compiler implementations, OpenACC is quickly establishing itself as an important part of the computing ecosystem.

Threads Everywhere

In early days of high-performance computing, developers coding for multiple processors had to manually manage spawning, monitoring, and synchronization of multiple threads. Operating systems differed in their support, so the code had to be rewritten for each proprietary system. Developers spent hours scrutinizing small sections of code. Bugs sometimes took years to find.

Directive Based Programming and OpenMP

In 1997 the 1.0 version of the OpenMP specification was released. Rather than require explicit thread management, developers could write their algorithm in standard C, C++ or Fortran with hints to the compiler for detecting and exploiting parallelism.

Take for example an OpenMP implementation of matrix multiply (for two square matrices A and B):

#pragma omp parallel {
#pragma omp for
for ( int i = 0; i < size; ++i ){
for ( int j = 0; j < size; ++j ){
int tmp = 0;
for ( int k = 0; k < size; ++k ){
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}

The two #pragma directives are sufficient to allow the compiler to parallelize the routine. For a size of 512, the OpenMP version runs 3x faster on my 4 core machine.

GPU Programming – Return to the Wild West

At the turn of the century, the scientific community turned to graphics hardware to perform tasks previously performed by the CPU. A number of proprietary APIs were created to provide efficient access to the hardware.

Much like the earlier days of threaded CPU programming, coders had to write versions of their code for each platform. The development task was burdensome but often yielded substantial performance gains.

In 2009 the 1.0 version of OpenCL was released. It relieved some of the development pain by providing an open standards based approach to GPU computing, allowing the code to be written once and run on any compatible device. Although the OpenCL standard was an improvement over previous closed models, its use is still dominated by expert programmers.

OpenACC

OpenACC does for GPU programming what OpenMP did for CPU programming – providing acceleration with a directives based approach. Returning to the matrix multiply example above, the OpenMP pragmas can be replaced with OpenACC directives:

#pragma acc kernels \
pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])

When I ran this on my 4-core AMD APU system with a size of 4096, the OpenACC implementation provided a significant performance advantage over the OpenMP and serial implementations.

Although not every OpenMP routine will benefit as immediately from a drop-in replacement of OpenACC, research inside AMD found significant performance benefits for many common OpenMP routines simply by switching to equivalent OpenACC pragmas.

High Performance Computing

OpenACC provides developers with the ability to quickly develop GPU accelerated code. We look forward to working with OpenACC standards group to extend its benefits to the high performance computing community.

 

Brent Hollingsworth is a Program Manager at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only.  Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Comments are closed.