CPU-to-GPU data transfers exceed 15GB/s using APU zero copy path


At the AMD Developer Summit (AFDS), AMD introduced the AMD A-Series processors, formerly known as the “Sabine” platform and “Llano” APU. The AMD A-Series APU is a capable GPU compute device, the A8-3850 with Radeon™ HD 6550D Graphics is capable of up to480 GFLOPS of performance.  In order to fully benefit from this performance, it is essential to be able to transfer data efficiently between the host CPU and the GPU.   In this blog I discuss the zero copy data transfer path that has been introduced to support the tighter coupling between CPU and GPU memory afforded by the APU architecture.

As you may be aware, in the AMD A-Series APU the CPU and GPU share a common DDR3 memory, partitioning the memory between the CPU and the GPU.  Unlike a discrete GPU, this means that data may be transferred more efficiently between the CPU and the GPU as there is no longer a constraint imposed by a PCIe bus.  In AMD APP SDK 2.5 we introduce a zero copy transfer path for buffers defined using the standard OpenCLTM memory buffer flags CL_MEM_ALLOC_HOST_PTR and CL_MEM_READ_ONLY when used on an APU.

For maximum performance transfers from CPU to GPU, you can create a memory buffer using the clCreateBuffer command with the flags CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY.  This buffer will be created in host accessible GPU memory and provides a true zero copy path for transferring data between CPU and GPU.  As host accessible memory, this buffer may be mapped to the host CPU and then written to efficiently.  To make this buffer accessible by the host CPU use the clMapBuffer command to map the buffer into host memory; control over the memory buffer is transferred logically between the devices.  The CPU may now write directly to this buffer, and when done, return its contents to the GPU using the clEnqueueUnmapMemObject command.  Now, when the GPU accesses this buffer, as initiated by a clEnqueueNDRangeKernel command, it is read directly by the GPU.  In this way data is transferred between the CPU and the GPU substantially faster than is possible to a discrete GPU over a PCIe bus; using multiple CPU cores, transfer rates of over 15 GB/s1 have been achieved for mid-range AMD A-Series platforms.  When the GPU no longer needs the buffer, it may be returned to the CPU using the clEnqueueMapBuffer command, so that additional input data can be provided to the GPU.

In my next blog I will discuss our BufferBandwidth example with Marc Romankewicz, one of our OpenCL performance engineers, and how it leverages this enhancement among others to illustrate how to effectively move data between the CPU and GPU.

For more details on best practice in creating high performing code on AMD APUs and GPUs I refer you to the AMD APP SDK OpenCL Programming guide.

1 A8-3800 with Radeon™ HD 6550D graphics, 8GB DDR3-1333

Mark Ireton is the Product Manager for Compute Solutions at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

16 Responses

    • Mark Ireton

      The specific test was for Windows 7 and Catalyst 11.7 drivers using the BufferBandwidth sample from SDK 2.5.

  1. Tweenk

    The relevant mapping functions are called clEnqueueMapBuffer and clEnqueueMapImage. There is no function clEnqueueMapMemObject, only the unmapping function (clEnqueueUnmapMemObject) is generic.

  2. Chenming

    I test on AMD A8-3850 with 8 GB DDR3-1333 with AMD SDK v2.5 bufferbandwidth workload,
    with map and unmap EnqueueBuffer, but only 8 GB/s on CPU-to-GPU, but not 15 GB/s, are there anything more I should do?

  3. rahul garg

    Having some issues with consistency in testing this on a A8-3500M with BufferBandwidth test. Sometimes I reach 14GB/s (great) copy with nearly instantaneous map/unmap as expected. But sometimes the map/unmap operation takes a lot of time (reported b/w of 1GB/s for map/unmap) and copy is slower as well (about 4 GB/s).

    Posted some details on the forum here: http://devgurus.amd.com/thread/167085