Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.
Now that you have created your application, you’ll want to optimize its performance. Some useful measures include:Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.
- Execution time and launch time
The OpenCL runtime provides a built-in mechanism for timing the execution of kernels by setting the CL_QUEUE_PROFILING_ENABLE flag when the queue is created. Once profiling is enabled, the OpenCL runtime automatically records timestamp information for every kernel and memory operation submitted to the queue.
- Memory bandwidth
To calculate this:
Effective Bandwidth = (Br + Bw)/T
Br = total number of bytes read from global memory.
Bw = total number of bytes written to global memory.
T = time required to run kernel, specified in nanoseconds.
Some general tips:
- Avoid declaring global arrays on the kernel’s stack frame as these typically cannot be allocated in registers and require expensive global memory operations.
- Use predication rather than control-flow. The predication allows the GPU to execute both paths of execution in parallel, which can be faster than attempting to minimize the work through clever control-flow.
- If possible, create a reduced-size version of your data set for easier debugging and faster turn-around on performance experimentation.
See Chapter 4 in the AMD APP SDK OpenCL™ Programming Guide for extensive details on optimization.
The following tutorials also contain valuable optimization tips:
- Image Convolution Using OpenCL™ – A Step-by-Step Tutorial
- ATI Stream Computing – Histogram Optimization Illustration, (Histogram_optimized.zip)
- OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication
- OpenCL™ Optimization Case Study: Simple Reductions
- OpenCL™ Optimization Case Study: GATLAS – Designing Kernels with Auto-Tuning
Performance Analysis Tools
AMD APP Profiler
Included with the AMD APP SDK v2 release, but also available as a separate download, is the AMD APP Profiler. The AMD APP Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your application runs. This information can then be used by developers to discover where the bottlenecks are in their OpenCL™ application and find ways to optimize their application’s performance.
Updates to the AMD APP Profiler (already packaged with the AMD APP SDK v2) are available from:
Also available for download is the APP KernelAnalyzer which is a tool for statically analyzing the performance of OpenCL™ C kernels. APP KernelAnalyzer will compile down your OpenCL™ C kernels into the actual instructions used to program the GPU. It then performs a static analysis of the instruction stream and is able to report back to the developer a variety of information, including register usage, ALU utilization and memory contention, all without having to run the application on actual hardware.
The APP KernelAnalyzer is currently available as a separate download from:
View the AMD Fusion Developer Summit “OpenCL Application Analysis and Optimization Made Easy With AMD APP Profiler and KernelAnalyzer” tutorial that demonstrates advanced techniques to visualize your application’s workloads, discover hard-to-find bugs and bottlenecks, and determine the performance characteristics of your application.
» Download PDF
Next Topic: Porting CUDA Applications to OpenCL
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.