ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.
Now that you have created your application, you’ll want to optimize its performance. Some useful measures include:.
Execution time and launch time
The OpenCL™ runtime provides a built-in mechanism for timing the execution of kernels by setting the CL_QUEUE_PROFILING_ENABLE flag when the queue is created. Once profiling is enabled, the OpenCL™ runtime automatically records timestamp information for every kernel and memory operation submitted to the queue.
To calculate this:
Effective Bandwidth = (Br + Bw)/T
Br = total number of bytes read from global memory.
Bw = total number of bytes written to global memory.
T = time required to run kernel, specified in nanoseconds.
Some general tips:
- Avoid declaring global arrays on the kernel’s stack frame as these typically cannot be allocated in registers and require expensive global memory operations.
- Use predication rather than control-flow. The predication allows the GPU to execute both paths of execution in parallel, which can be faster than attempting to minimize the work through clever control-flow.
- If possible, create a reduced-size version of your data set for easier debugging and faster turn-around on performance experimentation.
See Chapter 4 in the AMD APP SDK OpenCL™ Programming Guide for extensive details on optimization.
The following tutorials also contain valuable optimization tips:
- ATI Stream Computing – Histogram Optimization Illustration,(Histogram_optimized.zip)
- OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication
- OpenCL™ Optimization Case Study: Simple Reductions
- OpenCL™ Optimization Case Study: GATLAS – Designing Kernels with Auto-Tuning
Performance Analysis Tools
AMD APP Profiler
Included with the AMD APP SDK release, but also available as a separate download, is the AMD APP Profiler. The AMD APP Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your application runs. This information can then be used by developers to discover where the bottlenecks are in their OpenCL™ application and find ways to optimize their application’s performance.
Updates to the AMD APP Profiler (already packaged with the AMD APP SDK) are available from the AMD APP Profiler product page.
Also available for download is the APP KernelAnalyzer which is a tool for statically analyzing the performance of OpenCL™ C kernels. APP KernelAnalyzer will compile down your OpenCL™ C kernels into the actual instructions used to program the GPU. It then performs a static analysis of the instruction stream and is able to report back to the developer a variety of information, including register usage, ALU utilization and memory contention, all without having to run the application on actual hardware. The APP KernelAnalyzer is currently available as a separate download from the APP KernelAnalyzer product page.
Next Topic: Porting CUDA Applications to OpenCL™
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.