Developer Central
China  |  India
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support
  • Home
  • Tools & SDKs
  • Resources
  • Community
  • Partners
  • Support

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

Home > Resources > Heterogeneous Computing > OpenCL™ Zone > Programming in OpenCL™ > Optimizing Applications

Optimizing Applications

Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.

Now that you have created your application, you’ll want to optimize its performance. Some useful measures include:Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology.

- Execution time and launch time
The OpenCL runtime provides a built-in mechanism for timing the execution of kernels by setting the CL_QUEUE_PROFILING_ENABLE flag when the queue is created. Once profiling is enabled, the OpenCL runtime automatically records timestamp information for every kernel and memory operation submitted to the queue.

- Memory bandwidth
To calculate this:

Effective Bandwidth = (Br + Bw)/T

where:

Br = total number of bytes read from global memory.
Bw = total number of bytes written to global memory.
T = time required to run kernel, specified in nanoseconds.

Some general tips:

  • Avoid declaring global arrays on the kernel’s stack frame as these typically cannot be allocated in registers and require expensive global memory operations.
  • Use predication rather than control-flow. The predication allows the GPU to execute both paths of execution in parallel, which can be faster than attempting to minimize the work through clever control-flow.
  • If possible, create a reduced-size version of your data set for easier debugging and faster turn-around on performance experimentation.

See Chapter 4 in the AMD APP SDK OpenCL™ Programming Guide for extensive details on optimization.

The following tutorials also contain valuable optimization tips:

  • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial
  • ATI Stream Computing – Histogram Optimization Illustration, (Histogram_optimized.zip)
  • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication
  • OpenCL™ Optimization Case Study: Simple Reductions
  • OpenCL™ Optimization Case Study: GATLAS – Designing Kernels with Auto-Tuning

Performance Analysis Tools

AMD APP Profiler

Included with the AMD APP SDK v2 release, but also available as a separate download, is the AMD APP Profiler. The AMD APP Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your application runs. This information can then be used by developers to discover where the bottlenecks are in their OpenCL™ application and find ways to optimize their application’s performance.

Updates to the AMD APP Profiler (already packaged with the AMD APP SDK v2) are available from:

  • the AMD APP Profiler product page.

APP KernelAnalyzer

Also available for download is the APP KernelAnalyzer which is a tool for statically analyzing the performance of OpenCL™ C kernels. APP KernelAnalyzer will compile down your OpenCL™ C kernels into the actual instructions used to program the GPU. It then performs a static analysis of the instruction stream and is able to report back to the developer a variety of information, including register usage, ALU utilization and memory contention, all without having to run the application on actual hardware.

The APP KernelAnalyzer is currently available as a separate download from:

  • the APP KernelAnalyzer product page.

View the AMD Fusion Developer Summit “OpenCL Application Analysis and Optimization Made Easy With AMD APP Profiler and KernelAnalyzer” tutorial that demonstrates advanced techniques to visualize your application’s workloads, discover hard-to-find bugs and bottlenecks, and determine the performance characteristics of your application.
» Download PDF

 Next Topic: Porting CUDA Applications to OpenCL

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

 

Get the hcNewsFlash.

Your email address:

No SPAM.
Easy unsubscribe.

HSA is going to rock your world.

Learn more about Heterogeneous System Architecture.

Got Questions?

Ask the Developer Forums Community. They’ve got answers.

Resources

  • Heterogeneous Computing
    • OpenCL™ Zone
      • Getting Started with OpenCL
      • Tools and Libraries
      • Programming in OpenCL™
        • Introductory Exercises and Tutorials
        • Debugging Applications
        • Optimizing Applications
        • Benchmarking Performance
        • Porting CUDA Applications to OpenCL™
        • Image Convolution Using OpenCL™
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 2
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 3
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 4
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 5
          • Image Convolution Using OpenCL™ – A Step-by-Step Tutorial Step 6
      • Training & Events
        • OpenCL™ Course: Introduction to OpenCL™ Programming
        • OpenCL™ Course: Introductory Tutorial to OpenCL™ for HPC at SAAHPC’10
        • OpenCL Programming Webinar Series
        • OpenCL™ On-Demand Webinars
      • Articles & Papers
      • Getting Started – Software & Hardware
    • What is Heterogeneous Computing?
    • What is Heterogeneous System Architecture (HSA)?
      • A Heterogenius Architecture
  • Documentation Library
  • Hardware & Drivers
    • CCC Driver Details
    • “Magny-Cours” Zone
    • ATI Catalyst™ PC Vendor ID (1002) LI
  • AFDS Videos
  • Documentation & Articles
    • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
    • Exploiting Multi-Core Processors in Windows Vista
    • Performance Optimization of Windows Applications on AMD Processors, Part I
    • Performance Optimization of Windows Applications on AMD Processors, Part II
    • Ten Things Developers Should Know About Windows 7
    • The Windows NUMA API-What It Is and Why You Care
    • Articles & Whitepapers
      • OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication Test
      • Barcelona’s Innovative Architecture Is Driven by a New Shared Cache
      • Bulk Encryption on GPUs
      • Develop Blazing Fast Code with Microsoft Visual Studio® 2008 and AMD Tools
      • Going to Barcelona: A Modern Architecture for Breakthrough Software Performance
      • Introduction to “Magny-Cours”
      • Java Performance when Debugging is Enabled
      • JPEG Decoding with Run-Length Encoding: A CPU and GPU Approach
      • New Round-to-Even Technique for Large-scale Data and Its Application in Integer Scaling
      • OpenCL™ and the AMD APP SDK
      • OpenCL™ and the AMD APP SDK v2.4
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part 1
      • OpenCL™ Optimization Case Study Fast Fourier Transform – Part II
      • OpenCL™ Optimization Case Study: Simple Reductions
      • OpenCL™ Optimization Case Study: Support Vector Machine Training
      • Tiled Convolution: Fast Image Filtering
    • Developer Guides & Manuals
    • Specifications & Technical Bulletins
    • Case Studies
    • Conference Presentations
      • GPU Technical Publications
      • GPU Technology Papers
    • Videos
      • AMD Developer Inside Track
      • Intro to CodeAnalyst
      • OpenCL™ Technical Overview
      • GPU Demo Videos
      • AMD & Sun Technology
      • AMD Opteron 6100 Series: A Developer’s Perspective
      • Software Optimization Video Series
      • Xen Summit North America 2010
    • Java™ Zone
    • Knowledge Base
    • OpenGL® Zone
      • OpenGL® Specifications
    • Samples & Demos
      • Processor and Core Enumeration Using CPUID
      • GPU Demos
        • Radeon™ HD 7900 Series Graphics Real-Time Demos
        • Radeon™ HD 6900 Series Graphics Real-Time Demo
        • Radeon™ HD 5000 Series Graphics Real-Time Demos
        • Radeon™ HD 4800 Series Real-Time Demos
        • FireGL™ V8600 PCI-Express Real-Time Demos
        • Radeon™ HD 3000 Series Real-Time Demo
        • Radeon™ HD 2000 Series Real-Time Demos
  • India Developer Zone
    • India University Courses
    • University Kit & Book
    • C-DAC “Think Parallel” participants visits at AMD – 20th June, 2012
    • C-DAC HeGaPa 2012 Conference
    • Heterogeneous computing Jobs in AMD India
  • Archive
    • Events
      • AMD OpenCL Coding Competition
      • Real-Time Image Processing for Autonomous Learning and Control within 3D Virtual Worlds
      • Semi-Supervised Learning-Based Method for Adaptive Shadow Detection
      • AMD OpenCL™ Coding Competition
      • Real-time Video Effects with AMD & Kinect
      • Numerical Simulation of an X-Ray Generator
    • AppShowcase Archive
    • Archived Tools
      • CPU Tools Archive
        • 128-Bit SSE5 Instruction Set
        • AMD String Library
        • Framewave Project
        • SSEPlus Project
      • GPU Tools Archive
        • ATI Stream Software Development Kit (SDK) v2.0 Beta Program
        • AMD Tootle
        • ASHLI – Advanced Shading Language Interface
        • ATI Radeon™ SDK
        • ATI Stream Software Development Kit (SDK) v1.4-beta
          • ATI Stream SDK MD5 Checksums
        • ATI_Compress
        • CubeMapGen
        • AMD GPU MeshMapper
        • GPU PerfStudio
        • Normal Mapper
        • RenderMonkey™ Toolsuite
          • RenderMonkey Toolsuite – IDE Features
          • RenderMonkey™ Toolsuite – Testimonials
          • RenderMonkey™ Toolsuite – SDK
        • The Compressonator
        • TruForm Resources
          • TruForm™ FAQ
      • Installing GCC on Ubuntu 8.04

©2013 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.

  • Contact Us
  • |
  • Careers
  • |
  • Site Map
  • |
  • Terms and Conditions
  • |
  • Privacy
  • |
  • Trademarks