This panel shows the GPU performance counters for a profile session. To get the .csv file of the result, browse to the location of the file shown in the title of the panel. To quickly navigate to the location of the file, right click the session in the Session Explorer, and select “Open Containing Folder” from the menu.

On the top of the panel, there is a button to launch the profiling result with an external application (if Microsoft Excel is bound to the .csv file extension, then clicking on the button will launch Excel). There are also three checkboxes to select the view options for the session result.

  • Show Kernel Dispatch: If this option is selected, then the session result will show the kernel dispatch operations.
  • Show Data Transfer: If this option is selected, then the session result will show the data transfer operations.
  • Show Zero Column: If this option is selected, then the session result will show all columns (even for columns with zero/empty values).

The first several columns in the session are always displayed even if no performance counters are selected for the profile. A description of these columns for OpenCL™ applications is given in the table below:

Name Description
Method The kernel name (appended by __k[KernelID]_[DeviceName][DeviceID] to differentiate unique kernels with the same name).
ExecutionOrder The order of execution for the kernel dispatch operations in the program.
ThreadID The thread ID of the host thread that made the OpenCL API call that initiated the kernel dispatch operation.
CallIndex The call index of the OpenCL™ API call that initiated the kernel dispatch operation.
GlobalWorkSize The global work-item size of the kernel.
WorkGroupSize The work-group size of the kernel.
Time The time spent executing the kernel in milliseconds (does not include the kernel setup time).
LocalMemSize The amount of local memory (LDS for GPU) in bytes being used by the kernel.
VGPRs The number of general purpose vector registers used by the kernel (valid only for GPU devices).
SGPRs The number of general purpose scalar registers used by the kernel (valid only for AMD Radeon HD 7000 series GPU devices based on Graphics Core Next Architecture/Southern Islands or newer).
ScratchRegs The number of scratch registers used by the kernel (valid only for GPU devices). If non zero, this is typically the main bottleneck. To reduce this number, reduce the number of GPRs used by the kernel.
FCStacks The size of the flow control stack used by the kernel (valid only for AMD Radeon HD 6000 series GPU devices or older). This number may affect the number of wavefronts in-flight. To reduce the stack size, reduce the amount of flow control nesting in the kernel.
KernelOccupancy The kernel occupancy (valid only for GPU devices). This is an estimate of the number of in-flight wavefronts on a compute unit as a percentage of the theoretical maximum number of wavefronts that the compute unit can support.

A description of these columns for a DirectCompute application is given in the table below:

Name Description
Identifier The kernel name (appended by a pointer value that is unique for each kernel instance) or the data transfer operation name.
ExecutionOrder The order of execution for the kernel and data transfer operations from the program.
ThreadGroup The Thread Group size of the kernel.
WorkGroupSize The work-group size of the kernel.
Time For a kernel dispatch operation: time spent executing the kernel in milliseconds (does not include the kernel setup time). For a data transfer operation, time spent transferring data in milliseconds.

The Counter Selection page of the APP Profiler Settings Window contains the description of the performance counters. This description is also shown if you hover the mouse cursor over the counter name in the Session panel.

To show the source, IL or ISA code of an OpenCL™ kernel or the DXASM code of a DirectCompute kernel, click on the kernel name in the first column to open the Code Viewer Panel .

For OpenCL™ applications, if a kernel is run on a CPU device, only the global work size, work group size, local memory, and the execution time for the kernel will be available.

Using the performance counters, you can:

  • Find the number of resources (General Purpose Registers, Local Memory size, and Flow Control Stack size) allocated for the kernel. These resources affect the possible number of in-flight wavefronts in the GPU. A higher number better hides data latency.
  • Determine the number of ALU, global and local memory instructions executed by the GPU.
  • Determine the number of bytes fetched from and written to the global memory.
  • Determine the utilization of the SIMD engines and memory units in the system.
  • View the efficiency of the Shader Compiler in packing ALU instructions into the VLIW instructions used by AMD GPUs.
  • View any local memory (Local Data Share – LDS) bank conflicts.
  • View Kernel occupancy percentage , which estimates the number of in-flight wavefronts on a compute unit as a percentage of the theoretical maximum number of wavefronts that the compute unit can support.

To view more information about the kernel occupancy figure for an OpenCL™ kernel, click on the percentage value in the Kernel Occupancy column to open the Kernel Occupancy Viewer Panel .