General Settings

This page allows you to configure some general profiler settings.

  • Show profile setting before profiling When checked, the APP Profiler Session Parameters dialog will be shown before each profile session, allowing you to view and edit the project settings that will used for profiling. When unchecked, the settings dialog will only be shown if the profiler is unable to get enough information from the project to initiate a profile session.
  • Generate occupancy information for each OpenCL kernel profiled When checked, the profiler will generate kernel occupancy data for each OpenCL™ kernel dispatched to a GPU device. This setting affects both Performance Counter mode and Application Trace mode.
  • Delete session files when a solution is closed This setting allows you to tell the profiler to automatically delete all files and directories related to sessions that belong to projects in a solution when that solution is closed in Visual Studio.
    Setting Description
    Always delete session files The profiler will automatically delete session files when a solution is closed
    Never delete session files The profiler will not delete session files when a solution is closed
    Ask user every time The profiler will display a prompt when a solution is closed, asking the user if session files should be deleted

    If the option is set to Always delete session files or Ask user every time, you can also enable Show details of deletion. When enabled, the profiler will display a dialog, showing all the files and directories which were deleted as well an any errors that occurred when trying to delete files or directories.

Counter Selection

This page allows you to select the counters to capture for the next profile session.

Below is a list of available counters and a brief description of them. The exact counters shown depends on the type of GPU installed on the system.

The supported counters on AMD Radeon™ HD 6000 series graphics cards or older:

Name Description
Wavefronts The total number of wavefronts
ALUInsts The average number of ALU instructions executed per work-item (affected by flow control).
FetchInsts The average number of Fetch instructions from the video memory executed per work-item (affected by flow control).
WriteInsts The average number of Write instructions to the video memory executed per work-item (affected by flow control).
ALUBusy The percentage of GPUTime ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
ALUFetchRatio The ratio of ALU to Fetch instructions. If the number of Fetch instructions is zero, then one will be used instead.
ALUPacking The ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
FetchSize The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
CacheHit The percentage of fetches that hit the data cache. Value range: 0% (no hit) to 100% (optimal).
FetchUnitBusy The percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
FetchUnitStalled The percentage of GPUTime the Fetch unit is stalled. Try reducing the number of fetches or reducing the amount per fetch if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled The percentage of GPUTime Write unit is stalled. Value range: 0% to 100% (bad).

Additional performance counters for AMD Radeon™ HD 5000 or for AMD Radeon™ HD 6000 series graphics cards:

Name Description
FastPath The total kilobytes written to the video memory through the FastPath which only supports basic operations: no atomics or sub-32 bit types. This is an optimized path in the hardware.
CompletePath The total kilobytes written to the video memory through the CompletePath which supports atomics and sub-32 bit types (byte, short). This number includes bytes for load, store and atomics operations on the buffer. This number may indicate a big performance impact (higher number equals lower performance). If possible, remove the usage of this Path by moving atomics to the local memory or partition the kernel.
PathUtilization The percentage of bytes written through the FastPath or CompletePath compared to the total number of bytes transferred over the bus. To increase the path utilization, use the FastPath. Value range: 0% (bad) to 100% (optimal).
LDSFetchInsts The average number of Fetch instructions from the LDS executed per work-item (affected by flow control). This counter is a subset of the ALUInsts counter.
LDSWriteInsts The average number of Write instructions to the LDS executed per work-item (affected by flow control). This counter is a subset of the ALUInsts counter.
LDSBankConflict The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

The full set of counters for AMD Radeon™ HD 7000 series GPU devices (based on Graphics Core Next Architecture/Southern Islands) or newer:

Name Description
Wavefronts Total wavefronts.
VALUInsts The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control).
SFetchInsts The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts The average number of vector write instructions to the video memory executed per work-item (affected by flow control).
LDSInsts The average number of instructions to/from the LDS executed per work-item (affected by flow control).
VALUUtilization The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal – no thread divergence).
VALUBusy The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
FetchSize The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
CacheHit The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled The percentage of GPUTime the memory unit is stalled. Try reduce the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
LDSBankConflict The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
WriteSize The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
GDSInsts The average number of instructions to/from the GDS executed per work-item (affected by flow control). This counter is a subset of the VALUInsts counter.

You can also hover over the counter names to get the descriptions.

To load and save the counter selections to a file, click on the Load Selection and Save Selection buttons.

OpenCL™ Trace Page

This page contains two subpages that allow you to configure the behavior of the profiler when it performs an application trace.

  • Always show API error codes When checked, the profiler will always report the return codes for all OpenCL™ API calls. Some OpenCL™ API functions return an error code via a passed-in parameter. If the host application passes in NULL for that parameter, then the OpenCL™ runtime will not report an error code. The profiler will substitute a non-null parameter in this case and the API Trace will be able to show the return code.
  • Collapse consecutive identical clGetEventInfo calls Some OpenCL™ applications will wait for certain Enqueue API calls to complete by continuously checking the status of the event returned by the Enqueue API. These applications do this by calling clGetEventInfo within a loop until the event status reaches a certain state (typically CL_COMPLETE). For these applications, the timeline and API trace could contain thousands of clGetEventInfo calls, making it difficult to easily analyze the timeline and trace data. In order to make analysis easier, the profiler can collapse consecutive clGetEventInfo calls that have the same parameters and return values, into a single entry in the timeline and API trace.
  • Enable navigation to source code (high overhead) When checked, the profiler will generate a symbol information file from an application’s .pdb file, containing one entry for each OpenCL™ API that is called. This symbol information file will allow the user to navigate from an item in the API Trace in the APP Profiler Timeline panel to the source location of the API call.
  • Maximum number of APIs to trace This setting controls how many APIs will be traced over an application’s lifetime. The default number of APIs to trace is 1 million. Limiting the number of APIs traced will help to prevent running out of memory while profiling. After the limit is reached, any additional APIs will not be traced, and the trace results will not include any additional information. Because of this, any information provided in the Summary pages may not be correct, as a complete trace is required in order to provide a fully-accurate application summary.
  • Generate summary pages When checked, the profiler will automatically generate Summary Pages using the API trace and timeline data. You can further configure the summary pages by selecting rules to be used when generating the Warning(s)/Error(s) Summary page. Rules currently supported are shown in the table below:
    Rule Description
    Detect resource leaks Tracks the reference count for all OpenCL™ objects, and reports any objects which are never released.
    Detect deprecated API calls Detects calls to OpenCL™ API functions that have been deprecated in recent versions of OpenCL™
    Detect unnecessary blocking writes Detects unnecessary blocking write operations.
    Detect non-optimized work size Detect clEnqueueNDRangeKernel calls which specify a global or local workgroup size which is non-optimal for AMD Hardware.
    Detect non-optimized data transfer 1. Detect Non-Fusion APU access to Device-Visible Host Memory directly.2. Detect Host-Visible Device Memory read back to CPU directly.
    Detect redundant synchronization Detect redundant synchronization which results in low host and device utilization
    Detect failed API calls Detect OpenCL™ API calls that do not return CL_SUCCESS.Some of the return codes may not be detected unless Always show API error codes option is checked.
  • APIs to trace When checked, you can tell the profiler which APIs you want it to trace. By limiting the APIs to trace, you can focus attention on particular APIs when analyzing trace data, while also reducing the overhead of performing a trace. Because a full trace is required in order to generate the Summary pages, this option is mutually exclusive with the Generate summary pages option. Use the treeview below the option to select which APIs the profiler should trace.

Display and Output Options Page

Display Options

  • Enhanced visual effects in timeline When checked, items in the API Trace timeline will be drawn using rounded corners and gradient colors. When unchecked, items will be drawn with square corners and solid colors.
  • Zoom pivot follows mouse cursor When checked, the zoom pivot line in the timeline will automatically follow the mouse cursor as the mouse is moved. When unchecked, the zoom pivot line remains in a fixed location, and you must click somewhere within the timeline to reposition it.

Output Options

  • Write trace data at program termination When selected, the profiler will wait until the application terminates before writing any trace data to disk.
  • Write trace data periodically during program execution When selected, the profiler will periodically write all queued up trace data to disk during program execution. Note: this mode may introduce extra overhead to the profile data which will appear as periodic gaps in the timeline each time the queued-up data is written to disk
  • Interval at which to write trace data (in milliseconds) This specifies how often any queued-up profile data will be written to disk.

Updates Page

This page allows you to configure whether the APP Profiler will automatically check for updates, as well as how often it will check for updates.

It also allows you to check for updates manually.

  • Check for updates on startup When checked, the APP Profiler will automatically check for updates during startup of the Visual Studio plugin. When unchecked, no automatic check for updates will be performed.
  • Frequency of update check Allows you to specify how often you want the APP Profiler to check for updates.
    Frequency Description
    Every startup The APP Profiler will check for an update each time Visual Studio is started.
    Every day The APP Profiler will check for an update once each day when Visual Studio is started.
    Every 7 days The APP Profiler will check for an update once every 7 days when Visual Studio is started.
    Every 30 days The APP Profiler will check for an update once every 30 days when Visual Studio is started.
  • Last update check Displays the date of the most recent update check.
  • Next scheduled update check Displays the date of the next scheduled update check.
  • You have chosen to skip build X This will be shown if you choose to skip an available update.
  • Check for Updates Click this button to immediately check for an update.