Skip to content
Filters & more
1h
Advanced
pytorchrocmgpukernelsoptimization
Device Family
Device
OS

Building Custom GPU Kernels with PyTorch and AMD ROCm™

Write and optimize custom GPU kernels using PyTorch and AMD ROCm™ software on AMD Ryzen™ AI

Building Custom GPU Kernels with PyTorch and AMD ROCm™

Overview

Write a GPU kernel from scratch, compile it, launch it on an AMD GPU, and watch utilization spike. This playbook shows how GPU computation actually works: write the kernel code, and execute it in parallel across thousands of threads.

What You’ll Learn

  • How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data
  • How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification
  • How to compile a kernel at runtime using torch.cuda._compile_kernel
  • How to build a native C++ kernel extension with CUDAExtension + pybind11, importable from Python
  • How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data
  • How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification
  • How to compile a kernel at runtime using torch.cuda._compile_kernel
  • How to build a native C++ kernel extension with CUDAExtension + pybind11, importable from Python
  • How to measure kernel execution time and monitor live GPU utilization with amd-smi

This playbook covers two approaches for kernel development:

ApproachEntry point
JIT Compilationtorch.cuda._compile_kernel, write a kernel as a Python string, with no build step
C++ ExtensionCUDAExtension + pybind11: compile a .cu file into a native .pyd and import it
ApproachEntry point
JIT Compilationtorch.cuda._compile_kernel, write a kernel as a Python string, with no build step
C++ ExtensionCUDAExtension + pybind11: compile a .cu file into a native .so and import it

Both approaches run on AMD GPUs. This is possible because PyTorch’s ROCm build maps the entire CUDA API surface to HIP. This means torch.cuda, CUDAExtension, and CUDA kernel syntax all work on AMD hardware transparently.


Background

What is a GPU Kernel?

A GPU kernel is a function that runs in parallel across thousands of GPU threads simultaneously. Unlike a CPU function that executes once per call, a kernel is launched with a grid of blocks, each containing many threads, all executing the same code on different data.

Thread Indexing Model

When launching a kernel you specify two dimensions:

VariableMeaning
gridDimNumber of blocks in the grid
blockDimNumber of threads per block

Each thread has access to three built-in read-only variables:

VariableMeaning
blockIdx.xWhich block this thread belongs to
blockDim.xNumber of threads in one block
threadIdx.xThread index within its block

Global Thread ID

These variables are combined to compute a globally unique thread index:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

Total threads = gridDim.x * blockDim.x. Each thread processes one element independently. This is the foundation of data parallelism. The same operation runs on many elements at once, with no inter-thread dependency.


GPU Execution Model: Wavefronts

AMD GPUs execute threads in groups of 32 called wavefronts. All threads in a wavefront run the same instruction simultaneously. This affects optimal block size choices (256 threads = 8 wavefronts = good scheduling efficiency).

AMD GPU Programming: HIP + ROCm

ROCm is AMD’s open-source GPU compute stack (drivers, compilers, libraries, runtime). HIP sits on top, designed to be syntactically identical to CUDA. PyTorch’s ROCm build transparently maps torch.cuda.* to HIP, so the same code works on AMD GPUs.


PyTorch + AMD/HIP

PyTorch ships a ROCm build where the CUDA API surface (torch.cuda.*) is transparently backed by HIP. This means:

  • torch.cuda.is_available() works on AMD GPUs with ROCm
  • tensor.to("cuda") allocates on the AMD GPU
  • torch.version.hip exposes the HIP version

PyTorch also exposes torch.cuda._compile_kernel(), a high-level shortcut to JIT-compile a raw kernel string and get back a callable, without needing a separate build step.


Check for Software Updates

Before starting, ensure your Ryzen AI Halo has the latest software installed. Open the AMD Ryzen™ AI Developer Center and check for available updates, both to the app itself and additional software.

Go to the Updates tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Updates tab on Windows

Go to the Manage tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Manage tab on Linux

Installing Software Prerequisites

Prerequisites - Windows

Create a Virtual Environment

On Linux, open a terminal in the directory of your choice and follow the commands to create a venv with ROCm+Pytorch already installed.

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv kernel-env --system-site-packages
source kernel-env/bin/activate

Grant your user access to GPU devices (log out and back in for this to take effect):

Terminal window
sudo usermod -aG render,video $LOGNAME

On Linux, open a terminal in the directory of your choice and follow the commands to create a venv.

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv kernel-env
source kernel-env/bin/activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv.

Terminal window
python -m venv kernel-env
kernel-env\Scripts\activate

Installing Basic Dependencies

ROCm

Add the current user to the render and video groups.

Terminal window
sudo usermod -a -G render,video $LOGNAME

Restart your system to apply the settings.

Terminal window
sudo reboot

Install ROCm in the created virtual environment.

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "rocm[libraries,devel]"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "rocm[libraries,devel]"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "rocm[libraries,devel]"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx120x-all/ "rocm[libraries,devel]"

For further installation help, please see this link.


PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1200-all/ torch torchvision torchaudio

For other devices, please refer to this link for full instructions.

AMD GPU Driver

Update to the latest AMD GPU driver using AMD Software: Adrenalin Edition™.

  1. Open AMD Software: Adrenalin Edition from your Start menu or system tray.
  2. Navigate to Driver and Software, click Manage Updates.
  3. If an update is available, follow the prompts to download and install.

AMD GPU Driver

Download and install the latest AMD GPU driver for Linux:

  1. Visit the AMD Linux Drivers page.
  2. Follow the installation instructions provided on the download page.

ROCm

Add the current user to the render and video groups.

Terminal window
sudo usermod -a -G render,video $LOGNAME

Restart your system to apply the settings.

Terminal window
sudo reboot

Install ROCm in the created virtual environment.

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "rocm[libraries,devel]"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "rocm[libraries,devel]"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "rocm[libraries,devel]"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx120x-all/ "rocm[libraries,devel]"

For further installation help, please see this link.


PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1200-all/ torch torchvision torchaudio

For other devices, please refer to this link for full instructions.

Install ROCm:

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "rocm[libraries,devel]"

Install PyTorch:

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"

Installing Additional Dependencies

Install the Linux C/C++ build toolchain. This is a system-level dependency and is required for the C++ extension walkthroughs because CUDAExtension builds native .so modules from .cu files.

Run this once on the Linux machine, outside the created Python virtual environment:

Terminal window
sudo apt update
sudo apt install -y build-essential gcc g++

After activating the kernel-env virtual environment, install the Python build dependencies:

Terminal window
python -m pip install "setuptools<82" wheel ninja

Please ensure Visual Studio 2022 or newer is installed with the Desktop development with C++ workload.

Open a PowerShell terminal and run the following commands before building the C++ extension.

Step 1: Find the installed Visual Studio C++ environment

(A) Locate vswhere.exe, which is installed with the Visual Studio Installer

Terminal window
$VsWhere = "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe"
if (-not (Test-Path $VsWhere)) {throw "vswhere.exe was not found. Install Visual Studio 2022 or newer with the Desktop development with C++ workload."}

(B) Find vcvars64.bat from Visual Studio 2022 or newer with C++ build tools

Terminal window
$Vcvars = & $VsWhere `
-latest `
-products * `
-requires Microsoft.VisualStudio.Component.VC.Tools.x86.x64 `
-find "VC\Auxiliary\Build\vcvars64.bat" |
Select-Object -First 1
if (-not $Vcvars) {throw "Could not find vcvars64.bat. Install Visual Studio 2022 or newer with the Desktop development with C++ workload."}

(C) Print the Visual Studio C++ Environment being used

Terminal window
Write-Host "Using Visual Studio C++ environment: $Vcvars"

Step 2: Activate the Visual Studio C++ build environment

(A) Run vcvars64.bat and capture the environment it sets

This makes cl.exe, INCLUDE, LIB, LIBPATH, and Windows SDK paths available.

Terminal window
$VsEnv = cmd /c "`"$Vcvars`" && where cl && set" 2>&1
$ExitCode = $LASTEXITCODE
if ($ExitCode -ne 0) {
$VsEnv | Out-Host
throw "Failed to activate the Visual Studio C++ environment. Exit code: $ExitCode"
}

(B) Import the Visual Studio environment variables into this PowerShell session

Terminal window
$VsEnv | ForEach-Object {
if ($_ -match '^([^=]+)=(.*)$') {
[System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process')
}
}

Step 3: Verify that the Microsoft C++ compiler is available

Terminal window
where.exe cl

Set Environment Variables

Terminal window
rocm-sdk init # Initialize the devel libraries
PY_MM="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
export ROCM_HOME="$VIRTUAL_ENV/lib/python${PY_MM}/site-packages/_rocm_sdk_devel"
export LD_LIBRARY_PATH="$ROCM_HOME/lib:$LD_LIBRARY_PATH"
export PATH="$ROCM_HOME/bin:$PATH"
# Set compiler and build settings
export CC=clang
export CXX=clang
export DISTUTILS_USE_SDK=1
Terminal window
rocm-sdk init # Initialize the devel libraries
$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$RocmPathEntries = @(
$ROCM_BIN,
"$ROCM_ROOT\bin",
"$ROCM_ROOT\lib",
"$ROCM_ROOT\lib\llvm\bin"
) | Where-Object { $_ -and (Test-Path $_) }
$env:PATH = (($RocmPathEntries + @($env:PATH)) -join ";")
$env:ROCM_HOME = $ROCM_ROOT
$env:HIP_PATH = $ROCM_ROOT
$env:ROCM_BIN = $ROCM_BIN
$env:HIP_PLATFORM = "amd"
# Set compiler and build settings
$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

Verify that the AMD GPU is visible with:

Terminal window
amd-smi

Download Required Files

Create the following directory structure by making the 2 new folders and downloading the corresponding files:

DirectoryFiles to DownloadDescription
Vector_Addition/


JIT and C++ extension files for vector addition kernel
Matrix_Multiplication/


JIT and C++ extension files for matrix multiplication kernel

Walkthroughs

Walkthrough 1: Vector Addition

Approach A: JIT Compilation

JIT (Just-In-Time) compilation means the kernel is written as a raw C++ string inside Python and compiled at runtime, without needing extra build steps.

To use , make sure it’s downloaded and run:

Terminal window
cd Vector_Addition # if not already inside the directory
python add_one_kernel.py

Key Code Snippets

import torch
# Snippet 1: Kernel source as a string
KERNEL_SOURCE = """
extern "C"
__global__ void add_one(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
for (int i = 0; i < 1000; i++)
data[idx] += 1.0f;
}
}
"""
# Snippet 2: Compile the kernel string. PyTorch calls hipcc under the hood with ROCm
add_one_kernel = torch.cuda._compile_kernel(KERNEL_SOURCE, "add_one")
x = torch.ones(100_000_000, dtype=torch.float32, device="cuda")
n = x.numel()
block_size = 256
grid_size = (n + block_size - 1) // block_size
# Snippet 3: Launch: specify the grid/block dimensions and pass tensor arguments directly
for _ in range(200):
add_one_kernel(
grid=(grid_size, 1, 1),
block=(block_size, 1, 1),
args=[x, n],
)
# Snippet 4: Test the output
print("First 5 elements:", x[:5].cpu())
#Expected output: tensor([200001., 200001., 200001., 200001., 200001.])

What the workload does:

The kernel artificially adds extra work to demonstrate GPU utilization:

  • 100,000,000 elements in the tensor
  • Inner loop runs 1,000 times per element per kernel launch
  • 200 kernel launches total

Math:

  • Each element: gets incremented by 1 × 1,000 iterations × 200 launches = 200,000
  • Final result: 1.0 (starting value) + 200,000 (additions) = 200,001.0

Why the inner loop?

  • Without the for (int i = 0; i < 1000; i++) loop, 200 launches would finish instantly and the monitoring tools wouldn’t capture meaningful GPU utilization. The artificial work makes each kernel run long enough for monitoring tools to measure performance.

Expected output:[The performance numbers will vary]

First 5 elements: tensor([200001., 200001., 200001., 200001., 200001.])
Elapsed time: 2.753s
Peak GPU Utilization: 93%
Average GPU Utilization: 65.94%

Expected output:

First 5 elements: tensor([200001., 200001., 200001., 200001., 200001.])
Elapsed time: 2.753s
No GPU Usage captured.

Nice work! You just ran your first GPU kernel.


Approach B: C++ Extension

The second approach is more manual: write the kernel and Python binding to a single .cu file, compile it natively using PyTorch’s build system, and import it into Python.

Download the following files if you haven’t already:

FileRole
Kernel + launcher + pybind11 binding, everything in one file
Build script, uses CUDAExtension to compile the .cu into a .pyd
Python script that runs the built artifacts
FileRole
Kernel + launcher + pybind11 binding, everything in one file
Build script, uses CUDAExtension to compile the .cu into a .so
Python script that runs the built artifacts

Step 1: The kernel, launcher, and binding ():

#include <torch/extension.h>
#include <hip/hip_runtime.h>
// GPU kernel, one thread per element
__global__ void add_one(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) data[idx] += 1.0f;
}
// Launcher, bridges torch::Tensor to raw pointer, sets grid/block, runs kernel
void add_one_launcher(torch::Tensor tensor) {
int n = tensor.numel();
float* data = tensor.data_ptr<float>();
int block_size = 256;
int grid_size = (n + block_size - 1) / block_size;
add_one<<<grid_size, block_size>>>(data, n);
hipDeviceSynchronize();
}
// Python binding, exposes add_one_launcher as add_one_ext.add_one
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("add_one", &add_one_launcher, "Add one kernel (HIP)");
}

Step 2: Build

Terminal window
pip install --no-build-isolation -v .

CUDAExtension is a CUDA build helper from torch.utils.cpp_extension. With ROCm, PyTorch remaps CUDAExtension to use hipcc instead of nvcc. ROCm intercepts the build path and routes it through the HIP compiler, porting CUDA code to AMD.

This produces the following files:

  • build/: directory with the .pyd files
  • add_one_kernel.hip: the HIP source generated by hipifying the .cu file; this is what hipcc actually compiled
  • build/: directory with the .so files
  • add_one_kernel.hip: the HIP source generated by hipifying the .cu file; this is what hipcc actually compiled

Step 3: Use from Python ():

Execute this script to see the kernel in action:

Terminal window
cd Vector_Addition # if not already in directory
python run_compiled_addition.py

Expected output:

Before: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
After: tensor([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.], device='cuda:0')

Walkthrough 2: Matrix Multiplication

Matrix multiplication computes C = A × B where:

  • A is M×N (rows × columns)
  • B is N×K
  • C is M×K (the result)

Each output element is defined as: C[row,col]=n=0N1A[row,n]B[n,col]C[row, col] = \sum_{n=0}^{N-1} A[row, n] \cdot B[n, col]

Each element of C is calculated independently, making this perfect for GPU parallelism.

How It Maps to GPU Threads

Unlike vector addition (1D), matrix multiplication produces a 2D output, so we use a 2D grid of threads:

Vector AdditionMatrix Multiplication
Output shape1D array2D matrix (M×K)
Thread mapping1 thread → 1 element1 thread → 1 output element
Launch pattern1D grid: (grid_x, 1, 1)2D grid: (grid_x, grid_y, 1)
Block size(256, 1, 1)(16, 16, 1) = 256 threads

Each thread computes one element of the output matrix C. Thread at position (row, col) computes C[row][col] by multiplying the corresponding row of A with the corresponding column of B.

Memory Layout: GPU memory is flat (1D), but matrices are stored row-by-row. To access A[row][col], the kernel uses A[row * N + col].

Approach A: JIT Compilation:

Like Walkthrough 1, the kernel is written as a raw C++ string inside Python and compiled at runtime via PyTorch’s built-in JIT.

To use , make sure it’s downloaded and run:

Terminal window
cd Matrix_Multiplication # if not already inside the directory
python matmul_kernel.py

Key Code Snippets

import torch
# Snippet 1: Kernel source as a string
KERNEL_SOURCE = """
extern "C"
__global__ void matmul(float* A, float* B, float* C, int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < K) {
float sum = 0.0f;
for (int n = 0; n < N; n++) {
sum += A[row * N + n] * B[n * K + col];
}
C[row * K + col] = sum;
}
}
"""
# Snippet 2: Creating the Matrix - 2D indexing to map threads onto the M×K output matrix
# Inputs: A is M x N, B is N x K, C is M x K
M, N, K = 1024, 512, 768
A = torch.randn(M, N, dtype=torch.float32, device="cuda")
B = torch.randn(N, K, dtype=torch.float32, device="cuda")
C = torch.zeros(M, K, dtype=torch.float32, device="cuda")
BLOCK = 16
grid_x = (K + BLOCK - 1) // BLOCK
grid_y = (M + BLOCK - 1) // BLOCK
# Snippet 3: Compile the kernel string
matmul_kernel = torch.cuda._compile_kernel(KERNEL_SOURCE, "matmul")
# Snippet 4:. Launch with a 2D grid, grid_x covers columns (K), grid_y covers rows (M)
BLOCK = 16
matmul_kernel(
grid=(grid_x, grid_y, 1),
block=(BLOCK, BLOCK, 1),
args=[A, B, C, M, N, K],
)
C_ref = torch.mm(A, B)
max_err = (C - C_ref).abs().max().item()
print(f"Max error vs torch.mm: {max_err:.6f}")

The script verifies the result against torch.mm with a small tolerance. Floating-point arithmetic on GPUs may produce small numerical differences compared to CPU implementations due to parallel reduction order.

Expected output:[The performance numbers will vary]

Elapsed time: 2.753s
Max error vs torch.mm: 0.000160
Peak GPU Utilization: 93%
Average GPU Utilization: 65.94%

Expected output:

Elapsed time: 2.753s
Max error vs torch.mm: 0.000160
No GPU Usage captured.

Approach B: C++ Extension

The second approach is more manual: write the kernel and Python binding to a single .cu file, compile it natively using PyTorch’s build system, and import it into Python.

Download the following files if you haven’t already:

FileRole
Kernel + launcher + pybind11 binding
Build script, uses CUDAExtension to compile the .cu into a .pyd
Python script that runs the built artifacts
FileRole
Kernel + launcher + pybind11 binding
Build script, uses CUDAExtension to compile the .cu into a .so
Python script that runs the built artifacts

Step 1: The kernel, launcher, and binding ():

#include <torch/extension.h>
#include <hip/hip_runtime.h>
#define BLOCK 16
// GPU kernel, one thread per output element of C
__global__ void matmul(float* A, float* B, float* C, int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < K) {
float sum = 0.0f;
for (int n = 0; n < N; n++) {
sum += A[row * N + n] * B[n * K + col];
}
C[row * K + col] = sum;
}
}
// Launcher, extracts dims from torch::Tensor, allocates C, sets 2D grid/block
torch::Tensor matmul_launcher(torch::Tensor A, torch::Tensor B) {
int M = A.size(0), N = A.size(1), K = B.size(1);
auto C = torch::zeros({M, K}, A.options());
dim3 block(BLOCK, BLOCK);
dim3 grid((K + BLOCK - 1) / BLOCK, (M + BLOCK - 1) / BLOCK);
matmul<<<grid, block>>>(A.data_ptr<float>(), B.data_ptr<float>(),
C.data_ptr<float>(), M, N, K);
hipDeviceSynchronize();
return C;
}
// Python binding, exposes matmul_launcher as matmul_ext.matmul
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("matmul", &matmul_launcher, "Naive matmul kernel (HIP): A(M,N) @ B(N,K) -> C(M,K)");
}

Compared to add_one_launcher in Walkthrough 1, the launcher here:

  • Takes two input tensors instead of one
  • Derives all three dimensions (M, N, K) from tensor shapes, no manual size passing from Python
  • Allocates and returns the output tensor C, rather than mutating in-place
  • Uses dim3 for both grid and block to express the 2D launch shape

Step 2: Build

Terminal window
pip install --no-build-isolation -v .

This produces the following files:

  • build/: directory with the .pyd files
  • matmul_kernel.hip: the HIP source generated by hipifying the .cu file; this is what hipcc actually compiled
  • build/: directory with the .so files
  • matmul_kernel.hip: the HIP source generated by hipifying the .cu file; this is what hipcc actually compiled

Step 3: Use from Python ():

Execute this script to see the kernel in action:

Terminal window
cd Matrix_Multiplication # if not already in directory
python run_compiled_multiply.py

Expected output:

Result: tensor([[19., 22.],
[43., 50.]])

Awesome! You just implemented matrix multiplication on the GPU. This is a major milestone because matrix multiplication is the backbone of modern machine learning operations like:

  • Neural network layers
  • Attention mechanisms
  • Embeddings
  • Transformers

Next Steps

You’ve learned to write, compile, and launch GPU kernels using both JIT compilation and C++ extensions for basic parallel operations.

Performance optimizations:

  • Shared memory tiling - Cache data blocks to reduce global memory access
  • Memory coalescing - Optimize memory access patterns for bandwidth

Real-world algorithms:

  • 2D Convolution - A small filter (kernel) slides across an image, computing each output pixel from a weighted sum of neighboring pixels. This introduces stencil computations and shared memory tiling, where threads reuse overlapping image regions to reduce global memory access.
  • Softmax Function: Softmax converts a vector of numbers into probabilities that sum to 1, commonly used in neural network outputs. Implementing it efficiently on GPU introduces parallel reductions and numerical stability techniques while processing large vectors.

Production considerations:

  • Error handling - Bounds checking and device management
  • PyTorch integration - Custom operators with autograd support

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.