Skip to content
Filters & more
1h 30m
Advanced
clusteringrpcdistributedllminferencerocm
Device Family
Device
OS

Clustering Two Ryzen™ AI Halos with RPC

Set up distributed inference using RPC server across two Ryzen™ AI Halo devices with llama.cpp to run 350B+ models

Overview

Your Ryzen™ AI Halo is already capable of running large language models locally. Clustering takes this further by combining the GPU memory of multiple systems over a local network, giving you access to even larger models with stronger reasoning, better code generation, and deeper multilingual understanding, all entirely on your own hardware.

This playbook teaches you how to cluster two Ryzen AI Halo systems using llama.cpp’s RPC engine and run GLM 4.7, a 358B parameter model, across both machines with AMD ROCm™ acceleration.

What You’ll Learn

  • How to extend VRAM allocation on Ryzen AI Halo systems
  • Installing llama.cpp with ROCm and RPC support
  • Configuring an RPC worker and launching distributed inference across two nodes
  • Running a 358B parameter model across two networked Ryzen AI Halo systems

Setting the Memory Configuration

On Windows, to run larger models that require higher memory, we need to use the AMD Variable Graphics Memory (iGPU VRAM) allocation.

This can be done by opening AMD Software: Adrenalin Edition control panel and navigating to: Performance > Tuning > AMD Variable Graphics Memory. Set the value to 96 GB. Please reboot the system for the changes to take effect.

AMD Software Adrenalin Edition — AMD Variable Graphics Memory panel

On Linux, ROCm utilizes a shared system memory pool, and this pool is configured by default to half the system memory.

This amount can be increased by changing the kernel’s Translation Table Manager (TTM) page setting, with the following instructions. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5 GB).

  • Install the pipx utility and add the path for pipx installed wheels into the system search path.
Terminal window
sudo apt install pipx
pipx ensurepath
  • Install the amd-debug-tools wheel from PyPI.
Terminal window
pipx install amd-debug-tools
  • Run the amd-ttm tool to query the current settings for shared memory.
Terminal window
amd-ttm
  • Reconfigure shared memory settings to 120 GB:
Terminal window
amd-ttm --set 120
  • Reboot the system for changes to take effect.

Check for Software Updates

Before starting, ensure your Ryzen AI Halo has the latest software installed. Open the AMD Ryzen™ AI Developer Center and check for available updates, both to the app itself and additional software.

Go to the Updates tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Updates tab on Windows

Go to the Manage tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Manage tab on Linux

Prerequisites

Hardware

This playbook requires two Ryzen AI Halo units and one Ethernet switch, connected in a star topology with each unit wired directly to the switch.

ComponentQuantityDescription
Ryzen AI Halo2Compute nodes that form the cluster
10Gbps Ethernet switch1Central switch to allow multi node Ryzen AI Halo communication (at least 2 ports)
Ethernet cable2Connects each Halo unit to the switch (Cat 7 or higher recommended)

Software

AMD GPU Driver

Update to the latest AMD GPU driver using AMD Software: Adrenalin Edition™.

  1. Open AMD Software: Adrenalin Edition from your Start menu or system tray.
  2. Navigate to Driver and Software, click Manage Updates.
  3. If an update is available, follow the prompts to download and install.

AMD GPU Driver

Download and install the latest AMD GPU driver for Linux:

  1. Visit the AMD Linux Drivers page.
  2. Follow the installation instructions provided on the download page.

Please install:

Terminal window
sudo apt install git cmake python3 python3-pip

Physical Hardware Setup

Connect each Ryzen AI Halo unit to the Ethernet switch using a Cat 7 (or higher) cable. This establishes the 10Gbps link used for high-speed communication between the nodes.

1. Determine Network Interfaces

On each machine, find the name of its network interface and note it down (it will be referred to below as IFNAME). Run:

Terminal window
ip route get 1.1.1.1 | grep -oP 'dev \K\S+'

This prints the interface name directly, for example:

Terminal window
enp191s0

Confirm the link is active and running at full speed by checking the speed of your interface:

Terminal window
sudo ethtool <IFNAME> | grep Speed

You should see a speed of 10000Mb/s:

Terminal window
Speed: 10000Mb/s

On each machine, check the link speed of your network interfaces:

Terminal window
Get-NetAdapter | Select-Object Name, Status, LinkSpeed

Your Ethernet interface should be Up and running at 10 Gbps:

Terminal window
Name Status LinkSpeed
---- ------ ---------
Ethernet Up 10 Gbps

Installing llama.cpp

Two installation options are available:

The Lemonade SDK provides nightly builds of llama.cpp with AMD ROCm 7 acceleration, targeting GPUs such as gfx1151 (Strix Halo / Ryzen AI Max+ 395) and other recent Radeon architectures.

Step 1: Download the Pre-Built Binaries

Navigate to the latest release page and download the archive matching your platform and GPU target:

https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/

Download the file named llama-bxxxx-windows-rocm-gfx1151-x64.zip (where xxxx is the build number).

Step 2: Extract the Binaries

Unzip the downloaded archive:

Terminal window
llama-bxxxx-windows-rocm-gfx1151-x64.zip

This directory now contains ROCm-enabled builds of llama-cli.exe, llama-server.exe, and rpc-server.exe, precompiled for your Ryzen AI Halo system.

Step 3: Verify GPU Detection

Terminal window
.\llama-cli.exe --list-devices

Expected output:

Terminal window
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon(TM) Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
ROCm0: AMD Radeon(TM) Graphics (110511 MiB, 110357 MiB free)

Step 1: Download the Pre-Built Binaries

Navigate to the latest release page and download the archive matching your platform and GPU target:

https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/

Download the file named llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip (where xxxx is the build number).

Step 2: Extract and Prepare the Binaries

Terminal window
unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip
cd llama-bxxxx-ubuntu-rocm-gfx1151-x64
chmod +x llama-cli llama-server rpc-server

This directory now contains ROCm-enabled builds of llama-cli, llama-server, and rpc-server, precompiled for your Ryzen AI Halo system.

Step 3: Verify GPU Detection

Terminal window
./llama-cli --list-devices

Expected output:

Terminal window
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544
ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free)

With llama.cpp prepared on each node, proceed to Downloading the Model.

Option 2: Manual Source Build

Step 1: Build llama.cpp

Open the x64 Native Tools Command Prompt (installed with Visual Studio Build Tools) and clone the repository:

Terminal window
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Add HIP to your path and build with ROCm and RPC support:

Terminal window
set PATH=%HIP_PATH%\bin;%PATH%
cmake -S . -B rocm -G Ninja -DGGML_HIP=ON -DGGML_RPC=ON -DGPU_TARGETS=gfx1151 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
cmake --build rocm --config Release
Build FlagPurpose
-DGGML_HIP=ONEnables the ROCm/HIP software stack
-DGGML_RPC=ONEnables RPC for distributed inference
-DGPU_TARGETS=gfx1151Targets the Ryzen AI Halo GPU (Radeon 8060s)
-G NinjaUses the Ninja build system

Step 2: Verify GPU Detection

Terminal window
cd rocm\bin
.\llama-cli.exe --list-devices

Expected output:

Terminal window
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon(TM) Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
ROCm0: AMD Radeon(TM) Graphics (110511 MiB, 110357 MiB free)

Step 3: Add HIP to Your User Path

The build step above set %HIP_PATH%\bin for the current session only. To make the HIP libraries available in any terminal (not just the x64 Native Tools Command Prompt), add it to your user PATH permanently:

Terminal window
powershell -Command "[System.Environment]::SetEnvironmentVariable('Path', [System.Environment]::GetEnvironmentVariable('Path', 'User') + ';%HIP_PATH%\bin', 'User')"

With llama.cpp prepared on each node, proceed to Downloading the Model.

Step 1: Build llama.cpp

Clone the repository:

Terminal window
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build with ROCm and RPC support:

Terminal window
cmake -B rocm -DGGML_HIP=ON -DGGML_RPC=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DAMDGPU_TARGETS="gfx1151"
cmake --build rocm --config Release -j$(nproc)
Build FlagPurpose
-DGGML_HIP=ONEnables the ROCm software stack
-DGGML_RPC=ONEnables RPC for distributed inference
-DGGML_HIP_ROCWMMA_FATTN=ONEnables rocWMMA for enhanced Flash Attention on AMD GPUs
-DAMDGPU_TARGETS="gfx1151"Targets the Ryzen AI Halo GPU (Radeon 8060s)

For more build options, refer to the llama.cpp build documentation.

Step 2: Verify GPU Detection

Terminal window
cd rocm/bin
./llama-cli --list-devices

Expected output:

Terminal window
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544
ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free)

With llama.cpp prepared on each node, proceed to Downloading the Model.

Downloading the Model

This playbook uses GLM 4.7, a 358B parameter model in the Q4_K_XL quantization from Unsloth. At this quantization the model requires approximately 205GB of storage and fits within the combined GPU memory of two Ryzen AI Halo nodes.

Download the GGUF files using the Hugging Face CLI:

Terminal window
pip install huggingface-hub
hf download unsloth/GLM-4.7-GGUF --include "UD-Q4_K_XL/*" --local-dir GLM-4.7-GGUF
Terminal window
python -m pip install -U huggingface-hub
$hfScripts = python -c "import sysconfig; print(sysconfig.get_path('scripts'))"
$env:Path = "$hfScripts;$env:Path"
hf download unsloth/GLM-4.7-GGUF --include "UD-Q4_K_XL/*" --local-dir GLM-4.7-GGUF

Launching the Model on the Cluster

The llama.cpp RPC (Remote Procedure Call) engine allows a single llama.cpp instance to offload model layers to remote workers over the network. One machine acts as the controller (Machine 1), handling tokenization, scheduling, and orchestration. The other machine runs a lightweight RPC server (Machine 2) that exposes its GPU memory and compute to the controller.

At load time, llama.cpp shards the model across both nodes. Once loaded, inference proceeds as if running on a single accelerator. RPC handles tensor transfers and synchronization behind the scenes.

Step 1: Start the RPC Server (Machine 2)

On Machine 2, start the RPC server to expose its GPU resources to the controller:

Terminal window
./rpc-server -p 50053 -c --host 0.0.0.0
Terminal window
.\rpc-server.exe -p 50053 -c --host 0.0.0.0
FlagPurpose
-pPort to broadcast the RPC server on
-cEnables a local cache for large tensors, avoiding repeated network transfers during model loading
--hostIP address to bind the RPC server to (0.0.0.0 for all interfaces)

For more options, refer to the llama.cpp RPC documentation.

Step 2: Launch the Model (Machine 1)

With the RPC server running on Machine 2, launch inference from Machine 1 using either llama-cli or llama-server.

llama-cli

llama-cli provides a terminal-based interface for interacting directly with the model. It is ideal for benchmarking, debugging, and low-level experimentation.

Terminal window
./llama-cli \
-m /path/to/GLM-4.7-GGUF/UD-Q4_K_XL/GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf \
-c 32768 \
-fa on \
-ngl 999 \
--no-mmap \
--rpc <RPC_WORKER_IP>:50053

Finding <RPC_WORKER_IP>: On Machine 2, run hostname -I | awk '{print $1}' to find its local IP address.

Terminal window
.\llama-cli.exe `
-m C:\path\to\GLM-4.7-GGUF\UD-Q4_K_XL\GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf `
-c 32768 `
-fa on `
-ngl 999 `
--no-mmap `
--rpc <RPC_WORKER_IP>:50053

Finding <RPC_WORKER_IP>: On Machine 2, run ipconfig | findstr /C:"IPv4" in Terminal (Powershell) to find its local IP address.

Once running, llama-cli displays model loading progress and enters an interactive prompt where you can chat directly with the model:

llama-cli running GLM 4.7 across two nodes

llama-server

llama-server exposes the same inference engine through a persistent server process with an integrated web UI and an OpenAI-compatible HTTP API. This is the preferred interface for longer-running deployments, multi-user access, and integration with external tooling.

Terminal window
./llama-server \
-m /path/to/GLM-4.7-GGUF/UD-Q4_K_XL/GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf \
-c 32768 \
-fa on \
-ngl 999 \
--no-mmap \
--host 0.0.0.0 \
--port 8081 \
--rpc <RPC_WORKER_IP>:50053

Finding <RPC_WORKER_IP>: On Machine 2, run hostname -I | awk '{print $1}' to find its local IP address.

Terminal window
.\llama-server.exe `
-m C:\path\to\GLM-4.7-GGUF\UD-Q4_K_XL\GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf `
-c 32768 `
-fa on `
-ngl 999 `
--no-mmap `
--host 0.0.0.0 `
--port 8081 `
--rpc <RPC_WORKER_IP>:50053

Finding <RPC_WORKER_IP>: On Machine 2, run ipconfig | findstr /C:"IPv4" in Terminal (Powershell) to find its local IP address.

Once started, open http://<HOST_IP>:8081 in your browser to access the built-in web UI. This provides a browser-based chat interface for interacting with the model:

llama-server web UI running GLM 4.7 across two nodes

Finding <HOST_IP>: On Machine 1, run hostname -I | awk '{print $1}' to find its local IP address.

Finding <HOST_IP>: On Machine 1, run ipconfig | findstr /C:"IPv4" in Terminal (Powershell) to find its local IP address.

Parameter Reference

FlagPurpose
-mPath to the GGUF model file (use the first shard, 00001-of-00005)
-cContext size in tokens. Larger values use more memory
-fa onEnables rocWMMA Flash Attention for improved performance on AMD GPUs
-ngl 999Offloads all model layers to the GPU
--no-mmapDisables memory-mapping, reducing load times when model size exceeds system RAM but fits in VRAM
--hostIP to bind llama-server to (llama-server only)
--portPort to serve the HTTP API on (llama-server only)
--rpcComma-separated list of RPC worker endpoints (IP:port)

For full parameter usage, refer to the llama-cli documentation and llama-server documentation.

Next Steps

  • Connect third-party applications: llama-server exposes an OpenAI-compatible API. Point any OpenAI-compatible application (such as Open WebUI) at http://<HOST_IP>:8081 with any placeholder API key (e.g., none) to connect to your cluster
  • Explore other models: Browse quantized GGUFs on Hugging Face to find models that fit within your cluster’s combined GPU memory
  • Scale to four nodes: Add two more Ryzen AI Halo systems as additional RPC workers to access models at the 1 trillion parameter scale. Pass additional endpoints to --rpc as a comma-separated list (e.g., --rpc <IP1>:50053,<IP2>:50053,<IP3>:50053)

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.