Skip to content
Filters & more
1h 30m
Advanced
clusteringdistributedmulti-gpurcclvllm
Device Family
Device
OS

Clustering Two Ryzen™ AI Halos with RCCL

Set up a multi-node cluster using two Ryzen™ AI Halo devices with RCCL for distributed workloads

Overview

Your Ryzen™ AI Halo is already capable of running large language models locally. Clustering takes this further by combining the GPU memory of multiple systems over a local network, giving you access to even larger models with stronger reasoning, better code generation, and deeper multilingual understanding, all entirely on your own hardware.

This playbook teaches you how to cluster two Ryzen AI Halo systems using RCCL (ROCm Communication Collectives Library) with vLLM and run Qwen3.5-397B, a 397B parameter model, across both machines with ROCm acceleration.

What You’ll Learn

  • How to extend VRAM allocation on Ryzen AI Halo systems
  • Launching vLLM with ROCm support
  • Configuring RCCL for multi-node tensor-parallel inference across two Ryzen AI Halo systems
  • Running a 397B parameter model across two networked Ryzen AI Halo systems

Prerequisites

Hardware

This playbook requires two Ryzen AI Halo units and one Ethernet switch, connected in a star topology with each unit wired directly to the switch.

ComponentQuantityDescription
Ryzen AI Halo2Compute nodes that form the cluster
10Gbps Ethernet switch1Central switch to allow multi node Ryzen AI Halo communication (at least 2 ports)
Ethernet cable2Connects each Halo unit to the switch (Cat 7 or higher recommended)

Software

Terminal window
sudo apt install curl

Physical Hardware Setup

Connect each Ryzen AI Halo unit to the Ethernet switch using a Cat 7 (or higher) cable. This establishes the 10Gbps link used for high-speed communication between the nodes.

1. Determine Network Interfaces

On each machine, find the name of its network interface and note it down (it will be referred to in the rest of the instructions as IFNAME). Run:

Terminal window
ip route get 1.1.1.1 | grep -oP 'dev \K\S+'

This prints the interface name directly, for example:

Terminal window
enp191s0

Confirm the link is active and running at full speed by checking the speed of your interface:

Terminal window
sudo ethtool <IFNAME> | grep Speed

You should see a speed of 10000Mb/s:

Terminal window
Speed: 10000Mb/s

Extending VRAM Allocation

Memory Configuration for Running Large Models

On Linux, ROCm utilizes a shared system memory pool, and this pool is configured by default to half the system memory.

This amount can be increased by changing the kernel’s Translation Table Manager (TTM) page setting, with the following instructions. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5 GB).

  • Install the pipx utility and add the path for pipx installed wheels into the system search path.
Terminal window
sudo apt install pipx
pipx ensurepath
  • Install the amd-debug-tools wheel from PyPI.
Terminal window
pipx install amd-debug-tools
  • Run the amd-ttm tool to query the current settings for shared memory.
Terminal window
amd-ttm
  • Reconfigure shared memory settings to 120 GB:
Terminal window
amd-ttm --set 120
  • Reboot the system for changes to take effect.

vLLM Container Initialization

Your Ryzen AI Halo ships with vLLM packaged inside a prebuilt container image, which you run using Podman, a free and open source container tool.

1. Create the Model Download Directory

When you serve the Qwen3.5-397B model in this playbook, vLLM will automatically download the model weights to your system. To make sure those weights are accessible from inside the container, first create a models directory that the container can mount:

Terminal window
mkdir -p ~/.local/share/vLLM/models

2. Launch the vLLM Container

The command below launches the container and drops you into an interactive shell. It mounts the models directory you just created and passes your IFNAME to NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME, telling RCCL (the library vLLM uses to coordinate GPUs across the cluster) which interface to use.

Start the container with:

Terminal window
sudo podman run -it --name vllm_cluster --replace --pull missing --network=host --device /dev/kfd --device /dev/dri -v ~/.local/share/vLLM/models:/opt/vLLM/models --env HF_HOME=/opt/vLLM/models --entrypoint="bin/bash" --shm-size=64g -e NCCL_SOCKET_IFNAME=<IFNAME> -e GLOO_SOCKET_IFNAME=<IFNAME> oci-registry.ryai.dev/ryai-vllm:latest

Running the Model on the Cluster

vLLM uses Ray to orchestrate the cluster and RCCL to handle GPU-to-GPU communication across nodes. One machine acts as the head node (Machine 1), coordinating inference. The other joins as a worker node (Machine 2), contributing its GPU memory and compute.

At launch, vLLM shards the model across both nodes using tensor parallelism. Once loaded, inference proceeds as if running on a single accelerator.

Step 1: Start the Ray Head Node (Machine 1)

On Machine 1, start the Ray head node to initialize the cluster:

Terminal window
ray start --head --port=6379 --node-ip-address=<MACHINE_1_IP> --num-gpus=1

Finding <MACHINE_1_IP>: On Machine 1, run hostname -I | awk '{print $1}' to find its local IP address.

Step 2: Join the Cluster (Machine 2)

On Machine 2, connect to the head node to form the cluster:

Terminal window
ray start --address=<MACHINE_1_IP>:6379 --node-ip-address=<MACHINE_2_IP> --num-gpus=1

Finding <MACHINE_2_IP>: On Machine 2, run hostname -I | awk '{print $1}' to find its local IP address.

Step 3: Serve the Model (Machine 1)

On Machine 1, launch the vLLM server. This will automatically download the model and begin serving it across both nodes:

Terminal window
vllm serve Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 \
--port 7000 \
--host 0.0.0.0 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--dtype float16 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--enforce-eager \
--language-model-only \
--reasoning-parser qwen3

Parameter Reference

FlagPurpose
--portPort to serve the HTTP API on
--hostIP address to bind the server to (0.0.0.0 for all interfaces)
--max-model-lenMaximum context length in tokens
--gpu-memory-utilizationFraction of GPU memory to allocate (0.0–1.0)
--dtypeData type for model weights
--tensor-parallel-sizeNumber of GPUs to shard the model across (set to total GPUs in the cluster)
--distributed-executor-backendBackend for multi-node execution (ray for cluster deployments)
--enforce-eagerDisables CUDA graph compilation for compatibility
--language-model-onlySkips loading auxiliary model components (e.g., vision encoder)
--reasoning-parserEnables structured reasoning output parsing for the model

For full parameter usage, refer to the vLLM documentation.

Accessing the Model

vLLM exposes an OpenAI-compatible API, so you can connect any compatible client or interface to your cluster. One popular option is Open WebUI, which provides a browser-based chat interface.

To connect Open WebUI to your vLLM endpoint:

  1. Open Settings > Admin Panel > Connections
  2. Click the + on Manage OpenAI API Connections
  3. Set the Connection Type to External
  4. Set the URL to http://<MACHINE_1_IP>:7000/v1
  5. Under Auth, select None from the dropdown
  6. Leave Model IDs empty to automatically discover all models from the endpoint

Finding <MACHINE_1_IP>: On Machine 1, run hostname -I | awk '{print $1}' to find its local IP address. If accessing Open WebUI from Machine 1 itself, you can use http://localhost:7000/v1.

Open WebUI connection settings for the vLLM endpoint

Once connected, select the model from the model dropdown in Open WebUI and start chatting. The model is now running across both of your Ryzen AI Halo nodes:

Chatting with Qwen3.5-397B in Open WebUI

Next Steps

  • Explore other models: Discover new models on Hugging Face that fit within your cluster’s combined GPU memory
  • Scale to four nodes: Add two more Ryzen AI Halo systems as additional Ray workers to shard models across even more GPUs. This requires an Ethernet switch with at least four ports, one for each node. Follow Step 2: Join the Cluster on each additional worker and increase --tensor-parallel-size accordingly
  • Try other parallelism strategies: vLLM supports expert parallel for mixture-of-experts models and data parallel for higher throughput. Experiment with --enable-expert-parallel and --data-parallel-size to find the best configuration for your workload

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.