Skip to content
Filters & more
30 min
Beginner
vllminferencellmquickstart
Device Family
Device
OS

Getting Started with vLLM

Learn how to run inference and serving using containerized vLLM on the integrated GPU

Overview

vLLM is a high-performance inference engine designed for large language models (LLMs). It provides optimized serving with continuous batching for high throughput and an OpenAI-compatible API for seamless application integration. This makes vLLM great for production deployments where speed and resource efficiency are critical.

This playbook teaches you how to serve LLMs using containerized vLLM on the integrated GPU and interact with models through the OpenAI Python API.

What You’ll Learn

  • How to set up and start a vLLM server with AMD ROCm™ support
  • How to interact with models via OpenAI-compatible API endpoints
  • How to send prompts to the local server with vllm-prompt

Setting the Memory Configuration

For the Ryzen AI Halo, the dedicated GPU memory defaults to 64GB, which is sufficient for most workloads. For larger models or longer contexts, increasing this to 96GB may help. To adjust, open AMD Software: Adrenalin Edition™ and navigate to Performance → Tuning → AMD Variable Graphics Memory. Reboot for the changes to take effect.

AMD Software Adrenalin Edition — AMD Variable Graphics Memory panel

To change the dedicated GPU memory value, open AMD Software: Adrenalin Edition™ and navigate to Performance → Tuning → AMD Variable Graphics Memory. Reboot for the changes to take effect.

AMD Software Adrenalin Edition — AMD Variable Graphics Memory panel

On Linux, to run larger models, increase the shared memory pool available to the GPU. This might involve setting the BIOS dedicated GPU memory to the minimum, so that the shared memory pool can be maximized.

For the AMD Ryzen™ AI Halo, the default is 96GB shared. To modify this, open the AMD Ryzen™ AI Developer Center and go to the Settings tab. Under Graphics Performance Settings, increase the Shared Video Memory slider, then click Apply Changes and reboot for the changes to take effect.

AMD Ryzen AI Developer Center — Graphics Performance Settings with Shared Video Memory slider

Increase the shared memory pool by changing the kernel’s Translation Table Manager (TTM) page setting. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5 GB) so the maximum amount is available as shared memory.

  1. Install the pipx utility and add the path for pipx-installed wheels to the system search path:
Terminal window
sudo apt install pipx
pipx ensurepath
  1. Install the amd-debug-tools wheel from PyPI:
Terminal window
pipx install amd-debug-tools
  1. Query the current shared memory settings:
Terminal window
amd-ttm
  1. Increase the shared memory allocation (units in GB):
Terminal window
amd-ttm --set <NUM>
  1. Reboot for the changes to take effect.

Check for Software Updates

Before starting, ensure your Ryzen AI Halo has the latest software installed. Open the AMD Ryzen™ AI Developer Center and check for available updates, both to the app itself and additional software.

Go to the Updates tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Updates tab on Windows

Go to the Manage tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Manage tab on Linux

Installing Software Prerequisites

This playbook uses a prebuilt container image that includes vLLM, ROCm support, and the helper scripts needed to launch the server. You do not need to install PyTorch, vLLM, or local playbook scripts manually.

There is no host-side vLLM installation step. Start vLLM with:

Terminal window
vllm-launch

The launcher starts the container, targets the integrated GPU, and exposes a local OpenAI-compatible vLLM server. Alternatively, click the vLLM icon in the taskbar.

Quick Start

1. Confirm the vLLM Server Is Running

The vllm-launch may take a couple minutes to initialize everything. Once it starts, the server is available at http://localhost:8001. Keep the launch terminal open because the server runs in the foreground, then open a separate terminal for the remaining steps. The examples below use Qwen/Qwen3-1.7B; if your launcher is configured for a different model, substitute that model ID in the requests.

2. Send a Prompt

Use the provided vllm-prompt script to send a request to the local vLLM OpenAI-compatible server:

Terminal window
vllm-prompt "Tell me a story"

3. Chat with the model using the OpenAI Python API

Since vLLM exposes an OpenAI-compatible API, you can use the openai Python package to interact with it.

First, create a Python virtual environment:

Terminal window
sudo apt install -y python3-venv
python3 -m venv vllm-env
source vllm-env/bin/activate

Install the OpenAI package

Terminal window
pip install openai

Create an OpenAI client pointed at the local vLLM server instead of OpenAI’s servers. The api_key is required by the client but vLLM doesn’t validate it, so any string works:

from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8001/v1",
api_key="EMPTY",
)

Then, send a chat completion request. This uses the same message format as the OpenAI API — a list of messages with roles like "user" and "assistant". Setting stream=True means the response will arrive incrementally rather than all at once:

response = client.chat.completions.create(
model="Qwen/Qwen3-1.7B",
messages=[
{"role": "user", "content": "Tell me a short story"},
],
max_tokens=2048, # Maximum number of tokens the model will generate in its response
stream=True,
)

Finally, iterate over the streamed chunks and print each piece of text as it arrives:

for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)

The included script contains the entire example and can be downloaded.

Troubleshooting

Connection refused

Make sure the server is running:

Terminal window
curl http://localhost:8001/health

Summary

In this playbook, you learned how to:

  • Start containerized vLLM with ROCm support on the integrated GPU
  • Start a vLLM server with OpenAI-compatible API endpoints on port 8001
  • Send prompts with vllm-prompt
  • Make API calls to the vLLM server using both streaming and non-streaming requests
  • Troubleshoot common issues with server startup, memory, and client connections

You now have a containerized vLLM deployment for serving large language models with optimized performance on the integrated GPU.

Next Steps

  • Try different models — Swap the model in the vllm-launch configuration to experiment with different LLMs and compare performance.
  • Build an application — Use the OpenAI-compatible API to integrate vLLM into a Python app, chatbot, or automation workflow.
  • Fine-tune and serve — Fine-tune a model using LoRA or QLoRA, then deploy it with vLLM for optimized inference.

Additional Resources

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.