Skip to content
Filters & more
30 min
Intermediate
speechtranslationaudioreal-time
Device Family
Device
OS

Real-Time Speech-to-Speech Translation

Build a real-time speech-to-speech translation on your local hardware.

Real-Time Speech-to-Speech Translation

Overview

The AMD ROCm™ software and PyTorch stack create a unified ecosystem for on-device AI. It works for both Windows and Linux with official support for a wide range of devices including Ryzen™ AI APUs and Radeon™ GPUs.

This playbook will teach you how to run low-latency, expressive, and private speech-to-speech translation entirely on the edge.

What You’ll Learn

  • How to set up speech-to-speech environment
  • How to write Python code to load and use speech-speech models
  • How to run and experiment with the Gradio UI

Why use real-time speech-to-speech translation?

  • Removes friction between translation and language barriers
  • Conveys tone, emotion, and intent without awkward pauses
  • Enables global collaboration and faster decision-making

Setting the Memory Configuration

For the Ryzen AI Halo, the dedicated GPU memory defaults to 64GB, which is sufficient for most workloads. For larger models or longer contexts, increasing this to 96GB may help. To adjust, open AMD Software: Adrenalin Edition™ and navigate to Performance → Tuning → AMD Variable Graphics Memory. Reboot for the changes to take effect.

AMD Software Adrenalin Edition — AMD Variable Graphics Memory panel

To change the dedicated GPU memory value, open AMD Software: Adrenalin Edition™ and navigate to Performance → Tuning → AMD Variable Graphics Memory. Reboot for the changes to take effect.

AMD Software Adrenalin Edition — AMD Variable Graphics Memory panel

On Linux, to run larger models, increase the shared memory pool available to the GPU. This might involve setting the BIOS dedicated GPU memory to the minimum, so that the shared memory pool can be maximized.

For the AMD Ryzen™ AI Halo, the default is 96GB shared. To modify this, open the AMD Ryzen™ AI Developer Center and go to the Settings tab. Under Graphics Performance Settings, increase the Shared Video Memory slider, then click Apply Changes and reboot for the changes to take effect.

AMD Ryzen AI Developer Center — Graphics Performance Settings with Shared Video Memory slider

Increase the shared memory pool by changing the kernel’s Translation Table Manager (TTM) page setting. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5 GB) so the maximum amount is available as shared memory.

  1. Install the pipx utility and add the path for pipx-installed wheels to the system search path:
Terminal window
sudo apt install pipx
pipx ensurepath
  1. Install the amd-debug-tools wheel from PyPI:
Terminal window
pipx install amd-debug-tools
  1. Query the current shared memory settings:
Terminal window
amd-ttm
  1. Increase the shared memory allocation (units in GB):
Terminal window
amd-ttm --set <NUM>
  1. Reboot for the changes to take effect.

Check for Software Updates

Before starting, ensure your Ryzen AI Halo has the latest software installed. Open the AMD Ryzen™ AI Developer Center and check for available updates, both to the app itself and additional software.

Go to the Updates tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Updates tab on Windows

Go to the Manage tab. If updates are available, install them and reboot before continuing.

AMD Ryzen AI Developer Center — Manage tab on Linux

Installing Software Prerequisites

Create a Virtual Environment

On Linux, open a terminal and run the following prompt to create a venv with ROCm+Pytorch already installed:

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv s2st-env --system-site-packages
source s2st-env/bin/activate

Grant your user access to GPU devices (log out and back in for this to take effect):

Terminal window
sudo usermod -aG render,video $LOGNAME

On Linux, open a terminal and run the following prompt to create a venv:

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv s2st-env
source s2st-env/bin/activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv with ROCm+Pytorch already installed:

Terminal window
python -m venv s2st-env --system-site-packages
s2st-env\Scripts\activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv:

Terminal window
python -m venv s2st-env
s2st-env\Scripts\activate

Installing Basic Dependencies

PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "torch==2.11.0+rocm7.13.0" "torchvision==0.26.0+rocm7.13.0" "torchaudio==2.11.0+rocm7.13.0"
Terminal window
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1200-all/ torch torchvision torchaudio

For other devices, please refer to this link for full instructions.

Additional Dependencies

Install m4t dependencies using pip:

Terminal window
pip install transformers==4.57.1 safetensors==0.6.2 tiktoken==0.9.0 accelerate soundfile==0.13.1 sentencepiece protobuf gradio scipy==1.15.3

Set up the speech-to-speech demo

Learn about seamless-m4t-v2

Check out the model card on Hugging Face for more information. This is the technical architecture of the speech-speech models: m4t arch

Download Scripts

This playbook includes ready-to-use scripts. Please download all of them to the same directory as the environment you created.

ScriptDescriptionUsage
Basic LLM text generationpython infer.py
input1.wavExample Audio fileN/A
Language Support FileN/A
Intuitive UI for Speech Translationpython gradio_demo.py --no-share

Starting with infer.py

To execute the script, run

Terminal window
python infer.py

Explaining the Code

Snippet 1: Importing the necessary dependencies

import os
os.environ["HIP_VISIBLE_DEVICES"] = "0"
import time
import numpy as np
import scipy.io.wavfile
import soundfile as sf
import torch
import torchaudio
from transformers import AutoProcessor, SeamlessM4Tv2Model
DEFAULT_TARGET_LANGUAGE = "eng"
INPUT_AUDIO_PATH = "./input1.wav"
OUTPUT_AUDIO_PATH = "./out1.wav"
# Automatically downloads + caches via Hugging Face
MODEL_ID = "facebook/seamless-m4t-v2-large"
TARGET_SAMPLE_RATE = 16_000

Snippet 2: Loading the models from HuggingFace

This function takes in a model ID and downloads the model if not already downloaded. It then returns the processor and model for the next function to use.

def load_model(model_id: str, device: torch.device):
start = time.time()
print("Loading model (downloads automatically on first run)...")
processor = AutoProcessor.from_pretrained(model_id)
dtype = torch.float16 if device.type == "cuda" else torch.float32
model = SeamlessM4Tv2Model.from_pretrained(model_id, torch_dtype=dtype).to(device)
elapsed = time.time() - start
print(f"Model loading duration: {elapsed:.2f} seconds")
return processor, model

Snippet 3: Input audio clip .wav file and preprocess it

This function loads the audio clip and resamples it to the target rate.

def preprocess_audio(audio_path: str, target_sr: int = TARGET_SAMPLE_RATE) -> torch.Tensor:
audio_np, orig_freq = sf.read(audio_path, dtype="float32", always_2d=True)
# Convert to tensor [channels, samples]
audio = torch.from_numpy(audio_np.T)
# Resample if needed
if orig_freq != target_sr:
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=target_sr)
# Convert stereo -> mono
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0, keepdim=True)
return audio

Snippet 4: Run inference

This function runs inference with the model and returns the generated output.

def run_inference(model, processor, audio: torch.Tensor, device: torch.device, target_lang: str = DEFAULT_TARGET_LANGUAGE):
start = time.time()
audio_inputs = processor(
audio=audio.squeeze(0).cpu().numpy(),
sampling_rate=TARGET_SAMPLE_RATE,
return_tensors="pt",
)
audio_inputs = {
k: v.to(device) if isinstance(v, torch.Tensor) else v
for k, v in audio_inputs.items()
}
with torch.inference_mode():
output = model.generate(**audio_inputs, tgt_lang=target_lang)[0]
audio_array = output.float().cpu().numpy().squeeze()
elapsed = time.time() - start
print(f"Inference duration: {elapsed:.2f} seconds")
return audio_array, elapsed

Snippet 5: Save the translated file

This function saves the audio array to a .WAV file.

def save_audio(audio_array: np.ndarray, output_path: str, sample_rate: int):
if np.issubdtype(audio_array.dtype, np.floating):
max_abs = np.max(np.abs(audio_array)) if audio_array.size else 0.0
if max_abs > 1.0:
audio_array = audio_array / max_abs
audio_array = (audio_array * 32767.0).clip(-32768, 32767).astype(np.int16)
scipy.io.wavfile.write(output_path, rate=sample_rate, data=audio_array)
print(f"Output saved to: {output_path}")

Running the Gradio UI demo:

Now that you have run a basic script example, the following instructions provide a helpful UI that builds upon the code we have written and makes live speech-speech translation easy.

Run Gradio Locally

Terminal window
python ./gradio_demo.py --no-share

Then, open your web browser at http://127.0.0.1:7860 to access the UI.

Gradio UI example:

gradio UI

Next Steps

  • Mix and match between dozens of languages for quick translation.
  • Share your demo with others: Add —share to create a public link that anyone can access remotely, or deploy permanently using Hugging Face Spaces

Resources

Below are some additional resources to learn more about speech-to-speech translation:

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.