Skip to content
Filters & more
20 min
Intermediate
speechtranslationaudioreal-time
Device Family
Device
OS

Speech-to-Speech Translation

Build a real-time speech-to-speech translation on your local hardware.

Overview

The AMD ROCm™ software and PyTorch stack create a unified ecosystem for on-device AI. It works for both Windows and Linux with official support for a wide range of devices including Ryzen™ AI APUs and Radeon™ GPUs.

This playbook will teach you how to run low-latency, expressive, and private speech-to-speech translation entirely on the edge.

What You’ll Learn

  • How to set up speech-to-speech environment
  • How to write Python code to load and use speech-speech models
  • How to run and experiment with the Gradio UI

Why use real-time speech-to-speech translation?

  • Removes friction between translation and language barriers
  • Conveys tone, emotion, and intent without awkward pauses
  • Enables global collaboration and faster decision-making

Setting Up Your Environment

Create a Virtual Environment

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv with ROCm+Pytorch already installed:

Terminal window
python -m venv s2st-env --system-site-packages
s2st-env\Scripts\activate

On Linux, open a terminal and run the following prompt to create a venv with ROCm+Pytorch already installed:

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv s2st-env --system-site-packages
source s2st-env/bin/activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv:

Terminal window
python -m venv s2st-env
s2st-env\Scripts\activate

On Linux, open a terminal and run the following prompt to create a venv:

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv s2st-env
source s2st-env/bin/activate

Installing Basic Dependencies

ROCm

Add the current user to the render and video groups.

Terminal window
sudo usermod -a -G render,video $LOGNAME

Restart your system to apply the settings.

Terminal window
sudo reboot

Install ROCm in the created virtual environment.

Terminal window
python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "rocm[libraries,devel]"
Terminal window
python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "rocm[libraries,devel]"
Terminal window
python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "rocm[libraries,devel]"
Terminal window
python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx120x-all/ "rocm[libraries,devel]"

For further installation help, please see this link.


PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1152/ torch torchvision torchaudio
Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1150/ torch torchvision torchaudio

See this link for details.

Additional Dependencies

Install m4t dependencies using pip:

Terminal window
pip install transformers==4.57.1 safetensors==0.6.2 tiktoken==0.9.0 accelerate soundfile==0.13.1 sentencepiece protobuf gradio scipy==1.15.3

Set up the speech-to-speech demo

Learn about seamless-m4t-v2

Check out the model card on Hugging Face for more information. This is the technical architecture of the speech-speech models: m4t arch

Download Scripts

This playbook includes ready-to-use scripts. Please download all of them to the same directory as the environment you created.

ScriptDescriptionUsage
Basic LLM text generationpython infer.py
input1.wavExample Audio fileN/A
Language Support FileN/A
Intuitive UI for Speech Translationpython gradio_demo.py --no-share

Starting with infer.py

To execute the script, run

Terminal window
python infer.py

Explaining the Code

Snippet 1: Importing the necessary dependencies

import os
os.environ["HIP_VISIBLE_DEVICES"] = "0"
import time
import numpy as np
import scipy.io.wavfile
import soundfile as sf
import torch
import torchaudio
from transformers import AutoProcessor, SeamlessM4Tv2Model
# ============ Configuration ============
DEFAULT_TARGET_LANGUAGE = "eng"
INPUT_AUDIO_PATH = "./input1.wav"
OUTPUT_AUDIO_PATH = "./out1.wav"
# Automatically downloads + caches via Hugging Face
MODEL_ID = "facebook/seamless-m4t-v2-large"
TARGET_SAMPLE_RATE = 16_000

Snippet 2: Loading the models from HuggingFace

This function takes in a model ID and downloads the model if not already downloaded. It then returns the processor and model for the next function to use.

def load_model(model_id: str, device: torch.device):
start = time.time()
print("Loading model (downloads automatically on first run)...")
processor = AutoProcessor.from_pretrained(model_id)
dtype = torch.float16 if device.type == "cuda" else torch.float32
model = SeamlessM4Tv2Model.from_pretrained(model_id, torch_dtype=dtype).to(device)
elapsed = time.time() - start
print(f"Model loading duration: {elapsed:.2f} seconds")
return processor, model

Snippet 3: Input audio clip .wav file and preprocess it

This function loads the audio clip and resamples it to the target rate.

def preprocess_audio(audio_path: str, target_sr: int = TARGET_SAMPLE_RATE) -> torch.Tensor:
audio_np, orig_freq = sf.read(audio_path, dtype="float32", always_2d=True)
# Convert to tensor [channels, samples]
audio = torch.from_numpy(audio_np.T)
# Resample if needed
if orig_freq != target_sr:
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=target_sr)
# Convert stereo -> mono
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0, keepdim=True)
return audio

Snippet 4: Run inference

This function runs inference with the model and returns the generated output.

def run_inference(model, processor, audio: torch.Tensor, device: torch.device, target_lang: str = DEFAULT_TARGET_LANGUAGE):
start = time.time()
audio_inputs = processor(
audio=audio.squeeze(0).cpu().numpy(),
sampling_rate=TARGET_SAMPLE_RATE,
return_tensors="pt",
)
audio_inputs = {
k: v.to(device) if isinstance(v, torch.Tensor) else v
for k, v in audio_inputs.items()
}
with torch.inference_mode():
output = model.generate(**audio_inputs, tgt_lang=target_lang)[0]
audio_array = output.float().cpu().numpy().squeeze()
elapsed = time.time() - start
print(f"Inference duration: {elapsed:.2f} seconds")
return audio_array, elapsed

Snippet 5: Save the translated file

This function saves the audio array to a .WAV file.

def save_audio(audio_array: np.ndarray, output_path: str, sample_rate: int):
if np.issubdtype(audio_array.dtype, np.floating):
max_abs = np.max(np.abs(audio_array)) if audio_array.size else 0.0
if max_abs > 1.0:
audio_array = audio_array / max_abs
audio_array = (audio_array * 32767.0).clip(-32768, 32767).astype(np.int16)
scipy.io.wavfile.write(output_path, rate=sample_rate, data=audio_array)
print(f"Output saved to: {output_path}")

Running the Gradio UI demo:

Now that you have run a basic script example, the following instructions provide a helpful UI that builds upon the code we have written and makes live speech-speech translation easy.

Run Gradio Locally

Terminal window
python ./gradio_demo.py --no-share

Then, open your web browser at http://127.0.0.1:7860 to access the UI.

Gradio UI example:

gradio UI

Next Steps

  • Mix and match between dozens of languages for quick translation.
  • Share your demo with others: Add —share to create a public link that anyone can access remotely, or deploy permanently using Hugging Face Spaces

Resources

Below are some additional resources to learn more about speech-to-speech translation:

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.