Skip to content
Filters & more
1h
Intermediate
speechtranslationaudioreal-time
Device Family
Device
OS

Speech-to-Speech Translation

Build a real-time speech-to-speech translation on your local hardware.

Overview

The AMD ROCm™ software and PyTorch stack create a unified ecosystem for on-device AI. It works for both Windows and Linux with official support for a wide range of devices including Ryzen™ AI APUs and Radeon™ GPUs.

This playbook will teach you how to run low-latency, expressive, and private speech-to-speech translation entirely on the edge.

What You’ll Learn

  • How to set up speech-to-speech environment
  • How to write Python code to load and use speech-speech models
  • How to run and experiment with the Gradio UI

Why use real-time speech-to-speech translation?

  • Removes friction between translation and language barriers
  • Conveys tone, emotion, and intent without awkward pauses
  • Enables global collaboation and faster decision-making

Setting Up Your Environment

Create a Virtual Environment

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv with ROCm+Pytorch already installed:

Terminal window
python -m venv s2st-env --system-site-packages
s2st-env\Scripts\activate

On Linux, open a terminal and run the following prompt to create a venv with ROCm+Pytorch already installed:

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv s2st-env --system-site-packages
source s2st-env/bin/activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv:

Terminal window
python -m venv s2st-env
s2st-env\Scripts\activate

On Linux, open a terminal and run the following prompt to create a venv:

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv s2st-env
source s2st-env/bin/activate

Installing Basic Dependencies

PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1152/ torch torchvision torchaudio
Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1150/ torch torchvision torchaudio

See this link for details.

Additional Dependencies

Install m4t dependencies using pip:

Terminal window
pip install transformers==4.57.1 safetensors==0.6.2 tiktoken==0.9.0 accelerate soundfile==0.13.1 sentencepiece protobuf gradio scipy==1.15.3

Set up the speech-to-speech demo

Learn about seamless-m4t-v2

Check out the model card on Hugging Face for more information: https://huggingface.co/facebook/seamless-m4t-v2-large/tree/main

This is the technical architecture of the speech-speech models: m4t arch

Download the model locally

Before running infer.py or gradio_demo.py, download the model files into a local folder named seamless-m4t-v2-large in the same directory as the scripts.

Open a terminal in the scripts directory. Activate the s2st-env virtual environment only if it is not already active, then run:

Terminal window
s2st-env\Scripts\activate # Activate the venv only if it's not already active
pip install -U "huggingface_hub<1.0"
hf download facebook/seamless-m4t-v2-large --local-dir ./seamless-m4t-v2-large
Terminal window
source s2st-env/bin/activate # Activate the venv only if it's not already active
pip install -U "huggingface_hub<1.0"
hf download facebook/seamless-m4t-v2-large --local-dir ./seamless-m4t-v2-large

After the download completes, the model folder should be available at ./seamless-m4t-v2-large.

Import necessary dependencies

from transformers import AutoProcessor, SeamlessM4Tv2Model
import os
import time
import numpy as np
import scipy.io.wavfile
import soundfile as sf
import torch
import torchaudio
from transformers import AutoProcessor, SeamlessM4Tv2Model
os.environ["HIP_VISIBLE_DEVICES"] = "0"
device = "cuda"
model_path = os.environ.get("S2S_MODEL_PATH", "./seamless-m4t-v2-large")

Load models

start = time.time()
processor = AutoProcessor.from_pretrained("./seamless-m4t-v2-large")
dtype = torch.float16 if device.type == "cuda" else torch.float32
model = SeamlessM4Tv2Model.from_pretrained(model_path, dtype=dtype).to(device)
end = time.time()
print(f"model loading duration: {end - start} seconds")

Input audio clip .wav file

Please download the following file: input1.wav. Then, load it with soundfile.

audio_np, orig_freq = sf.read("input1.wav", dtype="float32", always_2d=True)
audio = torchaudio.functional.resample(
torch.from_numpy(audio_np.T),
orig_freq=orig_freq,
new_freq=16_000,
)

Preprocess input .wav file

audio_inputs = processor(
audio=audio.squeeze(0).cpu().numpy(),
sampling_rate=16_000,
return_tensors="pt",
).to(device)

Generate translated audio file

start = time.time()
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()
end = time.time()
print(f"gpu infer duration: {end - start} seconds")

Save the translated file

sample_rate = model.config.sampling_rate
scipy.io.wavfile.write("out1.wav", rate=sample_rate, data=audio_array_from_audio)

Run the complete file to check the audio generation duration

Please download the following file: . Then, run it.

Terminal window
python ./infer.py

Runing the Gradio UI demo:

This is a helpful UI that builds upon the code we have written and makes live speech-speech translation easy. This demo supports two launch modes:

  • --no-share: local-only mode. The app is available only on your machine.
  • --share: also creates a public Gradio share link. This requires outbound network access to Gradio’s share service.

Run locally only

Terminal window
python ./gradio_demo.py --no-share

Then open your web browser at http://127.0.0.1:7860

When --share is used, Gradio uses Fast Reverse Proxy (FRP) to create a public link. On some systems, the FRP client download may be blocked by antivirus or network policy.

  1. First try running the following code after downloading it: .
Terminal window
python ./gradio_demo.py --share
  1. If Gradio says the FRP client is missing or blocked, do this:
  1. If Gradio says the FRP client is missing or blocked, do this:
  1. Try running gradio_demo.py again with --share.

Press and hold the record button to capture your voice; releasing it will automatically execute the translation.

Gradio UI example:

gradio UI

Next Steps

  • Mix and match between dozens of languages for quick translation.
  • Experiment with voice input and text-to-speech

Resources

Below are some additional resources to learn more about speech-to-speech translation:

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.