- Generating images with ComfyUI and Z Image Turbo
- Automating Workflows with n8n and Local LLMs
- Fine-tune LLMs with PyTorch and AMD ROCm™ Software
- How to Chat with LLMs in Open WebUI
- Local LLM Coding with VS Code and Qwen3-Coder
- Optimized LLMs Fine-tuning with Unsloth
- Running and serving LLMs with LM Studio
- Running LLMs with PyTorch and AMD ROCm™ software
- Speech-to-Speech Translation
- Using Lemonade Across CPU, GPU, and NPU
Speech-to-Speech Translation
Build a real-time speech-to-speech translation on your local hardware.
Overview
The AMD ROCm™ software and PyTorch stack create a unified ecosystem for on-device AI. It works for both Windows and Linux with official support for a wide range of devices including Ryzen™ AI APUs and Radeon™ GPUs.
This playbook will teach you how to run low-latency, expressive, and private speech-to-speech translation entirely on the edge.
What You’ll Learn
- How to set up speech-to-speech environment
- How to write Python code to load and use speech-speech models
- How to run and experiment with the Gradio UI
Why use real-time speech-to-speech translation?
- Removes friction between translation and language barriers
- Conveys tone, emotion, and intent without awkward pauses
- Enables global collaboation and faster decision-making
Setting Up Your Environment
Create a Virtual Environment
On Windows, open a terminal in the directory of your choice and follow the commands to create a venv with ROCm+Pytorch already installed:
python -m venv s2st-env --system-site-packagess2st-env\Scripts\activateOn Linux, open a terminal and run the following prompt to create a venv with ROCm+Pytorch already installed:
sudo apt updatesudo apt install -y python3-venvpython3 -m venv s2st-env --system-site-packagessource s2st-env/bin/activateOn Windows, open a terminal in the directory of your choice and follow the commands to create a venv:
python -m venv s2st-envs2st-env\Scripts\activateOn Linux, open a terminal and run the following prompt to create a venv:
sudo apt updatesudo apt install -y python3-venvpython3 -m venv s2st-envsource s2st-env/bin/activateInstalling Basic Dependencies
PyTorch
Install PyTorch with AMD ROCm™ software support in the created virtual environment:
python -m pip install --upgrade pippython -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudiopython -m pip install --upgrade pippython -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1152/ torch torchvision torchaudiopython -m pip install --upgrade pippython -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1150/ torch torchvision torchaudioSee this link for details.
Additional Dependencies
Install m4t dependencies using pip:
pip install transformers==4.57.1 safetensors==0.6.2 tiktoken==0.9.0 accelerate soundfile==0.13.1 sentencepiece protobuf gradio scipy==1.15.3Set up the speech-to-speech demo
Learn about seamless-m4t-v2
Check out the model card on Hugging Face for more information: https://huggingface.co/facebook/seamless-m4t-v2-large/tree/main
This is the technical architecture of the speech-speech models:
Download the model locally
Before running infer.py or gradio_demo.py, download the model files into a local folder named seamless-m4t-v2-large in the same directory as the scripts.
Open a terminal in the scripts directory. Activate the s2st-env virtual environment only if it is not already active, then run:
s2st-env\Scripts\activate # Activate the venv only if it's not already activepip install -U "huggingface_hub<1.0"hf download facebook/seamless-m4t-v2-large --local-dir ./seamless-m4t-v2-largesource s2st-env/bin/activate # Activate the venv only if it's not already activepip install -U "huggingface_hub<1.0"hf download facebook/seamless-m4t-v2-large --local-dir ./seamless-m4t-v2-largeAfter the download completes, the model folder should be available at ./seamless-m4t-v2-large.
Import necessary dependencies
from transformers import AutoProcessor, SeamlessM4Tv2Modelimport osimport timeimport numpy as npimport scipy.io.wavfileimport soundfile as sfimport torchimport torchaudiofrom transformers import AutoProcessor, SeamlessM4Tv2Model
os.environ["HIP_VISIBLE_DEVICES"] = "0"device = "cuda"model_path = os.environ.get("S2S_MODEL_PATH", "./seamless-m4t-v2-large")Load models
start = time.time()processor = AutoProcessor.from_pretrained("./seamless-m4t-v2-large")dtype = torch.float16 if device.type == "cuda" else torch.float32model = SeamlessM4Tv2Model.from_pretrained(model_path, dtype=dtype).to(device)end = time.time()print(f"model loading duration: {end - start} seconds")Input audio clip .wav file
Please download the following file: input1.wav. Then, load it with soundfile.
audio_np, orig_freq = sf.read("input1.wav", dtype="float32", always_2d=True)audio = torchaudio.functional.resample( torch.from_numpy(audio_np.T), orig_freq=orig_freq, new_freq=16_000,)Preprocess input .wav file
audio_inputs = processor( audio=audio.squeeze(0).cpu().numpy(), sampling_rate=16_000, return_tensors="pt",).to(device)Generate translated audio file
start = time.time()audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()end = time.time()print(f"gpu infer duration: {end - start} seconds")Save the translated file
sample_rate = model.config.sampling_ratescipy.io.wavfile.write("out1.wav", rate=sample_rate, data=audio_array_from_audio)Run the complete file to check the audio generation duration
Please download the following file: . Then, run it.
python ./infer.pyRuning the Gradio UI demo:
This is a helpful UI that builds upon the code we have written and makes live speech-speech translation easy. This demo supports two launch modes:
--no-share: local-only mode. The app is available only on your machine.--share: also creates a public Gradio share link. This requires outbound network access to Gradio’s share service.
Run locally only
python ./gradio_demo.py --no-shareThen open your web browser at http://127.0.0.1:7860
Run with a public share link
When --share is used, Gradio uses Fast Reverse Proxy (FRP) to create a public link. On some systems, the FRP client download may be blocked by antivirus or network policy.
- First try running the following code after downloading it: .
python ./gradio_demo.py --share- If Gradio says the FRP client is missing or blocked, do this:
- Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.3/frpc_windows_amd64.exe
- Rename the downloaded file to:
frpc_windows_amd64_v0.3 - Move the file to this location:
%USERPROFILE%\.cache\huggingface\gradio\frpc
- If Gradio says the FRP client is missing or blocked, do this:
- Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.3/frpc_linux_amd64
- Rename the downloaded file to:
frpc_linux_amd64_v0.3 - Move the file to this location:
/root/.cache/huggingface/gradio/frpc
- Try running
gradio_demo.pyagain with--share.
Press and hold the record button to capture your voice; releasing it will automatically execute the translation.
Gradio UI example:

Next Steps
- Mix and match between dozens of languages for quick translation.
- Experiment with voice input and text-to-speech
Resources
Below are some additional resources to learn more about speech-to-speech translation:
- The repo is here https://huggingface.co/facebook/seamless-m4t-v2-large
- Research academia related to “Seamless: Multilingual Expressive and Streaming Speech Translation”
Need help with this playbook?
Run into an issue or have a question? Open a GitHub issue and our team will take a look.