Skip to content
Filters & more
45 min
Intermediate
pytorchfine-tuningllmrocm
Device Family
Device
OS

Fine-tune LLMs with PyTorch and AMD ROCm™ Software

Fine-tune large language models (LLMs) using PyTorch and ROCm.

Fine-tune LLMs with PyTorch and AMD ROCm™ Software

Overview

This tutorial provides step-by-step examples for fine-tuning a large language model (LLM) with PyTorch and ROCm. It covers several techniques, from standard fine-tuning to memory-efficient Parameter-Efficient Fine-Tuning (PEFT) strategies, so you can easily adapt models for your needs.

Model Used: google/gemma-3-4b-it (see Enable HF authentication if gated)
Hardware: AMD Radeon™ GPU with ROCm support
Framework: PyTorch + Hugging Face (Transformers, PEFT, Transformer Reinforcement Learning (TRL))

Quick Start

1. Install Dependencies

Terminal window
python -m venv venv
venv\Scripts\activate.bat
Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv venv
source venv/bin/activate

Installing Basic Dependencies

PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1152/ torch torchvision torchaudio
Terminal window
python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1150/ torch torchvision torchaudio

See this link for details.

Additional Dependencies

Terminal window
pip install transformers==4.57.1 safetensors==0.6.2 accelerate peft trl bitsandbytes "fsspec[http]>=2023.1.0,<=2025.9.0"

Windows: Only core packages are tested and supported here. bitsandbytes is not well supported on Windows, so the Windows install omits it; use LoRA or full fine-tuning on Windows (QLoRA requires bitsandbytes and is intended for Linux).

Terminal window
pip install transformers==4.57.1 safetensors==0.6.2 datasets==4.2.0 accelerate peft trl "fsspec[http]>=2023.1.0,<=2025.9.0"

Enable HF authentication (gated or custom / non–preinstalled models)

In this example we use google/gemma-3-4b-it, which is a gated model. You must accept the model’s terms on Hugging Face and then authenticate so the training scripts can download it.

  1. Accept the license: Open https://huggingface.co/google/gemma-3-4b-it, sign in (or create an account), and accept the license/terms on the model page (e.g. “Agree and access repository”).
  2. Install and log in: Install the Hugging Face CLI, then run the standard login:
Terminal window
pip install huggingface_hub
hf auth login

2. Choose Your Method

MethodMemorySpeedQualityBest For
QLoRA12-16GBFastest90-95%Low Memory Usage
LoRA24-32GBFast95-98%Balanced approach
Full80GB+Slowest100%Maximum quality

3. Run Training

Dataset and what the model learns
The scripts turn the dataset into chat examples. For example, the QLoRA script uses Abirate/english_quotes: each example becomes a user–assistant pair like:

  • User: “Give me a quote about: <tag>”
  • Assistant: “<quote> – <author>”

Fine-tuning teaches the model to respond to prompts asking for quotes about a topic and to return them in the format <quote text> - <author>. The LoRA and full fine-tuning scripts use databricks/databricks-dolly-15k (general instruction/response pairs), so the exact task varies by script; the idea is the same - adapt the model to your chosen dataset and format.

Below is a summary of the available training methods. Each method links to its script and provides a brief description for choosing the right approach.

ScriptMethodDescriptionTypical VRAMRecommended For
LoRATrains small adapter matrices while freezing base model. 3–5x faster; ~95–98% full quality.24–32GBAdvanced users; multiple adapters; more VRAM
(Linux only)QLoRA4-bit quantization + LoRA adapters. Lowest memory use, fastest, small quality trade-off. Requires bitsandbytes (Linux only).12–16GBMost users; fast experiments; limited VRAM
Full Fine-tuningUpdates all model parameters. Maximum quality; highest memory and compute usage.40GB+Maximum quality; research; large VRAM

Understanding the Techniques

What is LoRA?

LoRA (Low-Rank Adaptation) keeps the base model frozen and only trains small “adapter” matrices that get added to certain layers.

  • The key idea: instead of updating a huge weight matrix with millions of parameters, we learn a low-rank update (two small matrices whose product has much fewer parameters). That gives a large reduction in trainable parameters and VRAM while keeping most of the full fine-tuning quality.
# Instead of updating full weight matrix W (16M params):
W_updated = W + ΔW
# LoRA decomposes the update into two small matrices:
W_updated = W + B × A
# B: 4096×32 matrix
# A: 32×4096 matrix
# Total: 262K params (98% reduction!)

What is QLoRA?

QLoRA combines 4-bit quantization with LoRA. The base model is loaded in 4-bit (large memory savings), and only the LoRA adapters are trained in higher precision. So you get the parameter efficiency of LoRA plus much lower VRAM, with a small quality trade-off compared to full-precision LoRA. Note that 4-bit quantization can cause numerical instabilities (loss spikes or NaNs), so users may often prefer LoRA if enough VRAM is available.

Base Model (4-bit): 10GB ← Frozen, quantized
LoRA Adapters (BF16): 2GB ← Trainable, full precision
Total: 12GB (vs 40GB full precision)

Using your Fine-Tuned Model

After Full Fine-Tuning

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"output-gemma-3-4b-full", # Directory containing your fully fine-tuned checkpoint
device_map="auto",
torch_dtype="auto" # Use BF16 if your GPU supports it, else "auto"
)
tokenizer = AutoTokenizer.from_pretrained("output-gemma-3-4b-full")
# Generate text
prompt = "Explain quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After LoRA/QLoRA Training

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
# Load model with LoRA or QLoRA adapters
model = AutoPeftModelForCausalLM.from_pretrained(
"output-gemma-3-4b-qlora", # or "output-gemma-3-4b-lora" depending on your training
device_map="auto",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("output-gemma-3-4b-qlora")
# Generate text
prompt = "Explain quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merge LoRA Adapter into Base Model

# Merge LoRA/QLoRA adapter weights into the base model for standalone inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("gemma-3-4b-merged")
tokenizer.save_pretrained("gemma-3-4b-merged")

For more custom settings (padding tokens, device, etc), refer to the script that you used for training.


Customization Guide

Use your Own Dataset

All scripts use the same dataset format. Replace the loading section:

from datasets import load_dataset
# Option 1: Local JSON/JSONL file
dataset = load_dataset('json', data_files='your_data.json')
# Option 2: Hugging Face Hub dataset
dataset = load_dataset('username/dataset-name')
# Option 3: CSV file
dataset = load_dataset('csv', data_files='data.csv')
# Format for chat models
def format_instruction(example):
return {
"messages": [
{"role": "user", "content": example['instruction']},
{"role": "assistant", "content": example['response']}
]
}
dataset = dataset.map(format_instruction)

Dataset Format:

[
{
"messages": [
{"role": "user", "content": "Your instruction here"},
{"role": "assistant", "content": "Expected response here"}
]
}
]

Adjust Training Parameters

Edit the training script and change the variables to match your goals: learning rate (LR), epochs (EPOCHS), batch size (BATCH_SIZE), gradient accumulation (GRAD_ACCUM_STEPS), and for LoRA/QLoRA rank (LORA_R). For faster runs use fewer epochs and a higher learning rate (LR); for better quality use more epochs and a lower LR. Reduce batch size or sequence length if you hit out-of-memory errors.

Memory Optimization Tips

If you encounter out-of-memory errors:

1. Reduce Batch Size:

BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 16 # Maintain effective batch size

2. Reduce Sequence Length:

max_seq_length=256 # Instead of 512

3. Use More Aggressive Quantization:

Full → LoRA → QLoRA

4. Enable Gradient Checkpointing (Full fine-tuning only):

model.gradient_checkpointing_enable()

Monitoring & Debugging

Watch GPU Memory

Terminal window
# Check ROCm GPU status
watch -n 1 rocm-smi
# Show memory info
rocm-smi --showmeminfo vram

(Optional) Track Experiments with Weights & Biases

To log runs and metrics to Weights & Biases:

Terminal window
pip install wandb
wandb login

In the training script, set report_to="wandb" and optionally run_name="your-experiment-name" in the trainer config. If you prefer not to use Wandb, leave report_to at its default or set it to "none".

Common Issues

Out of Memory (OOM)

Solution: Reduce batch size and/or use QLoRA

BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 16
# Or: python train_qlora.py

Loss Not Decreasing

Solution: Adjust learning rate

LR = 1e-4 # Try lower
# or
LR = 5e-4 # Try higher

Slow Training

Solution: Increase batch size if memory allows

BATCH_SIZE = 8

Next Steps

After you have completed successful fine-tuning, consider the following next steps to get more from your model:

  1. Evaluate thoroughly on held-out test data to measure generalization and avoid overfitting.
  2. Experiment by trying different hyperparameter values for better accuracy, speed, and memory trade-offs.
  3. Track all your experiments (and corresponding metrics) with Weights & Biases for reproducible research.
  4. Try training on your own custom datasets to adapt the model specifically for your use-case.
  5. Deploy your fine-tuned model for fast inference using efficient backends such as vLLM on compatible hardware.
  6. Explore advanced techniques including prompt engineering, mixed precision, and longer sequence lengths.
  7. Train multiple LoRA adapters for different tasks or domains and swap them as needed.

Good luck with your fine-tuning journey! 🎉

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.