Device Family

Device

Fine-tune LLMs with PyTorch and AMD ROCm™ Software

Fine-tune large language models (LLMs) using PyTorch and ROCm.

Overview

This tutorial provides step-by-step examples for fine-tuning a large language model (LLM) with PyTorch and ROCm. It covers several techniques, from standard fine-tuning to memory-efficient Parameter-Efficient Fine-Tuning (PEFT) strategies, so you can easily adapt models for your needs.

Model Used: google/gemma-3-4b-it (see Enable HF authentication if gated)
Hardware: AMD Radeon™ GPU with ROCm support
Framework: PyTorch + Hugging Face (Transformers, PEFT, Transformer Reinforcement Learning (TRL))

Quick Start

1. Install Dependencies

python -m venv venv
venv\Scripts\activate.bat

sudo apt update
sudo apt install -y python3-venv
python3 -m venv venv
source venv/bin/activate

Installing Basic Dependencies

PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio

python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1152/ torch torchvision torchaudio

python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1150/ torch torchvision torchaudio

See this link for details.

Additional Dependencies

pip install transformers==4.57.1 safetensors==0.6.2 accelerate peft trl bitsandbytes "fsspec[http]>=2023.1.0,<=2025.9.0"

Windows: Only core packages are tested and supported here. bitsandbytes is not well supported on Windows, so the Windows install omits it; use LoRA or full fine-tuning on Windows (QLoRA requires bitsandbytes and is intended for Linux).

pip install transformers==4.57.1 safetensors==0.6.2 datasets==4.2.0 accelerate peft trl "fsspec[http]>=2023.1.0,<=2025.9.0"

Enable HF authentication (gated or custom / non–preinstalled models)

In this example we use google/gemma-3-4b-it, which is a gated model. You must accept the model’s terms on Hugging Face and then authenticate so the training scripts can download it.

Accept the license: Open https://huggingface.co/google/gemma-3-4b-it, sign in (or create an account), and accept the license/terms on the model page (e.g. “Agree and access repository”).
Install and log in: Install the Hugging Face CLI, then run the standard login:

pip install huggingface_hub
hf auth login

2. Choose Your Method

Method	Memory	Speed	Quality	Best For
QLoRA	12-16GB	Fastest	90-95%	Low Memory Usage
LoRA	24-32GB	Fast	95-98%	Balanced approach
Full	80GB+	Slowest	100%	Maximum quality

3. Run Training

Dataset and what the model learns
The scripts turn the dataset into chat examples. For example, the QLoRA script uses Abirate/english_quotes: each example becomes a user–assistant pair like:

User: “Give me a quote about: <tag>”
Assistant: “<quote> – <author>”

Fine-tuning teaches the model to respond to prompts asking for quotes about a topic and to return them in the format <quote text> - <author>. The LoRA and full fine-tuning scripts use databricks/databricks-dolly-15k (general instruction/response pairs), so the exact task varies by script; the idea is the same - adapt the model to your chosen dataset and format.

Below is a summary of the available training methods. Each method links to its script and provides a brief description for choosing the right approach.

Script	Method	Description	Typical VRAM	Recommended For
	LoRA	Trains small adapter matrices while freezing base model. 3–5x faster; ~95–98% full quality.	24–32GB	Advanced users; multiple adapters; more VRAM
(Linux only)	QLoRA	4-bit quantization + LoRA adapters. Lowest memory use, fastest, small quality trade-off. Requires `bitsandbytes` (Linux only).	12–16GB	Most users; fast experiments; limited VRAM
	Full Fine-tuning	Updates all model parameters. Maximum quality; highest memory and compute usage.	40GB+	Maximum quality; research; large VRAM

Understanding the Techniques

What is LoRA?

LoRA (Low-Rank Adaptation) keeps the base model frozen and only trains small “adapter” matrices that get added to certain layers.

The key idea: instead of updating a huge weight matrix with millions of parameters, we learn a low-rank update (two small matrices whose product has much fewer parameters). That gives a large reduction in trainable parameters and VRAM while keeping most of the full fine-tuning quality.

# Instead of updating full weight matrix W (16M params):
W_updated = W + ΔW

# LoRA decomposes the update into two small matrices:
W_updated = W + B × A
# B: 4096×32 matrix
# A: 32×4096 matrix
# Total: 262K params (98% reduction!)

What is QLoRA?

QLoRA combines 4-bit quantization with LoRA. The base model is loaded in 4-bit (large memory savings), and only the LoRA adapters are trained in higher precision. So you get the parameter efficiency of LoRA plus much lower VRAM, with a small quality trade-off compared to full-precision LoRA. Note that 4-bit quantization can cause numerical instabilities (loss spikes or NaNs), so users may often prefer LoRA if enough VRAM is available.

Base Model (4-bit):  10GB  ← Frozen, quantized
LoRA Adapters (BF16): 2GB  ← Trainable, full precision
Total: 12GB (vs 40GB full precision)

Using your Fine-Tuned Model

After Full Fine-Tuning

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "output-gemma-3-4b-full",     # Directory containing your fully fine-tuned checkpoint
    device_map="auto",
    torch_dtype="auto"            # Use BF16 if your GPU supports it, else "auto"
)
tokenizer = AutoTokenizer.from_pretrained("output-gemma-3-4b-full")

# Generate text
prompt = "Explain quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After LoRA/QLoRA Training

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Load model with LoRA or QLoRA adapters
model = AutoPeftModelForCausalLM.from_pretrained(
    "output-gemma-3-4b-qlora",   # or "output-gemma-3-4b-lora" depending on your training
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("output-gemma-3-4b-qlora")

# Generate text
prompt = "Explain quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merge LoRA Adapter into Base Model

# Merge LoRA/QLoRA adapter weights into the base model for standalone inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("gemma-3-4b-merged")
tokenizer.save_pretrained("gemma-3-4b-merged")

For more custom settings (padding tokens, device, etc), refer to the script that you used for training.

Customization Guide

Use your Own Dataset

All scripts use the same dataset format. Replace the loading section:

from datasets import load_dataset

# Option 1: Local JSON/JSONL file
dataset = load_dataset('json', data_files='your_data.json')

# Option 2: Hugging Face Hub dataset
dataset = load_dataset('username/dataset-name')

# Option 3: CSV file
dataset = load_dataset('csv', data_files='data.csv')

# Format for chat models
def format_instruction(example):
    return {
        "messages": [
            {"role": "user", "content": example['instruction']},
            {"role": "assistant", "content": example['response']}
        ]
    }

dataset = dataset.map(format_instruction)

Dataset Format:

[
  {
    "messages": [
      {"role": "user", "content": "Your instruction here"},
      {"role": "assistant", "content": "Expected response here"}
    ]
  }
]

Adjust Training Parameters

Edit the training script and change the variables to match your goals: learning rate (LR), epochs (EPOCHS), batch size (BATCH_SIZE), gradient accumulation (GRAD_ACCUM_STEPS), and for LoRA/QLoRA rank (LORA_R). For faster runs use fewer epochs and a higher learning rate (LR); for better quality use more epochs and a lower LR. Reduce batch size or sequence length if you hit out-of-memory errors.

Memory Optimization Tips

If you encounter out-of-memory errors:

1. Reduce Batch Size:

BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 16  # Maintain effective batch size

2. Reduce Sequence Length:

max_seq_length=256  # Instead of 512

3. Use More Aggressive Quantization:

Full → LoRA → QLoRA

4. Enable Gradient Checkpointing (Full fine-tuning only):

model.gradient_checkpointing_enable()

Monitoring & Debugging

Watch GPU Memory

# Check ROCm GPU status
watch -n 1 rocm-smi

# Show memory info
rocm-smi --showmeminfo vram

(Optional) Track Experiments with Weights & Biases

To log runs and metrics to Weights & Biases:

pip install wandb
wandb login

In the training script, set report_to="wandb" and optionally run_name="your-experiment-name" in the trainer config. If you prefer not to use Wandb, leave report_to at its default or set it to "none".

Common Issues

Out of Memory (OOM)

Solution: Reduce batch size and/or use QLoRA

BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 16
# Or: python train_qlora.py

Loss Not Decreasing

Solution: Adjust learning rate

LR = 1e-4  # Try lower
# or
LR = 5e-4  # Try higher

Slow Training

Solution: Increase batch size if memory allows

BATCH_SIZE = 8

Next Steps

After you have completed successful fine-tuning, consider the following next steps to get more from your model:

Evaluate thoroughly on held-out test data to measure generalization and avoid overfitting.
Experiment by trying different hyperparameter values for better accuracy, speed, and memory trade-offs.
Track all your experiments (and corresponding metrics) with Weights & Biases for reproducible research.
Try training on your own custom datasets to adapt the model specifically for your use-case.
Deploy your fine-tuned model for fast inference using efficient backends such as vLLM on compatible hardware.
Explore advanced techniques including prompt engineering, mixed precision, and longer sequence lengths.
Train multiple LoRA adapters for different tasks or domains and swap them as needed.

Good luck with your fine-tuning journey! 🎉

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.

Browse Issues Open an Issue