Device Family

Device

Optimized LLMs Fine-tuning with Unsloth

Use Unsloth for memory-efficient fine-tuned LLMs™

Overview

This playbook shows how to fine-tune a language model locally with Unsloth on AMD hardware.

It uses a short Supervised Fine-Tuning (SFT) example with LoRA adapters on unsloth/gemma-4-E4B-it, using a subset of the mlabonne/FineTome-100k dataset. The goal is to give you a simple end-to-end workflow that covers setup, training, inference, and saving the fine-tuned result.

The example is designed to be practical and easy to modify, so you can use it as a starting point for your own datasets and models.

What You’ll Learn

How to set up the Unsloth environment
How to fine-tune a LLM using SFT with Unsloth
How to save the fine-tuned result in local storage

Why Unsloth?

Unsloth makes LLM fine-tuning easier to run on local hardware by reducing memory usage and speeding up training compared to a standard setup.

In this playbook, we use Unsloth together with LoRA-based SFT. That means the base model stays mostly frozen, while a much smaller set of adapter weights is trained. This is a good fit for local development because it is lighter than full fine-tuning and faster to iterate on.

Unsloth also supports other training approaches, including QLoRA and reinforcement learning workflows. This playbook focuses on the simplest path first: a small LoRA fine-tuning example that users can run, understand, and extend.

Set up your environment

Open a terminal and run the following prompt to create a venv with AMD ROCm™ software and Pytorch already installed:

sudo apt update
sudo apt install -y python3-venv
python3 -m venv unsloth-env --system-site-packages
source unsloth-env/bin/activate

Open a terminal and run the following prompt to create a venv:

sudo apt update
sudo apt install -y python3-venv
python3 -m venv unsloth-env
source unsloth-env/bin/activate

Installing Basic Dependencies

ROCm

Add the current user to the render and video groups.

sudo usermod -a -G render,video $LOGNAME

Restart your system to apply the settings.

sudo reboot

Install ROCm in the created virtual environment.

python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ "rocm[libraries,devel]"

python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1152/ "rocm[libraries,devel]"

python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx1150/ "rocm[libraries,devel]"

python -m pip install --upgrade pip
python -m pip install --index-url https://repo.amd.com/rocm/whl/gfx120x-all/ "rocm[libraries,devel]"

For further installation help, please see this link.

PyTorch

Install PyTorch with AMD ROCm™ software support in the created virtual environment:

python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio

python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1152/ torch torchvision torchaudio

python -m pip install --upgrade pip
python -m pip install --force-reinstall --no-cache-dir --index-url https://repo.amd.com/rocm/whl/gfx1150/ torch torchvision torchaudio

See this link for details.

AMD GPU Driver

Update to the latest AMD GPU driver using AMD Software: Adrenalin Edition™.

Open AMD Software: Adrenalin Edition from your Start menu or system tray.
Navigate to Driver and Software, click Manage Updates.
If an update is available, follow the prompts to download and install.

AMD GPU Driver

Download and install the latest AMD GPU driver for Linux:

Visit the AMD Linux Drivers page.
Follow the installation instructions provided on the download page.

Additional Dependencies

pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git
pip install --no-deps --upgrade timm
pip install datasets transformers trl

Download the Unsloth Fine-Tuning Script

Instead of manually executing each step, this playbook provides a clean, end-to-end script here: .

Run the following code to execute the script:

python test_unsloth.py

The rest of the playbook will conceptually go through each major step of the script.

How It Works

The test_unsloth.py script performs the following steps:

Load Model: Loads unsloth/gemma-4-E4B-it using FastModel.
Prepare Data: Standardizes the dataset (e.g., FineTome-100k) and applies the Gemma-4 chat template.
Apply LoRA: Adds adapters to language, attention, and MLP modules for efficient training.
Train: Uses SFTTrainer with response-only loss masking.
Inference: Runs a quick generation test to verify performance.
Save: Exports LoRA adapters locally.

Key Configuration

You can modify the following constants to customize your run:

MODEL_NAME = "unsloth/gemma-4-E4B-it"
MAX_SEQ_LEN = 1024
DATASET_NAME = "mlabonne/FineTome-100k"
OUTPUT_DIR = "gemma_4_lora"

Example of the Unsloth welcome message and output when loading the model weights:

alt text

Prepare Dataset

We use a subset of:

mlabonne/FineTome-100k

The dataset is:

Converted into chat format
Processed using the Gemma-4 chat template
Cleaned to remove duplicate BOS tokens

Train the Model

The script runs a short training demo, with the following parameters:

~50 steps
Small batch size
Gradient accumulation

During training, you will see logs such as:

alt text

Saving and Deployment

Local Saving (LoRA)

The script automatically saves LoRA adapters to the OUTPUT_DIR.

model.save_pretrained("gemma_4_lora")
tokenizer.save_pretrained("gemma_4_lora")

Save merged model (for vLLM)

For deployment with vLLM, merge the adapters into a full model:

model.save_pretrained_merged("gemma-4-finetune", tokenizer)

Export GGUF (for llama.cpp)

Convert directly to GGUF for local inference:

model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method="Q8_0")

Explore Lower-Memory (4-bit) Fine-Tuning

This playbook uses standard LoRA fine-tuning. If you need lower memory usage with minimal quality loss, a natural next step is to explore QLoRA with a supported 4-bit model and runtime stack.

QLoRA keeps the same adapter-based training idea as LoRA, but uses a quantized 4-bit base model underneath. This can reduce memory usage further, but compatibility depends on the model, backend, and low-bit kernel support in the software stack.

Before switching to QLoRA, verify that your chosen model and your AMD hardware/software environment support the required quantized runtime path.

An example on how you can enable 4-bit quantization by using a 4-bit quantized model:

load_in_4bit = True
model_name = "unsloth/gemma-4-E4B-it-unsloth-bnb-4bit"

Next Steps

Train on your own specific datasets
Try finetuning with different hyperparameters
Experiment with different quantization levels to understand the tradeoff between memory usage and quality
Deploy with vLLM or llama.cpp
Try QLoRA for a lower-memory setup

Resources

Below are some additional resources to learn more about Unsloth and finetuning:

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.

Browse Issues Open an Issue