Skip to content
Filters & more
1h
Intermediate
lemonadecpugpunpuinference
Device Family
Device
OS

Using Lemonade Across CPU, GPU, and NPU

Learn to run Gen AI models locally with Lemonade, an open-source local AI server.

Overview

🍋 Lemonade is an open-source local AI server that lets you run large language models (LLMs), image generators, and audio models directly on your own hardware. It exposes the models through the industry-standard OpenAI API, so any app that works with OpenAI can instantly work with Lemonade. By the end of the playbook, you’ll be using Lemonade to run models locally on your machine.

What You’ll Learn

By the end of this playbook you will be able to:

  • Install Lemonade Server and verify it is running.
  • Download and chat with an LLM using a single command.
  • Explore the web UI and try different modalities such as vision, speech-to-text, and image generation.
  • Switch GPU backends between Vulkan and AMD ROCm™ software.
  • Build a Python app powered by a local LLM using the OpenAI-compatible API.
  • Run models on the AMD Neural Processing Unit (NPU) using Hybrid and FLM execution modes on AMD Ryzen™ AI hardware.

Prerequisites

Before you begin, make sure you have:

  • A PC running Windows 11 or a supported Linux distribution (Ubuntu 24.04+, Fedora, Debian)
  • At least 16 GB of RAM (64 GB recommended for larger models)
  • ~4–32 GB of free disk space, depending on the models you download. The largest model in this guide requires ~20 GB.
  • Python 3.10–3.13 (used in the Python app section)
  • [Optional] An AMD XDNA 2 NPU (Ryzen AI 300/400/Max 300 series or Z2 Extreme) with the latest driver installed from Ryzen AI Software Installation Instructions if you want to run a model on the NPU.

Lemonade

Installing Lemonade

Download the latest installer from lemonade-server.ai and run the .msi file.

After installation:

  • The lemonade CLI is added to your system PATH automatically
  • Lemonade server is expected to run in the background automatically

You can also install silently from the command line:

Terminal window
msiexec /i lemonade-server-minimal.msi /qn

Ubuntu:

Terminal window
sudo add-apt-repository ppa:lemonade-team/stable
sudo apt install lemonade-server

Arch Linux (AUR):

Terminal window
yay -S lemonade-server

For other distributions or to install from source, see the full installation options.

Verifying Lemonade Installation

Open a terminal and run:

Terminal window
lemonade --version

You should see output like:

lemonade version x.y.z

If you see a version number, Lemonade is installed correctly and ready to go.

For quick reference, here are common Lemonade CLI commands:

CommandWhat it does
lemonade --helpShows all available commands and flags.
lemonade --versionPrints the installed Lemonade version.
lemonade statusConfirms whether the Lemonade server is running and reachable. The default OpenAI-compatible API base URL is http://localhost:13305/api/v1.
lemonade listLists models available to your Lemonade setup.
lemonade pull <MODEL_NAME>Downloads a model without launching it.
lemonade run <MODEL_NAME>Downloads the model if needed, then starts it for inference/chat.
lemonade run <MODEL_NAME> --llamacpp rocmStarts a llama.cpp model with the ROCm backend.
lemonade run <MODEL_NAME> --llamacpp vulkanStarts a llama.cpp model with the Vulkan backend.
lemonade configDisplays the current Lemonade configuration values.
lemonade config set llamacpp.backend=rocmSets the default llama.cpp backend to ROCm.

For the latest Lemonade server options or troubleshooting, please refer to the official Lemonade documentation.


Core Concepts — How Local AI Servers Work

Before we run a model, it is worth understanding why things are set up this way. Lemonade is a local model server, a process that loads AI models into memory and exposes them to applications over HTTP, just like a cloud AI service would.

Why a Server?

BenefitWhat It Means for You
Simplified integrationApps talk to one HTTP API instead of dealing with hardware-specific C++ or Python libraries.
Shared modelsA single loaded model can serve multiple apps at once, no duplicate copies eating your RAM.
Cloud-to-local portabilityCode written for OpenAI’s cloud API works with Lemonade by changing one URL.
Separation of concernsModel management, streaming, and fault tolerance are handled by the server so developers can focus on their app.

The OpenAI API Standard

Lemonade implements the OpenAI API, the same interface used by ChatGPT, Azure OpenAI, and dozens of other services. The conversation model is simple:

RoleWho Is Talking
systemInstructions to the model (persona, constraints, available tools)
userMessages from the human (or application) to the model
assistantResponses generated by the model

This means any library or app that supports OpenAI can talk to Lemonade by pointing it at http://localhost:13305/api/v1 while Lemonade Server is running.

Main Activity — Your First Local AI Chat

Let’s download an LLM and have a conversation with it, running the AI entirely on your own machine.

Step 1: Download and Run a Model

Lemonade ships with a curated model library. Let’s start with Gemma 4 E2B, a capable and compact model that includes vision support:

lemonade run Gemma-4-E2B-it-GGUF

This single command does three things:

  1. Downloads the model (~3 GB) from Hugging Face, if it is not already downloaded.
  2. Starts the Lemonade Server process on port 13305.
  3. Opens Lemonade App so you can start chatting with the model.

On Windows, the Lemonade App launches automatically and you can begin chatting immediately. If you installed the minimal.msi package, the app is not included. To start chatting, open your web browser and go to http://localhost:13305.

On Linux, open your browser and navigate to http://localhost:13305 to access the web app.

Try typing a question:

What are three fun facts about lemons?

The model will respond directly in the chat window. Congratulations! You are running a large language model locally.

Lemonade App with Logs displayed

In the Server Logs pane in the Lemonade App, you can find telemetry data about the model’s performance after each response. For example:

=== Telemetry ===
Input tokens: 24
Output tokens: 527
TTFT (s): 0.052
TPS: 95.99
=================

Step 2: Explore the Web Interface and Different Modalities

Lemonade includes a built-in web interface where you can:

  • Interact with the loaded model in a familiar chat window
  • Browse models in the Model Manager tab
  • Download new models with one click

Try switching between different modalities using the Model Manager tab in the web UI where you can browse models by Recipe or by Category:

  1. Vision: The Gemma-4-E2B-it-GGUF model you already have loaded supports vision. Paste an image into the chat box and ask the model to describe it.
  2. Image generation: In the Image category, download an image model such as SDXL-Turbo from the Model Manager, then use the Lemonade Image Generator to type a prompt and generate an image locally.
  3. Audio: In the Audio category, download an audio model such as Whisper-Tiny, which can do speech-to-text. Provide a recording of audio to transcribe it locally. For text-to-speech, try one of the models in the Speech category, such as kokoro-v1.

Multi-Modality with Lemonade

Step 3: Try a Model with a Different Backend

If you hover over a model in the Lemonade App, you’ll see a gear icon. Clicking this allows you to select options for the model, including choosing your desired backend.

By default, Lemonade uses Vulkan for GPU acceleration. If you have a supported AMD discrete GPU, you can switch to ROCm.

Lemonade Select Backend

To manage your installed backends, click the backend button in the leftmost column.

Alternatively, you can specify the backend using the following command:

lemonade run Gemma-4-E2B-it-GGUF --llamacpp rocm

You can also set your default backend using the environment variable LEMONADE_LLAMACPP with the values: vulkan, rocm, or cpu.


Going Deeper — Build an AI-Powered App with Python

The real power of a local AI server is that any application can connect to it using just a few lines of code. To prove it, let’s build a small but functional study flashcard generator where you give it a topic, it generates flashcards, and you can quiz yourself interactively.

Step 4: Start the Server

Make sure the Lemonade server is running. It typically starts automatically in the background after installation. To verify, run:

lemonade status

You should see a message like: Server is running on port 13305.

If the server isn’t running, start it by opening the Lemonade app. Use the default port 13305 (you can confirm or select this from the tray icon).

Step 5: Install the OpenAI Python Client

In a terminal, install the OpenAI Python Client using the following command:

Step 6: Build the Flashcard App

Load the Qwen3-Coder-30B-A3B-Instruct model and use the following prompt to generate a simple Flashcard app. Create a file with the generated code called flashcards.py.

Generate a simple python script that uses the following to call an llm (Qwen3.5-4B-GGUF) to generate flashcards on a subject that is entered by the user.
curl -X POST http://localhost:13305/api/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen3.5-4B-GGUF”,
“messages”: [
{“role”: “user”, “content”: “What is the population of Paris?”}
],
“stream”: false
}’
from openai import OpenAI
import json, random
# Connect to Lemonade
client = OpenAI(
base_url="http://localhost:13305/api/v1",
api_key="lemonade",
)
def generate_flashcards(topic, count=5):
"""Ask the LLM to generate flashcards on any topic."""
print(f"\n✨ Generating {count} flashcards on: {topic}\n")
response = client.chat.completions.create(
model="Qwen3.5-4B-GGUF",
messages=[
{
"role": "system",
"content": (
"You are a study assistant. Generate flashcards as a JSON array. "
"Each object must have a 'question' and 'answer' field. "
"Keep answers to 1-2 sentences. Return ONLY valid JSON, no markdown."
),
},
{
"role": "user",
"content": f"Create {count} flashcards about: {topic}",
},
],
)
# Parse the structured output from the LLM
text = response.choices[0].message.content.strip()
# Handle cases where the model wraps JSON in markdown code fences
if text.startswith("```"):
text = text.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return json.loads(text)
def quiz(cards):
"""Run an interactive quiz session with the generated flashcards."""
random.shuffle(cards)
score = 0
for i, card in enumerate(cards, 1):
print(f"--- Card {i}/{len(cards)} ---")
print(f"Q: {card['question']}\n")
input("Press Enter to reveal the answer...")
print(f"A: {card['answer']}\n")
got_it = input("Did you get it right? (y/n): ").strip().lower()
if got_it == "y":
score += 1
print()
print(f"🏆 Score: {score}/{len(cards)}")
# --- Main loop ---
print("🍋 Lemonade Flashcard Generator")
print("================================")
print("Powered by a local LLM running on your own hardware.\n")
while True:
topic = input('Enter a topic (or "quit" to exit): ').strip()
if topic.lower() in ("quit", "q", "exit"):
print("Happy studying! 👋")
break
try:
cards = generate_flashcards(topic)
print(f"Generated {len(cards)} cards!\n")
for i, card in enumerate(cards, 1):
print(f" {i}. {card['question']}")
print()
if input("Start quiz? (y/n): ").strip().lower() == "y":
quiz(cards)
print()
except (json.JSONDecodeError, KeyError):
print("The model returned an unexpected format. Try again!\n")

Step 7: Run It

python flashcards.py

Here’s what you should see:

🍋 Lemonade Flashcard Generator
================================
Powered by a local LLM running on your own hardware.
Enter a topic (or "quit" to exit): the solar system
✨ Generating 5 flashcards on: the solar system
Generated 5 cards!
1. Which planet is closest to the Sun?
2. What is the largest planet in our solar system?
3. Which planet is known as the "Red Planet"?
4. How many moons does Earth have?
5. What separates the inner planets from the outer planets?
Start quiz? (y/n): y
--- Card 1/5 ---
Q: What is the largest planet in our solar system?
Press Enter to reveal the answer...
A: Jupiter is the largest planet, with a diameter of about 139,820 km.
Did you get it right? (y/n): y
...
🏆 Score: 4/5

In about 50 lines of code you have built a fully functional study tool powered by a local LLM. There is no API key to manage, no usage costs, and no data ever leaves your machine.

Key insight: Notice the client = OpenAI(base_url=...) line is the only thing tying this app to Lemonade instead of OpenAI’s cloud. The rest of the code is identical to what you would write against any OpenAI-compatible service. If you have ever used the OpenAI Python library, you already know how to build apps with Lemonade.

What This Demonstrates

This small app exercises several real-world integration patterns:

PatternWhere It Appears
System promptsThe "system" message tells the LLM to output structured JSON
Structured outputThe app parses the LLM’s response as JSON to build flashcards
Stateless requestsEach generate_flashcards() call is independent
Error handlingThe try/except gracefully handles cases where the LLM’s output is not valid JSON

These same patterns scale to any application such as chatbots, code assistants, content generators, automation tools.

Bonus Challenge

  • For an added challenge, try updating the app to have the flashcards read to the user by referencing the example provided here.

Running Models on the NPU

If you have a Ryzen AI 300/400/Max 300 series or Z2 Extreme, your device has a built-in Neural Processing Unit (NPU), a dedicated chip designed specifically for AI workloads. Running models on the NPU is more power-efficient than using the GPU, which makes it ideal for background AI tasks, longer sessions, and battery-powered use.

Lemonade supports three NPU execution modes, all transparent behind the same OpenAI API:

ModeHow It WorksRecipeExample Models
Hybrid (NPU + iGPU)NPU processes the prompt, iGPU generates tokensOGA (oga-hybrid)Qwen3-8B-Hybrid, Phi-4-Mini-Instruct-Hybrid
NPU-onlyEntire inference runs on the NPURyzen AI LLM (ryzenai-llm)Qwen-2.5-7B-Instruct-NPU
FLMUses FastFlowLM engine on the NPU, optimized for AMD XDNA2FLM (flm)Gemma3-4b-it-FLM, Qwen3-8b-FLM

Requirements

  • AMD Ryzen AI 300/400 series or Z2 series processor
  • For FLM models: The FLM runtime can be installed from within the Lemonade app or Lemonade will automatically install the FLM runtime when running an FLM model. To learn more about FastFlowLM see here

Step 8: Run a Hybrid Model

Hybrid models split work between the NPU and iGPU for a good balance of speed and efficiency. In the Lemonade App, select a model from the Ryzen AI LLM list, for example, Qwen3-4B-Hybrid, or run it using the following command:

lemonade run Qwen3-4B-Hybrid

Lemonade detects your NPU automatically and uses the right backend.

What is happening under the hood? When you send a message, the NPU processes your entire prompt in parallel (this is called “prefill”). Then the iGPU takes over to generate the response one token at a time (this is called “decode”). This hybrid approach plays to each chip’s strengths.

Step 9: Run an FLM Model

FastFlowLM (FLM) models are specifically optimized for AMD’s XDNA2 NPU architecture and can be very fast for their size. For example, select Gemma3-4b from the FastFlowLM NPU list or use the following command:

lemonade run gemma3-4b-FLM

FLM models include some of the most popular architectures (Gemma 3, Qwen 3, Llama 3, and DeepSeek R1) and range from under 1 GB to over 13 GB.

Switching Models

The flashcard app from Step 6 works on NPU models too, just change the model name:

# In flashcards.py, swap the model to run on NPU instead of GPU
response = client.chat.completions.create(
model="Qwen3-4B-Hybrid", # swap in any NPU/Hybrid/FLM model
messages=messages,
)

Next Steps

You have a local AI server running on your own hardware, here is where to go next:

  1. Connect your favorite apps: Lemonade works out of the box with VS Code Copilot, Open WebUI, Continue, n8n, and many more.

  2. Browse more models: Explore the full model library to find models optimized for coding, reasoning, vision, and more. Use the Lemonade App or lemonade list to see what is available.

  3. Unlock ROCm GPU acceleration: If you have a supported AMD GPU, switch to the ROCm backend: lemonade config set llamacpp.backend=rocm. See supported AMD GPUs.

  4. Read the full API spec: Lemonade supports chat completions, embeddings, audio transcription, image generation, text-to-speech, and more. See the Server Spec for every endpoint.

  5. Contribute: Lemonade is open source. Check out the contribution guide and look for Good First Issues.

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.