Skip to content
Device Family
Device
OS

Running and serving LLMs with LM Studio

Set up LM Studio and LM Studio Server to run and serve large language models locally.

Overview

LM Studio is a powerful GUI-based wrapper for llama.cpp and also provides an OpenAI compliant endpoint for local model serving. LM Studio provides a simple but powerful interface to easily download and deploy models. LM Studio offers both Vulkan and AMD ROCm™ software backends (called runtimes) for AMD users.

What You’ll Learn

  • How to configure and use LM Studio to leverage your local hardware
  • Test and manage LLMs in a completely offline environment
  • Serve models via OpenAI Compatible API to power custom workflows and apps

Installing Dependencies

LM Studio

  1. Download the installer from here: https://lmstudio.ai/download
  2. Install.
  1. Download the appimage from here: https://lmstudio.ai/download?os=linux
  2. run sudo apt install libfuse2
  3. run cd ~/Downloads
  4. run chmod +x LM-Studio-*.AppImage
  5. run ./LM-Studio-*.AppImage

Memory configuration for running large models

On Windows, to run larger models that require higher memory, we need to use the AMD Variable Graphics Memory (iGPU VRAM) allocation. Although 64 GB is adequate for most workloads, running the largest models with high context may require 96 GB.

This can be done by opening AMD Software: Adrenalin™ Edition control panel and navigating to: Performance > Tuning > AMD Variable Graphics Memory. Please reboot the system for the changes to take effect.

On Linux, ROCm utilizes a shared system memory pool, and this pool is configured by default to half the system memory.

This amount can be increased by changing the kernel’s Translation Table Manager (TTM) page setting, with the following instructions. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5GB)

  • Install the pipx utility and add the path for pipx installed wheels into the system search path.
Terminal window
sudo apt install pipx
pipx ensurepath
  • Install the amd-debug-tools wheel from PyPi.
Terminal window
pipx install amd-debug-tools
  • Run the amd-ttm tool to query the current settings for shared memory.
Terminal window
amd-ttm
  • Reconfigure shared memory settings by using the —set argument (units in GB).
Terminal window
amd-ttm --set <NUM>
  • Reboot the system for changes to take effect.

For amd-ttm usage examples, see the ROCm documentation.

Downloading Models

Downloading GPT-OSS 120B on LM Studio

To download the GPT-OSS 120B model:

  1. Press “Ctrl” + “Shift” + “M” on your keyboard or click on the “Discover” tab (Magnifying Glass icon) on the left sidebar
  2. Search for ggml-org/gpt-oss-120b-GGUF
  3. Select mxfp4 and click Download

LM Studio Download Models

LM Studio will automatically download and place the model in the correct directory.

Should you wish to download additional models, you can search for them in the Discover tab and LM Studio will handle the rest.

Chatting with an LLM

Learn how to start chatting with a ChatGPT-grade LLM completely locally.

  1. Open LMStudio.
  2. Press Ctrl + L to open the Model Loader, select Manually chose model load parameters, and click on GPT-OSS 120B
  3. Make sure “show advanced settings” is checked.
  4. Change Context Length as desired. Higher context length means more model memory, but more system memory used. Recommended for this playbook is 4096.
  5. Make sure GPU Offload is set to maximum and Flash Attention is On
  6. Check Remember settings and click on Load Model.
  7. If not in the chat window, press Ctrl + 1 or click on the 👾 button on the top left of the screen.
  8. Send a message and start interacting with the model!

Chatting with gpt-oss-120b on LM Studio

Serve LLMs through an OpenAI compatible endpoint

LM Studio also offers an OpenAI compliant endpoint in the form of LM Studio Server. This has already been demonstrated in an agentic coding workflow with Cline here. Another common use case is connecting LM Studio Server to any web application (React, Node.js, Python) by sending standard HTTP requests to the inference endpoint.

To set up LM Studio Server, use the following instructions:

  1. On the left hand side, click on the Developer tab (command line icon) or CTRL + 2 and then click on Server Settings.
  2. (Optional): If you want to serve the model over your LAN, check Serve on Local Network. If you want to use with a website or extensive calling within VS Code, check Enable CORS.
  3. On the upper left corner, make sure the server is running by clicking on the toggle button in front of Status.
  4. An OpenAI compliant endpoint will now be running. The address is typically at http://127.0.0.1:1234
  5. If a model is not already loaded, you can load it by clicking Load Model and following the previously mentioned steps.

This model will now be accessible through the LM Studio Server endpoint and will support OpenAI endpoints including:

EndpointMethodDocs
/v1/modelsGETModels
/v1/responsesPOSTResponses
/v1/chat/completionsPOSTChat Completions
/v1/embeddingsPOSTEmbeddings
/v1/completionsPOSTCompletions

Example: Pinging your Endpoint

Having just created the OpenAI Compatible endpoint, let’s look at how to integrate this into a Python developer environment (such as VSCode) and use your system as a local API Provider.

  1. Create a Python virtual environment:

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv.

Terminal window
python -m venv llm-env --system-site-packages
llm-env\Scripts\activate

On Linux, open a terminal in the directory of your choice and follow the commands to create a venv.

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv llm-env --system-site-packages
source llm-env/bin/activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv.

Terminal window
python -m venv llm-env
llm-env\Scripts\activate

On Linux, open a terminal in the directory of your choice and follow the commands to create a venv.

Terminal window
sudo apt update
sudo apt install -y python3-venv
python3 -m venv llm-env
source llm-env/bin/activate
  1. Install the OpenAI package
Terminal window
pip install openai
  1. Run the following script to ping the endpoint we have just created.
from openai import OpenAI
# Initialize the client specifically for your local server
# The API key is required by the library but ignored by LM Studio
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio"
)
print("Attempting to connect to local LM Studio server...")
try:
# Create a simple chat completion request
completion = client.chat.completions.create(
model="local-model", # The model identifier is optional in local mode
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain Python decorators in 1 sentence"}
],
temperature=0.7,
)
# Print the response
print("\nConnection Successful! Server Response:\n")
print(completion.choices[0].message.content)
except Exception as e:
print(f"\nConnection Failed: {e}. Ensure LM Studio server is running on port 1234.")

(Optional): Swapping between Runtimes

  1. Press Ctrl + Shift + R on your keyboard. Alternatively click on the Discover tab (Magnifying Glass) on the left-hand side and then click on Runtime in the pop up.
  2. You should then see Runtime Selections, where the dropdown menu can be used to change the runtime.

Next Steps

  • Custom App Integration: Integrate your own Python scripts or applications using the local OpenAI-compatible API.
  • Advanced Frontends: Connect powerful interfaces like Open WebUI to your server for chat history and persona management.

For more documentation, please visit: https://lmstudio.ai/docs/developer

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.