Device Family

Device

Running and serving LLMs with LM Studio

Set up LM Studio and LM Studio Server to run and serve large language models locally.

Overview

LM Studio is a powerful GUI-based wrapper for llama.cpp and also provides an OpenAI compliant endpoint for local model serving. LM Studio provides a simple but powerful interface to easily download and deploy models. LM Studio offers both Vulkan and AMD ROCm™ software backends (called runtimes) for AMD users.

What You’ll Learn

How to configure and use LM Studio to leverage your local hardware
Test and manage LLMs in a completely offline environment
Serve models via OpenAI Compatible API to power custom workflows and apps

Installing Dependencies

LM Studio

Download the installer from here: https://lmstudio.ai/download
Install.

Download the appimage from here: https://lmstudio.ai/download?os=linux
run sudo apt install libfuse2
run cd ~/Downloads
run chmod +x LM-Studio-*.AppImage
run ./LM-Studio-*.AppImage

Memory configuration for running large models

On Windows, to run larger models that require higher memory, we need to use the AMD Variable Graphics Memory (iGPU VRAM) allocation. Although 64 GB is adequate for most workloads, running the largest models with high context may require 96 GB.

This can be done by opening AMD Software: Adrenalin™ Edition control panel and navigating to: Performance > Tuning > AMD Variable Graphics Memory. Please reboot the system for the changes to take effect.

On Linux, ROCm utilizes a shared system memory pool, and this pool is configured by default to half the system memory.

This amount can be increased by changing the kernel’s Translation Table Manager (TTM) page setting, with the following instructions. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5GB)

Install the pipx utility and add the path for pipx installed wheels into the system search path.

sudo apt install pipx
pipx ensurepath

Install the amd-debug-tools wheel from PyPi.

pipx install amd-debug-tools

Run the amd-ttm tool to query the current settings for shared memory.

amd-ttm

Reconfigure shared memory settings by using the —set argument (units in GB).

amd-ttm --set <NUM>

Reboot the system for changes to take effect.

For amd-ttm usage examples, see the ROCm documentation.

Downloading Models

Downloading GPT-OSS 120B on LM Studio

To download the GPT-OSS 120B model:

Press “Ctrl” + “Shift” + “M” on your keyboard or click on the “Discover” tab (Magnifying Glass icon) on the left sidebar
Search for ggml-org/gpt-oss-120b-GGUF
Select mxfp4 and click Download

LM Studio Download Models

LM Studio will automatically download and place the model in the correct directory.

Should you wish to download additional models, you can search for them in the Discover tab and LM Studio will handle the rest.

Chatting with an LLM

Learn how to start chatting with a ChatGPT-grade LLM completely locally.

Open LMStudio.
Press Ctrl + L to open the Model Loader, select Manually chose model load parameters, and click on GPT-OSS 120B
Make sure “show advanced settings” is checked.
Change Context Length as desired. Higher context length means more model memory, but more system memory used. Recommended for this playbook is 4096.
Make sure GPU Offload is set to maximum and Flash Attention is On
Check Remember settings and click on Load Model.
If not in the chat window, press Ctrl + 1 or click on the 👾 button on the top left of the screen.
Send a message and start interacting with the model!

Chatting with gpt-oss-120b on LM Studio

Serve LLMs through an OpenAI compatible endpoint

LM Studio also offers an OpenAI compliant endpoint in the form of LM Studio Server. This has already been demonstrated in an agentic coding workflow with Cline here. Another common use case is connecting LM Studio Server to any web application (React, Node.js, Python) by sending standard HTTP requests to the inference endpoint.

To set up LM Studio Server, use the following instructions:

On the left hand side, click on the Developer tab (command line icon) or CTRL + 2 and then click on Server Settings.
(Optional): If you want to serve the model over your LAN, check Serve on Local Network. If you want to use with a website or extensive calling within VS Code, check Enable CORS.
On the upper left corner, make sure the server is running by clicking on the toggle button in front of Status.
An OpenAI compliant endpoint will now be running. The address is typically at http://127.0.0.1:1234
If a model is not already loaded, you can load it by clicking Load Model and following the previously mentioned steps.

This model will now be accessible through the LM Studio Server endpoint and will support OpenAI endpoints including:

Endpoint	Method	Docs
/v1/models	GET	Models
/v1/responses	POST	Responses
/v1/chat/completions	POST	Chat Completions
/v1/embeddings	POST	Embeddings
/v1/completions	POST	Completions

Example: Pinging your Endpoint

Having just created the OpenAI Compatible endpoint, let’s look at how to integrate this into a Python developer environment (such as VSCode) and use your system as a local API Provider.

Create a Python virtual environment:

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv.

python -m venv llm-env --system-site-packages
llm-env\Scripts\activate

On Linux, open a terminal in the directory of your choice and follow the commands to create a venv.

sudo apt update
sudo apt install -y python3-venv
python3 -m venv llm-env --system-site-packages
source llm-env/bin/activate

On Windows, open a terminal in the directory of your choice and follow the commands to create a venv.

python -m venv llm-env
llm-env\Scripts\activate

On Linux, open a terminal in the directory of your choice and follow the commands to create a venv.

sudo apt update
sudo apt install -y python3-venv
python3 -m venv llm-env
source llm-env/bin/activate

Install the OpenAI package

pip install openai

Run the following script to ping the endpoint we have just created.

from openai import OpenAI

# Initialize the client specifically for your local server
# The API key is required by the library but ignored by LM Studio
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)
print("Attempting to connect to local LM Studio server...")

    try:
        # Create a simple chat completion request
        completion = client.chat.completions.create(
            model="local-model", # The model identifier is optional in local mode
            messages=[
                {"role": "system", "content": "You are a helpful coding assistant."},
                {"role": "user", "content": "Explain Python decorators in 1 sentence"}
            ],
            temperature=0.7,
        )
        # Print the response
        print("\nConnection Successful! Server Response:\n")
        print(completion.choices[0].message.content)

    except Exception as e:
        print(f"\nConnection Failed: {e}. Ensure LM Studio server is running on port 1234.")

(Optional): Swapping between Runtimes

Press Ctrl + Shift + R on your keyboard. Alternatively click on the Discover tab (Magnifying Glass) on the left-hand side and then click on Runtime in the pop up.
You should then see Runtime Selections, where the dropdown menu can be used to change the runtime.

Next Steps

Custom App Integration: Integrate your own Python scripts or applications using the local OpenAI-compatible API.
Advanced Frontends: Connect powerful interfaces like Open WebUI to your server for chat history and persona management.

For more documentation, please visit: https://lmstudio.ai/docs/developer

Need help with this playbook?

Run into an issue or have a question? Open a GitHub issue and our team will take a look.

Browse Issues Open an Issue