Running Local LLMs on an AMD APU Laptop with 56GB Unified Memory

I recently got a Lenovo ThinkPad P14s Gen 6 with the AMD Ryzen AI 9 HX PRO 370 and 56GB of LPDDR5x RAM. I wanted to see what I could actually run on it for local LLM inference, and it turns out the answer is “quite a lot” if you know how to get around a couple of gotchas with the AMD iGPU memory model.

The short version: I’m running Qwen3.5-35B-A3B (a 35 billion parameter MoE model) at about 14-16 tokens/second generation and 80-126 tok/s prompt processing, served as an OpenAI-compatible API accessible from other machines on my network. No discrete GPU required.

The Hardware

The key specs that matter here:

  • CPU: AMD Ryzen AI 9 HX PRO 370 (Zen 5, 12 cores / 24 threads)
  • GPU: AMD Radeon 890M (RDNA 3.5 integrated, 16 compute units)
  • RAM: 56GB LPDDR5x-5600 (2x 32GB Ramaxel sticks)
  • Storage: 2TB SK Hynix NVMe

The important thing to understand is that the Radeon 890M has zero discrete VRAM. It’s all unified memory, the same physical LPDDR5x chips serve both CPU and GPU. Windows reports about 4GB as “dedicated GPU memory” but that’s just a reservation policy, not a separate pool of faster memory.

The VRAM Problem (and How to Solve It)

This is where it gets tricky. There are basically three tiers of GPU memory on AMD APUs:

  1. UMA Frame Buffer (“dedicated VRAM”): carved out at boot time in BIOS. Usually 512MB-4GB. Changing it requires a reboot.
  2. WDDM Shared Memory: the Windows driver dynamically maps system RAM for GPU use at runtime. About 50% of total RAM (~28GB on my system). This is truly dynamic, no reboot needed.
  3. VGM (Variable Graphics Memory): AMD’s newer Strix Point feature, configured in AMD Adrenalin. Can allocate up to 75% of RAM as “dedicated”. Still requires a reboot though.

The problem is that most LLM tools (ollama, ROCm-based llama.cpp, vllm) only query the “dedicated VRAM” amount and ignore the shared pool entirely. They see 512MB of VRAM, decide the GPU is useless, and refuse to offload anything.

The fix: use the Vulkan backend. Vulkan accesses both dedicated VRAM and the GTT (Graphics Translation Table) dynamically. On my 56GB system this means about 48-50GB of GPU-addressable memory with no reboot, no VGM changes, no BIOS fiddling.

Also worth noting: ollama has no native gfx1150 support for the 890M, and forcing HSA_OVERRIDE_GFX_VERSION results in performance that’s slower than CPU-only. Don’t bother with it on this hardware.

Installing Lemonade Server

AMD’s Lemonade project is the easiest path to get a Vulkan-backed llama.cpp running on Windows. It’s basically a thin orchestrator that auto-detects your hardware, downloads the right llama.cpp build, and serves it as an OpenAI-compatible API.

Step 1: Download and Install

Grab the MSI from the GitHub releases page. At time of writing the latest is v10.2.0.

The installer is pretty straightforward, it goes to %LOCALAPPDATA%\lemonade_server\. The server auto-starts and listens on localhost:13305.

Step 2: Install the Vulkan Backend

1
lemonade backends install llamacpp:vulkan

This downloads the llama.cpp Vulkan build (about 54MB). On my system it grabbed build b8668 which auto-detected the Zen 4+ CPU backend (ggml-cpu-zen4.dll) alongside the Vulkan GPU backend.

The binaries end up in C:\Users\<you>\.cache\lemonade\bin\llamacpp\vulkan\ and include the full llama.cpp suite: llama-server.exe, llama-cli.exe, llama-bench.exe, llama-quantize.exe etc. These are all directly callable if you want to bypass Lemonade entirely.

Step 3: Pull a Model

1
lemonade pull Qwen3.5-35B-A3B-GGUF

This grabs the UD-Q4_K_XL quantisation from unsloth’s HuggingFace repo. About 21.6GB for the weights plus an 858MB vision projector. You can also pull any GGUF from HuggingFace directly:

1
lemonade pull "SomeUser/SomeModel-GGUF:model.Q8_0.gguf"

Step 4: Configure for Your Workload

The default config works but there are a couple of things worth tuning. Set them with:

1
lemonade config set ctx_size=8192 "llamacpp.args=--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --batch-size 4096 --ubatch-size 4096 --threads 4"

What these do:

  • ctx_size=8192: context window. Adjust to your needs, the DeltaNet hybrid architecture on Qwen3.5 models has tiny KV cache overhead so you can go much higher without issue
  • --cache-type-k q8_0 --cache-type-v q8_0: quantise the KV cache to 8-bit. Provably lossless on Qwen3.5’s hybrid attention (only 10 of 40 layers use full KV attention)
  • --flash-attn on: efficient memory access patterns during attention
  • --batch-size 4096 --ubatch-size 4096: process up to 4K tokens in one batch rather than smaller chunks
  • --threads 4: CPU thread count. Over-threading hurts on MoE models since they share the memory bus with the iGPU. 4 tends to be the sweet spot

I tested --parallel 2 and --parallel 4 for concurrent request handling but the GPU contention made per-request throughput worse than sequential. For a single-GPU iGPU, stick with single-slot.

Step 5: Configure for Remote Access

By default Lemonade only listens on localhost. To make it accessible from other machines:

1
lemonade config set host=0.0.0.0

Then restart the server (stop the LemonadeServer process and re-launch it). You’ll also need to check Windows Firewall, Lemonade tends to get auto-blocked on first launch. Look for rules named lemonadeserver.exe and set them to Allow for Private/Domain networks.

After this the API is available at:

  • http://localhost:13305/v1 (local)
  • http://<your-lan-ip>:13305/v1 (LAN)
  • If you’re running Tailscale, http://<tailscale-ip>:13305/v1 works too

Step 6: Load the Model

You can either explicitly load it:

1
lemonade run Qwen3.5-35B-A3B-GGUF

Or just send a request and it auto-loads on first use.

Controlling Thinking Mode Per-Request

Qwen3.5 is a thinking model, it generates chain-of-thought reasoning before the actual response. For some workloads (classification, extraction, simple Q&A) this is pure overhead. For complex reasoning tasks it improves quality.

You can control this per-request without any server-side changes using chat_template_kwargs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Thinking disabled (fast, minimal output)
curl http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-35B-A3B-GGUF",
    "messages": [{"role": "user", "content": "Classify this as positive or negative: Great product!"}],
    "max_tokens": 50,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

# Thinking enabled (slower, better reasoning)
curl http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-35B-A3B-GGUF",
    "messages": [{"role": "user", "content": "Explain the trade-offs of MoE vs dense architectures"}],
    "max_tokens": 2000,
    "chat_template_kwargs": {"enable_thinking": true}
  }'

When thinking is enabled the reasoning appears in message.reasoning_content and the actual response in message.content. When disabled, only content is populated.

What Models Fit

With ~48-50GB usable via Vulkan GTT, here’s what fits at practical quantisations:

Model Type Quant Size KV @ 256K Fits?
Qwen3.5-35B-A3B 3B active MoE Q8_0 36.9GB ~0.7GB Yes
Qwen3.5-27B 27B dense Q8_0 28.6GB ~1.0GB Yes
Gemma-4-26B-A4B 3.8B active MoE Q8_0 26.9GB ~5.2GB Yes
Qwen3-Next-80B-A3B 3B active MoE Q3_K_M 38.3GB ~0.8GB Yes, lower quant
Gemma-4-31B 31B dense Q6_K 25.2GB ~20.8GB Tight at 256K

The MoE models with hybrid DeltaNet attention (Qwen3.5 series) are particularly good here because their KV cache overhead is negligible, only a fraction of layers actually use KV-cached attention. You can run Qwen3.5-35B-A3B at Q8_0 with full 256K context and still have headroom.

Testing It: A Quick Chat Interface

The API is OpenAI-compatible so any client that supports a custom endpoint works. Here are a couple of ways to test.

curl

The simplest smoke test:

1
2
3
4
5
6
7
8
curl http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-35B-A3B-GGUF",
    "messages": [{"role": "user", "content": "Hello, what model are you?"}],
    "max_tokens": 200,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Python with openai library

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/v1",
    api_key="not-needed"  # lemonade doesn't require auth by default
)

response = client.chat.completions.create(
    model="Qwen3.5-35B-A3B-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the capital of Australia?"}
    ],
    max_tokens=500,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

print(response.choices[0].message.content)

Open WebUI

Open WebUI gives you a full ChatGPT-style interface. Run it in Docker and point it at your Lemonade server:

1
2
3
4
5
docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:13305/v1 \
  -e OPENAI_API_KEY=not-needed \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in a browser. The model should show up automatically in the model selector.

Simple streaming chat in the terminal

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from openai import OpenAI

client = OpenAI(base_url="http://localhost:13305/v1", api_key="x")
messages = []

while True:
    user_input = input("\nYou: ")
    if user_input.lower() in ("quit", "exit"):
        break

    messages.append({"role": "user", "content": user_input})

    stream = client.chat.completions.create(
        model="Qwen3.5-35B-A3B-GGUF",
        messages=messages,
        max_tokens=2000,
        stream=True,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}}
    )

    print("\nAssistant: ", end="", flush=True)
    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            full_response += chunk.choices[0].delta.content

    messages.append({"role": "assistant", "content": full_response})
    print()

Performance Notes

On my system in Lenovo’s “Quiet” thermal mode (which throttles the iGPU) I get pretty variable results:

  • Prompt processing: 60-126 tok/s depending on thermal state
  • Generation: 14-16 tok/s
  • Per-request latency for ~4K token input: about 35-50 seconds (prompt-dominated)

Switching Lenovo Intelligent Cooling to Performance mode (via Lenovo Vantage or Fn+T) should give more consistent speeds at the higher end. You can check your current mode from powershell:

1
2
(Get-ItemProperty "HKLM:\SOFTWARE\Lenovo\MultiMode").CurrentMode
# 0 = Quiet, 1 = Balanced, 2 = Performance

The generation speed (~15 tok/s) is fine for interactive chat. For batch workloads with thousands of requests the prompt processing time dominates, and disabling thinking mode via chat_template_kwargs saves a heap of wasted tokens.

What Didn’t Work

A few things I tried that aren’t worth pursuing on this hardware:

  • ollama: no native gfx1150 support. Forcing the GFX version override gives 6.4 tok/s, which is 2.5x slower than CPU-only
  • ROCm / HIP backend: hipMalloc only sees BIOS-allocated dedicated VRAM, ignores GTT. The model can’t fit
  • vllm: same ROCm dependency, same problem
  • –parallel 2/4: concurrent inference slots share the same iGPU compute, contention makes per-request throughput worse than sequential

The Vulkan backend is the only viable GPU path on AMD APU iGPUs right now. This may change when ROCm gets proper UMA/GTT support (there are patches in progress) but for now Vulkan + llama.cpp is the way to go.

Wrapping Up

I’m pretty happy with this setup. Running a 35B parameter model locally on a laptop with no discrete GPU and having it accessible as a standard OpenAI API is genuinely useful. The generation speed is fast enough for interactive use and the prompt processing is reasonable for batch work. The DeltaNet hybrid architecture on Qwen3.5 is a good fit for unified memory systems since the KV cache stays tiny even at long context lengths.

The main takeaway: if you have an AMD APU with a decent amount of RAM, use the Vulkan backend. Everything else on the AMD side is broken for iGPU inference right now.