I recently got a Lenovo ThinkPad P14s Gen 6 with the AMD Ryzen AI 9 HX PRO 370 and 56GB of LPDDR5x RAM. I wanted to see what I could actually run on it for local LLM inference, and it turns out you can run pretty large models if you know how to get around a couple of gotchas with the AMD iGPU memory model.

The short version: I’m running Qwen3.5-35B-A3B (a 35 billion parameter MoE model) and Gemma-4-26B-A4B (26B, also MoE) locally, served as an OpenAI-compatible API accessible from other machines on my network. No discrete GPU required. I’ve since put them to work on a real batch classification task (triaging ~4000 GitHub issues for the MicroPython project) and compared their output quality against Claude Sonnet.

The Hardware

The key specs that matter here:

  • CPU: AMD Ryzen AI 9 HX PRO 370 (Zen 5, 12 cores / 24 threads)
  • GPU: AMD Radeon 890M (RDNA 3.5 integrated, 16 compute units)
  • RAM: 56GB LPDDR5x-5600 (2x 32GB Ramaxel sticks)
  • Storage: 2TB SK Hynix NVMe

The important thing to understand is that the Radeon 890M has zero discrete VRAM. It’s all unified memory, the same physical LPDDR5x chips serve both CPU and GPU. Windows reports about 4GB as “dedicated GPU memory” but that’s just a reservation policy, not a separate pool of faster memory.

The VRAM Problem (and How to Solve It)

This is where it gets tricky. There are basically three tiers of GPU memory on AMD APUs:

  1. UMA Frame Buffer (“dedicated VRAM”): carved out at boot time in BIOS. Usually 512MB-4GB. Changing it requires a reboot.
  2. WDDM Shared Memory: the Windows driver dynamically maps system RAM for GPU use at runtime. About 50% of total RAM (~28GB on my system). This is truly dynamic, no reboot needed.
  3. VGM (Variable Graphics Memory): AMD’s newer Strix Point feature, configured in AMD Adrenalin. Can allocate up to 75% of RAM as “dedicated”. Still requires a reboot though.

The problem is that most LLM tools (ollama, ROCm-based llama.cpp, vllm) only query the “dedicated VRAM” amount and ignore the shared pool entirely. They see 512MB of VRAM, decide the GPU is useless, and refuse to offload anything.

The fix: use the Vulkan backend. Vulkan accesses both dedicated VRAM and the GTT (Graphics Translation Table) dynamically. On my 56GB system this means about 48-50GB of GPU-addressable memory with no reboot, no VGM changes, no BIOS fiddling.

Also worth noting: ollama has no native support for gfx1150 (AMD’s internal architecture ID for the 890M), and forcing HSA_OVERRIDE_GFX_VERSION results in performance that’s slower than CPU-only. Don’t bother with it on this hardware.

Installing Lemonade Server

AMD’s Lemonade project is the easiest path to get a Vulkan-backed llama.cpp running on Windows. It’s basically a thin orchestrator that auto-detects your hardware, downloads the right llama.cpp build, and serves it as an OpenAI-compatible API.

Step 1: Download and Install

Grab the MSI from the GitHub releases page. At time of writing the latest is v10.2.0.

The installer is pretty straightforward, it goes to %LOCALAPPDATA%\lemonade_server\. The server auto-starts and listens on localhost:13305.

Step 2: Install the Vulkan Backend

1
lemonade backends install llamacpp:vulkan

This downloads the llama.cpp Vulkan build (about 54MB). On my system it grabbed build b8668 which auto-detected the Zen 4+ CPU backend (ggml-cpu-zen4.dll) alongside the Vulkan GPU backend.

The binaries end up in C:\Users\<you>\.cache\lemonade\bin\llamacpp\vulkan\ and include the full llama.cpp suite: llama-server.exe, llama-cli.exe, llama-bench.exe, llama-quantize.exe etc. These are all directly callable if you want to bypass Lemonade entirely.

Step 3: Pull a Model

1
2
lemonade pull Qwen3.5-35B-A3B-GGUF
lemonade pull Gemma-4-26B-A4B-it-GGUF

Qwen grabs the UD-Q4_K_XL quantisation (~21.6GB weights), Gemma gets the UD-Q4_K_M (~16.1GB). You can also pull any GGUF from HuggingFace directly:

1
lemonade pull "SomeUser/SomeModel-GGUF:model.Q8_0.gguf"

Step 4: Configure for Your Workload

The default config works but there are a couple of things worth tuning. Set them with:

1
lemonade config set ctx_size=8192 "llamacpp.args=--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --batch-size 4096 --ubatch-size 4096 --threads 4"

What these do:

  • ctx_size=8192: context window. Adjust to your needs, though reducing this from 32K to 8K gave me a 3x prompt processing improvement since my workload only needed ~4K tokens per request
  • --cache-type-k q8_0 --cache-type-v q8_0: quantise the KV cache to 8-bit. Provably lossless on Qwen3.5’s hybrid attention (only 10 of 40 layers use full KV attention). Works on Gemma too though the benefit is less obvious
  • --flash-attn on: efficient memory access patterns during attention. Note that Gemma 4 has 4 layers with head_dim=512 which exceeds flash attention’s limit, so it silently falls back on those layers
  • --batch-size 4096 --ubatch-size 4096: process up to 4K tokens in one batch rather than smaller chunks
  • --threads 4: CPU thread count. Over-threading hurts on MoE models since they share the memory bus with the iGPU. 4 tends to be the sweet spot

I tested --parallel 2 and --parallel 4 for concurrent request handling but the GPU contention made per-request throughput worse than sequential. For a single-GPU iGPU, stick with single-slot.

Step 5: Configure for Remote Access

By default Lemonade only listens on localhost. To make it accessible from other machines:

1
lemonade config set host=0.0.0.0

Then restart the server (stop the LemonadeServer process and re-launch it). You’ll also need to check Windows Firewall, Lemonade tends to get auto-blocked on first launch. Look for rules named lemonadeserver.exe and set them to Allow for Private/Domain networks.

After this the API is available at:

  • http://localhost:13305/v1 (local)
  • http://<your-lan-ip>:13305/v1 (LAN)
  • If you’re running Tailscale, http://<tailscale-ip>:13305/v1 works too

Step 6: Load the Model

You can either explicitly load it:

1
lemonade run Qwen3.5-35B-A3B-GGUF

Or just send a request and it auto-loads on first use. Switching models means unloading the current one first (lemonade unload <name>) since max_loaded_models defaults to 1.

Controlling Thinking Mode Per-Request

Both Qwen3.5 and Gemma 4 are thinking models that generate chain-of-thought reasoning before the actual response. For some workloads (classification, extraction, simple Q&A) this is pure overhead, it can burn 200-400 tokens on reasoning before emitting a single-word answer, though for harder reasoning tasks it’s worth the overhead.

You can control this per-request without any server-side changes using chat_template_kwargs:

1
2
3
4
5
6
7
8
9
# Thinking disabled (fast, minimal output)
curl http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-35B-A3B-GGUF",
    "messages": [{"role": "user", "content": "Classify this as positive or negative: Great product!"}],
    "max_tokens": 50,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

When thinking is enabled the reasoning appears in message.reasoning_content and the actual response in message.content. When disabled, only content is populated.

One thing I found: Qwen3.5-35B-A3B works fine with thinking disabled and still produces good output. A distilled 9B reasoning model I tested (Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled) couldn’t produce any output at all without thinking enabled, it just returned empty content. So this varies by model.

What Models Fit

With ~48-50GB usable via Vulkan GTT, here’s what fits at practical quantisations:

Model Type Quant Size KV @ 256K Fits?
Qwen3.5-35B-A3B 3B active MoE Q8_0 36.9GB ~0.7GB Yes
Qwen3.5-27B 27B dense Q8_0 28.6GB ~1.0GB Yes
Gemma-4-26B-A4B 3.8B active MoE Q8_0 26.9GB ~5.2GB Yes
Qwen3-Next-80B-A3B 3B active MoE Q3_K_M 38.3GB ~0.8GB Yes, lower quant
Gemma-4-31B 31B dense Q6_K 25.2GB ~20.8GB Tight at 256K

The MoE models with hybrid DeltaNet attention (Qwen3.5 series) are particularly good here because their KV cache overhead is negligible, only a fraction of layers actually use KV-cached attention. You can run Qwen3.5-35B-A3B at Q8_0 with full 256K context and still have headroom.

Real-World Use: Triaging 4000 GitHub Issues

I put this setup to work on a real task: classifying ~4000 GitHub issue/PR pairs from the MicroPython project as duplicates, related, or unrelated. Each request sent about 3-4K tokens of assembled issue text (title, body, labels, comments, diff excerpts) and asked for a short JSON classification. I ran Qwen3.5-35B-A3B as the primary classifier, then validated the results with Claude Sonnet via API, and used Gemma-4-26B-A4B as a tiebreaker on disagreements.

The Pipeline

  1. Qwen first pass on all 4051 pairs with thinking disabled, took about 40 hours at 35-50 seconds per pair
  2. Sonnet validation on the 519 pairs Qwen classified as DUPLICATE or LIKELY_DUPLICATE (the actionable ones), about 15s/pair via API
  3. Gemma tiebreaker on 50 pairs where Qwen and Sonnet disagreed. Ran it twice with different sampling configs to check sensitivity

How They Compared

Qwen’s DUPLICATE classifications were pretty solid, Sonnet confirmed 62% as DUPLICATE and another 30% as LIKELY_DUPLICATE (still actionable). So 92% of Qwen’s DUPLICATE calls were useful.

Qwen’s LIKELY_DUPLICATE calls were much weaker though. Sonnet downgraded 78% of these to RELATED, meaning Qwen was over-promoting at that confidence level. On the positive side, Qwen’s errors were one-directional: it over-classified things as duplicates but never missed actual duplicates. So nothing was lost, just over-counted.

Gemma as tiebreaker was interesting. On the main disagreement bucket (Qwen says LIKELY_DUPLICATE, Sonnet says RELATED) Gemma sided with Sonnet 76% of the time. But when Qwen said DUPLICATE and Sonnet softened to LIKELY_DUPLICATE, Gemma sided with Qwen 100% of the time (8/8). So Sonnet may actually be overly conservative on the clearer cases.

I also tested Gemma with two different sampling configs (temp=0.1 vs Google’s recommended temp=1.0 with top_p=0.95, top_k=64) and there was no meaningful difference in classification decisions. Sampling parameters don’t affect speed, only which token gets picked from the probability distribution, so there’s no cost to using Google’s recommended defaults.

Throughput

Model Avg latency Prompt tok/s Gen tok/s Cost
Qwen3.5-35B-A3B 35-50s/pair 80-130 10-16 Free
Gemma-4-26B-A4B 42-45s/pair 14-16 6-16 Free
Claude Sonnet ~15s/pair N/A (API) N/A (API) ~$0.01/pair

Qwen’s prompt processing was faster than Gemma’s, partly because of the DeltaNet hybrid architecture and partly because Qwen benefits from llama.cpp’s prompt prefix caching (repeated system prompts are processed once). Gemma 4 has a known issue where its shared KV cache architecture breaks prefix caching, so every request re-evaluates the full prompt. If that gets fixed Gemma should be a fair bit faster for batch work.

Generation speed was comparable between the two models, both in the 10-16 tok/s range with variation due to thermal throttling.

Cost

Running Qwen across all 4051 pairs took about 40 hours. Sonnet validated the 519 positive cases in a couple of hours for about $5. The combination caught the same issues that a full Sonnet run would have, at ~69% lower API cost. For a project with ~1374 open issues, that’s a pretty practical workflow.

The local models aren’t fast enough to replace a cloud API for interactive use, but for batch classification where you can let it run overnight they’re basically free compute.

Server Tuning: What Helped and What Didn’t

Configuration change Effect
ctx_size 32768 → 8192 Prompt processing 40 → 130 t/s (3x improvement)
–cache-type-k/v q8_0 Lossless on Qwen3.5 hybrid attention, halves KV memory
–flash-attn on Marginal improvement
–batch-size/ubatch-size 4096 Faster prompt ingestion for ~4K token inputs
–parallel 2 –cont-batching Slower than single slot (memory bandwidth bound)
Thinking mode enabled 4x slower per request, ~350 tokens of reasoning for a 1-word answer

The ctx_size reduction was the single biggest win. If your prompts are ~4K tokens with ~300 token output, there’s no reason to allocate 32K of context. The KV cache memory reduction means more bandwidth for the actual computation.

Lenovo Intelligent Cooling also matters. I was getting pretty variable prompt speeds (60-126 tok/s) and it turned out the laptop was in Quiet mode, actively throttling the iGPU. You can check this from PowerShell:

1
2
(Get-ItemProperty "HKLM:\SOFTWARE\Lenovo\MultiMode").CurrentMode
# 0 = Quiet, 1 = Balanced, 2 = Performance

Gemma 4 Gotchas

A couple of things to be aware of with Gemma 4 on llama.cpp Vulkan:

  • There’s a Vulkan-specific bug where Gemma 4 can generate infinite <unused8> tokens. I didn’t hit this on b8668 but it’s been reported
  • Prompt prefix caching is broken due to the shared KV cache architecture. Every request re-evaluates the full prompt. Big impact on batch workloads with a shared system prompt
  • Builds from b8744 onward have a ~29% throughput regression on Vulkan Windows. Our b8668 is safe. Don’t upgrade past b8742 until that’s resolved
  • Google recommends specific sampling defaults (temp=1.0, top_p=0.95, top_k=64). These only affect output quality not speed, so they can be set per-request

Testing It: A Quick Chat Interface

The API is OpenAI-compatible so any client that supports a custom endpoint works. Here are a couple of ways to test.

curl

The simplest smoke test:

1
2
3
4
5
6
7
8
curl http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-35B-A3B-GGUF",
    "messages": [{"role": "user", "content": "Hello, what model are you?"}],
    "max_tokens": 200,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Python with openai library

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/v1",
    api_key="not-needed"  # lemonade doesn't require auth by default
)

response = client.chat.completions.create(
    model="Qwen3.5-35B-A3B-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the capital of Australia?"}
    ],
    max_tokens=500,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

print(response.choices[0].message.content)

Open WebUI

Open WebUI gives you a full ChatGPT-style interface. Run it in Docker and point it at your Lemonade server:

1
2
3
4
5
docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:13305/v1 \
  -e OPENAI_API_KEY=not-needed \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in a browser. The model should show up automatically in the model selector.

Simple streaming chat in the terminal

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from openai import OpenAI

client = OpenAI(base_url="http://localhost:13305/v1", api_key="x")
messages = []

while True:
    user_input = input("\nYou: ")
    if user_input.lower() in ("quit", "exit"):
        break

    messages.append({"role": "user", "content": user_input})

    stream = client.chat.completions.create(
        model="Qwen3.5-35B-A3B-GGUF",
        messages=messages,
        max_tokens=2000,
        stream=True,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}}
    )

    print("\nAssistant: ", end="", flush=True)
    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            full_response += chunk.choices[0].delta.content

    messages.append({"role": "assistant", "content": full_response})
    print()

What Didn’t Work

A few things I tried that aren’t worth pursuing on this hardware:

  • ollama: no native gfx1150 support. Forcing the GFX version override gives 6.4 tok/s, which is 2.5x slower than CPU-only
  • ROCm / HIP backend: hipMalloc only sees BIOS-allocated dedicated VRAM, ignores GTT. The model can’t fit
  • vllm: same ROCm dependency, same problem
  • –parallel 2/4: concurrent inference slots share the same iGPU compute, contention makes per-request throughput worse than sequential
  • Distilled reasoning models: tested a 9B model distilled from Claude Opus reasoning. It can’t produce output without thinking enabled, making it ~100x more expensive per classification in token budget. The 35B MoE with thinking disabled was both faster and more accurate

Basically, Vulkan is the only viable GPU path on AMD APU iGPUs right now. Everything else on the AMD side is broken for iGPU inference.

What’s Next

I’m pretty happy with this setup. Next step is automating the Qwen-then-Sonnet pipeline so it can run unattended on new MicroPython issues as they come in. There are a couple of llama.cpp fixes I’m watching that could improve things too, particularly the Gemma 4 prompt cache fix which would make batch workloads on Gemma a fair bit faster.

If ROCm ever gets proper GTT/UMA support on APU iGPUs the whole Vulkan workaround becomes unnecessary. There are patches in progress but nothing landed yet. Fwiw a dedicated box with a used RTX 3090 (~AUD$1500) would give about 7x the throughput, but for overnight batch work the laptop iGPU is honestly fine.