I recently got a Lenovo ThinkPad P14s Gen 6 with the AMD Ryzen AI 9 HX PRO 370 and 56GB of LPDDR5x RAM. I wanted to see what I could actually run on it for local LLM inference, and it turns out you can run pretty large models if you know how to get around a couple of gotchas with the AMD iGPU memory model.
The short version: I’m running Qwen3.5-35B-A3B (a 35 billion parameter MoE model) and Gemma-4-26B-A4B (26B, also MoE) locally, served as an OpenAI-compatible API accessible from other machines on my network. No discrete GPU required. I’ve since put them to work on a real batch classification task (triaging ~4000 GitHub issues for the MicroPython project) and compared their output quality against Claude Sonnet.
The Hardware
The key specs that matter here:
- CPU: AMD Ryzen AI 9 HX PRO 370 (Zen 5, 12 cores / 24 threads)
- GPU: AMD Radeon 890M (RDNA 3.5 integrated, 16 compute units)
- RAM: 56GB LPDDR5x-5600 (2x 32GB Ramaxel sticks)
- Storage: 2TB SK Hynix NVMe
The important thing to understand is that the Radeon 890M has zero discrete VRAM. It’s all unified memory, the same physical LPDDR5x chips serve both CPU and GPU. Windows reports about 4GB as “dedicated GPU memory” but that’s just a reservation policy, not a separate pool of faster memory.
The VRAM Problem (and How to Solve It)
This is where it gets tricky. There are basically three tiers of GPU memory on AMD APUs:
- UMA Frame Buffer (“dedicated VRAM”): carved out at boot time in BIOS. Usually 512MB-4GB. Changing it requires a reboot.
- WDDM Shared Memory: the Windows driver dynamically maps system RAM for GPU use at runtime. About 50% of total RAM (~28GB on my system). This is truly dynamic, no reboot needed.
- VGM (Variable Graphics Memory): AMD’s newer Strix Point feature, configured in AMD Adrenalin. Can allocate up to 75% of RAM as “dedicated”. Still requires a reboot though.
The problem is that most LLM tools (ollama, ROCm-based llama.cpp, vllm) only query the “dedicated VRAM” amount and ignore the shared pool entirely. They see 512MB of VRAM, decide the GPU is useless, and refuse to offload anything.
The fix: use the Vulkan backend. Vulkan accesses both dedicated VRAM and the GTT (Graphics Translation Table) dynamically. On my 56GB system this means about 48-50GB of GPU-addressable memory with no reboot, no VGM changes, no BIOS fiddling.
Also worth noting: ollama has no native support for gfx1150 (AMD’s internal architecture ID for the 890M), and forcing HSA_OVERRIDE_GFX_VERSION results in performance that’s slower than CPU-only. Don’t bother with it on this hardware.
Installing Lemonade Server
AMD’s Lemonade project is the easiest path to get a Vulkan-backed llama.cpp running on Windows. It’s basically a thin orchestrator that auto-detects your hardware, downloads the right llama.cpp build, and serves it as an OpenAI-compatible API.
Step 1: Download and Install
Grab the MSI from the GitHub releases page. At time of writing the latest is v10.2.0.
The installer is pretty straightforward, it goes to %LOCALAPPDATA%\lemonade_server\. The server auto-starts and listens on localhost:13305.
Step 2: Install the Vulkan Backend
|
|
This downloads the llama.cpp Vulkan build (about 54MB). On my system it grabbed build b8668 which auto-detected the Zen 4+ CPU backend (ggml-cpu-zen4.dll) alongside the Vulkan GPU backend.
The binaries end up in C:\Users\<you>\.cache\lemonade\bin\llamacpp\vulkan\ and include the full llama.cpp suite: llama-server.exe, llama-cli.exe, llama-bench.exe, llama-quantize.exe etc. These are all directly callable if you want to bypass Lemonade entirely.
Step 3: Pull a Model
|
|
Qwen grabs the UD-Q4_K_XL quantisation (~21.6GB weights), Gemma gets the UD-Q4_K_M (~16.1GB). You can also pull any GGUF from HuggingFace directly:
|
|
Step 4: Configure for Your Workload
The default config works but there are a couple of things worth tuning. Set them with:
|
|
What these do:
ctx_size=8192: context window. Adjust to your needs, though reducing this from 32K to 8K gave me a 3x prompt processing improvement since my workload only needed ~4K tokens per request--cache-type-k q8_0 --cache-type-v q8_0: quantise the KV cache to 8-bit. Provably lossless on Qwen3.5’s hybrid attention (only 10 of 40 layers use full KV attention). Works on Gemma too though the benefit is less obvious--flash-attn on: efficient memory access patterns during attention. Note that Gemma 4 has 4 layers with head_dim=512 which exceeds flash attention’s limit, so it silently falls back on those layers--batch-size 4096 --ubatch-size 4096: process up to 4K tokens in one batch rather than smaller chunks--threads 4: CPU thread count. Over-threading hurts on MoE models since they share the memory bus with the iGPU. 4 tends to be the sweet spot
I tested --parallel 2 and --parallel 4 for concurrent request handling but the GPU contention made per-request throughput worse than sequential. For a single-GPU iGPU, stick with single-slot.
Step 5: Configure for Remote Access
By default Lemonade only listens on localhost. To make it accessible from other machines:
|
|
Then restart the server (stop the LemonadeServer process and re-launch it). You’ll also need to check Windows Firewall, Lemonade tends to get auto-blocked on first launch. Look for rules named lemonadeserver.exe and set them to Allow for Private/Domain networks.
After this the API is available at:
http://localhost:13305/v1(local)http://<your-lan-ip>:13305/v1(LAN)- If you’re running Tailscale,
http://<tailscale-ip>:13305/v1works too
Step 6: Load the Model
You can either explicitly load it:
|
|
Or just send a request and it auto-loads on first use. Switching models means unloading the current one first (lemonade unload <name>) since max_loaded_models defaults to 1.
Controlling Thinking Mode Per-Request
Both Qwen3.5 and Gemma 4 are thinking models that generate chain-of-thought reasoning before the actual response. For some workloads (classification, extraction, simple Q&A) this is pure overhead, it can burn 200-400 tokens on reasoning before emitting a single-word answer, though for harder reasoning tasks it’s worth the overhead.
You can control this per-request without any server-side changes using chat_template_kwargs:
|
|
When thinking is enabled the reasoning appears in message.reasoning_content and the actual response in message.content. When disabled, only content is populated.
One thing I found: Qwen3.5-35B-A3B works fine with thinking disabled and still produces good output. A distilled 9B reasoning model I tested (Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled) couldn’t produce any output at all without thinking enabled, it just returned empty content. So this varies by model.
What Models Fit
With ~48-50GB usable via Vulkan GTT, here’s what fits at practical quantisations:
| Model | Type | Quant | Size | KV @ 256K | Fits? |
|---|---|---|---|---|---|
| Qwen3.5-35B-A3B | 3B active MoE | Q8_0 | 36.9GB | ~0.7GB | Yes |
| Qwen3.5-27B | 27B dense | Q8_0 | 28.6GB | ~1.0GB | Yes |
| Gemma-4-26B-A4B | 3.8B active MoE | Q8_0 | 26.9GB | ~5.2GB | Yes |
| Qwen3-Next-80B-A3B | 3B active MoE | Q3_K_M | 38.3GB | ~0.8GB | Yes, lower quant |
| Gemma-4-31B | 31B dense | Q6_K | 25.2GB | ~20.8GB | Tight at 256K |
The MoE models with hybrid DeltaNet attention (Qwen3.5 series) are particularly good here because their KV cache overhead is negligible, only a fraction of layers actually use KV-cached attention. You can run Qwen3.5-35B-A3B at Q8_0 with full 256K context and still have headroom.
Real-World Use: Triaging 4000 GitHub Issues
I put this setup to work on a real task: classifying ~4000 GitHub issue/PR pairs from the MicroPython project as duplicates, related, or unrelated. Each request sent about 3-4K tokens of assembled issue text (title, body, labels, comments, diff excerpts) and asked for a short JSON classification. I ran Qwen3.5-35B-A3B as the primary classifier, then validated the results with Claude Sonnet via API, and used Gemma-4-26B-A4B as a tiebreaker on disagreements.
The Pipeline
- Qwen first pass on all 4051 pairs with thinking disabled, took about 40 hours at 35-50 seconds per pair
- Sonnet validation on the 519 pairs Qwen classified as DUPLICATE or LIKELY_DUPLICATE (the actionable ones), about 15s/pair via API
- Gemma tiebreaker on 50 pairs where Qwen and Sonnet disagreed. Ran it twice with different sampling configs to check sensitivity
How They Compared
Qwen’s DUPLICATE classifications were pretty solid, Sonnet confirmed 62% as DUPLICATE and another 30% as LIKELY_DUPLICATE (still actionable). So 92% of Qwen’s DUPLICATE calls were useful.
Qwen’s LIKELY_DUPLICATE calls were much weaker though. Sonnet downgraded 78% of these to RELATED, meaning Qwen was over-promoting at that confidence level. On the positive side, Qwen’s errors were one-directional: it over-classified things as duplicates but never missed actual duplicates. So nothing was lost, just over-counted.
Gemma as tiebreaker was interesting. On the main disagreement bucket (Qwen says LIKELY_DUPLICATE, Sonnet says RELATED) Gemma sided with Sonnet 76% of the time. But when Qwen said DUPLICATE and Sonnet softened to LIKELY_DUPLICATE, Gemma sided with Qwen 100% of the time (8/8). So Sonnet may actually be overly conservative on the clearer cases.
I also tested Gemma with two different sampling configs (temp=0.1 vs Google’s recommended temp=1.0 with top_p=0.95, top_k=64) and there was no meaningful difference in classification decisions. Sampling parameters don’t affect speed, only which token gets picked from the probability distribution, so there’s no cost to using Google’s recommended defaults.
Throughput
| Model | Avg latency | Prompt tok/s | Gen tok/s | Cost |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | 35-50s/pair | 80-130 | 10-16 | Free |
| Gemma-4-26B-A4B | 42-45s/pair | 14-16 | 6-16 | Free |
| Claude Sonnet | ~15s/pair | N/A (API) | N/A (API) | ~$0.01/pair |
Qwen’s prompt processing was faster than Gemma’s, partly because of the DeltaNet hybrid architecture and partly because Qwen benefits from llama.cpp’s prompt prefix caching (repeated system prompts are processed once). Gemma 4 has a known issue where its shared KV cache architecture breaks prefix caching, so every request re-evaluates the full prompt. If that gets fixed Gemma should be a fair bit faster for batch work.
Generation speed was comparable between the two models, both in the 10-16 tok/s range with variation due to thermal throttling.
Cost
Running Qwen across all 4051 pairs took about 40 hours. Sonnet validated the 519 positive cases in a couple of hours for about $5. The combination caught the same issues that a full Sonnet run would have, at ~69% lower API cost. For a project with ~1374 open issues, that’s a pretty practical workflow.
The local models aren’t fast enough to replace a cloud API for interactive use, but for batch classification where you can let it run overnight they’re basically free compute.
Server Tuning: What Helped and What Didn’t
| Configuration change | Effect |
|---|---|
| ctx_size 32768 → 8192 | Prompt processing 40 → 130 t/s (3x improvement) |
| –cache-type-k/v q8_0 | Lossless on Qwen3.5 hybrid attention, halves KV memory |
| –flash-attn on | Marginal improvement |
| –batch-size/ubatch-size 4096 | Faster prompt ingestion for ~4K token inputs |
| –parallel 2 –cont-batching | Slower than single slot (memory bandwidth bound) |
| Thinking mode enabled | 4x slower per request, ~350 tokens of reasoning for a 1-word answer |
The ctx_size reduction was the single biggest win. If your prompts are ~4K tokens with ~300 token output, there’s no reason to allocate 32K of context. The KV cache memory reduction means more bandwidth for the actual computation.
Lenovo Intelligent Cooling also matters. I was getting pretty variable prompt speeds (60-126 tok/s) and it turned out the laptop was in Quiet mode, actively throttling the iGPU. You can check this from PowerShell:
|
|
Gemma 4 Gotchas
A couple of things to be aware of with Gemma 4 on llama.cpp Vulkan:
- There’s a Vulkan-specific bug where Gemma 4 can generate infinite
<unused8>tokens. I didn’t hit this on b8668 but it’s been reported - Prompt prefix caching is broken due to the shared KV cache architecture. Every request re-evaluates the full prompt. Big impact on batch workloads with a shared system prompt
- Builds from b8744 onward have a ~29% throughput regression on Vulkan Windows. Our b8668 is safe. Don’t upgrade past b8742 until that’s resolved
- Google recommends specific sampling defaults (temp=1.0, top_p=0.95, top_k=64). These only affect output quality not speed, so they can be set per-request
Testing It: A Quick Chat Interface
The API is OpenAI-compatible so any client that supports a custom endpoint works. Here are a couple of ways to test.
curl
The simplest smoke test:
|
|
Python with openai library
|
|
Open WebUI
Open WebUI gives you a full ChatGPT-style interface. Run it in Docker and point it at your Lemonade server:
|
|
Then open http://localhost:3000 in a browser. The model should show up automatically in the model selector.
Simple streaming chat in the terminal
|
|
What Didn’t Work
A few things I tried that aren’t worth pursuing on this hardware:
- ollama: no native gfx1150 support. Forcing the GFX version override gives 6.4 tok/s, which is 2.5x slower than CPU-only
- ROCm / HIP backend:
hipMalloconly sees BIOS-allocated dedicated VRAM, ignores GTT. The model can’t fit - vllm: same ROCm dependency, same problem
- –parallel 2/4: concurrent inference slots share the same iGPU compute, contention makes per-request throughput worse than sequential
- Distilled reasoning models: tested a 9B model distilled from Claude Opus reasoning. It can’t produce output without thinking enabled, making it ~100x more expensive per classification in token budget. The 35B MoE with thinking disabled was both faster and more accurate
Basically, Vulkan is the only viable GPU path on AMD APU iGPUs right now. Everything else on the AMD side is broken for iGPU inference.
What’s Next
I’m pretty happy with this setup. Next step is automating the Qwen-then-Sonnet pipeline so it can run unattended on new MicroPython issues as they come in. There are a couple of llama.cpp fixes I’m watching that could improve things too, particularly the Gemma 4 prompt cache fix which would make batch workloads on Gemma a fair bit faster.
If ROCm ever gets proper GTT/UMA support on APU iGPUs the whole Vulkan workaround becomes unnecessary. There are patches in progress but nothing landed yet. Fwiw a dedicated box with a used RTX 3090 (~AUD$1500) would give about 7x the throughput, but for overnight batch work the laptop iGPU is honestly fine.