Running Local LLMs on an AMD APU Laptop with 56GB Unified Memory
I recently got a Lenovo ThinkPad P14s Gen 6 with the AMD Ryzen AI 9 HX PRO 370 and 56GB of LPDDR5x RAM. I wanted to see what I could actually run on it for local LLM inference, and it turns out the answer is “quite a lot” if you know how to get around a couple of gotchas with the AMD iGPU memory model.
The short version: I’m running Qwen3.5-35B-A3B (a 35 billion parameter MoE model) at about 14-16 tokens/second generation and 80-126 tok/s prompt processing, served as an OpenAI-compatible API accessible from other machines on my network. No discrete GPU required.
The Hardware
The key specs that matter here:
- CPU: AMD Ryzen AI 9 HX PRO 370 (Zen 5, 12 cores / 24 threads)
- GPU: AMD Radeon 890M (RDNA 3.5 integrated, 16 compute units)
- RAM: 56GB LPDDR5x-5600 (2x 32GB Ramaxel sticks)
- Storage: 2TB SK Hynix NVMe
The important thing to understand is that the Radeon 890M has zero discrete VRAM. It’s all unified memory, the same physical LPDDR5x chips serve both CPU and GPU. Windows reports about 4GB as “dedicated GPU memory” but that’s just a reservation policy, not a separate pool of faster memory.
The VRAM Problem (and How to Solve It)
This is where it gets tricky. There are basically three tiers of GPU memory on AMD APUs:
- UMA Frame Buffer (“dedicated VRAM”): carved out at boot time in BIOS. Usually 512MB-4GB. Changing it requires a reboot.
- WDDM Shared Memory: the Windows driver dynamically maps system RAM for GPU use at runtime. About 50% of total RAM (~28GB on my system). This is truly dynamic, no reboot needed.
- VGM (Variable Graphics Memory): AMD’s newer Strix Point feature, configured in AMD Adrenalin. Can allocate up to 75% of RAM as “dedicated”. Still requires a reboot though.
The problem is that most LLM tools (ollama, ROCm-based llama.cpp, vllm) only query the “dedicated VRAM” amount and ignore the shared pool entirely. They see 512MB of VRAM, decide the GPU is useless, and refuse to offload anything.
The fix: use the Vulkan backend. Vulkan accesses both dedicated VRAM and the GTT (Graphics Translation Table) dynamically. On my 56GB system this means about 48-50GB of GPU-addressable memory with no reboot, no VGM changes, no BIOS fiddling.
Also worth noting: ollama has no native gfx1150 support for the 890M, and forcing HSA_OVERRIDE_GFX_VERSION results in performance that’s slower than CPU-only. Don’t bother with it on this hardware.
Installing Lemonade Server
AMD’s Lemonade project is the easiest path to get a Vulkan-backed llama.cpp running on Windows. It’s basically a thin orchestrator that auto-detects your hardware, downloads the right llama.cpp build, and serves it as an OpenAI-compatible API.
Step 1: Download and Install
Grab the MSI from the GitHub releases page. At time of writing the latest is v10.2.0.
The installer is pretty straightforward, it goes to %LOCALAPPDATA%\lemonade_server\. The server auto-starts and listens on localhost:13305.
Step 2: Install the Vulkan Backend
|
|
This downloads the llama.cpp Vulkan build (about 54MB). On my system it grabbed build b8668 which auto-detected the Zen 4+ CPU backend (ggml-cpu-zen4.dll) alongside the Vulkan GPU backend.
The binaries end up in C:\Users\<you>\.cache\lemonade\bin\llamacpp\vulkan\ and include the full llama.cpp suite: llama-server.exe, llama-cli.exe, llama-bench.exe, llama-quantize.exe etc. These are all directly callable if you want to bypass Lemonade entirely.
Step 3: Pull a Model
|
|
This grabs the UD-Q4_K_XL quantisation from unsloth’s HuggingFace repo. About 21.6GB for the weights plus an 858MB vision projector. You can also pull any GGUF from HuggingFace directly:
|
|
Step 4: Configure for Your Workload
The default config works but there are a couple of things worth tuning. Set them with:
|
|
What these do:
ctx_size=8192: context window. Adjust to your needs, the DeltaNet hybrid architecture on Qwen3.5 models has tiny KV cache overhead so you can go much higher without issue--cache-type-k q8_0 --cache-type-v q8_0: quantise the KV cache to 8-bit. Provably lossless on Qwen3.5’s hybrid attention (only 10 of 40 layers use full KV attention)--flash-attn on: efficient memory access patterns during attention--batch-size 4096 --ubatch-size 4096: process up to 4K tokens in one batch rather than smaller chunks--threads 4: CPU thread count. Over-threading hurts on MoE models since they share the memory bus with the iGPU. 4 tends to be the sweet spot
I tested --parallel 2 and --parallel 4 for concurrent request handling but the GPU contention made per-request throughput worse than sequential. For a single-GPU iGPU, stick with single-slot.
Step 5: Configure for Remote Access
By default Lemonade only listens on localhost. To make it accessible from other machines:
|
|
Then restart the server (stop the LemonadeServer process and re-launch it). You’ll also need to check Windows Firewall, Lemonade tends to get auto-blocked on first launch. Look for rules named lemonadeserver.exe and set them to Allow for Private/Domain networks.
After this the API is available at:
http://localhost:13305/v1(local)http://<your-lan-ip>:13305/v1(LAN)- If you’re running Tailscale,
http://<tailscale-ip>:13305/v1works too
Step 6: Load the Model
You can either explicitly load it:
|
|
Or just send a request and it auto-loads on first use.
Controlling Thinking Mode Per-Request
Qwen3.5 is a thinking model, it generates chain-of-thought reasoning before the actual response. For some workloads (classification, extraction, simple Q&A) this is pure overhead. For complex reasoning tasks it improves quality.
You can control this per-request without any server-side changes using chat_template_kwargs:
|
|
When thinking is enabled the reasoning appears in message.reasoning_content and the actual response in message.content. When disabled, only content is populated.
What Models Fit
With ~48-50GB usable via Vulkan GTT, here’s what fits at practical quantisations:
| Model | Type | Quant | Size | KV @ 256K | Fits? |
|---|---|---|---|---|---|
| Qwen3.5-35B-A3B | 3B active MoE | Q8_0 | 36.9GB | ~0.7GB | Yes |
| Qwen3.5-27B | 27B dense | Q8_0 | 28.6GB | ~1.0GB | Yes |
| Gemma-4-26B-A4B | 3.8B active MoE | Q8_0 | 26.9GB | ~5.2GB | Yes |
| Qwen3-Next-80B-A3B | 3B active MoE | Q3_K_M | 38.3GB | ~0.8GB | Yes, lower quant |
| Gemma-4-31B | 31B dense | Q6_K | 25.2GB | ~20.8GB | Tight at 256K |
The MoE models with hybrid DeltaNet attention (Qwen3.5 series) are particularly good here because their KV cache overhead is negligible, only a fraction of layers actually use KV-cached attention. You can run Qwen3.5-35B-A3B at Q8_0 with full 256K context and still have headroom.
Testing It: A Quick Chat Interface
The API is OpenAI-compatible so any client that supports a custom endpoint works. Here are a couple of ways to test.
curl
The simplest smoke test:
|
|
Python with openai library
|
|
Open WebUI
Open WebUI gives you a full ChatGPT-style interface. Run it in Docker and point it at your Lemonade server:
|
|
Then open http://localhost:3000 in a browser. The model should show up automatically in the model selector.
Simple streaming chat in the terminal
|
|
Performance Notes
On my system in Lenovo’s “Quiet” thermal mode (which throttles the iGPU) I get pretty variable results:
- Prompt processing: 60-126 tok/s depending on thermal state
- Generation: 14-16 tok/s
- Per-request latency for ~4K token input: about 35-50 seconds (prompt-dominated)
Switching Lenovo Intelligent Cooling to Performance mode (via Lenovo Vantage or Fn+T) should give more consistent speeds at the higher end. You can check your current mode from powershell:
|
|
The generation speed (~15 tok/s) is fine for interactive chat. For batch workloads with thousands of requests the prompt processing time dominates, and disabling thinking mode via chat_template_kwargs saves a heap of wasted tokens.
What Didn’t Work
A few things I tried that aren’t worth pursuing on this hardware:
- ollama: no native gfx1150 support. Forcing the GFX version override gives 6.4 tok/s, which is 2.5x slower than CPU-only
- ROCm / HIP backend:
hipMalloconly sees BIOS-allocated dedicated VRAM, ignores GTT. The model can’t fit - vllm: same ROCm dependency, same problem
- –parallel 2/4: concurrent inference slots share the same iGPU compute, contention makes per-request throughput worse than sequential
The Vulkan backend is the only viable GPU path on AMD APU iGPUs right now. This may change when ROCm gets proper UMA/GTT support (there are patches in progress) but for now Vulkan + llama.cpp is the way to go.
Wrapping Up
I’m pretty happy with this setup. Running a 35B parameter model locally on a laptop with no discrete GPU and having it accessible as a standard OpenAI API is genuinely useful. The generation speed is fast enough for interactive use and the prompt processing is reasonable for batch work. The DeltaNet hybrid architecture on Qwen3.5 is a good fit for unified memory systems since the KV cache stays tiny even at long context lengths.
The main takeaway: if you have an AMD APU with a decent amount of RAM, use the Vulkan backend. Everything else on the AMD side is broken for iGPU inference right now.