Triaging 1500 Open Issues: Local LLMs, Sonnet, and a GPU in the Closet

MicroPython has about 1500 open issues across its repos. Some of them have been there for years. A bunch are duplicates of each other, a bunch more are already fixed by PRs that got merged without anyone linking them back, and a pretty solid chunk are just noise (support questions, cross-posts, wrong-repo stuff). Nobody’s going to sit down and manually review 1500 issues against 8000+ PRs looking for connections though.

So I built a pipeline to do it. Over about three weeks, working with Claude Code, I went from “this would be nice” to a working system that scanned every open issue, found 4051 candidate matches, had three different LLMs classify them, and produced a browseable workbench with side-by-side rendered markdown for reviewing the results. The whole thing runs on a SQLite database, a GPU machine I already had ssh access to, and a laptop with an AMD iGPU running inference on the LAN.

I’m writing this partly because the process itself was pretty interesting, but mostly because the model comparison results were genuinely surprising, the infrastructure constraints drove a lot of the design decisions in ways I didn’t expect, and I think the pattern of “cheap local LLM first pass, expensive API model for validation” is going to become standard for this kind of bulk classification work.

tldr: 229 confirmed duplicates, 233 likely duplicates, built with Qwen3.5 + Sonnet + Gemma-4 across three machines. Browse all the results, or read on for how it works.

The pipeline

The architecture ended up as a seven-stage pipeline, each stage feeding into the next:

collect → summarize → assemble → embed → scan → assess → export

Collect mirrors GitHub issues and PRs into a local SQLite database via the GitHub API. 6481 issues, 8269 PRs, 3736 comments. Full incremental sync so it can be re-run without re-fetching everything.

Summarize optionally runs each item through Haiku or a local LLM for a structured summary (components, category, affected code, error signatures). I ended up skipping this for the bulk scan, it wasn’t adding enough value over the raw text.

Assemble builds a structured XML representation per item, budget-capped at 4000 characters. This is what gets embedded and compared. The budget cap matters because the embedding model and the GPU both have limits, and I found through a fair bit of trial and error that 4K chars gives the best trade-off between context and stability.

Embed encodes every assembled item into a 1024-dimensional vector using Qwen3-Embedding-0.6B, stored in sqlite-vec, plus a parallel FTS5 full-text index for keyword matching.

Scan is where the interesting stuff happens. For each of the 1374 open issues, it runs a hybrid retrieval query: cosine-similarity KNN from the vector index plus BM25 keyword matching from FTS5, fused with Reciprocal Rank Fusion, then re-ranked with a cross-encoder. The top candidates get stored in a scan_results table with value scores that weight merged PRs higher (an open issue matched to a merged PR is the most actionable finding).

Assess sends each candidate pair to an LLM for classification: DUPLICATE, LIKELY_DUPLICATE, RELATED, OFF_TOPIC, or UNRELATED.

Export produces CSV, Markdown, HTML index, and a full static site with per-pair detail pages.

Infrastructure constraints shaped everything

I didn’t set out to build a multi-host distributed system, it just happened because of what hardware was available.

The GPU host is an older machine called step with a GTX 1650 SUPER (4GB VRAM). It runs the embedding model and the cross-encoder reranker. It has no internet access though, models have to be cached locally and all interaction goes through SSH. This meant setting HF_HUB_OFFLINE=1 for every invocation and syncing model files by hand.

The inference host is pilap2, an AMD Ryzen AI 9 HX PRO 370 laptop with a Radeon 890M iGPU sharing 64GB system RAM. It runs Lemonade (an AMD inference server wrapping llama.cpp) hosting Qwen3.5-35B-A3B for the assessment pass. No dedicated GPU, all inference runs on the integrated graphics, which means it’s memory-bandwidth bound rather than compute-bound.

My local machine has no GPU at all. It runs the scripts, the SQLite database, and the Claude CLI for Sonnet calls.

The scan itself was the first operation that forced me to deal with this split. Embedding 14,750 items and then running KNN + reranking for 1374 queries needed GPU. I’d SSH into step, run the scan in tmux (in case the network dropped), then rsync the database back. The whole scan took about 50 minutes on the GTX 1650 with the final reranker model, producing 4051 candidate pairs.

The reranker discovery

The original reranker was BAAI/bge-reranker-large, a 560M parameter cross-encoder. It worked great for quality but was absurdly slow for a batch scan: 559 seconds per issue, which would have meant 8.5 days for a full scan.

I spent a while researching alternatives and discovered something I didn’t expect: on GPU, bge-reranker-base (278M params) takes almost identical time to bge-reranker-large (560M params). The bottleneck is memory bandwidth, not compute. Dropping from 560M to 278M parameters doesn’t help when the GPU spends most of its time shuttling data from VRAM.

What did help was switching architecture entirely. cross-encoder/ms-marco-MiniLM-L-6-v2 is a 22.7M parameter model (6 layers, 384 hidden dimension) and it brought the per-issue time down from 559 seconds to 2.2 seconds. 250x faster. The quality was fine for our use case because Sonnet was going to re-assess the results anyway, so the reranker just needed to be good enough to filter noise, not be the final arbiter.

Three models, one prompt, very different results

With 4051 candidate pairs to classify, I couldn’t justify running them all through Sonnet. I’m on the Claude Pro plan so it’s not a direct per-call cost, but 4051 pairs at 15 seconds each would chew through a fair bit of my usage quota, and 17 hours of wall time isn’t nothing either.

So I ran a tiered evaluation. Qwen3.5-35B-A3B (a MoE model with 3B active parameters) went first on all 4051 pairs running locally on pilap2. This took about 40 hours at 35-50 seconds per pair, depending on the laptop’s thermal state. That’s slower than Sonnet in wall-clock terms, but it ran unattended over the weekend and the only cost was the laptop’s iGPU power draw rather than eating into my online quota.

Then Sonnet validated the 1269 pairs that Qwen flagged as DUPLICATE or LIKELY_DUPLICATE, the actionable classifications where a mistake means closing an issue that shouldn’t be closed.

Then Gemma-4-26B-A4B ran as a tiebreaker on the 50 most interesting disagreement pairs between Qwen and Sonnet.

What I found

Qwen’s RELATED and UNRELATED calls were 100% reliable when spot-checked against Sonnet. It never under-classified, it never called something RELATED that Sonnet would call DUPLICATE. That’s the important direction for this use case: you don’t want the first-pass filter to miss real duplicates.

Where Qwen fell down was over-promotion. Out of 899 pairs it called LIKELY_DUPLICATE, Sonnet downgraded 750 of them (83%) to just RELATED. It was being optimistic about relationships that were thematically connected but not actually duplicates.

Qwen’s DUPLICATE calls were more reliable though. 88% of them were still actionable per Sonnet (either confirmed DUPLICATE or at least LIKELY_DUPLICATE).

Gemma mostly agreed with Sonnet when they disagreed with Qwen, siding with Sonnet 64% of the time. The one exception was when Sonnet softened Qwen’s DUPLICATE to LIKELY_DUPLICATE, Gemma agreed with Qwen 100% of the time (8/8). Sonnet might actually be overly conservative on those.

I also tested Gemma with thinking mode enabled. It was 3.7x slower, had a 16% failure rate from timeouts, and produced lower agreement with Sonnet. Thinking mode made the model more independent (more “neither” verdicts) but not more accurate. For a structured classification task like this, extra reasoning tokens are wasted, the model just talks itself into alternative answers.

Tuning the inference server

I spent a couple of sessions optimising the Lemonade/llama.cpp configuration and found some things that are probably useful for anyone running MoE models on iGPUs:

Reducing ctx_size from 32K to 8K gave a 3x improvement in prompt processing speed (130 t/s vs 40 t/s). Our prompts were only ~4K tokens, so the extra context window was just wasting memory.
KV cache quantisation to q8_0 was completely lossless on Qwen3.5. This model has hybrid attention (only 8 of 32 layers use full KV caches) so quantisation noise gets absorbed by the non-attention layers.
Parallel slots (--parallel 2 with --cont-batching) was actually slower than single-slot on the iGPU. Memory bandwidth is the bottleneck, not GPU utilisation, so two requests competing for the same memory bus just made both slower.
The enable_thinking flag is per-request in chat_template_kwargs, not a global server flag. No server restart needed to switch between thinking and non-thinking mode on the same model.

The workbench

The raw data isn’t much use without a way to review it. I built a web UI served from a single Python http.server that renders each pair as a side-by-side comparison with full markdown rendering of the issue/PR bodies and comments.

On desktop it’s a split-pane layout. On phones it switches to a tab interface showing one pane at a time since there’s no point trying to read two issues at ~280px each. There’s keyboard navigation (j/k for next/prev pair, 1/2 for query/candidate pane, c to copy a suggested close-comment to clipboard, r to toggle the model’s reasoning).

The aesthetic is intentionally editorial rather than dashboard, warm cream paper background, serif typography (Fraunces for headings, Source Serif 4 for body), classification colours on a vertical “seam” between the panes. I wanted it to feel like reading a scholarly critical edition of two parallel texts, not like another Jira board.

The whole thing also exports as a static directory (mpy-triage export-html -d site/) with relative links, so it can be hosted anywhere or just opened from the filesystem.

Results

The final count after Sonnet validation:

Classification	Count
Duplicates (close the issue)	229
Likely Duplicates (confirm first)	233
Related (linked but not resolved)	2416
Off-topic (noise)	1109

229 confirmed duplicates is a pretty solid result. These are open issues that either have merged PRs explicitly fixing them, or are duplicate reports of the same bug filed months or years apart. The 233 likely duplicates need a human to confirm but they’re at least flagged and prioritised.

What I’d do differently

The biggest time sink was the assessment phase, 40 hours of Qwen inference followed by 6 hours of Sonnet validation. If I were starting again I’d probably skip the local LLM first-pass entirely for the DUPLICATE/LIKELY_DUPLICATE classifications and just run Sonnet directly on the ~1200 highest-scoring pairs. The savings from Qwen’s filtering (~70% of pairs not sent to Sonnet) were real, but 1269 Sonnet calls is pretty manageable within the Pro plan quota compared to 40 hours of local inference time.

Where the local model did add value was classifying the RELATED and OFF_TOPIC pairs that Sonnet never needed to see. Running all 4051 pairs through Sonnet would have burned through a lot more of my quota, and Qwen’s accuracy on non-duplicate classifications was 100% in validation.

The reranker choice mattered more than I expected. Dropping from bge-reranker-large to MiniLM-L-6-v2 was the single biggest speed improvement in the project (250x), and it came from understanding that GPU inference is memory-bandwidth bound, not compute-bound, for these models.

Try it

The whole project is at github.com/andrewleech/mpy-triage. The full results are browseable here.

The pipeline#

Infrastructure constraints shaped everything#

The reranker discovery#

Three models, one prompt, very different results#

What I found#

Tuning the inference server#

The workbench#

Results#

What I’d do differently#

Try it#