tests: cmdline/repl_lock.py and repl_cont.py intermittent failures
Two REPL tests fail intermittently on CI:
cmdline/repl_lock.py — fails on QEMU ARM and RISCV64. The expected
output shows >>> micropython.heap_lock() but the actual output drops
the >>> prompt prefix. Observed 3 times in 20 runs with logs. This is
a REPL prompt timing issue under QEMU emulation.
cmdline/repl_cont.py — fails on macOS. Differences in quote escaping
in REPL continuation prompts ("'" vs '\''). Observed once in 20 runs.
The macOS job has historically been the second most failure-prone job
(4.3% failure rate, 25 failures over 14 months) with all failures
attributed to REPL-related issues. The August 2025 spike (11 macOS
failures) correlates with the GitHub Actions macOS 15 runner migration.
PR #18861 now ignores these failures in CI.
See analysis: https://gist.github.com/andrewleech/5686ed5242e0948d8679c432579e002e
tests/run-tests.py: Add automatic retry for known-flaky tests.
Summary
The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.
I added a flaky-test allowlist and retry-until-pass mechanism to run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.
tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py stays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.
Retries are on by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately from clean passes.
Testing
Full CI passed on fork (all workflows green). The codecov upload step is also gated on CODECOV_TOKEN being available so it skips cleanly on forks.
Trade-offs and Alternatives
Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.
The test result status strings ("pass", "fail", "skip") are bare string literals throughout run-tests.py — the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.
Generative AI
I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.