tests: thread stress tests intermittent failures under QEMU (stress_aes, stress_recurse, stress_schedule)
Three thread stress tests fail intermittently under QEMU emulation on CI:
thread/stress_aes.py — times out on QEMU ARM/MIPS/RISCV64. Execution
time approaches or exceeds the configured timeout (70-180s depending on
arch). Observed 7 times in a 20-run log window. Attributed to ~28 of 103
failed runs over 14 months. On RISCV64 it's excluded entirely because it
takes ~180s against a 200s timeout.
thread/stress_recurse.py — was already excluded from qemu_mips,
qemu_arm, qemu_riscv64 with "is flaky" comments. No direct log
observations in the sample window since it was excluded, but the
exclusion predates the analysis period.
thread/stress_schedule.py — crashed once (expected PASS, got
CRASH) on qemu_riscv64 in the 20-run window. Low frequency but a
crash rather than a timeout suggests a real issue.
These may be QEMU-specific timing/emulation issues rather than bugs in
MicroPython's threading, but the crash in stress_schedule suggests at
least some of these are real.
PR #18861 now ignores these failures in CI. stress_aes.py is additionally
excluded on RISCV64 to avoid burning ~180s of CI time on each timeout.
See analysis: https://gist.github.com/andrewleech/5686ed5242e0948d8679c432579e002e
tests/run-tests.py: Add automatic retry for known-flaky tests.
Summary
The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.
I added a flaky-test allowlist and retry-until-pass mechanism to run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.
tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py stays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.
Retries are on by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately from clean passes.
Testing
Full CI passed on fork (all workflows green). The codecov upload step is also gated on CODECOV_TOKEN being available so it skips cleanly on forks.
Trade-offs and Alternatives
Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.
The test result status strings ("pass", "fail", "skip") are bare string literals throughout run-tests.py — the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.
Generative AI
I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.