tests: thread stress tests intermittent failures under QEMU (stress_aes, stress_recurse, stress_schedule)
Three thread stress tests fail intermittently under QEMU emulation on CI:
thread/stress_aes.py — times out on QEMU ARM/MIPS/RISCV64. Execution
time approaches or exceeds the configured timeout (70-180s depending on
arch). Observed 7 times in a 20-run log window. Attributed to ~28 of 103
failed runs over 14 months. On RISCV64 it's excluded entirely because it
takes ~180s against a 200s timeout.
thread/stress_recurse.py — was already excluded from qemu_mips,
qemu_arm, qemu_riscv64 with "is flaky" comments. No direct log
observations in the sample window since it was excluded, but the
exclusion predates the analysis period.
thread/stress_schedule.py — crashed once (expected PASS, got
CRASH) on qemu_riscv64 in the 20-run window. Low frequency but a
crash rather than a timeout suggests a real issue.
These may be QEMU-specific timing/emulation issues rather than bugs in
MicroPython's threading, but the crash in stress_schedule suggests at
least some of these are real.
PR #18861 now ignores these failures in CI. stress_aes.py is additionally
excluded on RISCV64 to avoid burning ~180s of CI time on each timeout.
See analysis: https://gist.github.com/andrewleech/5686ed5242e0948d8679c432579e002e
tests/thread/thread_gc1: Skip unreliable test in Github CI.
Summary
thread/thread_gc1.py is a constant source of spurious failures in Github CI.
This PR adds it to the list of tests skipped when running on Github CI using either macos, qemu_riscv64, qemu_mips, or qemu_arm, to help reduce the overall false positive rate and improve the predictive value of the test fail indication.
Testing
I examined a sample of the last 25 unix port Github Actions runs, tabulated their outcomes and the causes attributable to any failures, and examined relevant statistics over the results to
| Action Run | Outcome | Failed Job(s) | Cause |
|---|---|---|---|
| 17256297609 | FAIL | macos | thread/thread_gc1.py |
| 17248166934 | PASS | ||
| 17239364181 | FAIL | macos | thread/thread_gc1.py |
| 17239346523 | PASS | ||
| 17232449204 | FAIL | macos | thread/thread_gc1.py (possibly valid) |
| 17230929159 | FAIL | qemu_arm | cmdline/repl_sys_ps1_ps2.py |
| 17230082929 | PASS | ||
| 17226283109 | FAIL | settrace_stackless<br>macos<br>qemu_mips | thread/thread_gc1.py<br>thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17226266333 | PASS | ||
| 17225202917 | FAIL | macos<br>qemu_riscv64 | thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17224743621 | FAIL | settrace_stackless<br>macos | thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17224739270 | FAIL | macos | thread/thread_gc1.py |
| 17220251949 | FAIL | macos | thread/thread_gc1.py |
| 17218037418 | PASS | ||
| 17218024199 | PASS | ||
| 17212060390 | FAIL | macos | thread/thread_gc1.py |
| 17211892105 | CANCEL | ||
| 17209911695 | FAIL | macos | thread/thread_gc1.py |
| 17209904205 | FAIL | macos | thread/thread_gc1.py |
| 17196446007 | PASS | ||
| 17196132542 | FAIL | macos | thread/thread_gc1.py |
| 17180766768 | PASS | ||
| 17175320257 | PASS | ||
| 17175019154 | FAIL | macos<br>qemu_mips | thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17175013008 | CANCEL |
Of the 14 test failures observed in this sample, all but one were attributable to thread/thread_gc1.py, with all but one of these failures happening on macos or qemu. (Note that one of these changes did touch thread code, so for the sake of robustness I've assumed it's actually a true positive in my analysis.)
This test has a false positive rate of 59% over this sample, an F1 score of 0.13, and a positive predictive value of 7.14% (i.e. when the test suite reports failure for a PR, the chance that the failure is due to the PR's change is only 7%, due to this test.)
This test should therefore be disabled in these scenarios where it's unreliable.