tests: thread/thread_gc1.py intermittent failure on CI
The thread_gc1.py test fails intermittently on CI with False instead of
True. This is the single biggest contributor to CI flakiness on master,
attributed to ~62 of 103 failed runs over 14 months (575 runs sampled).
Observed in settrace_stackless (6 times), coverage (3 times) in a 20-run
window with available logs. The test was already excluded from macos,
qemu_mips, qemu_arm, and qemu_riscv64 jobs prior to PR #18861.
The test spawns threads that perform garbage collection and checks a
boolean result. The failure pattern suggests a race condition in the GC
or thread interaction, not a test logic issue — the test is correctly
detecting a real bug.
Estimated per-execution failure rate: ~1.3% across the 8 CI jobs that
run it.
PR #18861 now ignores this failure in CI so it doesn't block other work,
but the underlying issue should be fixed.
See analysis: https://gist.github.com/andrewleech/5686ed5242e0948d8679c432579e002e
tests/thread/thread_gc1: Skip unreliable test in Github CI.
Summary
thread/thread_gc1.py is a constant source of spurious failures in Github CI.
This PR adds it to the list of tests skipped when running on Github CI using either macos, qemu_riscv64, qemu_mips, or qemu_arm, to help reduce the overall false positive rate and improve the predictive value of the test fail indication.
Testing
I examined a sample of the last 25 unix port Github Actions runs, tabulated their outcomes and the causes attributable to any failures, and examined relevant statistics over the results to
| Action Run | Outcome | Failed Job(s) | Cause |
|---|---|---|---|
| 17256297609 | FAIL | macos | thread/thread_gc1.py |
| 17248166934 | PASS | ||
| 17239364181 | FAIL | macos | thread/thread_gc1.py |
| 17239346523 | PASS | ||
| 17232449204 | FAIL | macos | thread/thread_gc1.py (possibly valid) |
| 17230929159 | FAIL | qemu_arm | cmdline/repl_sys_ps1_ps2.py |
| 17230082929 | PASS | ||
| 17226283109 | FAIL | settrace_stackless<br>macos<br>qemu_mips | thread/thread_gc1.py<br>thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17226266333 | PASS | ||
| 17225202917 | FAIL | macos<br>qemu_riscv64 | thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17224743621 | FAIL | settrace_stackless<br>macos | thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17224739270 | FAIL | macos | thread/thread_gc1.py |
| 17220251949 | FAIL | macos | thread/thread_gc1.py |
| 17218037418 | PASS | ||
| 17218024199 | PASS | ||
| 17212060390 | FAIL | macos | thread/thread_gc1.py |
| 17211892105 | CANCEL | ||
| 17209911695 | FAIL | macos | thread/thread_gc1.py |
| 17209904205 | FAIL | macos | thread/thread_gc1.py |
| 17196446007 | PASS | ||
| 17196132542 | FAIL | macos | thread/thread_gc1.py |
| 17180766768 | PASS | ||
| 17175320257 | PASS | ||
| 17175019154 | FAIL | macos<br>qemu_mips | thread/thread_gc1.py<br>thread/thread_gc1.py |
| 17175013008 | CANCEL |
Of the 14 test failures observed in this sample, all but one were attributable to thread/thread_gc1.py, with all but one of these failures happening on macos or qemu. (Note that one of these changes did touch thread code, so for the sake of robustness I've assumed it's actually a true positive in my analysis.)
This test has a false positive rate of 59% over this sample, an F1 score of 0.13, and a positive predictive value of 7.14% (i.e. when the test suite reports failure for a PR, the chance that the failure is due to the PR's change is only 7%, due to this test.)
This test should therefore be disabled in these scenarios where it's unreliable.