tests: thread/stress_heap.py intermittent failure on macOS
The stress_heap.py test was excluded from the macOS CI job with a "is
flaky" comment prior to the analysis period. No direct log observations
are available since it was already excluded, but it's listed in
FLAKY_TESTS restricted to the darwin platform.
PR #18861 now handles this via ignore-on-failure instead of exclusion,
so the test runs and its output is visible but doesn't block CI.
See analysis: https://gist.github.com/andrewleech/5686ed5242e0948d8679c432579e002e
tests/run-tests.py: Add automatic retry for known-flaky tests.
Summary
The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.
I added a flaky-test allowlist and retry-until-pass mechanism to run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.
tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py stays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.
Retries are on by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately from clean passes.
Testing
Full CI passed on fork (all workflows green). The codecov upload step is also gated on CODECOV_TOKEN being available so it skips cleanly on forks.
Trade-offs and Alternatives
Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.
The test result status strings ("pass", "fail", "skip") are bare string literals throughout run-tests.py — the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.
Generative AI
I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.