Debugging flaky tests

A flaky test is a test that sometimes passes and sometimes fails with no change in the product or the test code. Retries and quarantine are short-term bandaids, the real fix is understanding why.

Step 1: Look at the video

Open the failing run in the dashboard and watch the video. 80% of flakes have an obvious visual cause: a modal that wasn’t dismissed, a toast that covered a button, an animation that hadn’t settled.

Step 2: Compare to a passing run

Open a passing run of the same test and diff the step timings. Big spikes usually point at:

Network timing: a request is faster/slower than usual
Animation timing: a transition is now blocking interaction
Ordering: a race condition between two state updates

Step 3: Check auto-heal history

If auto-heal fired in recent runs, something in the UI is drifting. Review the before/after locators to see what changed and whether the product change was intentional.

Step 4: Watch the trace

Each failing run has a full trace: DOM, network, console. Common signals:

Console errors from the product around the failed step
Pending network requests when the agent acted
Unhandled promise rejections that broke the page after a step

Step 5: Isolate the step

Pull the problematic step into a minimal test and run it 20 times locally:

for i in {1..20}; do npx momentic run my-test || break; done

If it always fails in isolation, the step itself is wrong. If it only fails when run with others, you have test pollution, shared state from a previous test.

Common fixes

Add an explicit wait for a specific element or URL before acting
Use an AI check to confirm intermediate state
Don’t assert on transient UI. Toasts, growlers, and snackbars often disappear within a few seconds, so a check that targets them races with the animation. Assert the durable result instead (the row that was added, the URL that changed), or capture it with a page check immediately after the action
Clear stale input. If a field keeps its previous value, set clear on the Type step (clear: always for custom widgets that aren’t a plain <input>)
Reset state between tests with a login/setup module in the before section
Ban dynamic attributes that bust the cache every run. If a framework generates random for, aria-controls, or data-* attributes, add them to bannedAttributes so they are stripped from both the AI context and cache comparisons
Mock flaky upstream services: see Request mocking
Split the test if it’s doing too much

When to quarantine

Quarantine if:

You know the fix but can’t ship it today
The test is blocking a critical merge
You’re investigating and don’t want to distract the team

See Quarantine and Failure recovery.

Documentation Index

​Step 1: Look at the video

​Step 2: Compare to a passing run

​Step 3: Check auto-heal history

​Step 4: Watch the trace

​Step 5: Isolate the step

​Common fixes

​When to quarantine