Is This a Flaky Test or a Real Bug? A Developer's Guide to Triage Flaky Tests

The CI pipeline glows red. A critical end-to-end test has failed on the main branch, blocking the deployment of a time-sensitive feature. Panic sets in, followed by the inevitable, soul-crushing question: is it a real bug, or just another flaky test? This single moment of uncertainty can derail an entire team, eroding trust in the test suite and slowing development velocity to a crawl. The ability to efficiently triage flaky tests is no longer a niche skill but a fundamental competency for modern software engineering teams. It's the critical process that separates high-performing teams with reliable, fast feedback loops from those bogged down in a perpetual cycle of rerunning jobs and debugging the debugger. According to a study by Google engineers, flaky tests are a pervasive problem even at scale, necessitating a systematic approach to identification and resolution. This guide provides a comprehensive framework for developers to confidently navigate this gray area, make decisive calls, and restore stability to their development lifecycle.

Understanding the Enemy: What Exactly is a Flaky Test?

Before we can effectively triage flaky tests, we must establish a clear definition. A flaky test is a test that can both pass and fail for the same code without any changes. Its outcome is non-deterministic. Unlike a consistently failing test, which clearly indicates a regression, a flaky test introduces noise and uncertainty. The root causes of flakiness are often subtle and can be notoriously difficult to pin down. They typically stem from dependencies on uncontrolled external factors. Martin Fowler's analysis of non-deterministic tests highlights several common culprits:

Asynchronicity and Race Conditions: This is arguably the most common cause. A test might make an asynchronous call (e.g., an API request or a database write) and then immediately assert a result without properly waiting for the operation to complete. Depending on network latency or thread scheduling, the assertion might run before or after the operation finishes, leading to inconsistent results.
Infrastructure and Environment Issues: The test environment is not a pristine laboratory. Flakiness can be introduced by network hiccups, database connection timeouts, container startup delays, or resource contention (CPU, memory) on the CI runner. A test that passes reliably on a powerful developer machine may fail intermittently under the constrained resources of a shared CI agent.
Order Dependency: Some tests implicitly rely on other tests running before them to set up a specific state. When tests are run in a different order (a common practice for parallelization), this implicit dependency is broken, and the test fails.
Concurrency: In multi-threaded applications, improper synchronization can lead to race conditions not just in the application code, but within the test itself. Shared state between parallel test executions is a classic recipe for flakiness.
Third-Party Dependencies: Tests that rely on external APIs, services, or even the system clock (DateTime.Now) are susceptible to flakiness. The third-party service could be down, rate-limiting requests, or returning unexpected data. Relying on the current time is problematic because the test's execution speed can vary.

Identifying the category of flakiness is the first step in the triage process. As documented in a Microsoft Research paper, understanding these common patterns dramatically accelerates the debugging process. The paper found that async/wait issues were responsible for a significant percentage of flaky tests in their analyzed projects, confirming the importance of mastering asynchronous test patterns.

The High Cost of Indecision: Why You Must Triage Flaky Tests Aggressively

Ignoring flaky tests or simply hitting 'rerun' is a high-interest technical debt. The accumulated cost goes far beyond the annoyance of a failed build. The decision to not triage flaky tests immediately has compounding negative effects on the entire engineering organization.

1. Erosion of Trust and 'Test Blindness' When the CI pipeline fails frequently due to flakiness, developers begin to lose trust in it. A red build ceases to be an urgent signal of a real problem and becomes background noise. This phenomenon, often called 'test blindness' or 'alert fatigue', is incredibly dangerous. Teams start to automatically assume a failure is 'just a flake' and rerun the build without investigation. This is precisely when a genuine, critical regression can slip through unnoticed into production. The core principle of Continuous Integration is to provide fast, reliable feedback, a principle that flaky tests directly undermine.

2. Decreased Development Velocity Every minute a developer spends waiting for a build to be rerun, investigating a false positive, or trying to merge a PR against a broken main branch is a minute not spent building features. GitHub's engineering blog has detailed how they invest heavily in CI stability because flaky tests are a direct impediment to developer productivity. A 10-minute build that fails 30% of the time doesn't just cost 10 minutes; it costs the context-switching time, the investigation time, and the delay for all other developers waiting to merge their own changes.

3. Masking of Real Bugs A flaky test can intermittently pass even when a real, underlying bug exists. For example, a race condition in the production code might only manifest under specific load conditions that the test only occasionally reproduces. By dismissing the test failure as 'flaky', the team misses the signal that there's a latent, critical bug waiting to happen in production. Properly triaging the test forces a deeper look that might uncover these hidden issues.

4. Increased Onboarding and Cognitive Load For new team members, a flaky test suite is a nightmare. They can't be sure if their changes broke the build or if they just got unlucky. This creates a culture of fear around making changes and significantly increases the cognitive load required to contribute. A clean, reliable test suite acts as living documentation and a safety net, both of which are compromised by flakiness. The effort required to triage flaky tests is an investment in team productivity and psychological safety, a concept explored in depth by Google's Project Aristotle research on team effectiveness.

A Step-by-Step Framework to Triage Flaky Tests

When faced with a failed test, resist the urge to immediately hit 'rerun'. Instead, follow a structured process to make an informed decision. This framework will guide you through how to triage flaky tests systematically.

Step 1: Isolate and Reproduce The first goal is to confirm the flakiness. A test that fails 100% of the time on a specific commit is not flaky; it's broken. Your task is to see if you can make it both pass and fail.

Run it in a loop: The simplest method is to run the specific test multiple times. A shell script can be very effective for this.

# Example for a Jest test
for i in {1..20}; do
  echo "Run $i"
  npx jest --testNamePattern="My Flaky Test" || echo "---> FAILED on run $i"
done

Run it locally vs. CI: Try to reproduce the failure on your local machine. If it only fails in the CI environment, this points towards environmental differences (e.g., resource constraints, network configuration, different service versions).
Analyze the CI artifacts: Do not discard the failed run. Download the logs, screenshots, video recordings, and any other artifacts generated by your test runner. These are your primary clues. A tool like Playwright's Trace Viewer is invaluable here, as it provides a complete DOM snapshot, console logs, and network requests for every step of the test.

Step 2: Gather Context and Historical Data A single test failure is a data point. A pattern of failures is evidence.

Check Git History: Who was the last person to touch this test file or the related application code? What changes were included in the PR that merged just before the failures started?
Review CI History: Look at the history for this specific test. Is this its first failure? Does it fail once a day? Does it only fail on Tuesdays? Modern CI platforms and test analytics tools can often provide a 'flakiness score' or history for each test, as discussed in best practices from CircleCI.
Check for Infrastructure Changes: Was there a recent deployment of a downstream service? Was the base container image for the CI runners updated? Was there a database migration? A change outside the application repository is a common external cause.

Step 3: Analyze the Failure Pattern and Hypothesize Now, dig into the 'how'. Examine the artifacts from the failed runs and look for patterns.

Timing-Related? (Async/Race Condition): Does the failure message indicate an element was not found, but the screenshot shows it's clearly there? This often means the assertion ran a millisecond too early. Does the test fail more often when the system is under load? This points to a race condition.
- Hypothesis: The test isn't properly waiting for an API call to resolve before asserting on the UI.
Data-Related? (State Contamination): Does the test fail when run after a specific other test? Does it use a hardcoded value (like a username test-user-1) that might be in a dirty state from a previous run?
- Hypothesis: The test relies on a clean database state, but a previous test is not cleaning up after itself.
Environment-Related? (Resource Contention): Does the test only fail when the full suite is run in parallel on CI, but never when run in isolation?
- Hypothesis: The test is sensitive to CPU or memory pressure, causing a timeout that doesn't occur locally.

Step 4: The Verdict - Classify and Act Based on your investigation, you must make a call. There are three primary outcomes:

Verdict: Confirmed Bug. The test is failing consistently because of a genuine regression in the application code. The test is doing its job. Action: Do not merge. Revert the offending commit or create a high-priority ticket to fix the bug immediately. The test itself should not be changed.
Verdict: Confirmed Flaky Test. You have reproduced both a pass and a fail under the same commit and have a strong hypothesis about the source of non-determinism (e.g., a race condition). Action: Quarantine and Fix. The immediate priority is to unblock the pipeline. Disable or quarantine the test (e.g., move it to a separate 'quarantine' job that doesn't block merges). Create a new, dedicated tech debt ticket to fix the test's flakiness. This ticket should include all your triage notes, logs, and reproduction steps. According to Google's testing philosophy, a flaky test is often treated with the same severity as a broken one and must be removed from the critical path immediately.
Verdict: Inconclusive. You cannot reliably reproduce the failure, and the cause is not obvious after a time-boxed investigation (e.g., 30-60 minutes). Action: Isolate and Observe. Rerun the blocking job to get the merge through, but immediately create a ticket to investigate further. Add more detailed logging or instrumentation around the suspicious test to gather more data on its next failure. Do not let it be forgotten.

Advanced Techniques and Tooling for Flaky Test Triage

For persistent and hard-to-diagnose flaky tests, standard triage methods may not be enough. Engineering teams can adopt more advanced strategies and leverage specialized tools to gain the upper hand.

Leveraging Deterministic Fakes and Mocks One of the most powerful strategies to eliminate flakiness is to remove sources of non-determinism. Instead of relying on a live database or a real network call to a third-party service, use deterministic fakes.

Mock Servers: Instead of hitting a real API, your test can hit a local mock server (like msw or nock) that returns predictable responses instantly. This eliminates network latency and service availability as variables.

Time Manipulation: For tests involving timeouts or time-sensitive logic, use libraries that allow you to control the clock. For example, in JavaScript, sinon.js or Jest's built-in timer mocks can be used to advance time programmatically, making setTimeout or setInterval behavior perfectly predictable.

// Jest example of controlling time
it('should call the callback after 1 second', () => {
  jest.useFakeTimers();
  const callback = jest.fn();

  myFunctionWithTimeout(callback);

  // Fast-forward time by 1000ms
  jest.advanceTimersByTime(1000);

  expect(callback).toHaveBeenCalledTimes(1);
  jest.useRealTimers();
});

Test Analytics and Observability Platforms Modern development practices advocate for treating your test suite as a production system that requires monitoring. Several platforms have emerged to help triage flaky tests proactively.

Flaky Test Detection: Tools like Buildkite's Test Analytics or dedicated platforms like Trunk can automatically identify tests that have a history of flapping. They analyze test results over time to surface the most problematic tests, allowing you to prioritize fixing them before they block a critical release. These platforms often provide rich dashboards showing failure rates, timing, and historical context.
Integration with Observability: Connecting test failures to your application's observability platform (e.g., DataDog, Honeycomb) can be a game-changer. When a test fails, you can automatically correlate it with application traces, logs, and metrics from that exact moment. This can reveal underlying issues like database query bottlenecks or resource exhaustion that were the true root cause of the test failure. The concept of Observability-Driven Development extends this idea to the entire development lifecycle, including testing.

Code-Level Strategies for Robustness Finally, some flakiness can be addressed by writing more resilient tests.

Explicit Waits over Fixed Delays: Never use a fixed sleep like Thread.sleep(2000). This is a primary source of flakiness. It either waits too long, slowing down the suite, or not long enough, causing a failure. Instead, use explicit waits that poll for a condition with a timeout. All major testing frameworks (Selenium, Cypress, Playwright) provide robust APIs for this.
Atomic and Idempotent Tests: Each test should be a self-contained unit. It should set up its own required state and, crucially, clean up after itself. Tests should not depend on the state left behind by other tests. This principle, detailed in Microsoft's testing best practices documentation, is fundamental to creating a stable, parallelizable test suite.

The battle against flaky tests is a continuous process, not a one-time fix. It requires a cultural shift where the entire team takes ownership of test suite health. A red build should be treated as a genuine emergency, and the process to triage flaky tests must be as disciplined and rigorous as the process for triaging production bugs. By implementing a systematic triage framework—isolating, gathering context, hypothesizing, and acting decisively—teams can transform their CI pipeline from a source of frustration into a reliable, trusted asset. The initial time investment in thorough triage pays exponential dividends in increased developer velocity, higher code quality, and greater confidence in deployments. Stop rerunning failed jobs and start investigating. Your future self, and your entire team, will thank you.

Is This a Flaky Test or a Real Bug? A Developer's Guide to Triage Flaky Tests

Understanding the Enemy: What Exactly is a Flaky Test?

The High Cost of Indecision: Why You Must Triage Flaky Tests Aggressively

A Step-by-Step Framework to Triage Flaky Tests

Advanced Techniques and Tooling for Flaky Test Triage

What today's top teams are saying about Momentic:

Increase velocity with reliable AI testing.

FAQs

Is This a Flaky Test or a Real Bug? A Developer's Guide to Triage Flaky Tests

Understanding the Enemy: What Exactly is a Flaky Test?

The High Cost of Indecision: Why You Must Triage Flaky Tests Aggressively

A Step-by-Step Framework to Triage Flaky Tests

Advanced Techniques and Tooling for Flaky Test Triage

Related Posts

Related Articles

What today's top teams are saying about Momentic:

Increase velocity with reliable AI testing.

FAQs

How reliable is Momentic?

How fast can I build tests?

Is there a big learning curve?

Can you run against pull requests, merges, and commits?

Do you support mobile (iOS, Android) and desktop (Electron)?

Do you support Chrome, Safari, and Firefox?