Is This a Flaky Test or a Real Bug? A Developer's Guide to Triage

August 5, 2025

The dreaded red 'X' appears on your pull request. The continuous integration (CI) pipeline has failed. Your heart sinks for a moment, but then a familiar, cynical thought emerges: 'Is this a real bug, or just that flaky test again?' This moment of uncertainty is one of the most significant drains on modern software development productivity. Answering this question quickly and accurately is the core challenge of how to triage flaky tests. Ignoring this problem allows instability to fester, eroding trust in your test suite and slowing down your entire team. A flaky test, which passes and fails intermittently without any underlying code changes, can be more insidious than a consistently failing one. This guide provides a systematic, authoritative framework for developers to dissect these failures, confidently determine their root cause, and implement strategies to build a more resilient testing culture. We will move beyond simple re-runs and delve into the diagnostic mindset required to conquer test flakiness for good.

The High Cost of Indecision: Why Flaky Tests Cripple Development

Before diving into the 'how' of triage, it's crucial to understand the 'why'. Flaky tests are not a minor annoyance; they are a significant impediment to engineering velocity and product quality. A flaky test is formally defined as a test that exhibits non-deterministic behavior, passing at times and failing at others for the same version of the code. This unpredictability is their most damaging characteristic.

The primary casualty of test flakiness is trust. When developers can no longer trust the test suite's output, they begin to ignore it. A red build ceases to be an urgent signal of a regression and becomes background noise. This leads to a dangerous culture where developers reflexively re-run failed jobs, hoping for a green light. Research from Google Engineering highlights that this behavior is common, but it masks real issues and delays feedback loops. If a genuine bug is introduced, it might be dismissed as 'just another flake,' allowing it to slip through to production.

The economic impact is substantial. A Forrester report on developer productivity indirectly points to the cost of such interruptions, noting that context switching and resolving pipeline issues are major drains on focused work. Imagine a team of ten engineers where each person loses just 30 minutes a day to investigating or re-running flaky builds. This amounts to over 100 hours of lost productivity per month—time that could have been spent building features. Furthermore, flaky tests can artificially increase CI/CD costs due to repeated runs and the need for more powerful, but underutilized, build agents.

This erosion of trust and efficiency creates a vicious cycle. As the pipeline becomes less reliable, developers are less motivated to write new tests, fearing they will only add to the problem. The test suite's coverage stagnates or degrades, increasing the risk of shipping bugs. The process to triage flaky tests isn't just a technical exercise; it's a critical practice for maintaining a healthy and effective software development lifecycle. As noted in a McKinsey study on developer velocity, top-quartile companies excel because of superior tools and practices, and a stable, reliable test suite is a foundational element of that excellence. Ignoring flakiness is a direct choice to operate at a lower level of performance.

The Triage Mindset: First Principles for Effective Investigation

Effective triage begins not with a tool, but with a mindset. When faced with a non-deterministic failure, the immediate goal is to reduce uncertainty. A disciplined approach prevents wasted hours chasing ghosts and ensures that real issues receive the attention they deserve. The guiding principle should be: Assume it's a real bug until you have systematically proven it is a test flake. This approach forces rigor and prevents the premature dismissal of potentially critical failures.

Your initial investigation should follow three fundamental steps:

  1. Reproducibility Check: This is the first and most critical question. Can the failure be reproduced? And under what conditions? Don't just re-run the entire suite. Attempt to re-run only the single failed test. If it passes, run it in a loop (10, 50, or 100 times). The goal is to get a 'flakiness ratio.' Does it fail 1 in 10 times? Or 1 in 100? This data is invaluable. Document this ratio. Try to reproduce it on your local machine. If it only fails in the CI environment, you've already narrowed the problem space to environmental differences. This methodical approach is a core tenet of debugging, as described in foundational software texts like The Pragmatic Programmer.

  2. Scope Analysis: Is this an isolated incident or part of a larger pattern? Look at the build history. Has this specific test failed before? Are other, similar tests also failing? Does the failure occur on a specific operating system, browser version, or node in your CI cluster? Understanding the blast radius helps differentiate a localized test code issue from a systemic environmental problem. A failure across multiple, unrelated tests might point to an infrastructure issue, like a database connection pool being exhausted.

  3. Recent Changes Correlation: Failures rarely occur in a vacuum. The most likely culprit for a new flaky behavior is a recent change. Use version control to your advantage. What pull requests were merged just before the flakiness started? The git bisect command is an incredibly powerful tool for this. As explained in the official Git documentation, it can automatically perform a binary search through your commit history to pinpoint the exact commit that introduced the issue. This is often the fastest path to identifying a problematic code change, whether it's in the application logic or the test itself.

Crucially, as soon as a test is identified as potentially flaky, create a ticket. This act of documentation is non-negotiable. The ticket should include the test name, a link to the failed build, the reproducibility ratio, and any initial findings from the scope and correlation analysis. This creates a record, prevents knowledge from being lost, and allows the team to prioritize the fix instead of allowing it to become part of the accepted 'tribal knowledge' of CI quirks. This aligns with Martin Fowler's observations that making difficult processes frequent and visible is the key to improving them.

The Flaky Test Triage Toolkit: A Step-by-Step Diagnostic Flow

With the right mindset established, you can now apply a systematic diagnostic process. This toolkit combines command-line techniques, code analysis, and strategic thinking to efficiently triage flaky tests and uncover their root cause. This flow moves from broad isolation to specific pattern analysis.

Step 1: Isolate and Amplify

The first step is to get the test to fail reliably, or at least more frequently. Running the entire test suite is slow and noisy. Isolate the single problematic test and run it repeatedly.

For example, using a shell command with a testing framework like Jest or Mocha:

# Run a specific test file 100 times to check for flakiness
for i in {1..100}; do
  echo "Running iteration $i"
  npx jest --testPathPattern=tests/components/MyFlakyComponent.test.js || echo "FAILURE on iteration $i"
done

This simple loop is your best friend. It helps confirm the flake and provides a quick feedback cycle for when you start attempting fixes. If it fails consistently in the loop, you might have a real bug. If it fails sporadically, you're on the right track for a flake.

Step 2: Analyze the Failure Signature

Not all failures are equal. The type of error provides crucial clues:

  • Timeout Error: This is a classic sign of an asynchronous issue. The test likely finished executing, but some background process (an API call, a timer, an animation) didn't complete within the test runner's time limit. Your investigation should focus on async/await usage, Promises, and event loops.
  • Assertion Failure with Varying Data: If a test asserts expect(result).toBe(5) but intermittently gets 4 or 6, this points to a race condition or a data pollution issue. Two processes might be writing to the same variable or database record concurrently.
  • Null Pointer / Undefined Reference: This often happens when an asynchronous operation to fetch data hasn't completed before the test tries to access a property on that data. The test logic is running ahead of the data it depends on.

Step 3: Investigate Common Flaky Test Patterns

Most flaky tests fall into a few common categories. Systematically check for these anti-patterns in your test code.

  • Asynchronous Operations: This is the number one cause of flakiness in modern applications. Ensure every Promise-based operation is properly handled with async/await or .then(). A common mistake is forgetting to await a function that performs an update before making an assertion. Bad Example (potential flake):

    it('should update the user profile', () => {
      // This function returns a Promise, but we don't wait for it
      updateUserProfile({ name: 'New Name' }); 
      const user = getUserProfile();
      // This assertion might run before the update is complete
      expect(user.name).toBe('New Name');
    });

    Good Example:

    it('should update the user profile', async () => {
      // Await the async operation to ensure it completes
      await updateUserProfile({ name: 'New Name' }); 
      const user = await getUserProfile();
      expect(user.name).toBe('New Name');
    });

    The MDN Web Docs on Promises are an essential resource for understanding these concepts deeply.

  • Test Isolation and Data Pollution: Tests should be atomic and independent. A test should not rely on the state created by a previous test. According to the Pytest documentation on fixtures, a core principle is setting up and tearing down state for each test individually. If one test modifies a shared resource (like a user in a database) and doesn't clean up after itself, it can cause subsequent tests to fail intermittently depending on execution order.

  • Environment Inconsistency: Check for differences between your local setup and the CI environment. This can include environment variables, network latency simulation, system time zones, or even CPU/memory resources. A test that passes locally where an API responds in 50ms may time out in CI where the same call takes 500ms.

  • External Service Dependencies: Tests that rely on live, third-party APIs are inherently fragile. The service could be down, rate-limiting you, or returning unexpected data. The best practice is to mock these external dependencies. Tools like nock for Node.js or unittest.mock for Python allow you to create stable, predictable responses for your tests, as detailed in many API development best practices.

Step 4: Leverage Advanced Tooling

When manual inspection fails, turn to more powerful tools. Observability platforms like Honeycomb or Datadog can provide distributed traces that show the exact lifecycle of a request through your application, often revealing hidden timing issues. Furthermore, many modern CI platforms, like CircleCI or Buildkite, have built-in analytics that can automatically detect and flag your flakiest tests, helping you focus your efforts where they are most needed.

Differentiating Flakes from Heisenbugs and Real Defects

The process to triage flaky tests sometimes uncovers issues that are more complex than a simple test implementation error. It's important to be able to distinguish between a standard flaky test, a 'Heisenbug', and a straightforward, reproducible bug.

  • Reproducible Bug: This is the simplest case. The test fails 100% of the time under a specific set of conditions. The cause is a clear defect in the application code. Your triage process ends quickly, and a standard bug-fixing workflow begins.

  • Flaky Test: As we've discussed, this fails intermittently for the same code version. The root cause is typically in the test's implementation or its interaction with its environment (e.g., async timing, test data, external services). The application code itself is often correct, but the test is a poor verifier of that correctness.

  • Heisenbug: This is a more subtle and challenging category. Named after the Heisenberg Uncertainty Principle, a Heisenbug is a bug in the production code that alters or disappears when you try to study it. For example, adding log statements or attaching a debugger can change the timing of concurrent operations, causing a race condition to vanish. These are often complex concurrency or memory-related issues. A flaky test can be a symptom of a Heisenbug. The test fails intermittently because it's managed to create the precise, rare conditions needed to trigger the underlying bug in the application code.

So how do you tell them apart? Here's a decision framework:

  1. Can you fix the failure by only changing the test code? If you can make the test robust by adding a proper await, improving a selector, or mocking a dependency, you were dealing with a standard flaky test. The application code was sound.

  2. Does the failure disappear when you add debugging tools? If adding console.log statements or running the code in a debugger makes the test pass consistently, you are likely looking at a Heisenbug. The act of observation is affecting the outcome. This points to a deep issue in the application code, very likely a race condition. Academic research on non-deterministic bugs confirms these are among the hardest to diagnose and often require specialized tools for concurrency analysis.

  3. Is the failure tied to specific data inputs or sequences? If you find that the test only fails when a specific value is used or after a particular sequence of operations, it might be a complex, but ultimately reproducible, bug. The flakiness was just a result of the test randomly hitting that specific edge case.

Your triage process is valuable regardless of the outcome. By attempting to stabilize a 'flaky test,' you may uncover a much deeper Heisenbug in your application that would have been nearly impossible to find otherwise. This is why the initial principle—'assume it's a real bug'—is so powerful. It pushes you to investigate deeply enough to differentiate between a superficial test issue and a critical application defect. As noted by experts at observability firms like Lightstep, understanding these complex failure modes is essential for building reliable distributed systems.

Beyond Triage: Building a Culture of Test Stability

Reactively triaging flaky tests is a necessary skill, but the ultimate goal is to create an environment where they rarely occur. This requires a cultural shift from a purely reactive to a proactive approach to test quality. It's about instilling shared ownership and implementing policies that prioritize stability.

Here are key strategies to build this culture:

  • Quarantine and Prioritize: When a test is confirmed as flaky, don't just leave it to disrupt the pipeline. As practiced by engineering teams at Spotify and other top companies, the test should be immediately quarantined. This means temporarily disabling it (e.g., prefixing it with xtest or test.skip) and creating a high-priority ticket. The key is to attach a strict Service Level Agreement (SLA) to these tickets. A flaky test should not be allowed to remain in quarantine for more than a few days or one sprint. This unblocks the CI pipeline for everyone else while ensuring the issue is not forgotten.

  • Implement a Flakiness Budget: Treat test debt like any other form of technical debt. Allocate a specific percentage of each sprint's engineering capacity—a 'flakiness budget'—to fixing quarantined tests and improving test infrastructure. This formalizes the commitment to stability and prevents it from being perpetually de-prioritized in favor of new features. This approach is championed by DevOps thought leaders on platforms like InfoQ as a way to balance velocity with quality.

  • Promote Ownership: Tests without clear owners are destined to become flaky. Assign ownership of test suites to the same teams that own the corresponding feature code. When a test in their suite becomes flaky, they are responsible for the triage and fix. This aligns responsibility with expertise and fosters a sense of pride in the quality of the team's tests, a practice often highlighted by the Netflix engineering team.

  • Educate and Standardize: Proactively educate developers on the common causes of flakiness. Hold brown-bag sessions on writing stable asynchronous tests, proper test data management, and effective mocking strategies. Provide clear, documented standards and examples in your project's repository. A well-written test should be as clean and readable as production code.

By adopting these practices, you transform the triage of flaky tests from a recurring fire drill into a rare, well-defined process. The focus shifts from fixing broken windows to building a solid foundation, ensuring your test suite remains a valuable asset rather than a source of frustration.

The challenge of a failing test pipeline is a pivotal moment for any development team. Choosing to dismiss it as 'just a flake' is a step towards diminished productivity and eroding quality. Conversely, embracing the challenge with a structured approach to triage flaky tests reinforces a culture of excellence and accountability. The distinction between a flaky test, a Heisenbug, and a real defect is not merely academic; it is fundamental to building reliable software. By adopting a rigorous mindset, utilizing a systematic diagnostic toolkit, and proactively fostering a culture of test stability, you can transform your CI/CD pipeline from a source of anxiety into a trusted guardian of your codebase. The goal is not just to get a green build, but to have unwavering confidence in what that green build represents: high-quality, well-tested software, ready to be delivered to your users.

What today's top teams are saying about Momentic:

"Momentic makes it 3x faster for our team to write and maintain end to end tests."

- Alex, CTO, GPTZero

"Works for us in prod, super great UX, and incredible velocity and delivery."

- Aditya, CTO, Best Parents

"…it was done running in 14 min, without me needing to do a thing during that time."

- Mike, Eng Manager, Runway

Increase velocity with reliable AI testing.

Run stable, dev-owned tests on every push. No QA bottlenecks.

Ship it

FAQs

Momentic tests are much more reliable than Playwright or Cypress tests because they are not affected by changes in the DOM.

Our customers often build their first tests within five minutes. It's very easy to build tests using the low-code editor. You can also record your actions and turn them into a fully working automated test.

Not even a little bit. As long as you can clearly describe what you want to test, Momentic can get it done.

Yes. You can use Momentic's CLI to run tests anywhere. We support any CI provider that can run Node.js.

Mobile and desktop support is on our roadmap, but we don't have a specific release date yet.

We currently support Chromium and Chrome browsers for tests. Safari and Firefox support is on our roadmap, but we don't have a specific release date yet.

© 2025 Momentic, Inc.
All rights reserved.