> ## Documentation Index
> Fetch the complete documentation index at: https://momentic.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Build vs. buy

> What the testing platform you build in-house has to own as your app changes, and which parts Momentic provides built in.

Writing a first E2E test is cheap. The expensive part is the testing platform
you build and run around those tests to keep them reliable as the product
changes. **It becomes a system your team maintains for the life of the
product.** This page lays out what that platform has to do, so you can scope it
against adopting Momentic.

The question is not whether you can build it. With current models, you can. The
real question is whether you want a team building, running, and maintaining an
internal testing product forever instead of shipping your own. Getting tests to
pass once is easy; keeping them reliable enough to gate releases on is the hard
part, and that is almost all of the cost.

## What you actually own

Authoring tests is the part most people picture. The platform you build around
them, the one you gate releases on, also has to:

* **Replay fast, and heal when it can't.** Calling an LLM on every step is slow,
  non-deterministic, and costs money per action. Cache the resolved locator and
  it is fast again, until the UI moves and the cached locator breaks. You own
  the logic that decides when to replay, when to re-resolve, and how to confirm
  the result was right.
* **Survive UI changes.** A redesign, a renamed ID, a moved button: each one
  breaks locators, and someone has to fix them. That work grows with the app,
  not the team.
* **Separate real bugs from flakes.** If a failing build might be noise, people
  stop trusting it, and a platform no one trusts cannot gate releases.
* **Keep up with the product.** New behavior needs new tests. Skip it and
  coverage falls behind.
* **Run as a service.** Runner, assertions, sharding, retries, reporting,
  browser and driver upgrades, mobile. Operating all of it is a job on its own.

<Note>
  AI makes authoring tests cheaper. It does little for maintenance, triage, or
  operation, which are the recurring costs.
</Note>

## The runtime-AI tradeoff

Running an LLM inside the test is useful: it adapts when the page changes. It is
also expensive, since you pay for a model call on every action. In practice you
are left tuning one dial:

|            | Adapts to change | Speed and cost      | Determinism                               |
| ---------- | ---------------- | ------------------- | ----------------------------------------- |
| **AI**     | Yes              | Slow, per-action \$ | Non-deterministic                         |
| **Cached** | No               | Fast, near-free     | Deterministic; breaks when the UI changes |

Tuning that dial (replay when safe, re-resolve only when needed) and then
confirming a passing run is actually correct is work most teams underestimate.
For the tool-specific version of this tradeoff, see the
[Stagehand](/comparisons/stagehand) and
[Playwright MCP](/comparisons/playwright-mcp) comparisons.

## Estimating the cost

Most of this cost is engineer time, so it is straightforward to size. The build
is a one-time number. Maintenance, triage, and operation repeat every year and
grow with the app.

| Cost                 | How to size it                                                                                  |
| -------------------- | ----------------------------------------------------------------------------------------------- |
| Build (v1)           | Engineers x months to reach a platform you trust, amortized over its life.                      |
| Maintenance + triage | Engineer-hours per week fixing broken tests and sorting real failures from flakes, times 52.    |
| Operation            | Engineer-hours per week on the framework itself (runner, sharding, upgrades, mobile), times 52. |

Say fixing broken tests and triaging failures takes **one engineer-day per
week**, and operating the framework another **half-day**. That is **12 hours per
week**, or about **620 hours per year**. At a fully-loaded **\$150 per hour**,
that is roughly **\$95k per year**, before the build, and it grows as you add
tests.

<Note>
  These numbers are illustrative. Use the bake-off below to replace them with
  measured ones from your own app.
</Note>

## What Momentic provides built in

Each row below is its own long-lived project if you build it in-house.

| Capability                                                    | What it does                                                                                                                                                       |
| ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [Step cache](/reliability/step-cache)                         | Replays a step's resolved locator with no LLM call until something changes, so steady-state runs cost about the same as plain Playwright.                          |
| [In-run auto-heal](/reliability/auto-heal)                    | Re-resolves a locator from its natural-language description on a cache miss and waits for the page to settle before acting, so the run recovers on its own.        |
| [Post-run heal agent](/guides/auto-heal/in-ci)                | After a failed run, rewrites the failing tests and opens a pull request or patch, so test fixes come to you as a code review.                                      |
| [Failure classification](/cli-reference/momentic/commands/ai) | Sorts a failed run into a category (bug, application change, test issue, infra) with reasoning, so you can tell a real failure from a flaky or broken test.        |
| [Explore agent](/ai/explore) (beta)                           | Reads a diff, finds the user journeys that changed, and authors or edits tests to cover them, so new behavior gets tests without anyone remembering to write them. |
| [App graph](/ai/app-graph) (beta)                             | A coverage model of your user journeys built from run traces, so you can see what is covered, partial, or missing instead of guessing from a test list.            |
| [Knowledge base](/ai/knowledge-base) (beta)                   | Org-level memory of your terminology, rules, and flows, retrieved per AI step so steps resolve more accurately as it learns your product.                          |

## When building in-house makes sense

It is the right call in a few cases:

<CardGroup cols={3}>
  <Card title="A small, stable test suite" icon="list-check">
    A handful of flows that rarely change and that you can maintain by hand.
  </Card>

  <Card title="Hard constraints" icon="lock">
    OSS with no SaaS dependency, or compliance and isolation rules that rule out
    external tooling.
  </Card>

  <Card title="Testing as a product" icon="screwdriver-wrench">
    You want to own testing as a core competency and have the platform capacity
    to staff it.
  </Card>
</CardGroup>

If none of these hold, the recurring maintenance and reliability cost usually
outweighs what you save by building. Those savings are in authoring, which is
exactly where AI already helps, not in running and maintaining the platform.

## Decide with a bake-off

The way to settle this is to run both approaches on the same work.

<Steps>
  <Step title="Pick two hard cases">
    A recent UI redesign and a flow with a history of breaking. These are where
    maintenance cost actually shows up.
  </Step>

  <Step title="Build it both ways">
    Cover the same journeys in-house and in Momentic.
  </Step>

  <Step title="Run through real churn">
    Let both run for a few weeks of normal merges, not a one-time trial run.
  </Step>

  <Step title="Count human interventions">
    Momentic's heal and explore agents also edit tests, but those edits arrive
    as pull requests you review, so count the work a person actually had to do:

    * Commits whose only purpose was fixing tests.
    * Tests that broke and needed editing per UI change.
    * Failures fixed automatically vs. by hand.
    * Flake rate against an unchanged app.
    * Time to triage each failure, and whether it is automatic.
    * Engineer-hours per week spent operating the framework itself.
  </Step>
</Steps>

<Tip>
  Run the bake-off long enough to see the trend, not a single week. A short
  window usually favors building; what matters is whether maintenance per test
  rises or falls as the app grows.
</Tip>