Skip to main content
Writing a first E2E test is cheap. The expensive part is the testing platform you build and run around those tests to keep them reliable as the product changes. It becomes a system your team maintains for the life of the product. This page lays out what that platform has to do, so you can scope it against adopting Momentic. The question is not whether you can build it. With current models, you can. The real question is whether you want a team building, running, and maintaining an internal testing product forever instead of shipping your own. Getting tests to pass once is easy; keeping them reliable enough to gate releases on is the hard part, and that is almost all of the cost.

What you actually own

Authoring tests is the part most people picture. The platform you build around them, the one you gate releases on, also has to:
  • Replay fast, and heal when it can’t. Calling an LLM on every step is slow, non-deterministic, and costs money per action. Cache the resolved locator and it is fast again, until the UI moves and the cached locator breaks. You own the logic that decides when to replay, when to re-resolve, and how to confirm the result was right.
  • Survive UI changes. A redesign, a renamed ID, a moved button: each one breaks locators, and someone has to fix them. That work grows with the app, not the team.
  • Separate real bugs from flakes. If a failing build might just be noise, people stop trusting it, and a platform no one trusts cannot gate releases.
  • Keep up with the product. New behavior needs new tests. Skip it and coverage falls behind.
  • Run as a service. Runner, assertions, sharding, retries, reporting, browser and driver upgrades, mobile. Operating all of it is a job on its own.
AI makes authoring tests cheaper. It does little for maintenance, triage, or operation, which are the recurring costs.

The runtime-AI tradeoff

Running an LLM inside the test is useful: it adapts when the page changes. It is also expensive, since you pay for a model call on every action. In practice you are left tuning one dial:
Adapts to changeSpeed and costDeterminism
AIYesSlow, per-action $Non-deterministic
CachedNoFast, near-freeDeterministic; breaks when the UI changes
Tuning that dial (replay when safe, re-resolve only when needed) and then confirming a passing run is actually correct is work most teams underestimate. For the tool-specific version of this tradeoff, see the Stagehand and Playwright MCP comparisons.

Estimating the cost

Most of this cost is engineer time, so it is straightforward to size. The build is a one-time number. Maintenance, triage, and operation repeat every year and grow with the app.
CostHow to size it
Build (v1)Engineers x months to reach a platform you trust, amortized over its life.
Maintenance + triageEngineer-hours per week fixing broken tests and sorting real failures from flakes, times 52.
OperationEngineer-hours per week on the framework itself (runner, sharding, upgrades, mobile), times 52.
Say fixing broken tests and triaging failures takes one engineer-day per week, and operating the framework another half-day. That is 12 hours per week, or about 620 hours per year. At a fully-loaded $150 per hour, that is roughly $95k per year, before the build, and it grows as you add tests.
These numbers are illustrative. Use the bake-off below to replace them with measured ones from your own app.

What Momentic gives you out of the box

Each row below is its own long-lived project if you build it in-house.
CapabilityWhat it does
Step cacheReplays a step’s resolved locator with no LLM call until something changes, so steady-state runs cost about the same as plain Playwright.
In-run auto-healRe-resolves a locator from its natural-language description on a cache miss and waits for the page to settle before acting, so the run recovers on its own.
Post-run heal agentAfter a failed run, rewrites the failing tests and opens a pull request or patch, so test fixes come to you as a code review.
Failure classificationSorts a failed run into a category (bug, application change, test issue, infra) with reasoning, so you can tell a real failure from a flaky or broken test.
Explore agent (beta)Reads a diff, finds the user journeys that changed, and authors or edits tests to cover them, so new behavior gets tests without anyone remembering to write them.
App graph (beta)A coverage model of your user journeys built from run traces, so you can see what is covered, partial, or missing instead of guessing from a test list.
Knowledge base (beta)Org-level memory of your terminology, rules, and flows, retrieved per AI step so steps resolve more accurately as it learns your product.

When building in-house makes sense

It is the right call in a few cases:

A small, stable suite

A handful of flows that rarely change and that you can maintain by hand.

Hard constraints

OSS with no SaaS dependency, or compliance and isolation rules that rule out external tooling.

Testing as a product

You want to own testing as a core competency and have the platform capacity to staff it.
If none of these hold, the recurring maintenance and reliability cost usually outweighs what you save by building. Those savings are in authoring, which is exactly where AI already helps, not in running and maintaining the platform.

Decide with a bake-off

The way to settle this is to run both approaches on the same work.
1

Pick two hard cases

A recent UI redesign and a flow with a history of breaking. These are where maintenance cost actually shows up.
2

Build it both ways

Cover the same journeys in-house and in Momentic.
3

Run through real churn

Let both run for a few weeks of normal merges, not a one-time trial run.
4

Count human interventions

Momentic’s heal and explore agents also edit tests, but those edits arrive as pull requests you review, so count the work a person actually had to do:
  • Commits whose only purpose was fixing tests.
  • Tests that broke and needed editing per UI change.
  • Failures fixed automatically vs. by hand.
  • Flake rate against an unchanged app.
  • Time to triage each failure, and whether it is automatic.
  • Engineer-hours per week spent operating the framework itself.
Run the bake-off long enough to see the trend, not a single week. A short window usually favors building; what matters is whether maintenance per test rises or falls as the app grows.