What you actually own
Authoring tests is the part most people picture. The platform you build around them, the one you gate releases on, also has to:- Replay fast, and heal when it can’t. Calling an LLM on every step is slow, non-deterministic, and costs money per action. Cache the resolved locator and it is fast again, until the UI moves and the cached locator breaks. You own the logic that decides when to replay, when to re-resolve, and how to confirm the result was right.
- Survive UI changes. A redesign, a renamed ID, a moved button: each one breaks locators, and someone has to fix them. That work grows with the app, not the team.
- Separate real bugs from flakes. If a failing build might just be noise, people stop trusting it, and a platform no one trusts cannot gate releases.
- Keep up with the product. New behavior needs new tests. Skip it and coverage falls behind.
- Run as a service. Runner, assertions, sharding, retries, reporting, browser and driver upgrades, mobile. Operating all of it is a job on its own.
AI makes authoring tests cheaper. It does little for maintenance, triage, or
operation, which are the recurring costs.
The runtime-AI tradeoff
Running an LLM inside the test is useful: it adapts when the page changes. It is also expensive, since you pay for a model call on every action. In practice you are left tuning one dial:| Adapts to change | Speed and cost | Determinism | |
|---|---|---|---|
| AI | Yes | Slow, per-action $ | Non-deterministic |
| Cached | No | Fast, near-free | Deterministic; breaks when the UI changes |
Estimating the cost
Most of this cost is engineer time, so it is straightforward to size. The build is a one-time number. Maintenance, triage, and operation repeat every year and grow with the app.| Cost | How to size it |
|---|---|
| Build (v1) | Engineers x months to reach a platform you trust, amortized over its life. |
| Maintenance + triage | Engineer-hours per week fixing broken tests and sorting real failures from flakes, times 52. |
| Operation | Engineer-hours per week on the framework itself (runner, sharding, upgrades, mobile), times 52. |
These numbers are illustrative. Use the bake-off below to replace them with
measured ones from your own app.
What Momentic gives you out of the box
Each row below is its own long-lived project if you build it in-house.| Capability | What it does |
|---|---|
| Step cache | Replays a step’s resolved locator with no LLM call until something changes, so steady-state runs cost about the same as plain Playwright. |
| In-run auto-heal | Re-resolves a locator from its natural-language description on a cache miss and waits for the page to settle before acting, so the run recovers on its own. |
| Post-run heal agent | After a failed run, rewrites the failing tests and opens a pull request or patch, so test fixes come to you as a code review. |
| Failure classification | Sorts a failed run into a category (bug, application change, test issue, infra) with reasoning, so you can tell a real failure from a flaky or broken test. |
| Explore agent (beta) | Reads a diff, finds the user journeys that changed, and authors or edits tests to cover them, so new behavior gets tests without anyone remembering to write them. |
| App graph (beta) | A coverage model of your user journeys built from run traces, so you can see what is covered, partial, or missing instead of guessing from a test list. |
| Knowledge base (beta) | Org-level memory of your terminology, rules, and flows, retrieved per AI step so steps resolve more accurately as it learns your product. |
When building in-house makes sense
It is the right call in a few cases:A small, stable suite
A handful of flows that rarely change and that you can maintain by hand.
Hard constraints
OSS with no SaaS dependency, or compliance and isolation rules that rule out
external tooling.
Testing as a product
You want to own testing as a core competency and have the platform capacity
to staff it.
Decide with a bake-off
The way to settle this is to run both approaches on the same work.Pick two hard cases
A recent UI redesign and a flow with a history of breaking. These are where
maintenance cost actually shows up.
Count human interventions
Momentic’s heal and explore agents also edit tests, but those edits arrive
as pull requests you review, so count the work a person actually had to do:
- Commits whose only purpose was fixing tests.
- Tests that broke and needed editing per UI change.
- Failures fixed automatically vs. by hand.
- Flake rate against an unchanged app.
- Time to triage each failure, and whether it is automatic.
- Engineer-hours per week spent operating the framework itself.