Why we built vision-grounded locators instead of training a better DOM model.

Six months into building Momentic, we realized every meaningful test failure had the same root cause — and it wasn't the AI. Here's the architectural decision that changed our reliability story.

Sam Stern

Engineering Manager, Retool

May 27, 2026

The problem with DOM-coupled tests

When we started building Momentic in 2023, the obvious approach was to make the AI better at understanding the DOM. Generate smarter selectors. Use ML to predict element identity. Make data-testid attributes a thing of the past.

We tried it. It almost worked. And then it broke every single time a designer touched a component.

The thing nobody talks about with DOM-coupled testing — whether the selector is hand-written or AI-generated — is that the DOM is the wrong abstraction layer. A button isn't its div.btn--primary.btn--lg classes. It's "the thing the user clicks to check out." The semantic identity lives in the visual presentation, not in the implementation detail.

“Cypress was supposed to solve our reliability concerns, but it couldn't.”

Sam SternEngineering ManagerRunway

Two options we considered

By month six, we had two viable paths forward:

We tried it. It almost worked. And then it broke every single time a designer touched a component.

The thing nobody talks about with DOM-coupled testing — whether the selector is hand-written or AI-generated — is that the DOM is the wrong abstraction layer. A button isn't its div.btn--primary.btn--lg classes. It's "the thing the user clicks to check out." The semantic identity lives in the visual presentation, not in the implementation detail.

Both options had real engineering costs. We picked Option B, and the reason was less about model accuracy and more about failure modes.

Why failure modes matter more than accuracy

Here's the thing about a 96% accurate DOM model: when the 4% failure happens, it fails silently. The locator points at the wrong element. The test passes. Production breaks anyway.

Vision-grounded locators fail differently. When they fail, they fail visibly — the AI can't see the element it was told to interact with, the test errors out, and you get a screenshot showing exactly what the AI saw at runtime. The failure modes are debuggable in a way DOM failures aren't.

What happened next

Eight weeks after shipping vision-grounded locators to early customers:

Selector breakage rate dropped from 14 per 1,000 test runs to 0.4.
Engineer time on test maintenance fell by 70% across the cohort.
CI flake rate — the rate of false-positive failures — went from 8% to under 0.5%.
Customer NPS on test reliability moved from +12 to +71.

The biggest unexpected win: engineers stopped opting out of CI gates. When you trust your test suite, you actually let it block deploys. We didn't anticipate that one.

The tradeoffs we accepted

Nothing is free. Vision-grounded locators are more expensive at runtime — each resolution call hits a multimodal model. We've offset that with aggressive caching of compiled locator plans, but for high-volume CI it's a real cost.

The other tradeoff is invisible elements. If something is visually hidden but DOM-present (a screen-reader-only label, an unrendered tab panel), vision can't find it. We fall back to DOM-only resolution in those cases, and we surface a warning so engineers know the test is in a less-reliable mode.

Keep reading.

See all

Engineering

How We Ditched Postgres for ClickHouse to Process 12 Billion Caches Per Day

From Postgres pain to ClickHouse speed: how we re-architected caching to serve 2M+ cache queries and 20B entries per day, while maintaining ~250ms average resolution latency.

Henry Haefliger

6 min read

Engineering

Hot libraries and heartbreak: Struggling with Cursor, RTL, and shadcn

Why testing libraries can be a real pain to use.

Matt Andryc

3 min read

Engineering

Fetch Mocking With Playwright in Next.js 15

A deep dive into how to intercept and mock API responses using Playwright and Next.js

Wei-Wei Wu

6 min read