The problem with DOM-coupled tests
When we started building Momentic in 2023, the obvious approach was to make the AI better at understanding the DOM. Generate smarter selectors. Use ML to predict element identity. Make data-testid attributes a thing of the past.
We tried it. It almost worked. And then it broke every single time a designer touched a component.
The thing nobody talks about with DOM-coupled testing — whether the selector is hand-written or AI-generated — is that the DOM is the wrong abstraction layer. A button isn't its div.btn--primary.btn--lg classes. It's "the thing the user clicks to check out." The semantic identity lives in the visual presentation, not in the implementation detail.
“Cypress was supposed to solve our reliability concerns, but it couldn't.”
Two options we considered
By month six, we had two viable paths forward:
We tried it. It almost worked. And then it broke every single time a designer touched a component.
The thing nobody talks about with DOM-coupled testing — whether the selector is hand-written or AI-generated — is that the DOM is the wrong abstraction layer. A button isn't its div.btn--primary.btn--lg classes. It's "the thing the user clicks to check out." The semantic identity lives in the visual presentation, not in the implementation detail.
Both options had real engineering costs. We picked Option B, and the reason was less about model accuracy and more about failure modes.
Why failure modes matter more than accuracy
Here's the thing about a 96% accurate DOM model: when the 4% failure happens, it fails silently. The locator points at the wrong element. The test passes. Production breaks anyway.
Vision-grounded locators fail differently. When they fail, they fail visibly — the AI can't see the element it was told to interact with, the test errors out, and you get a screenshot showing exactly what the AI saw at runtime. The failure modes are debuggable in a way DOM failures aren't.
What happened next
Eight weeks after shipping vision-grounded locators to early customers:
- Selector breakage rate dropped from 14 per 1,000 test runs to 0.4.
- Engineer time on test maintenance fell by 70% across the cohort.
- CI flake rate — the rate of false-positive failures — went from 8% to under 0.5%.
- Customer NPS on test reliability moved from +12 to +71.
The biggest unexpected win: engineers stopped opting out of CI gates. When you trust your test suite, you actually let it block deploys. We didn't anticipate that one.
The tradeoffs we accepted
Nothing is free. Vision-grounded locators are more expensive at runtime — each resolution call hits a multimodal model. We've offset that with aggressive caching of compiled locator plans, but for high-volume CI it's a real cost.
The other tradeoff is invisible elements. If something is visually hidden but DOM-present (a screen-reader-only label, an unrendered tab panel), vision can't find it. We fall back to DOM-only resolution in those cases, and we surface a warning so engineers know the test is in a less-reliable mode.