This is the second part of our guide on agentic testing, covering how agentic testing works in practice and best practices for implementing inside your organization.


Underneath the four-verb definition, the technical architecture is converging fast. A few patterns are now well-established across vendors and open-source projects.
The dominant pattern is accessibility-tree-first with a vision fallback. The a11y tree is the semantic representation of the page (the same one screen readers use), which makes it more stable across CSS and DOM refactors than raw selectors.
The agent reads the page by capturing a snapshot, a serialized text version of the tree at that moment. It looks something like this:
- heading "Todos" [level=1]
- textbox "What needs to be done?" [ref=e5]
- button "Add" [ref=e6]The LLM reads that snapshot, decides "I want to type in the textbox," and acts against ref=e5. Those reference IDs are local to this single snapshot: the next time the agent observes the page, it gets a fresh tree with fresh refs. Selector drift, the source of most flakiness in scripted suites, can't happen because there's no persistent selector to drift.
For surfaces the a11y tree can't describe well (HTML canvas, custom-rendered components, charting libraries), the agent falls back to a multimodal vision model interpreting the screenshot directly.
Rather than page.click('#submit-btn-v2'), the test author specifies the intent: click the submit button. The agent's resolver, an LLM with multi-modal access to page state, picks the actual element at runtime. Intent is what the test author writes. Selectors are implementation details the agent works out for itself.
This matters because intent survives refactors. “Click the submit button” stays valid when the button changes from a <button> to an <a>, when its class name is regenerated, when it's wrapped in a new component, or when it moves up a level in the component tree. None of those changes touch the intent, so none of them require the test author to do anything.
After each action, the agent re-reads the page and decides whether to proceed, retry, or revise the plan. The pattern follows ReAct:
When the loop stalls or fails, Reflexion-style self-critique kicks in: the agent writes a textual reflection on what went wrong and seeds the next attempt with it. Momentic exposes this as failure recovery, where the agent proposes corrected steps after a failed run that an engineer accepts or rejects in the run viewer. The broader pattern most production tools use is a hybrid: a planner emits a step-indexed sequence up front, then a per-step reasoning loop runs inside each step.
Pure agentic execution is slow and expensive. Every step costs at least one LLM call, and a hundred-step flow with model latency adds up to minutes per test. The solution that every serious tool has converged on is to cache the resolved locator after the first successful run, replay it deterministically on subsequent runs, and fall back to AI resolution only when the cached path is missed.
This is the technically honest version of “self-healing tests.” It's a cache-and-revalidate loop. The cache stores enough to replay the action deterministically. When the cached path misses, the agent re-resolves using the same multimodal grounding, updates the cache, and continues.
The speed difference is significant: cached steps run only 52ms slower on average than equivalent Playwright commands. Only first-run, uncached steps pay the full AI cost, and over 99% of steps execute in under 500ms once cached.
What this means operationally:
The cache is one layer of the agent's run-to-run state. Momentic’s memory is a record of past runs that captures which assertions tended to need retries, where the agent took alternative paths, and which signals proved authoritative for which flows. Caching keeps repeat runs cheap. Memory helps the agent reason consistently when conditions shift between runs.
The combination of intent-based specification, multi-modal verification, and cache-and-revalidate execution is what makes agentic tests cheap enough at steady state to actually deploy. Without the cache, the cost model doesn't work.
Three integration surfaces matter, depending on where the agent is running.
Model Context Protocol lets any IDE-resident coding agent (Claude Code, Cursor, Codex CLI, Copilot Chat, Windsurf) call tools as native tool calls. For agentic testing, this is the integration surface where the coding agent verifies its own work inside its working loop, before any code reaches CI.
What the MCP exposes matters. Playwright MCP and Stagehand expose browser primitives (click, type, navigate, snapshot). Momentic's MCP exposes the test layer directly: the agent can create a test, run it against a live browser, inspect traces, and classify failures without leaving the editor.
For high-throughput coding agents, CLI invocations can beat MCP on context cost. CLI calls don't load large tool schemas or verbose accessibility-tree snapshots into the model's window, leaving more room for code. Microsoft recommends this pattern for Playwright; agents discover the CLI syntax through structured markdown skill files that ship with the tool.
Momentic ships the same pattern. The Momentic CLI runs and manages tests. Matching skill files (npx skills add momentic-ai/skills. momentic-test for authoring and execution, momentic-result-classification for failure analysis) tell the coding agent how to invoke it. More setup than the MCP route, less context overhead per action.
Agentic testing in CI works like any other test suite. Platforms integrate with GitHub Actions, GitLab CI, and CircleCI through standard CLI calls: install browsers, run tests, and upload results. Sharding splits large test sets across parallel jobs, and a merge step aggregates outputs into a single run group for review.
What's distinct is what attaches to the pull request. Beyond pass/fail, the agent's trace (the recorded sequence of what it saw and did at each step), along with screenshots, videos, and the agent's classification of any failures, accompanies the PR. The reviewer doesn't need to re-run the test locally to understand what happened.
The strategic point is that agentic testing is most valuable inside the coding agent's working loop, not as a post-merge gate. A coding agent that writes a feature, opens a browser through MCP, navigates its own change, and verifies behavior before opening a PR is doing the modern version of shift-left.
Testing ownership has moved several times over the last twenty years. Dedicated QA organizations gave way to SDETs. SDETs gave way to “every engineer owns their tests” in the shift-left era. Each shift was driven by changes in how software was built. Agentic testing is the latest such shift.
When the agent owns the execution and maintenance of tests, the human's leverage shifts to a different layer. The work isn't writing the script anymore. It's:
That set of responsibilities doesn't map cleanly onto any existing job title. It pulls from QA, SDET, devex, and staff engineering. In practice, whoever takes it on is doing something closer to that of a “verification engineer” than to any of the historical roles. The dev who built the feature and the QA who tested it collapse into the same engineer: the one who specified what the agent should verify.
Failure triage is the clearest place where the role is already taking shape. Momentic's ai classify tool sorts failed runs into categories with the agent's confidence and reasoning attached:
The agent does the first pass. The human validates, marks intentional product changes, and tunes the test where the agent's verdict was off. Automatable enough to be a product feature, not trustworthy enough to run unattended.
If AI coding agents are in active use on your team, you're past the point where scripted automation can keep pace. Other diagnostic frameworks (test maintenance burden, suite trust, headcount allocation) are real, but they're all downstream of this one.
A few reasons coding agents change the math:
If your engineers are still hand-writing most of their code, scripted automation is fine. The moment Copilot, Cursor, Claude Code, or similar agents start landing meaningful volume in your codebase, that calculus changes.
Agentic testing fits well with E2E user journeys. Long flows that hit multiple screens, depend on stable application intent, and change shape every few weeks are precisely where the cache-and-revalidate model pays off. It's less suited to deep backend logic, tightly regulated assertions, and reproducibility-critical work, all of which still belong in your existing harness. Two practices make the rollout substantially cheaper.
Every fresh agentic test pays the full AI cost on its first execution: locating elements, evaluating assertions, and reasoning about page state. A 100-step suite can take 10-15 minutes on the first pass. Convert incrementally rather than bulk-migrating, prioritize the highest-churn E2E flows first (signup, checkout, onboarding, primary feature paths), and let the cache warm before benchmarking steady-state speed. Teams that try to flip a 500-test suite in one sprint burn tokens and end up with a slow first run nobody planned for.
Some categories don't belong in the agentic pipeline:
Codify this upfront so engineers know which surfaces to skip when authoring agentic tests. The exclude list also doubles as a teaching artifact: it's the clearest way to communicate where agentic verification is reliable and where it isn't.
Done well, the rollout looks less like rip-and-replace and more like a planned shift of test ownership. Agentic handles the high-churn E2E surface, where the maintenance dividend is largest. Scripted, unit, and integration tests hold the high-precision rest. Most production stacks land on this hybrid model as a permanent steady state.
Three things make agentic testing a distinct category rather than a feature bolted onto existing tools.
This is the bet behind what Momentic builds. The test author specifies intent in natural language. The agent handles planning, running, observing, and adapting across multiple signals. The locator cache speeds up repeat runs. And the whole thing is reachable from Claude Code, Cursor, Copilot, or your CI without a separate orchestration layer.
Don't forget to read what agentic testing is, what it isn’t, to understand the theory behind agentic testing and why it’s so important now: https://momentic.ai/blog/agentic-testing-guide.