Most developers never look at the internals of how a compiler does its magic.It's a tool we rely on to translate our code from a high-level language to something machine-readable; we only care if it succeeds or fails.
This should be the goal of testing. Developers don't need to understand, maintain, and scrutinize their test suites. They shouldn't treat test code a sfirst-class citizens, giving it the same attention as production code. The vision for testing is that you shouldn't have to micro-manage or be intimately aware of the intricacies of the testing process. Like compilation, testing should "just work" the coverage is comprehensive, the results are reliable, and if any issues are found, they're flagged or even automatically fixed.
With the availability of generative AI agents, machine learning-based coverage, and techniques to analyze user behavior and code paths, we're at an inflection point where automated testing can be far more sophisticated. AI agents will be able to intelligently infer usage patterns and edge cases and perhaps even submit PRs to fix the bugs they encounter.
You'll press test and go and have a sword fight.
(source: xkcd 303)
Today, compilation is automatic and reliable; developers focus on writing code, not on the minutiae of translation.
But it wasn't always so. In the early days of programming, software was written in assembly or machine code, with every detail handled manually. High-level languages and compilers existed, but they weren't trusted. They were janky, and early compilers often produced code that did not perform as well as hand-written assembly.
Developers would spend countless hours hand-optimizing their assembly code, arguing that automated compilation couldn't possibly understand the nuances of their specific use case. Sound familiar? It's the same argument we hear today about AI-driven testing not being able to understand the "special requirements" of our applications.
But compilers got better. A lot better. Not perfect, but to the point that if someone told you they were rolling-their-own compiler, you'd think they'd been rolling something else. This is because:
This is where we are with AI testing; the same tenets hold true.
We are not at the sword fight stage yet. But it is coming. And it will be good.
Garry Tan wants black box testing.
You should too. Testing sucks-that isn't a secret. It is time and resource-intensive. Manual test writing and maintenance consume a lot of time, especially for large codebases, and QA teams or developers spend hours writing and re-writing test cases for every change. It just isn't an efficient system.
Test coverage is also often incomplete. Even with dedicated testing teams, it's easy to miss edge cases, and changes in requirements or code can leave tests out dated. Traditional testing might only happen in certain stages (e.g., nightly builds, staging environments), and, to Garry's point, bug detection can be slow, and bug fixing is even slower.
But AI testing changes everything. It offers near-infinite scalability. As your application grows in complexity, traditional testing approaches break down-more features mean exponentially more test cases to write and maintain. AI-based testing systems scale automatically with your codebase. The AI agent simply adapts and generates new test scenarios as your application evolves, maintaining comprehensive coverage without additional effort from your team.
This continuous nature of AI testing is a game-changer:
Perhaps most importantly, AI testing transforms how we allocate our human resources. Instead of having skilled developers and QA engineers spend countless hours on repetitive regression testing, they can focus on more security audits, exploratory testing, and complex edge cases that require human insight These valuable tasks become their primary focus, while the AI handles the grunt work of comprehensive testing.
But that is all just about general AI testing. Why do we think the "black box"will prevail?
This goes back to our original point-you just don't need to know any of the internals. Developers shouldn't have to think about the details of coverage or test generation. They should write code and rely on the black box to do comprehensive testing. And just like with a compiler, you care about the output (does the code compile? Do tests pass?), not about the intricacies of lexical analysis, parsing, optimization, etc. Just as we trust our IDEs to handle syntax highlighting and code completion without understanding their internals, we should trust our testing systems to handle coverage and validation autonomously.
It is about cognitive load. Developers are simple folk-they need that headspace for other things.
(source: xkcd 2510)
We don't know, that's the magic!
However, we can sketch out how such a system would operate. At its core, the black box testing system would have several key components that work in harmony.
An AI constantly observes user sessions, identifying which paths through the code are most frequently traveled. It collects data on feature usage patterns and logs any anomalies or unexpected behaviors. This creates a dataset of real-world usage that goes beyond what traditional test coverage metrics can capture.
Using the collected data as a foundation, the AI doesn't just test what it sees-it infers what it might see. It identifies less-traveled or edge paths in the code and automatically generates test scenarios for them. This is where AI outpaces traditional testing approaches: it can imagine and test scenarios that haven't happened yet but could.
It runs these generated tests in staging or development environments, carefully flagging any failing scenarios. The AI can go beyond simple pass/fail reporting.It can analyze failures and propose suggestions for fixing them, understanding the context of the failure and potential solutions based on patterns it has learned.
When the AI identifies a new path through the code or discovers a potential bug, it automatically tests, logs, and proposes solutions. This process does not need human orchestration-it just happens, much like how your compiler works without you needing to think about its internals.
Imagine that your testing process is as straightforward as running npm run test
or clicking a "build & test" button. Under the hood, an AI orchestrates thousands of test scenarios.
When code changes break existing assumptions or introduce new behaviors, the AI adjusts or regenerates tests automatically. In the (not too distant) future, you might see automated PRs with suggested patches, plus an entire pipeline that merges these fixes if they pass further checks. This closes the loop between "find a bug" and "fix the bug" with minimal human intervention.
This is what we're working on at Momentic. If you want to black-box your testing, reach out here. If you want to build the black box, we're hiring (swords provided).