Software Testing Will Soon Be A Black Box, And That's Good For Sword-Fighting

Most developers never look at the internals of how a compiler does its magic. It's a tool we rely on to translate our code from a high-level language to something machine-readable; we only care if it succeeds or fails.

This should be the goal of testing. Developers don't need to understand, maintain, and scrutinize their test suites. They shouldn't treat test code as first-class citizens, giving it the same attention as production code. The vision for testing is that you shouldn't have to micro-manage or be intimately aware of the intricacies of the testing process. Like compilation, testing should "just work"—the coverage is comprehensive, the results are reliable, and if any issues are found, they're flagged or even automatically fixed.

With the availability of generative AI agents, machine learning-based coverage, and techniques to analyze user behavior and code paths, we're at an inflection point where automated testing can be far more sophisticated. AI agents will be able to intelligently infer usage patterns and edge cases and perhaps even submit PRs to fix the bugs they encounter.

You'll press test and go and have a sword fight.

comic

(source: xkcd 303)

What Testing Can Learn From Compilation

Today, compilation is automatic and reliable; developers focus on writing code, not on the minutiae of translation.

But it wasn't always so. In the early days of programming, software was written in assembly or machine code, with every detail handled manually. High-level languages and compilers existed, but they weren't trusted. They were janky, and early compilers often produced code that did not perform as well as hand-written assembly.

Developers would spend countless hours hand-optimizing their assembly code, arguing that automated compilation couldn't possibly understand the nuances of their specific use case. Sound familiar? It's the same argument we hear today about AI-driven testing not being able to understand the "special requirements" of our applications.

But compilers got better. A lot better. Not perfect, but to the point that if someone told you they were rolling-their-own compiler, you'd think they'd been rolling something else. This is because:

Academic research and industry investment led to dramatic improvements in optimization techniques, making compiler-generated code often better than hand-written assembly.
The development of robust intermediate representations and analysis frameworks allowed compilers to understand and optimize code at a deeper level than humans could manually track.
The standardization of compilation techniques and the emergence of battle-tested compiler infrastructures (like LLVM) created a foundation of trust and reliability.
Market forces drove the evolution. Companies couldn't afford to keep hand-optimizing code as applications grew in complexity, so they trusted and helped improve compiler technology.

This is where we are with AI testing; the same tenets hold true.

The rapid advancement of large language models and code understanding has led to AI-generated tests that catch edge cases that human developers miss. Just as compilers outperformed hand-written assembly, AI testing is beginning to surpass manual test writing in both coverage and effectiveness.
Modern AI systems can build sophisticated internal representations of codebases, understanding the syntax and semantic relationships between components, user behaviors, and system dependencies.
The emergence of standardized AI testing frameworks (cough-cough, Momentic) and proven methodologies (like Cursor, Claude, and similar tools) is creating a foundation of reliability that will only strengthen as these systems mature and become battle-tested in production environments.
Economic pressures are already forcing this transition. Companies simply cannot afford to maintain comprehensive manual test suites and QA teams. The cost of missed bugs in production far outweighs any perceived benefits of hand-crafted tests.

We are not at the swordfight stage yet. But it is coming. And it will be good.

Why You Want Black Box Testing

Garry Tan wants black box testing.

wei-wei

@wuweiweiwu

·Follow

What if AI could watch your user sessions and identify hot paths and automatically test them for you And extrapolate from those to infer edge cases and test those too And then suggest fixes directly via PRs Wow 🤯

Garry Tan

@garrytan

Every piece of software is so absolutely full of bugs I cannot wait for an integrated Blackbox AI tester agent that works with an automatic P2 bug fixer agent to just fix all the most annoying tedious little bugs that PMs will mark as “won’t fix” The people who do that should

8:10 AM · Dec 18, 2024

Read 1 reply

You should too. Testing sucks-that isn't a secret. It is time and resource-intensive. Manual test writing and maintenance consume a lot of time, especially for large codebases, and QA teams or developers spend hours writing and re-writing test cases for every change. It just isn't an efficient system.

Test coverage is also often incomplete. Even with dedicated testing teams, it's easy to miss edge cases, and changes in requirements or code can leave tests outdated. Traditional testing might only happen in certain stages (e.g., nightly builds, staging environments), and, to Garry's point, bug detection can be slow, and bug fixing is even slower.

But AI testing changes everything. It offers near-infinite scalability. As your application grows in complexity, traditional testing approaches break down-more features mean exponentially more test cases to write and maintain. AI-based testing systems scale automatically with your codebase. The AI agent simply adapts and generates new test scenarios as your application evolves, maintaining comprehensive coverage without additional effort from your team.

This continuous nature of AI testing is a game-changer:

Unlike human testers who need breaks and can only work in defined shifts, AI testing systems operate 24/7, constantly iterating and improving their test coverage. Every code change becomes an opportunity for immediate testing and validation rather than waiting for the next test cycle.
This continuous operation leads to dramatically faster bug detection. Instead of discovering issues during scheduled testing phases or, worse, from user reports, potential defects can be flagged immediately as they're introduced. The system can catch subtle regressions and edge cases that might slip past human testers in real time.

Perhaps most importantly, AI testing transforms how we allocate our human resources. Instead of having skilled developers and QA engineers spend countless hours on repetitive regression testing, they can focus on more security audits, exploratory testing, and complex edge cases that require human insight These valuable tasks become their primary focus, while the AI handles the grunt work of comprehensive testing.

But why a "black box?"

But that is all just about general AI testing. Why do we think the "black box" will prevail?

This goes back to our original point-you just don't need to know any of the internals. Developers shouldn't have to think about the details of coverage or test generation. They should write code and rely on the black box to do comprehensive testing. And just like with a compiler, you care about the output (does the code compile? Do tests pass?), not about the intricacies of lexical analysis, parsing, optimization, etc. Just as we trust our IDEs to handle syntax highlighting and code completion without understanding their internals, we should trust our testing systems to handle coverage and validation autonomously.

It is about cognitive load. Developers are simple folk-they need that headspace for other things.

headspace

(source: xkcd 2510)

How The Black Box Will Work

We don't know, that's the magic!

However, we can sketch out how such a system would operate. At its core, the black box testing system would have several key components that work in harmony.

Monitoring

An AI constantly observes user sessions, identifying which paths through the code are most frequently traveled. It collects data on feature usage patterns and logs any anomalies or unexpected behaviors. This creates a dataset of real-world usage that goes beyond what traditional test coverage metrics can capture.

Extrapolation and modeling

Using the collected data as a foundation, the AI doesn't just test what it sees-it infers what it might see. It identifies less-traveled or edge paths in the code and automatically generates test scenarios for them. This is where AI outpaces traditional testing approaches: it can imagine and test scenarios that haven't happened yet but could.

Execution and reporting

It runs these generated tests in staging or development environments, carefully flagging any failing scenarios. The AI can go beyond simple pass/fail reporting. It can analyze failures and propose suggestions for fixing them, understanding the context of the failure and potential solutions based on patterns it has learned.

CI/CD pipeline

When the AI identifies a new path through the code or discovers a potential bug, it automatically tests, logs, and proposes solutions. This process does not need human orchestration-it just happens, much like how your compiler works without you needing to think about its internals.

Beyond the Black Box

Imagine that your testing process is as straightforward as running npm run test or clicking a "build & test" button. Under the hood, an AI orchestrates thousands of test scenarios.

When code changes break existing assumptions or introduce new behaviors, the AI adjusts or regenerates tests automatically. In the (not too distant) future, you might see automated PRs with suggested patches, plus an entire pipeline that merges these fixes if they pass further checks. This closes the loop between "find a bug" and "fix the bug" with minimal human intervention.

This is what we're working on at Momentic. If you want to black-box your testing, reach out here. If you want to build the black box, we're hiring (swords provided).