There is a new slop on the block: product slop. If you can ship a feature today, tomorrow, or even tonight, then why not slopship? Any feature that can come to mind can come to fruition. Move fast, break minds.
There are some good reasons to follow this strategy, but there are also many to pull back code generation and shipping to just warp .9. Chief among them, a simple concept:
You cannot verify what you cannot reason about
Before you can test whether a system behaves correctly, you must be able to say what "correctly" means. You must know what must be true. When you are generating and shipping at the speed of light, this clarity disappears. The codebase, edge cases, and interactions grow too quickly for anyone to develop the mental model required to say what should be true in the first place.
Testing is a tool for confirming that a system behaves as intended. But if no one can articulate what "intended" means at a given boundary, or if the system's behavior emerges from interactions no one designed, testing becomes an exercise in confusion.
The good news is that, though history doesn’t repeat, it does rhyme. Software engineering has seen a crisis like this before, where complexity outpaced systems. What can we learn from the past to better prepare us for the future?
The “software crisis” was a period in the 60s and 70s when software projects failed at alarming rates. Budgets exploded. Timelines slipped. Systems that worked in testing failed in production.
This wasn’t a quality problem. The root cause was unmanaged complexity, not incompetence.
Because programming had suddenly become so much more powerful, systems behavior exceeded human comprehension. Codebases outpaced the ability to reason about them, and productivity collapsed despite more engineers, more tools, and more money.
No one could reason about what the system would do. Short (path) king, Edsger Dijkstra, put it best:
As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.
The resolution came from architecture, not from testing:
These innovations made systems comprehensible again. Once teams could reason about their systems, testing became effective. Tests could verify that modules honored their contracts. Integration tests could confirm that composed systems behaved as designed. Coverage metrics meant something because there was a "something" to cover.
Testing improved after architecture improved. Not before.
Sudden all-powerful computer technology? Hmm, sounds familiar.
AI is recreating the conditions of the software crisis, but compressed into January instead of over decades. Three properties of modern AI accelerate the problem.
The result is systems that work incredibly fast, until they don't, in ways no one can explain or reproduce.
Momentic is a testing platform. You should expect our answer to this problem to be “more tests.”
Testing alone cannot solve architectural ambiguity. When you test a system you don't understand, you end up testing symptoms rather than invariants. Your tests encode whatever the system happens to do, rather than what it should do. When behavior shifts (as it will), tests break, but the breakage doesn't tell you whether something went wrong or whether the system simply changed in ways your tests didn't anticipate.
This produces a specific pathology: coverage increases while confidence decreases. Teams accumulate thousands of test cases. The test suite becomes a brittle mirror of the system's emergent behavior. Maintaining the tests becomes a significant engineering burden. But no one trusts the tests more, because the tests don't verify anything fundamental. They just document what happened last time.
Testing is most effective when:
Without these conditions, testing doesn’t work. To get to “more tests,” we need “more structure.”
Architectural constraints narrow the space of possible system behaviors. They make explicit what was implicit. They convert unknown failure modes into known, handleable cases.
Consider what architectural constraints do in traditional software. Type systems restrict the values that can flow through a program, catching errors at compile time rather than runtime. Memory safety guarantees eliminate entire classes of bugs by making them unrepresentable. Process isolation ensures that one component's failure cannot corrupt another. Transaction boundaries define where operations succeed or fail atomically.
They are structural decisions that make systems comprehensible and therefore verifiable.
How to achieve the same effect for AI?
Traditional testing verifies that code behaves as intended. But that framing assumes engineers write the code and therefore know what "intended" means. When AI generates code at a volume humans cannot review, that assumption breaks. The bottleneck shifts from writing code to verifying it, and verification requires a stable reference.
That stable thing is not the code. Code changes constantly. The stable thing is a definition of what must be true about the system's behavior.
A test that says "a logged-out user cannot access /billing" is not a verification step. It is a constraint definition. It does not check implementation details. It specifies an invariant that any implementation must satisfy. Written at this level, tests function as architecture. They narrow the state space. They define boundaries. They convert "anything could happen" into "these specific things must hold."
This reframes the problem. The question is not "how do we test this system?" The question is "what must be true about this system?" The test is the truth.
This is harder than it sounds. It requires clear thinking about boundaries: what the system is allowed to do, what it must do, and what it must never do. It requires distinguishing between essential and incidental behaviors. It requires precision about failure modes, not just success cases.
When engineers can answer that question, they have defined the system's architecture in a form that is executable and enforceable. Tests become contracts. CI enforces them. Nothing ships unless the constraints are satisfied. Refactoring becomes safe because the definition of correctness is external to the implementation.
When engineers cannot answer that question, no amount of testing will help. Tests will encode whatever the system happens to do. They will break when the system changes, but the breakage will not tell anyone whether something went wrong. Coverage will increase while confidence decreases.
The software crisis ended when the industry learned to design systems that humans could reason about. Modularity, interfaces, contracts, and abstractions were not concessions to limited human cognition. They were engineering responses to a fundamental constraint: verification requires comprehension.
AI is testing this lesson again. We are shipping systems whose behavior emerges from processes we cannot inspect, are built from components we cannot isolate, and change in ways we cannot predict. The instinct to test harder is understandable. It is also missing the point.
The fix is not more testing in isolation. It is architecture that makes systems comprehensible, and rigorous testing on that foundation.
This is an old lesson. Ignoring it would be the truly modern mistake.