Shrinking the Verification Gap: Practical Patterns for AI-Assisted Development
If AI scales execution and verification is the bottleneck, the winning move is to make verification cheaper. Here are the patterns that actually work.
In the previous post, I wrote about human verification bandwidth as the binding constraint on AI productivity. The cost to automate falls. The cost to verify stays biologically bounded. # of humans << # of AIs.
The natural follow-up question: if verification is the bottleneck, how do you make verification cheaper?
Not by hiring more reviewers. Not by trusting AI more. By restructuring how work gets done so that less verification is needed in the first place, and the verification that remains is faster.
Here are the patterns that are working.
Red/green TDD
Simon Willison calls this “a pleasingly succinct way to get better results out of a coding agent.” Four words in a prompt — “use red/green TDD” — and the agent shifts to test-first development.
The red phase writes tests and confirms they fail. The green phase implements until the tests pass. This matters for verification because it converts a subjective question (“does this code work?”) into a binary one (“do the tests pass?”). Pass/fail is cheaper to verify than reading 200 lines of generated code and reasoning about correctness.
The red phase is the part people skip, and it’s the part that matters most. Without confirming tests fail first, you risk an agent writing tests that pass vacuously — tests that verify nothing. The red phase is your proof that the tests actually exercise the new code.
A related four-word prompt: “first run the tests.” Before the agent changes anything, it discovers the test suite, sees the project’s current state, and establishes a baseline. If the code has never been executed, it’s luck if it works when deployed.
Define the outcome, not the solution
There’s a temptation to constrain how the agent solves a problem. That’s the wrong lever. What works is constraining what the agent should produce and validating against that.
Pete Hodgson points out that AI writes code at the level of a solid senior engineer but makes design decisions like a fairly junior one. The fix isn’t to micromanage the implementation. It’s to be precise about the expected outcome so you can verify against it.
Instead of “implement user authentication,” say “add JWT validation middleware that rejects expired tokens and returns 401. Here’s the test case it should pass.” Now review shifts from “is this the right approach?” to “does it meet the acceptance criteria?” — a question with a clear answer.
Three ways to make outcomes verifiable:
- State acceptance criteria, not implementation steps. “Each change should be backwards compatible” is better than dictating which pattern to use.
- Point to existing patterns. “There’s an existing example in
UpdateCompany” gives the agent a reference without dictating a solution. - Store conventions permanently. A
CLAUDE.mdor rules file that encodes preferred libraries, commit size expectations, how to run tests. The agent adapts its approach to match your standards without you specifying the how every time.
Linters as executable law
Factory.ai’s guide has the best framing I’ve seen: “Agents write the code; linters write the law.”
Documentation provides hints. Linters provide rules. AI can choose to ignore documentation. It cannot ignore a linting error that blocks CI.
The practical pattern: every time you catch a mistake in review, don’t just fix it. Codify it as a lint rule with an autofix. Then run agents to fix all existing violations across the codebase. Wire the rule into pre-commit hooks, CI, and agent toolchains. The codebase self-heals against future drift.
If you adopt only one category of lint rules, prioritize grep-ability: named exports, consistent naming, absolute imports. A codebase that’s easy to search is easy to verify — for both humans and agents.
Small tasks, not large features
AI-generated code creates 1.7x more issues than human-written code. PRs are getting ~18% larger. Incidents per PR are up ~24%.
The math on large AI-generated PRs is brutal. A 500-line PR with a 1.7x defect rate doesn’t just have more bugs — it has bugs hiding in a larger haystack. The verification cost scales worse than linearly.
The fix: decompose work into small, bounded tasks. Each task should produce a change small enough to review in minutes, not hours. This also maps to how agents work best — they perform better on focused, well-scoped problems than on open-ended feature builds.
Spec before code
Lock intent before the agent touches code. A spec doesn’t need to be a formal document. It needs a problem statement, non-goals, acceptance criteria, and architecture constraints.
The spec is the source code. The generated implementation is the compiled output. As one practitioner put it: “We never commit the compiler output — we commit the source and regenerate the binary from scratch. Do the same with AI coding.”
When the spec is clear, verification becomes: does the output match the spec? When the spec is vague, verification becomes: is this what we wanted? The first question is answerable. The second is a conversation.
Automated quality gates
Configure your tools to run checks on every edit, not just at commit time. In Claude Code, hooks can run type checking after every TypeScript edit, lint after every file modification, and security scans on auth code changes. Non-zero exit codes block the agent from proceeding.
The stronger your automated gates, the less you need to verify manually. A type checker that runs on every edit catches a class of bugs that would otherwise land in your review queue. A linter that blocks on every save means you never review formatting or import ordering. These aren’t glamorous. They’re verification that runs at machine speed instead of human speed.
Fresh-context review
Have the model review its own output in a separate context window. Addy Osmani recommends this because the model approaches the code without the assumptions it accumulated during generation. It sees what a reviewer would see.
Willison takes this further with “linear walkthroughs” — asking the agent to narrate a structured walkthrough of the generated code, but forcing it to use shell commands (grep, cat, sed) to extract actual code into the document. This prevents the agent from copying snippets from memory, which could introduce hallucinations.
The verification flywheel
These patterns compound. Better linters catch more issues automatically, which reduces what humans need to review. Smaller tasks produce smaller diffs, which are faster to review. TDD provides binary pass/fail signals, which replace subjective code reading. Specs lock intent, which makes acceptance criteria testable.
Each pattern individually saves a little verification time. Stacked together, they shift the ratio. Addy Osmani reports that developers succeeding with AI agents spend 70% of their time on problem definition and verification strategy, 30% on execution. The ratios inverted from traditional development.
That inversion is the point. You can’t make humans read faster. You can make it so there’s less they need to read.
This is a follow-up to The Verification Bottleneck: Why AI’s Real Cost Is Human Attention.