The Verification Principle: How Boris Cherny and the Chief Engineer Arrived at the Same Idea

The Verification Principle: Why the Best AI Builders Keep Arriving at the Same Idea

Something interesting is happening in the world of AI-assisted development. Practitioners working independently, in different contexts, with different constraints, keep arriving at the same structural insight: the model that writes the code should not be the model that reviews it.

Boris Cherny, the creator of Claude Code, recently shared his personal workflow, and this principle sits at its center. He achieves separation through subagents running in fresh, isolated contexts. Meanwhile, the Chief Engineer workflow arrives at the same destination through a different mechanism — using a separate model entirely for the review step. The structure is the same. The only philosophical difference is whether "independent reviewer" means a fresh instance of the same model or a different model altogether.

When two very different practitioners land on the same answer without coordinating, that's usually a signal the answer is tracking something real about the problem. In this case, what it's tracking is something genuinely true about how AI cognition works — and understanding it will change how you build with these tools.

The Proofreading Problem in AI Systems

The core intuition is almost embarrassingly simple once you hear it: if a model writes a piece of code, it has a mental model of what it intended that code to do — and that same mental model is exactly what will blind it to the ways the code might be wrong. It's the AI equivalent of proofreading your own writing. Your brain autocorrects the typos because it already knows what you meant to say.

This isn't a new idea. It's the same reason code review exists in human teams. We've long understood that the person who wrote the code is the worst person to catch its bugs — not because they're careless, but because they're too close to it. The novelty is that we're now applying that same social engineering insight to AI systems themselves.

Builders like Boris discovered this empirically, by watching agentic systems fail in the wild. When you give one model instance the job of generating and reviewing its own work in the same context window, the error rate doesn't drop nearly as much as you'd hope. But introduce a second pass — a separate context, sometimes a different model entirely — and suddenly you're catching a whole class of subtle bugs the first pass walked right past.

Fresh Context vs. Different Model: Making the Right Call

The isolated context piece is the key mechanism, and it's more subtle than it sounds. When a model runs in a fresh context, it's not carrying any of the reasoning that led to the original code. It hasn't made the micro-decisions, the tradeoffs, the "eh, close enough" judgments that happen during generation. So when it reviews that code cold, it's genuinely asking "does this do what it claims?" rather than "does this match what I remember wanting?"

Boris's approach with subagents leans into this structurally — each agent invocation is genuinely isolated, no shared memory, no accumulated context that could bias the review. He actually shipped a feature where Plan Mode automatically clears context when you accept a plan, specifically because he recognized that context contamination degrades review quality. That's the creator of the tool telling you something important about how AI cognition works.

Using a completely different model for review takes this a step further. Different models have different training emphases, different tendencies, different blind spots. So you're not just getting a fresh perspective — you're getting a different perspective. If the generating model has a particular weakness, the reviewing model might have been trained in a way that makes it more attentive to exactly that area.

So how should a developer choose between these approaches? Start by asking what kind of errors you're most afraid of.

If your main concern is logical consistency — does this function do what the comments say, are the null cases handled, is the variable naming coherent — a fresh context with the same model is often sufficient and meaningfully cheaper. You're not fighting a knowledge gap; you're just breaking the momentum of the original generation context.

But if the stakes are high enough that you need to catch things a model is systematically inclined to miss — security vulnerabilities, subtle race conditions, architectural decisions that look fine locally but create problems at scale — that's when a different model earns its keep. You're not just resetting context at that point. You're genuinely diversifying your coverage.

There's also a practical heuristic worth keeping in your back pocket: if you're running the same model for both passes and the reviewer is consistently agreeing with the generator at a rate that feels suspiciously high, that's a signal you might need model diversity rather than just context isolation. Some degree of productive disagreement is what you actually want out of a review step. If the reviewer rubber-stamps everything, you haven't added a check — you've just added latency.

Putting This Into Practice Today

The lowest-friction starting point is also the most unglamorous one: just add a second prompt. Before you ship any AI-generated code, paste it into a fresh conversation — no history, no prior context — and ask "what could go wrong here?" You'll be surprised how often that cold read surfaces something the generating pass walked right by. You don't need a multi-agent framework to get the core benefit. You just need the discipline to break the context.

From there, if you're using Claude Code or another agentic setup, the next step is making that second pass structural rather than optional. Build it into the workflow so it happens automatically — not just when you remember to do it. Humans are bad at consistently applying discretionary checks under time pressure, and AI-assisted workflows aren't much better if they're not designed to enforce the check. Make the review step load-bearing, not aspirational.

Once you've got that habit established and you're seeing real catches come out of it, that's when you ask whether you need model diversity on top of context isolation. At that point you'll have actual data — you'll know what kinds of issues your review step is catching and what it's missing — and you can make that call with evidence rather than intuition.

The deeper shift underneath all of this requires reframing how you think about AI tools. Stop asking "did the AI get it right?" and start asking "how do I know if the AI got it right?" That second question is more productive, and everything else follows from it.

Stop thinking of the model as a magic box that produces correct output, and start thinking of it as a very fast collaborator who genuinely benefits from having their work checked.

Why Convergence Matters

When the creator of a major AI coding tool and independent workflow designers arrive at the same structure without coordinating, it tells us the structure is probably load-bearing, not incidental. The verification principle isn't a quirky personal preference — it's tracking something genuinely true about how AI cognition works.

A model that holds the implementation context also holds the rationalizations, the shortcuts, the assumptions that went into it. A fresh reviewer does not. That asymmetry is structural, not accidental, and the workflows that perform best are the ones that account for it explicitly.

What's genuinely optimistic about where this is heading is that as these practices mature, they get embedded in tooling. And once something is in the tooling, it stops requiring individual discipline to apply consistently. The verification principle stops being a best practice that thoughtful developers follow and becomes just how the tools work — a default you'd have to actively override.

The developers who will be most fluent in agentic AI systems two years from now are building these instincts right now. The starting point is simpler than most people think: break the context, treat review as a real step, and notice when your reviewer is being too agreeable.

If you want to hear these ideas explored in conversation, check out the "Claude Code Conversations with Claudine" radio show. Available on all major podcast sites.