Anthropic's multi-agent harness uses a generator and evaluator to improve code quality. But when both agents share the same model, they share the same blind spots. External evaluation changes everything.
Last week, Anthropic published their recommended architecture for building production applications with Claude Code. The core idea is a multi-agent harness: instead of one agent writing code and hoping for the best, you split the work across specialized roles.
A Planner expands your prompt into a detailed spec. A Generator implements features in sprints. An Evaluator runs tests and grades the output against predefined criteria. If the code fails evaluation, it goes back to the Generator for another pass.
The pattern draws from GANs (Generative Adversarial Networks) - the same adversarial dynamic that made image generation so effective. One system creates, another critiques, and the tension between them drives quality up.
It's a genuinely good architecture. And it solves a real problem: models cannot reliably evaluate their own work. But there's a gap in this approach that nobody seems to be talking about.
Note
Anthropic's harness uses Playwright for end-to-end testing in the Evaluator stage, grading sprints on product depth, functionality, visual design, and code quality.
AI models, regardless of how capable they are, have consistent failure patterns when generating code. These aren't random bugs - they're systematic blind spots that show up across models and across codebases.
Happy-path optimization. AI agents write code that handles the expected input perfectly. Edge cases, malformed data, concurrent access, network timeouts - these get skipped because the model optimizes for the scenario described in the prompt, not the scenarios that production will throw at it.
Security as an afterthought. Models treat security the way junior developers often do: it's something you add after the feature works. Hardcoded secrets, missing CSRF protection, SQL injection vectors - the model knows these are problems in theory, but doesn't prioritize them during generation.
Blast radius blindness. When an agent modifies authentication middleware, it doesn't naturally reason about how many services depend on that module. A change to a shared utility that "just adds a parameter" can break 15 downstream consumers. Models think locally, not systemically.
Test coverage gaps. AI-generated tests tend to mirror the implementation. If the code has a bug, the test often encodes that same bug as expected behavior. The test passes, the CI is green, and the vulnerability ships.
An evaluator built on the same model will consistently miss these patterns because they're baked into how the model thinks about code.
The fix isn't better prompting or more agents. It's different evaluation criteria applied by a different system.
Consider how mature engineering organizations handle this. They don't ask the developer who wrote the code to also write the security review. They have separate teams with separate checklists looking at the same code from different angles:
Each reviewer applies criteria that the original author didn't optimize for. That's what makes the review valuable.
The same principle applies to AI-generated code. An external evaluation system should score code across dimensions the generator wasn't optimizing for:
When your evaluation criteria are orthogonal to your generation criteria, you catch the problems the generator structurally cannot see.
Tip
The most effective code review catches problems the author didn't think about - not the ones they did. The same applies to AI evaluation: different criteria matter more than a different agent.
Anthropic's harness treats every sprint the same. The first feature the Generator produces gets the same evaluation as the fiftieth. There's no memory, no learning, no adaptation.
But in real teams, trust is earned over time. A developer who consistently ships clean code gets less scrutiny on routine changes. A new hire gets thorough reviews on everything until they've demonstrated judgment.
AI agents should work the same way. An agent that has produced 200 clean PRs with zero security findings should earn different treatment than one that's new to your codebase. The evaluation bar should adapt to the agent's track record - not because you trust the agent blindly, but because you have data about its reliability.
This is what a trust scoring system enables:
The harness pattern gives you quality control on a per-sprint basis. Trust scoring gives you quality control that compounds over time.
Anthropic's harness is a strong foundation. The Planner-Generator-Evaluator pattern is a real improvement over single-agent code generation. But it's solving one layer of the problem: code quality within a single session.
The layers it doesn't address:
The generator-evaluator loop handles the inner feedback cycle. Governance handles everything outside that loop - the organizational policies, the trust relationships, the risk-based routing that determines what merges automatically and what needs human judgment.
Note
The best setup is layered: Anthropic's harness for inner-loop quality, external risk scoring for cross-cutting concerns, and trust-based governance for organizational policies.
If your team is using AI agents to generate code today, here's the practical framework:
The harness gives you better code. Governance gives you confidence that what ships is actually safe. You need both.
Tip
Start with external risk scoring on every PR. Add trust scoring and auto-merge rules once you have enough data to calibrate thresholds for your codebase.
Dive deeper with interactive walkthroughs
Understanding Risk Scores
Learn how the two-stage AI pipeline scores PRs across 6 risk dimensions.
Read guideAgent Detection & Trust
How MergeShield identifies AI agents and builds trust scores over time.
Read guideConfiguring Auto-Merge
Set up auto-merge rules for low-risk PRs so your team can focus on what matters.
Read guide