AI AgentsGovernanceEngineering

Anthropic Says Use More Agents to Fix Agent Code. Here's What's Missing.

Anthropic's multi-agent harness uses a generator and evaluator to improve code quality. But when both agents share the same model, they share the same blind spots. External evaluation changes everything.

MergeShield TeamMarch 30, 20267 min read

The Harness Pattern

Last week, Anthropic published their recommended architecture for building production applications with Claude Code. The core idea is a multi-agent harness: instead of one agent writing code and hoping for the best, you split the work across specialized roles.

A Planner expands your prompt into a detailed spec. A Generator implements features in sprints. An Evaluator runs tests and grades the output against predefined criteria. If the code fails evaluation, it goes back to the Generator for another pass.

The pattern draws from GANs (Generative Adversarial Networks) - the same adversarial dynamic that made image generation so effective. One system creates, another critiques, and the tension between them drives quality up.

It's a genuinely good architecture. And it solves a real problem: models cannot reliably evaluate their own work. But there's a gap in this approach that nobody seems to be talking about.

Note

Anthropic's harness uses Playwright for end-to-end testing in the Evaluator stage, grading sprints on product depth, functionality, visual design, and code quality.

Anthropic's harness: the Generator and Evaluator are both Claude, creating a shared blind spot in evaluation.

The Shared Blind Spot Problem

Here's the issue: when your Generator is Claude and your Evaluator is also Claude, they share the same training data, the same biases, and the same blind spots.

Think about it like asking your coworker to proofread something they helped you write. They'll catch typos and obvious errors. But the structural problems, the wrong assumptions, the edge cases neither of you considered - those survive the review because both of you have the same mental model of what "correct" looks like.

This isn't theoretical. We've seen it play out in real codebases:

A Claude-generated authentication flow that passed Claude's own evaluation but used client-side token storage with no expiry handling
An API endpoint that both the Generator and Evaluator considered "complete" but had no rate limiting or input validation
Database queries that worked perfectly in testing but had no indexes for production-scale data

The Generator optimizes for "does it work?" The Evaluator, built from the same model, asks the same question slightly differently. What nobody asks is: "What would break this in production?"

That's a fundamentally different evaluation criteria, and it requires a fundamentally different evaluator.

Warning

Same-model evaluation has a structural limitation: the evaluator inherits the generator's biases. This is why adversarial review works best with different evaluation criteria, not just a different agent role.

What Same-Model Evaluators Miss

AI models, regardless of how capable they are, have consistent failure patterns when generating code. These aren't random bugs - they're systematic blind spots that show up across models and across codebases.

Happy-path optimization. AI agents write code that handles the expected input perfectly. Edge cases, malformed data, concurrent access, network timeouts - these get skipped because the model optimizes for the scenario described in the prompt, not the scenarios that production will throw at it.

Security as an afterthought. Models treat security the way junior developers often do: it's something you add after the feature works. Hardcoded secrets, missing CSRF protection, SQL injection vectors - the model knows these are problems in theory, but doesn't prioritize them during generation.

Blast radius blindness. When an agent modifies authentication middleware, it doesn't naturally reason about how many services depend on that module. A change to a shared utility that "just adds a parameter" can break 15 downstream consumers. Models think locally, not systemically.

Test coverage gaps. AI-generated tests tend to mirror the implementation. If the code has a bug, the test often encodes that same bug as expected behavior. The test passes, the CI is green, and the vulnerability ships.

An evaluator built on the same model will consistently miss these patterns because they're baked into how the model thinks about code.

Why External Evaluation Changes Everything

The fix isn't better prompting or more agents. It's different evaluation criteria applied by a different system.

Consider how mature engineering organizations handle this. They don't ask the developer who wrote the code to also write the security review. They have separate teams with separate checklists looking at the same code from different angles:

Security review looks for attack vectors, not functionality
Architecture review looks for coupling and blast radius, not correctness
Performance review looks for scalability bottlenecks, not feature completeness

Each reviewer applies criteria that the original author didn't optimize for. That's what makes the review valuable.

The same principle applies to AI-generated code. An external evaluation system should score code across dimensions the generator wasn't optimizing for:

Security implications - authentication changes, secrets exposure, injection risks
Blast radius - how many components and users are affected
Test coverage gaps - whether tests actually cover the new behavior
Dependency risks - new packages, version bumps, supply chain concerns
Breaking changes - API contract modifications, schema changes

When your evaluation criteria are orthogonal to your generation criteria, you catch the problems the generator structurally cannot see.

Tip

The most effective code review catches problems the author didn't think about - not the ones they did. The same applies to AI evaluation: different criteria matter more than a different agent.

The Missing Piece: Trust That Evolves

Anthropic's harness treats every sprint the same. The first feature the Generator produces gets the same evaluation as the fiftieth. There's no memory, no learning, no adaptation.

But in real teams, trust is earned over time. A developer who consistently ships clean code gets less scrutiny on routine changes. A new hire gets thorough reviews on everything until they've demonstrated judgment.

AI agents should work the same way. An agent that has produced 200 clean PRs with zero security findings should earn different treatment than one that's new to your codebase. The evaluation bar should adapt to the agent's track record - not because you trust the agent blindly, but because you have data about its reliability.

This is what a trust scoring system enables:

New agents start with maximum scrutiny
Each clean PR builds trust incrementally
High-risk findings reset trust immediately
Different agents earn different trust levels based on their actual track record
Trusted agents can auto-merge low-risk changes while untrusted ones require human review

The harness pattern gives you quality control on a per-sprint basis. Trust scoring gives you quality control that compounds over time.

Generator Plus Evaluator Plus Governance

Anthropic's harness is a strong foundation. The Planner-Generator-Evaluator pattern is a real improvement over single-agent code generation. But it's solving one layer of the problem: code quality within a single session.

The layers it doesn't address:

Cross-session learning - does this agent get better over time, or does it keep making the same mistakes?
Multi-agent governance - when you have Claude Code, Copilot, Cursor, and Devin all contributing to the same repo, who governs what ships?
Risk-proportional review - not every PR deserves the same level of scrutiny. A dependency bump and an auth middleware change are fundamentally different
Audit trail - when something breaks in production, can you trace which agent introduced the change and what the risk assessment was?
Approval workflows - high-risk changes from any agent should require human sign-off, with automatic escalation to the right reviewer

The generator-evaluator loop handles the inner feedback cycle. Governance handles everything outside that loop - the organizational policies, the trust relationships, the risk-based routing that determines what merges automatically and what needs human judgment.

Note

The best setup is layered: Anthropic's harness for inner-loop quality, external risk scoring for cross-cutting concerns, and trust-based governance for organizational policies.

The complete stack: inner-loop quality (Anthropic harness) + external risk scoring + trust-based governance routing.

What to Do About It

If your team is using AI agents to generate code today, here's the practical framework:

1Use the harness pattern for inner-loop quality. Anthropic's Planner-Generator-Evaluator architecture genuinely improves output quality. Use it.

1Add external evaluation with different criteria. Don't rely on the same model to both generate and evaluate. Score every PR across security, complexity, blast radius, test coverage, breaking changes, and dependencies using evaluation criteria the generator wasn't optimizing for.

1Build trust incrementally. Don't treat every agent the same forever. Track which agents produce clean code and which ones introduce risks. Let that data drive your review policy over time.

1Automate what's safe, escalate what's not. Low-risk PRs from trusted agents don't need human review. Auto-merge them. High-risk changes from any agent should route to the right human reviewer automatically.

1Keep an audit trail. When production breaks, you need to know which agent introduced the change, what the risk score was, and whether governance rules were followed.

The harness gives you better code. Governance gives you confidence that what ships is actually safe. You need both.

Tip

Start with external risk scoring on every PR. Add trust scoring and auto-merge rules once you have enough data to calibrate thresholds for your codebase.

Related Guides

Dive deeper with interactive walkthroughs

Understanding Risk Scores

Learn how the two-stage AI pipeline scores PRs across 6 risk dimensions.

Read guide

Agent Detection & Trust

How MergeShield identifies AI agents and builds trust scores over time.

Read guide

Configuring Auto-Merge

Set up auto-merge rules for low-risk PRs so your team can focus on what matters.

Read guide

← Previous article

MergeShield vs Manual Code Review: What Changes

What Claude Code's Leaked Source Reveals About AI Agent Governance

← Back to All Posts