AI AgentsGovernanceEngineering

The Three Signals Undercover Mode Kills (and What to Use Instead)

Undercover Mode strips the three signals most teams rely on to detect AI-generated code. Here's what actually works when attribution is gone.

MergeShield TeamApril 1, 20265 min read

The Wrong Priority

When the Claude Code source leaked last week, most of the attention went to KAIROS and Coordinator Mode - the autonomous daemon and multi-agent orchestration features that haven't shipped yet.

Undercover Mode got less coverage. That's the wrong priority.

KAIROS and Coordinator Mode are future problems. Undercover Mode is a present one. It's already in the codebase. And what it does is quietly eliminate every signal most teams rely on to know when an AI agent touched their code.

Note

Undercover Mode was designed for Anthropic employees contributing to public open-source repos. But the mechanism is identical in private repos - there is no scope restriction in the leaked source.

What Undercover Mode Actually Does

The leaked source is specific. When operating in Undercover Mode, Claude Code strips four categories of attribution:

Co-author git trailers - removes Co-Authored-By: Claude from every commit
Commit message attribution - no AI-generated message patterns or signatures
Branch naming conventions - avoids agent-identifiable prefixes like claude/, ai-fix-, copilot-
PR description signatures - suppresses tool headers and automated footers

The stated purpose is open-source contribution - Anthropic did not want Claude Code flagged as a bot when contributing to public repos. The intent is reasonable. The mechanism is the problem: it's identical whether Claude is contributing to a public library or opening a PR in your private monorepo. And according to the leaked code, there is no off switch.

Warning

There is no off switch for Undercover Mode in the leaked source. It activates based on context, not user preference.

The Three Signals It Kills

Most teams detecting AI-generated code - consciously or not - rely on three signals. Undercover Mode eliminates all three.

Signal 1: Git attribution. Co-author tags, commit trailer fields, the author field itself. Standard Claude Code practice is to add Co-Authored-By: Claude to commits. Undercover Mode removes this. The commit reads as purely human-authored.

Signal 2: Commit message patterns. AI-generated commit messages have recognizable structure - specific phrasing, consistent formatting, particular scope descriptions. Undercover Mode generates messages designed to match human conventions, not AI defaults.

Signal 3: Branch naming conventions. Most agent workflows create identifiable branches: claude/fix-auth-bug, copilot-refactor-db, sweep/update-deps. These are trivial to filter for. Undercover Mode uses whatever naming convention your repo already uses.

Strip all three and you have nothing to filter on at the metadata layer.

Undercover Mode strips all metadata-based signals. Behavioral detection in the diff is what remains.

What Actually Works

The diff doesn't lie. Metadata is strippable. What an agent writes into the code itself is significantly harder to mask.

File-level risk patterns. An agent touching auth code behaves differently than one touching a UI component. The structural changes it makes to session management, token handling, and permission checks follow patterns that don't disappear when you remove the co-author tag. Scoring risk by what files changed and how they changed works regardless of what the commit metadata claims.

Diff entropy analysis. AI-generated code has different entropy characteristics than human-written code - consistent formatting, predictable variable naming, symmetric error handling. These patterns survive Undercover Mode because they're in the substance of the change, not the wrapper around it.

Change scope signals. Agents tend to change more files than humans on equivalent tasks. They refactor things they weren't asked to refactor. They update tests in predictable ways humans often skip. The breadth and coherence of a diff is a signal that attribution stripping doesn't touch.

Cross-PR trust scoring. A single PR from an unknown author is hard to classify. A pattern of PRs from the same contributor over time builds a behavioral profile. If patterns across PRs match known agent behavior - even with stripped attribution - trust scoring catches what single-PR analysis misses.

Tip

Behavioral detection in the diff is more durable than metadata detection. Metadata is one config change away from disappearing. Behavioral patterns are embedded in the code itself.

The KAIROS Multiplier

Undercover Mode is a present concern. KAIROS makes it a harder future one.

KAIROS is the background daemon in the leaked source - an agent that runs continuously, monitors your repo, and opens PRs based on conditions you've configured, without waiting for you to invoke it. No terminal session. No obvious trigger. A PR that appears on its own schedule.

When KAIROS ships, you won't have the signal of "someone ran Claude Code right before this PR appeared." The PR arrives from a process that's been running quietly in the background. Undercover Mode plus KAIROS means the PR looks human-initiated, human-attributed, and arrives without a visible trigger.

Behavioral detection at the diff layer isn't optional in that world. It's the only layer left.

Warning

Undercover Mode + KAIROS = a PR that looks human-authored and arrives without a visible trigger. Every governance tool that relies on attribution metadata is blind to this scenario.

What Teams Should Do Right Now

Undercover Mode is in the current codebase. You don't need to wait for KAIROS to act on this.

Audit your detection assumptions. If your process for knowing whether an AI touched a PR relies on co-author tags or branch prefixes, document that dependency explicitly. It's already breakable with a single config change.

Shift to diff-level analysis. Whatever risk assessment process you have - manual or automated - the primary input should be what changed, not who the commit claims authored it. File categories, change scope, entropy patterns in the diff.

Build behavioral baselines now. Trust scoring improves with history. The sooner you start tracking behavioral patterns per contributor, the more signal you have when attribution gets stripped. Start before you need it.

Tip

The teams best positioned for the Undercover Mode world are the ones already running diff-level risk scoring today. Every PR analyzed now is a data point in the behavioral baseline.

Related Guides

Dive deeper with interactive walkthroughs

Agent Detection & Trust

How MergeShield identifies AI agents and builds trust scores over time.

Read guide

Understanding Risk Scores

Learn how the two-stage AI pipeline scores PRs across 6 risk dimensions.

Read guide

Custom Risk Policies

Add file-pattern rules to automatically flag sensitive paths like auth or payments.

Read guide

← Previous article

What Claude Code's Leaked Source Reveals About AI Agent Governance

How a Cursor Agent Deleted 37GB - A Forensic Breakdown

← Back to All Posts