← All posts
25 March 2026·6 min read

Anthropic Built a Three-Person Team. They Called It a Harness.

Anthropic's engineering team published a technical deep-dive on multi-agent harness design. What they actually published was an org chart, and it explains exactly why most AI initiatives produce underwhelming results.

ai-nativeengineeringleadershipagents

Anthropic published a post about how they design harnesses for long-running AI applications. I think it actually is: an org chart. And the fact that they had to write it in code is the most honest explanation I have seen for why most AI initiatives produce disappointing results.

The structure your AI workflow is missing, and what to build instead.

What a Harness Actually Is

A harness is not a model. It is not a prompt. It is the architecture you build around an AI model to structure how it approaches a complex task: what it does, in what order, with what inputs, and how its output gets evaluated before anything ships.

Anthropic's architecture is a three-agent system:

A Planner that expands a minimal input into a detailed specification. What to build, at what quality bar, with what constraints. The Planner sets the contract before any execution begins.

A Generator that implements against that specification, treating it as binding. The Generator does not decide what to build. It builds what was decided.

An Evaluator that tests the output against explicit criteria, grades it, and feeds the result back into the loop. Calibrated to be skeptical, not to rubber-stamp whatever the Generator produced.

The results: 20x more capable output than a single agent working alone. The cost: $200 and six hours versus $9 and twenty minutes for the baseline.

Now look at those three roles again. Planner. Generator. Evaluator.

That is a product manager, an engineer, and a QA lead. That is a product trio: the foundational unit of functional software delivery for decades. Anthropic did not invent it. They had to encode it in code because they could not assume it would exist in the workflow naturally.

That is the insight worth sitting with.

You Have Met These People

The product trio is not a new idea. The reason it is hard to sustain is that each function creates friction for the others.

The Planner slows things down at the start. Under delivery pressure, the spec becomes a vague ticket. Acceptance criteria become "make it work." The quality bar becomes whatever the engineer had time for.

The Evaluator creates friction at the end. Honest review against pre-agreed criteria surfaces gaps and sends work back. Most teams have review gates in theory and approval ceremonies in practice.

The Generator, caught between a weak Planner and a weak Evaluator, invents its own definition of done.

This dynamic is not new. What is new is what happens when you add AI to it.

AI amplifies every function it is given. A Generator running on AI with a strong Planner and Evaluator produces dramatically better output: that is the 20x result Anthropic demonstrated. A Generator running on AI with a weak Planner and no real Evaluator produces dramatically more output, at the same or worse quality. That is the experience most teams are having.

The model is the same. The structure is different. The results are incomparable.

The $9 Version Is What Most Enterprises Are Running

Teams buy the access, deploy the tools, give engineers Claude or Copilot or Cursor, and wait for productivity numbers to move. Then they track PRs merged and tickets closed and try to work out whether the investment is paying off.

What they are running is the $9 version.

One agent. One context. No Planner generating the spec before the Generator runs. No Evaluator testing the output before it ships. The model generates, the engineer reviews with variable attention against variable criteria, and the output is whatever falls out of that process.

The results are inconsistent. Some engineers extract real leverage. Others produce more volume without more value. The conclusion most teams reach is that they need a better model, or clearer prompts, or more training.

They need the harness.

The $200 version is not more expensive because Anthropic used more compute. It is more expensive because structure costs something to build and maintain. That cost is the price of quality. Most teams are trying to avoid paying it.

Self-Evaluation Is the Wrong Design

Anthropic's key finding: when a model evaluates its own output, it "confidently praises the work, even when it is obviously mediocre." They had to invest significant effort in adversarial prompting before the Evaluator became genuinely useful rather than performatively useful.

This is not a quirk of language models. It is a structural failure that appears wherever the same mistake is made: collapsing generation and evaluation into the same role.

The engineer who writes the code and reviews the code is running this. The team that ships a feature and assesses whether it solved the user problem using the same people who built it is running this.

When the same function performs both, evaluation degrades toward approval. Every time. In models and in people.

Anthropic's fix was not a more self-critical model. It was a separate Evaluator, with explicit criteria, calibrated to be skeptical. Quality improved immediately when the structure changed.

AI has made this principle harder to ignore because it generates output fast enough that the cost of bad evaluation becomes visible quickly.

The Question Is Not Which Model

Two diagnostics. Neither is about the model.

Do you have a Planner function? Not a role with that title, a function. Something that produces clear, agreed specifications before the build starts, including what done looks like, what quality means, and what the acceptance criteria are. If engineers are making those calls themselves at implementation time because the upstream input is insufficient, you are missing the Planner.

Do you have an Evaluator function? Not code review in the ceremonial sense. A deliberate stage in which output is assessed against criteria defined before the build, by someone not invested in the output passing. If review is typically "does this look right to me," you are missing the Evaluator.

If both exist and operate clearly, you have a harness. AI will amplify it.

If either is weak, you are running the single-agent model. AI will amplify that too, in the direction you do not want.

Anthropic spent $200 and six hours proving that structure produces better software than raw generation. They had to build the structure in code because the default, in most workflows, is no structure at all.

The question is not which AI model your team is using. It is whether you have built the harness.


I help engineering teams close the gap between "we use AI tools" and "AI actually changed how we deliver." Book a 20-minute call and I'll tell you where the leverage is.

Working on something similar?

I work with founders and engineering leaders who want to close the gap between what their technology can do and what it's actually delivering.