Harness Engineering: Designing the Systems That Make AI Agents Work

In March 2026, Anthropic engineer Prithvi Rajasekaran published results from a simple experiment. He asked Claude to build a retro game maker. A single agent produced a broken interface in 20 minutes for $9. Then he wrapped the same model in a multi-agent harness with a planner, a generator, and an evaluator. The harness produced a fully functional application with working physics, sprite animation, sound effects, and AI-assisted content generation. It took 6 hours and cost $200. Same model. Different architecture. The difference was not the AI. It was the harness. Ryan Lopopolo from OpenAI's Codex team put it plainly: "Agents are not hard. The harness is hard." This shift in focus from the model to the system around it has a name now. Harness engineering.

From Prompts to Context to Harnesses

AI engineering has gone through three distinct phases in three years. Each phase expanded the scope of what engineers controlled.

Prompt engineering (2022-2024) focused on crafting individual instructions. Engineers learned few-shot prompting, chain-of-thought reasoning, and persona-based framing. The unit of work was a single interaction. You wrote a good prompt, you got a good response.

Context engineering (2025) expanded the scope from a single prompt to the full information environment around the model. Shopify CEO Tobi Lutke defined it as "the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy described it as "the delicate art and science of filling the context window with just the right information for the next step." The insight was that most agent failures were context failures, not model failures. Engineers started building information pipelines, not just instructions.

Harness engineering (2026) goes further. The harness is everything in an AI agent system except the model itself: system prompts, code retrieval mechanisms, feedback loops, evaluation criteria, inter-agent communication protocols, context management strategies, and lifecycle orchestration. Birgitta Bockeler at Thoughtworks, writing on Martin Fowler's blog, defined the harness as the complete outer system that governs how agents receive work, produce output, get evaluated, and iterate. Mitchell Hashimoto, creator of Terraform, crystallized the engineering mindset: "Every time you discover an agent has made a mistake, engineer a solution preventing recurrence."

The progression follows a clear pattern. Prompt engineering controlled the input. Context engineering controlled the information environment. Harness engineering controls the entire execution lifecycle.

Why Models Alone Fail at Complex Tasks

Two persistent problems make harnesses necessary.

The first is context degradation. As conversations grow longer, models lose coherence. Anthropic's Applied AI team documented that as tokens in the context window increase, the model's ability to accurately recall information from that context decreases. Nearly 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning. Models also exhibit what Rajasekaran calls "context anxiety," prematurely wrapping up work as they approach perceived context limits, cutting corners instead of completing the task.

The second is self-evaluation bias. When asked to assess their own work, agents consistently overpraise it. Rajasekaran observed this directly: "When asked to evaluate work they have produced, agents tend to respond by confidently praising the work, even when, to a human observer, the quality is obviously mediocre." This is particularly damaging for subjective tasks like UI design, writing, and architectural decisions where no automated test can objectively verify quality.

These are not bugs that will be fixed in the next model release. They are structural properties of how autoregressive language models work. Harness engineering does not try to fix the model. It builds systems that compensate for these limitations.

The Generator-Evaluator Pattern

The core architectural pattern in harness engineering borrows from generative adversarial networks (GANs). In a GAN, a generator creates outputs and a discriminator evaluates them, and the two improve through adversarial feedback. Harness engineering applies the same principle to AI agents.

Rajasekaran's system uses three agents, each with a distinct role.

The Planner Agent converts brief prompts of 1-4 sentences into comprehensive product specifications. It emphasizes ambitious scope while avoiding granular technical prescriptions that could cascade errors downstream. Over-specification at the planning stage constrains the generator and propagates bad assumptions.

The Generator Agent implements features iteratively. In Rajasekaran's system, it works with React, Vite, FastAPI, and PostgreSQL, building in sprints with self-evaluation before handing off to QA. It uses git for version control and writes progress to structured files.

The Evaluator Agent is the most important component. It uses Playwright to interact with the running application the way an end user would, testing UI elements, API endpoints, and database states against predefined criteria. It does not read the code. It uses the product. This separation of concerns is what makes the feedback honest. The evaluator has no investment in the code it is reviewing because it did not write it.

Sprint contracts

Before implementation begins, the generator and evaluator negotiate "done" definitions for each sprint. These sprint contracts bridge high-level specifications to testable requirements. Communication between agents occurs through structured files rather than conversational message passing. This prevents the drift and ambiguity that accumulates in long conversations while maintaining fidelity to the original specification.

Grading criteria for subjective work

For frontend design, Rajasekaran developed four measurable criteria that transformed vague aesthetic judgment into a scoring function.

Design Quality measures coherent aesthetic identity across color, typography, and layout. Originality penalizes generic AI patterns and rewards evidence of deliberate creative choices. Craft evaluates technical execution of hierarchy, spacing, and contrast. Functionality assesses task completion independent of aesthetics.

Weighting design and originality heavily pushed models away from the safe, predictable outputs they default to. In one test, a Dutch art museum website went from a predictable dark theme to a 3D CSS perspective spatial navigation experience by iteration 10. Specific language in the criteria directly influenced output convergence. Including "museum quality" in the rubric changed the aesthetic the generator targeted.

Context Management: Resets vs. Compaction

How you manage context across long-running agent tasks determines whether the system degrades or remains coherent. Two primary strategies have emerged, each with trade-offs.

Context resets

Context resets clear the context window entirely and start a fresh agent with structured handoff artifacts. Anthropic found this essential for Claude Sonnet 4.5, which exhibited context anxiety as conversations grew long. A clean slate eliminates degradation at the cost of orchestration overhead and additional tokens.

The approach works well for the generator-evaluator pattern. Each agent receives only what it needs: the planner gets the user prompt, the generator gets the spec and sprint contracts, the evaluator gets the contracts and access to the running application. No agent inherits the accumulated noise of another agent's reasoning.

Context compaction

Context compaction summarizes earlier conversation segments to preserve continuity within a single agent's context. Factory.ai evaluated this across 36,000 real engineering session messages and found that "anchored iterative summarization," where new summaries merge into a persistent state document, outperformed full-reconstruction approaches. The ACON framework (Agent Context Optimization) demonstrated 26-54% memory reduction while preserving 95%+ task accuracy.

Just-in-time context loading

A third approach avoids the problem entirely. Instead of loading everything into context upfront, agents maintain lightweight identifiers like file paths and database queries, then dynamically load data at runtime using tools. Claude Code uses this pattern to analyze large codebases without ever loading full data objects into context. The model holds a map, not the territory.

Structured note-taking

For tasks spanning thousands of steps, agents can write persistent notes outside the context window and retrieve them later. Anthropic demonstrated this with Claude playing Pokemon, where the model maintained precise tallies across thousands of game steps using self-developed maps and strategic notes. The notes function as external memory, surviving context resets and preventing the information loss that context degradation causes.

In practice, production harnesses combine these strategies. Context resets between agents, just-in-time loading within agents, and structured notes for cross-session persistence.

Bockeler's Harness Taxonomy

Birgitta Bockeler at Thoughtworks proposed a systematic framework for classifying harness components along two dimensions.

The first dimension is timing. Guides are feedforward controls that anticipate agent behavior before it happens, such as system prompts, coding standards, and architectural constraints. Sensors are feedback controls that observe agent output after execution, such as linters, test suites, and evaluator agents.

The second dimension is mechanism. Computational controls are deterministic and fast, running in milliseconds. Linting, type checking, and formatting fall here. Inferential controls use AI to make semantic judgments and are slower but handle ambiguity. Code review agents, design evaluators, and architectural fitness checks fall here.

Bockeler groups these into three regulation categories. The Maintainability Harness governs code quality (style, complexity, readability). The Architecture Fitness Harness enforces architectural characteristics (dependency rules, API contracts, module boundaries). The Behaviour Harness verifies functional correctness (tests, acceptance criteria, end-to-end validation).

This taxonomy matters because it reveals gaps. Most teams implement computational guides (linters, formatters) and computational sensors (unit tests) but lack inferential sensors (evaluator agents that judge quality semantically). The generator-evaluator pattern fills that gap.

Bockeler also introduced the concept of harnessability: the degree to which a codebase supports harness implementation. Strongly-typed languages naturally support type-checking sensors. Clear module boundaries enable architectural constraints. Legacy systems with high technical debt are harder to harness because the implicit rules governing them cannot be expressed as explicit controls.

Production Implementations

Stripe's Minions

Stripe ships 1,300+ pull requests per week generated entirely by AI agents with zero human-written code. The system, called Minions, uses "blueprints" that alternate between deterministic nodes (linting, branch creation, formatting) and agentic nodes (feature implementation, CI failure resolution). Each agent operates in an isolated "devbox" with 10-second spin-up. Agents access 500+ tools via a centralized MCP server called Toolshed.

Minions enforces a "two-strike rule": if an agent fails CI twice, it stops and hands off to a human. This prevents the infinite retry loops that plague simpler agent architectures. Human engineers at Stripe have shifted from writing code to reviewing code. The harness does the production; humans do quality control.

Harvey's legal AI

Harvey, a legal AI company, applied harness engineering across 12 legal tasks including lease review, complaint drafting, tax memos, and due diligence. Baseline agent performance averaged 40.8% success. After implementing harness engineering with autoresearch loops, task-specific rubrics, and iterative evaluation, average success rose to 87.7%. Seven of twelve tasks exceeded 90% accuracy. One reached 100%. Five tasks that initially scored between 2% and 7% became production-viable.

Harvey's approach uses source documents, task instructions, and detailed grading rubrics. After an agent attempts a task, an LLM judge scores it with written feedback on what the agent got right, what it missed, and where its reasoning was incorrect. The agent then incorporates that feedback into subsequent attempts.

OpenAI Codex

OpenAI's Codex team reported building a production application with over one million lines of code where zero lines were written by human hands. The team reported approximately 10x faster development than manual approaches. The harness handled task decomposition, code generation, testing, and integration while humans focused on specification and review.

Evaluator Design: Rubrics as Loss Functions

The evaluator is where most harness engineering effort concentrates, because making an evaluator appropriately skeptical is more tractable than making a generator self-critical. Rajasekaran found this through direct experimentation: prompt iteration on standalone evaluators produced better results than attempts to make generators assess their own work.

Rubrics in harness engineering function like loss functions in machine learning. They define what "good" means in measurable terms so the generator has a target to optimize against. Key practices from production implementations include grading each dimension with an isolated LLM-as-judge rather than one judge for all dimensions, calibrating LLM-based judges frequently against expert human judgment, and using specific language in criteria since vague rubrics produce inconsistent scores.

Anthropic's evaluator goes beyond scoring. It uses Playwright MCP to actively navigate the running application, clicking buttons, submitting forms, and verifying database states. It evaluates the product as a user would, not as a code reviewer would. This distinction matters because code that looks correct can produce a broken user experience, and code that looks messy can work perfectly.

How Model Improvements Change Harness Design

Rajasekaran documented this directly. When Claude Opus 4.6 released during his research, he systematically removed harness components to test what was still necessary.

Sprint decomposition was the first to go. Opus 4.6's improved planning and extended context handling eliminated the need for forced sprint structures. The model handled coherent 2+ hour builds without imposed breakpoints.

Evaluator necessity became task-dependent. For work within the model's baseline capability, the evaluator added overhead without proportional benefit. For frontier-difficulty tasks beyond what the model could reliably produce alone, QA feedback remained essential.

The principle Rajasekaran derived: every harness component encodes beliefs about model limitations, and these beliefs become stale as models improve. The recommendation is to stress-test assumptions and simplify methodically. Remove one component at a time. Measure the impact. Keep only what is load-bearing.

This does not mean harness complexity is becoming unnecessary. As models improve, the problems you can throw at them get harder. Stripe is not using AI to write simple scripts. It is shipping production features across a massive codebase. Harvey is not summarizing single documents. It is drafting legal complaints from case law. The design space shifts rather than shrinks. Simpler harnesses handle what used to require complex ones, and complex harnesses tackle what used to be impossible.

Practical Architecture Patterns

Permission pruning

When agents have access to too many tools, execution precision drops. The model spends tokens reasoning about which tool to use instead of using the right one. Restricting sub-agents to the 2-3 tools needed for their immediate sub-task pushes precision toward 100%. A generator agent does not need database admin tools. An evaluator does not need code editing tools. Scope the permissions to the role.

File-based inter-agent communication

Agents communicate through structured files rather than conversational messages. Sprint contracts, progress logs, and evaluation reports are written to disk. This creates an auditable trail, survives context resets, and prevents the semantic drift that accumulates in long message chains.

Deterministic checkpoints

Alternate between agentic nodes (where AI makes decisions) and deterministic nodes (where code enforces rules). Stripe's blueprints use this pattern. After an agent generates code, a deterministic node runs the linter, formatter, and type checker before the next agentic node begins. This catches errors cheaply before they propagate.

Failure budgets

Set explicit limits on how many times an agent can retry. Stripe's two-strike rule on CI is one example. Without failure budgets, agents can enter infinite loops, burning tokens and time while producing increasingly degraded output. A failure budget forces the system to escalate to a human when automated remediation is not working.

Scaling multi-agent systems

Research from January 2026 formalized that star topologies, where one coordinator manages all agents, saturate at a finite number of agents determined by the coordinator's context window. Hierarchical trees bypass this constraint by enforcing context limits locally at each aggregation node. Compaction at each level becomes a mathematical requirement for scaling beyond a handful of agents.

Limitations and Open Problems

Harness engineering is not free. The retro game maker cost $200 with a harness versus $9 without one. Multi-agent systems add orchestration complexity, token overhead, and latency. Every additional agent multiplies the surface area for failure.

Harness portability is an unsolved problem. The Tsinghua University and HIT research group found that harness logic is typically scattered across controller code, hidden framework defaults, tool adapters, verifier scripts, and runtime-specific assumptions. There is no standard way to express a harness as a portable, reusable artifact. Their proposed Natural-Language Agent Harness (NLAH) specification attempts to address this, formalizing six components: contracts, roles, stage structure, adapters, state semantics, and failure taxonomy.

Verification has diminishing returns. The same Tsinghua paper found that additional verification layers sometimes created acceptance misalignment. In SWE-bench experiments, adding a dedicated verifier module actually reduced performance by 0.8%. More evaluation does not always mean better output.

Natural language specifications are imprecise. A harness described in plain English is inherently less exact than one expressed in code. Some mechanisms cannot be faithfully recovered from text alone. The tension between the flexibility of natural language descriptions and the precision of programmatic enforcement remains unresolved.

Human oversight remains essential. Every production system documented still requires human review. Stripe's Minions still needs engineers to review PRs. Harvey's system still needs lawyers to establish tasks and rubrics. The harness reduces human effort. It does not eliminate it.

What This Means for Enterprise AI Teams

If your organization is deploying AI agents for anything beyond single-turn interactions, harness engineering is where your engineering effort should concentrate.

Start with evaluation. Build an evaluator for your most important agent workflow. Define grading criteria that are specific enough to produce consistent scores. Test the evaluator against human judgment. Make it appropriately skeptical. This single step will improve output quality more than any amount of prompt tuning on the generator.

Implement context management early. Choose a strategy based on your task duration. Short tasks (under 30 minutes of agent work) can use compaction. Long tasks need context resets with structured handoffs. Very long tasks need external note-taking systems. Do not wait for context degradation to become a problem. It will.

Add deterministic checkpoints between agentic steps. Linting, type checking, and automated tests are cheap and fast. Run them after every code generation step. This catches errors before they compound.

Set failure budgets. Define how many retries each agent gets before escalating to a human. Two is a good default. One means you are not giving the agent a chance to self-correct. Five means you are burning tokens on increasingly degraded attempts.

Reassess your harness as models improve. What needed explicit structure six months ago may now be handled by the model natively. Test by removing components one at a time. Keep what is load-bearing. Discard what is not. The goal is the simplest harness that produces the required output quality.

"Every harness component encodes beliefs about model limitations. Those beliefs become stale as models improve. Find the simplest solution possible, and only increase complexity when needed."
Prithvi Rajasekaran, Anthropic Labs

Key Takeaways

Harness engineering is the successor to prompt engineering and context engineering. It encompasses everything around the model: orchestration, evaluation, context management, and inter-agent communication
The generator-evaluator pattern, inspired by GANs, separates production from assessment. The evaluator must be a different agent than the generator because models cannot reliably evaluate their own work
Stripe ships 1,300+ AI-generated PRs per week using a harness with deterministic checkpoints, failure budgets, and human review gates. Harvey improved legal AI accuracy from 40.8% to 87.7% with task-specific harness engineering
Context management is the technical foundation. Context resets between agents, just-in-time loading within agents, and structured notes for cross-session persistence form the standard approach
Rubrics function as loss functions. Specific, weighted grading criteria transform vague quality judgments into scoring functions that generators can optimize against
Harness complexity should track model capability. As models improve, remove components that are no longer load-bearing. The goal is the simplest harness that produces the required quality
Human oversight remains essential. Every production harness documented in 2026 retains human review as a critical safeguard