Harness Engineering: Why 88% of AI Agents Fail

April 15, 2026

Written by

Jay Kim

Harness Engineering: Why 88% of AI Agents Fail

Nearly 9 out of 10 AI agent projects die before production — and the failure rate hasn't improved as models have gotten smarter. The bottleneck isn't intelligence. It's the absence of a production-grade harness: the constraints, feedback loops, and observability systems that wrap around an agent to make it reliable. This post breaks down harness engineering — the emerging discipline that separates the 12% who ship from the 88% who don't.

The Number Nobody Wants to Talk About

While nearly all enterprises are exploring AI agents, only 11% have actually deployed them in production — that's roughly an 88% failure rate from pilot to production.[3] Not failure to build something that works in a demo.[3]

And here's the part that should alarm every engineering leader: 88% of AI agent projects never reach production, and that number has not improved as models have gotten more capable.[1]

Read that again. Models are getting better every quarter. GPT-5, Claude 4, Gemini 2.5 — all meaningfully more intelligent than their predecessors. Yet the failure rate hasn't budged. The bottleneck is not the model. It is the absence of a production-grade harness.[1]

This post explains what that means, where the term came from, and — most importantly — how the 12% who succeed are actually building their agent systems differently.

What Exactly Is Harness Engineering?

Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production.[6]

The metaphor is borrowed from horse tack. The term "harness" comes from horse tack — reins, saddle, bit — the complete set of equipment for channeling a powerful but unpredictable animal in the right direction. The metaphor is deliberate: the horse is the AI model — powerful, fast, but it doesn't know where to go on its own.[3]

A harness is not the agent itself. It is the complete infrastructure that governs how the agent operates: the tools it can access, the guardrails that keep it safe, the feedback loops that help it self-correct, and the observability layer that lets humans monitor its behavior.[6]

Or as the now-canonical formula puts it: Agent = Model + Harness.

Agent = Model + Harness, coined by Mitchell Hashimoto in 2026, defines the foundational AI agent formula.[1] The model reasons. The harness does everything else.[5]

Where Harness Engineering Came From

The term didn't emerge from an academic paper or an industry consortium. It emerged from pain.

Mitchell Hashimoto coined the phrase, calling it "the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."[3] Hashimoto, the co-founder of HashiCorp and creator of Ghostty, described it in a blog post on February 5th. OpenAI's Ryan Lopopolo followed on February 11th with a longer writeup about building a production application entirely with AI agents.[4]

What makes the Hashimoto–OpenAI convergence interesting isn't just the shared terminology. It's that they arrived from opposite directions. Hashimoto is a self-described skeptic who forced himself through the painful early phases of adoption, doing his work twice — once manually, once with an agent — until he developed intuition for what agents were good at. OpenAI's team started with the radical constraint that humans would never touch the code, and then figured out what infrastructure was needed to make that work. Both ended up in the same place: the engineer's job is to build the harness, not to write the code.[4]

Then came the taxonomy. Martin Fowler extended the framing through a rigorous guide published on martinfowler.com by Thoughtworks engineer Birgitta Böckeler. Fowler introduced the guides-and-sensors taxonomy: a vocabulary so precise it became the canonical way practitioners talk about harness components today.[1]

The term spread rapidly because it gave teams something "prompt engineering" never could: a name for everything outside the model.[1]

Why the Model Isn't the Problem

This is the single most counterintuitive insight in AI engineering right now: the model is no longer the bottleneck.

In OpenAI's Codex experiment, GPT-4 was the reasoning engine at the start and at the end. What changed, and what produced that extraordinary throughput, was the harness. The implication was immediate: model quality had become table stakes. The harness was the differentiator.[1]

The evidence is now overwhelming across multiple teams and benchmarks. The clearest evidence for harness primacy comes from SWE-bench, the standard benchmark for coding agents. The same model scores dramatically differently depending on the scaffold wrapping it — gaps of 20–30 percentage points between harness implementations on identical underlying models. SWE-bench is not just testing the model; it is simultaneously evaluating the harness. Teams treating model choice as the primary reliability variable are measuring the wrong thing.[8]

LangChain demonstrated the power of feedback loops when their coding agent jumped from 52.8% to 66.5% on Terminal Bench 2.0 by only changing the harness, not the model. Adding a self-verification loop and loop detection transformed a middling performer into a top-five result.[6]

The same model with a bad harness produces poor results. The same model with a great harness produces incredible results.[5]

The Anatomy of a Harness: Guides and Sensors

Birgitta Böckeler's framework on Martin Fowler's site is the clearest mental model the industry has produced. Every harness component falls into one of two categories.

Guides (Feedforward Controls)

Guides anticipate the agent's behaviour and aim to steer it before it acts.[1] Guides increase the probability that the agent creates good results in the first attempt.[1]

In practice, guides include AGENTS.md or CLAUDE.md files that document project conventions, system prompts that define the agent's role and constraints, architectural constraint documents, and coding conventions the agent must follow.

The key insight about guides is that they are cheap to implement and high-impact. Writing a good CLAUDE.md file takes 30 minutes. The improvement in agent output quality can be dramatic and immediate. This is why Anthropic recommends starting your harness engineering journey with guides.[5]

Mitchell Hashimoto's personal workflow demonstrates this perfectly. For simple things, like the agent repeatedly running the wrong commands or finding the wrong APIs, update the AGENTS.md (or equivalent). Here is an example from Ghostty. Each line in that file is based on a bad agent behavior, and it almost completely resolved them all.[3]

Sensors (Feedback Controls)

Sensors are feedback controls — they observe and validate the agent's behavior after it acts. Evals, validation loops, and output parsers are all sensors.[1]

While guides try to prevent errors, sensors accept that errors will happen and focus on detecting them quickly. The faster you detect an error, the cheaper it is to fix.[5]

The guides-and-sensors split maps directly to control systems theory, and Böckeler further subdivides each into computational and inferential variants. Computational guides increase the probability of good results with deterministic tooling. Computational sensors are cheap and fast enough to run on every change, alongside the agent. Inferential controls are of course more expensive and non-deterministic, but allow us to both provide rich guidance, and add additional semantic judgment. In spite of their non-determinism, inferential sensors can particularly increase our trust when used with a strong model.[9]

The Steering Loop

Together, guides and sensors create what practitioners call the steering loop — the continuous cycle of running the agent, observing results, and improving the harness.

Issue Occurs: The agent produces a sub-par solution or violates a pattern. Harness Gap Analysis: The human identifies why the harness failed to prevent or detect this. Regulation Improvement: The human updates the guides (feedforward) or sensors (feedback). Verification: The agent reruns the task, now governed by the improved harness. This "Steering Loop" ensures that the engineering team's collective intelligence is externalized into the system, making the codebase increasingly "agent-friendly" over time.[10]

How the 88% Actually Fail

The failure patterns are remarkably consistent. After analyzing failure patterns across hundreds of AI agent initiatives and cross-referencing them against industry research from Gartner, McKinsey, and primary case study data, seven failure patterns account for 94% of all pre-production stalls. These patterns are not random — they are predictable, identifiable early, and largely preventable.[5]

Here are the most critical ones, reframed through the lens of harness engineering.

1. No Guides: The Agent Is Guessing

The first attempts went like this: paste a Jira feature description into an AI coding tool, tell it to "implement this," and hope for the best. The results were unpredictable. Sometimes the AI would succeed. Other times it would hallucinate file paths, invent APIs that didn't exist, or modify the wrong module entirely. The failure mode was always the same: the AI was guessing about the code base instead of looking at it.[7]

This is the single most common failure: teams deploy agents without guides, expecting the model to intuit project conventions from thin air. Red Hat's enterprise experience confirmed the fix: the AI writes better code when you design the environment it works in. The secret is structured context rather than free-form tickets.[7]

2. No Sensors: Errors Compound Silently

Compound reliability is unforgiving. A 10-step agent process where each step succeeds 99% of the time still fails roughly one in ten complete runs — a ~90.4% end-to-end success rate.[2]

Without sensors, agents don't just make mistakes — they make mistakes on top of mistakes. The Anthropic engineering team's work on harness design for long-running applications identifies an important pattern: context window degradation (sometimes called "context rot") is a sensor problem. Without sensors that monitor context quality over time, agents accumulate stale, noisy information in their context window, and their outputs degrade. The fix is not a better model. It is a sensor that detects when context quality has dropped below the threshold for reliable operation.[1]

3. No Governance Before Demo Day

Agent systems that reach the demo stage without governance controls almost never get them added later. The architecture decisions that make demos fast — broad tool access, no approval gates, minimal logging — become technical debt that blocks production deployment.[6]

Governance is not a feature you add at the end. It is an architectural decision that shapes every component from the beginning.[6]

4. Data Quality Is the Hidden Killer

27% of AI agent failures trace to data quality, not harness architecture or model limitations.[1] A 2026 arXiv study found that LLM-generated context files caused performance drops in 5 of 8 tested settings when documentation already existed, because the guide content duplicated or contradicted existing docs. Context quality, not context presence, is the variable.[2]

5. Over-Engineering the Harness

There's a counterbalancing risk that deserves attention. Developers must build harnesses that allow them to rip out the "smart" logic they wrote yesterday. If you over-engineer the control flow, the next model update will break your system.[8]

Capabilities that required complex, hand-coded pipelines in 2024 are now handled by a single context-window prompt in 2026.[8] The harness must be adaptive, not brittle.

Real-World Proof: Who's Doing It Right

OpenAI Codex: 1M Lines, Zero Human-Written Code

OpenAI's Codex team built a production application with over 1 million lines of code where zero lines were written by human hands. The engineers didn't write code. They designed the system that let AI write code reliably. That system — the constraints, feedback loops, documentation, linters, and lifecycle management — is what the industry now calls a harness.[3]

The primary job of their engineering team became enabling the agents to do useful work. In practice, this meant working depth-first: breaking down larger goals into smaller building blocks, prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never "try harder."[2]

When code drift became a problem, they started encoding what they call "golden principles" directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs.[2]

Microsoft Azure SRE Agent: 40 Hours → 3 Minutes

Microsoft's Azure SRE agent has handled 35,000+ production incidents autonomously, reducing Azure App Service time-to-mitigation from 40.5 hours to 3 minutes. It documents the integration of MCP tools, telemetry, code repositories, and incident management platforms into a single agent harness with human-in-the-loop governance.[6]

Microsoft shifted from 100+ bespoke tools and a prescriptive prompt to a filesystem-based context engineering system for their SRE agent. Key finding: exposing everything (source code, runbooks, query schemas, past investigation notes) as files and letting the agent use read_file, grep, find, and shell outperformed specialized tooling — "Intent Met" score rose from 45% to 75% on novel incidents.[6]

Anthropic: Multi-Agent Harness for Long-Running Tasks

Anthropic tested this with a three-agent harness — Planner, Generator, Evaluator — against a solo agent on the task of building a 2D retro game engine.[10] The Planner expands a short prompt into a full product spec, deliberately leaving implementation details unspecified — early over-specification cascades into downstream errors. The Generator implements features in sprints, but before writing code, it signs a sprint contract with the Evaluator: a shared definition of "done." The Evaluator uses Playwright to click through the application like a real user, testing UI, API, and database behavior. If anything fails, the sprint fails.[10]

The solo agent produced a game that technically launched, but entity-to-runtime connections were broken at the code level — discoverable only by reading the source. The three-agent harness produced a superior result.[10]

Harness Engineering vs. Everything Else

If harness engineering sounds overlapping with other disciplines, that's because it deliberately subsumes parts of them. Here's how they relate.

Prompt Engineering optimizes the quality of a single exchange — phrasing, structure, examples. One conversation, one output.[10] Context Engineering manages how much information the model can see at once — which documents to retrieve, how to compress history, what fits in the context window and what gets dropped.[10] Harness Engineering builds the world the agent operates in. Tools, knowledge sources, validation logic, architectural constraints — everything that determines whether an agent can run reliably across hundreds of decisions without human supervision.[10]

Context engineering focuses on what information goes into the context window, specifically the content of what the model sees. Harness engineering focuses on how the entire agent environment operates: tools, constraints, feedback loops, memory, and lifecycle management. Context engineering is a component inside the harness; the harness contains and orchestrates the context engineering layer alongside all other agent subsystems.[7]

How to Start: A Practical First Harness

You don't need a million-line codebase to benefit from this. When applying harness engineering for the first time, there is no need to build every mechanism at once. The following three starting points usually produce the fastest practical return.[8]

Step 1: Create Your Guide File. Create CLAUDE.md or AGENTS.md at the project root and include the project structure, build commands, and coding rules. Start small, then add rules when the agent repeatedly fails in the same place. This is the same pattern Mitchell Hashimoto described: every time the agent makes a mistake, add the instruction that prevents that mistake from repeating.[8]

Step 2: Wire Up Computational Sensors. Add pre-commit hooks that run linters and type checkers on every change. Move quality checks left, distributing them across pre-commit (fast linters), PR integration (type checking, architecture fitness functions), and continuous monitoring (drift detection).[8]

Step 3: Close the Feedback Loop. At minimum, the agent should run tests after making changes and attempt to fix failures before declaring success. A write-test-fix cycle is the simplest effective feedback loop.[6]

Step 4: Add Guardrails. Restrict file access to relevant directories. Require linting before commits.[6] Define what the agent should never do just as clearly as what it should do.

Step 5: Observe Everything. You cannot improve what you cannot see. Observability in harness engineering means logging every agent action, tracking token usage and costs, recording decision points, and surfacing anomalies. This is what separates a research prototype from a production system.[6]

The Future: Harness Engineers, Not Software Engineers

The team's guiding principle was clear: humans design environments, specify intent, and build feedback loops; agents write the code. The engineer's job shifted from implementation to system design.[6]

Harness engineering is emerging as a distinct role, especially at companies building agent-powered products. The skillset combines traditional software engineering with AI-specific knowledge.[6]

What does a harness engineer actually do? They design the environments where agents operate. They write configuration files (AGENTS.md, CLAUDE.md) that give agents the context they need. They build and tune feedback loops. They analyze agent logs to find failure patterns. They define and enforce architectural constraints. They decide where human checkpoints belong.[6]

Building this outer harness is emerging as an ongoing engineering practice, not a one-time configuration.[1] Every improvement compounds. The key insight is that Stage 5, Harness Engineering, compounds. Every improvement applies to every future agent run.[6]

Conclusion

The 88% failure-before-production statistic is not an anomaly. It is a structural feature of how organizations currently approach AI agent development.[5]

The organizations stuck in that 88% are doing the same thing: building impressive demos, picking the latest model, and hoping the intelligence alone will carry them to production. It won't.

Harness engineering is the answer to a simple question: how do you make AI agents work reliably enough to trust in production? The answer is not better models.[6]

The principles are already well-established: constrain what agents can do, inform them about what they should do, verify their work, correct their mistakes, and keep humans in the loop at high-stakes decision points.[6]

The 88% treat production as something that happens after the pilot succeeds. The 12% treat production as the goal that shapes every decision from day one.[3]

The model is the engine. The harness is the car. Stop tuning the engine. Start building the car.

Frequently Asked Questions

What is harness engineering?

Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. The term borrows from horse tack — reins, saddle, bit — the complete set of equipment for channeling a powerful but unpredictable animal in the right direction. In practice, a harness includes everything outside the model itself: tool access policies, guardrails, validation sensors, documentation guides (like CLAUDE.md or AGENTS.md files), observability layers, and human-in-the-loop checkpoints. The canonical formula is Agent = Model + Harness — the model reasons, the harness does everything else.

Why do 88% of AI agents fail before reaching production?

The 88% failure rate reflects a structural problem, not a talent problem. Most teams focus on model selection and prompt optimization while neglecting the production infrastructure — governance, feedback loops, sensors, and context management — that agents need to operate reliably at scale. Seven predictable failure patterns account for the vast majority of stalls: missing guides (no project context for the agent), missing sensors (errors compound silently), absent governance, poor data quality, over-engineered control flows, underestimated integration complexity, and misaligned success metrics. The bottleneck is not model intelligence; it is the absence of a production-grade harness.

Who coined the term "harness engineering"?

Mitchell Hashimoto, co-founder of HashiCorp and creator of Ghostty, coined the phrase in a blog post published on February 5, 2025. He described it as "the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again." OpenAI's Ryan Lopopolo independently converged on the same concept six days later. Martin Fowler's site then extended the framework through Birgitta Böckeler's guides-and-sensors taxonomy, which became the canonical vocabulary for harness components.

What is the difference between harness engineering and prompt engineering?

Prompt engineering optimizes the quality of a single exchange — phrasing, structure, and examples within one conversation. Context engineering manages what information the model can see — retrieval, compression, and context window management. Harness engineering builds the entire world the agent operates in: tools, knowledge sources, validation logic, architectural constraints, feedback loops, memory, and lifecycle management. Prompt engineering and context engineering are components inside the harness. The harness contains and orchestrates them alongside all other agent subsystems.

What are guides and sensors in harness engineering?

Guides and sensors are the two fundamental categories of harness components, introduced by Birgitta Böckeler on martinfowler.com. Guides are feedforward controls — they steer the agent before it acts. Examples include AGENTS.md files, system prompts, architectural constraint documents, and coding conventions. Sensors are feedback controls — they observe and validate the agent's behavior after it acts. Examples include linters, type checkers, test suites, output parsers, and evaluation loops. Each category further divides into computational (deterministic, cheap, fast) and inferential (LLM-based, richer but non-deterministic) variants.

How do I start with harness engineering?

Start with three high-impact, low-effort steps. First, create a guide file (CLAUDE.md or AGENTS.md) at your project root documenting project structure, build commands, and coding rules — add a new rule every time the agent repeats a mistake. Second, wire up computational sensors like pre-commit hooks running linters and type checkers on every change. Third, close the feedback loop by ensuring the agent runs tests after making changes and attempts to fix failures before declaring success. You don't need to build every mechanism at once; these three starting points typically produce the fastest practical return.

Is harness engineering only for coding agents?

No. While the most visible early examples come from coding agents (OpenAI Codex, SWE-bench, Claude Code), the principles apply to any AI agent operating autonomously. Microsoft's Azure SRE agent uses harness engineering for incident response, handling 35,000+ production incidents with time-to-mitigation dropping from 40.5 hours to 3 minutes. The guides-and-sensors framework applies equally to customer service agents, data pipeline agents, research agents, and any system where an AI model takes multi-step actions with real-world consequences.

What is an AGENTS.md or CLAUDE.md file?

These are guide files placed at the root of a repository (or in specific subdirectories) that give AI agents the context they need to work effectively within a codebase. They typically include project structure, build and test commands, coding conventions, architectural patterns to follow, and anti-patterns to avoid. Mitchell Hashimoto's AGENTS.md file for Ghostty is a well-known example — each line in the file corresponds to a specific bad agent behavior that the guide almost completely resolved. Writing a good guide file takes about 30 minutes and often produces immediate, dramatic improvements in agent output quality.

Does a better model eliminate the need for harness engineering?

No — and this is the most counterintuitive insight in AI engineering today. SWE-bench results show the same model scoring 20–30 percentage points differently depending on the harness wrapping it. LangChain's coding agent jumped from 52.8% to 66.5% on Terminal Bench 2.0 by changing only the harness, not the model. Model quality is now table stakes. The harness is the differentiator. As models improve, some harness components can be simplified, but the need for guides, sensors, governance, and feedback loops does not disappear — it shifts.

What is the "steering loop" in harness engineering?

The steering loop is the continuous improvement cycle at the heart of harness engineering. It has four steps: (1) an issue occurs — the agent produces a suboptimal result or violates a pattern; (2) the human performs a harness gap analysis to identify why the harness failed to prevent or detect the issue; (3) the human updates guides (feedforward) or sensors (feedback) to close the gap; (4) the agent reruns the task under the improved harness for verification. This loop externalizes the engineering team's collective intelligence into the system, making the codebase increasingly agent-friendly over time. Every improvement compounds across all future agent runs.

References