The most important layer in AI is not the model, it's everything around it.
As foundation models approach commodity status and agent capabilities expand, a new discipline has emerged that may prove more valuable than model development itself: harness engineering. This is the art and science of building the infrastructure that wraps around foundation models—transforming raw intelligence into reliable, production-grade autonomous systems.
If you want to succeed in the agentic AI era, this is the skill you need to master.
What Is an Agent Harness?
An agent harness is the complete infrastructure layer that surrounds a foundation model, enabling it to operate as an autonomous agent. Think of it as the difference between an engine and a complete vehicle: the model provides raw capability, but the harness provides everything needed to actually get somewhere useful.
Phil Schmid, a prominent AI engineer, offers the clearest analogy: the model is the CPU—raw processing power. The context window is RAM—limited, volatile working memory. The agent harness is the operating system—it curates context, handles the boot sequence, provides standard drivers, and manages resources.
The harness is not the model (which provides raw reasoning capability), not the prompt (a single input), not a tool (a specific capability), and not a framework like LangChain or CrewAI (which provides reusable abstractions). The harness is the specific, configured assembly of all these elements—the orchestration logic, tool definitions, context management strategy, feedback loops, and guardrails that together channel a model's intelligence toward reliable work.
The harness encompasses:
Tool integration and action spaces — The interfaces through which agents interact with external systems
Context management — How information flows to and from the model
Feedback loops — Mechanisms for learning and self-correction
Orchestration logic — The workflows that coordinate complex multi-step operations
Safety and guardrails — Constraints that keep agents operating within acceptable boundaries
When Anthropic released their landmark research on building effective agents, they emphasized that "the most successful implementations... aren't using complex frameworks or specialized libraries. Instead, they're building with simple, composable patterns." This insight cuts to the heart of harness engineering: it's not about the sophistication of individual components, but about how elegantly they work together.
The Four Pillars of Harness Engineering
1. Action Space Design
The action space defines what an agent can do—its tools, APIs, and interfaces with the external world. This is where harness engineering diverges most dramatically from traditional software development.
Claude Code ships with a deliberately minimal set of primitives: file reading, directory listing, pattern search (grep), surgical edits, and a Bash shell that serves as a "universal adapter." Rather than hundreds of specialized integrations, the philosophy is "primitives over integrations." Anthropic's team found they "actually spent more time optimizing tools than the overall prompt" when building their SWE-bench agent. Their concept of the Agent-Computer Interface (ACI)—deliberately analogous to Human-Computer Interface (HCI)—frames tool design as a first-class discipline. A tool-testing agent that rewrote tool descriptions achieved a 40% decrease in task completion time. Claude Code's tool responses are capped at 25,000 tokens by default—because context is the scarcest resource an agent has.
Claude Code itself is built with extraordinary efficiency: 90% of its code is written by Claude, the team consists of roughly 12 engineers, and they ship 60–100 internal releases per day.
OpenAI's experience scaling Codex offers a complementary masterclass. Their system processes over one million lines of code daily, and they discovered that success depends heavily on tool design philosophy:
"We want each tool to handle one thing and handle it well. We want tools to be difficult to misuse by the model."
The key principles:
Single responsibility — Each tool should do one thing exceptionally well
Misuse resistance — Design interfaces that make incorrect usage difficult
Graceful degradation — Tools should fail informatively, giving agents clear signals for recovery
Composability — Simple tools that combine into complex behaviors outperform monolithic solutions
The Model Context Protocol (MCP), introduced by Anthropic in late 2024, represents an industry attempt to standardize these interfaces. MCP provides a universal protocol for connecting AI assistants to external data sources and tools—addressing the fragmentation that previously plagued agent deployments.
2. Context Management
Context is the lifeblood of effective agents. How you manage what information reaches the model—and when—often determines success or failure more than model capability itself.
LangChain's benchmark data illustrates this dramatically. When evaluating coding agents on Terminal Bench 2.0, they found that context engineering improvements boosted task completion from 52.8% to 66.5%—jumping from 30th place to 5th place—a gain achieved entirely through better information flow, with zero changes to the underlying model. This 13.7 percentage point improvement from harness changes alone is the clearest empirical proof that infrastructure matters more than model selection.
Effective context management requires:
Progressive disclosure — Revealing information incrementally as needed rather than overwhelming the model upfront
Relevance filtering — Ensuring only pertinent information enters the context window
Memory hierarchies — Distinguishing between working memory, episodic memory, and long-term knowledge
Context compression — Techniques for representing more information in fewer tokens
Anthropic's Agent Skills system takes progressive disclosure further with three explicit layers: Level 1 loads only YAML frontmatter (name and description) into the system prompt at startup. Level 2 loads the full SKILL.md body when Claude determines a skill is relevant. Level 3 accesses additional bundled reference files only when needed for deep work.
The most dramatic demonstration came from Anthropic's code execution approach to MCP tools. Instead of loading all tool definitions into context (consuming ~150,000 tokens), they present MCP servers as code APIs on a filesystem that agents discover by navigating directory structures. This achieved a 98.7% reduction in token usage—from 150,000 to 2,000 tokens—while maintaining full capability.
The AGENTS.md pattern—a structured file (typically ~100 lines) that provides agents with project-specific context—exemplifies this principle. Rather than expecting agents to discover project conventions through trial and error, harness engineers encode critical knowledge directly into the agent's starting context.
3. Feedback Loops
Production agents need mechanisms for learning from their actions without requiring constant human supervision. This is where harness engineering becomes genuinely novel.
Anthropic's research highlights the importance of what they call "observation loops":
"Agents can be given the ability to monitor their actions and self-correct, watching for common failure modes and correcting course when detected."
Effective feedback systems include:
Execution verification — Confirming that actions produced intended effects
Error classification — Distinguishing between recoverable failures and hard stops
Self-reflection prompts — Structured opportunities for agents to evaluate their own reasoning
Human escalation paths — Clear triggers for when autonomous operation should pause
The challenge is designing feedback that improves performance without creating infinite loops or excessive caution. OpenAI reports spending significant engineering effort on "ensuring agents know when to ask for help versus when to proceed autonomously."
4. Orchestration Patterns
As agent tasks grow more complex, orchestration—the logic governing how agents coordinate across multi-step workflows—becomes critical infrastructure.
Anthropic identifies several orchestration patterns of increasing sophistication:
Prompt chaining — Sequential decomposition where each step's output feeds the next
Routing — Dynamic dispatch to specialized handlers based on input characteristics
Parallelization — Concurrent execution of independent subtasks
Orchestrator-workers — A central agent coordinates specialist subagents
Evaluator-optimizer loops — Iterative refinement through generation and critique cycles
Anthropic's multi-agent research system demonstrates the power of orchestration at scale. Using an orchestrator-worker pattern with Claude Opus 4 as lead and Claude Sonnet 4 as subagents, the system outperformed single-agent Opus 4 by 90.2% on internal research evaluations. Token usage explained 80% of performance variance—multi-agent systems used approximately 15× more tokens than chat interactions, but parallel tool calling cut research time by up to 90% for complex queries.
Claude Code uses a single-threaded master loop with support for subagents—Plan, Explore, and Task types—each running in its own context window. This "context firewall" approach prevents "context rot" where intermediate noise accumulates in the parent thread.
The principle that should guide architectural decisions: "Agents can be thought of as an LLM using tools in a loop, with the LLM deciding at each step what to do next... the key is to provide the model with as much relevant context as possible while keeping the workflow as simple as possible."
Why 2025–2026 Became the Inflection Point
Three converging forces made harness engineering suddenly urgent.
First, models crossed a capability threshold where they could reliably use tools, follow multi-step plans, and recover from errors—making autonomous agent behavior practical rather than theoretical.
Second, the Model Context Protocol (MCP), launched by Anthropic in November 2024 and adopted by OpenAI in March 2025, created a universal standard for connecting agents to external systems. MCP now has over 10,000 active public servers and 97 million monthly SDK downloads. Its donation to the Agentic AI Foundation under the Linux Foundation in December 2025—co-founded by Anthropic, OpenAI, and Block—signaled industry alignment on interoperability.
Third, the economics shifted decisively. Sequoia Capital's Sonya Huang articulated the thesis: "The cloud transition was software-as-a-service—a $350B opportunity. Thanks to agentic reasoning, the AI transition is service-as-a-software. The addressable market is not the software market, but the services market measured in trillions."
The investment numbers confirm the shift:
AI captured 61% of all global venture capital in 2025—$258.7 billion of $427.1 billion total (OECD)
AI infrastructure firms raised $109.3 billion, up from $47.4 billion in 2024
Agentic AI startups raised $2.8 billion in H1 2025, nearly matching full-year 2024
85% of senior executives plan to increase AI agent spending, with 171% average expected ROI
The practical catalyst, however, was simpler: teams building agents discovered that model intelligence wasn't the bottleneck—the surrounding infrastructure was. Vercel removed 80% of its agent's tools and got better results. Manus refactored its harness five times in six months. OpenAI's engineers found that "when something failed, the fix was almost never 'try harder.'"
The AI agent market tells this story in dollars. Currently valued at $7–8 billion, projections suggest growth to $140–200 billion by 2034—a 30%+ compound annual growth rate. This growth won't come from better models alone. It will come from better harnesses that make models actually useful in production.
The AI agent market tells this story in dollars. Currently valued at $7-8 billion, projections suggest growth to $140-200 billion by 2034—a 30%+ compound annual growth rate. This growth won't come from better models alone. It will come from better harnesses that make models actually useful in production.
OpenAI's February 2026 "Harness Engineering" post describes the most extreme experiment yet documented. A team of three engineers (growing to seven) produced approximately 1,500 pull requests over five months—averaging 3.5 PRs per engineer per day—with throughput that increased as the team grew. Single Codex runs regularly worked for upwards of six hours, often while humans slept.
The term "harness engineering" itself was coined by Mitchell Hashimoto (creator of Terraform and Ghostty): "It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
OpenAI's harness rests on three pillars: context engineering (the repository as single source of truth), architectural constraints (a rigid layered architecture—Types → Config → Repo → Service → Runtime → UI—enforced by structural tests), and entropy management (background agents that scan for deviations and open targeted refactoring PRs). The team initially spent every Friday—20% of their engineering time—cleaning up "AI slop" before automating it away.



