Harness Engineering: The Most Important Skill in the Agentic AI Era

As foundation models become commodities, the infrastructure wrapping them—tools, context management, feedback loops, and orchestration—determines whether AI agents succeed or fail in production. This comprehensive guide covers the emerging discipline of harness engineering.

Stan Sedberry
Stan Sedberry
150 views
Harness Engineering: The Most Important Skill in the Agentic AI Era

The most important layer in AI is not the model, it's everything around it.

As foundation models approach commodity status and agent capabilities expand, a new discipline has emerged that may prove more valuable than model development itself: harness engineering. This is the art and science of building the infrastructure that wraps around foundation models—transforming raw intelligence into reliable, production-grade autonomous systems.

If you want to succeed in the agentic AI era, this is the skill you need to master.

What Is an Agent Harness?

An agent harness is the complete infrastructure layer that surrounds a foundation model, enabling it to operate as an autonomous agent. Think of it as the difference between an engine and a complete vehicle: the model provides raw capability, but the harness provides everything needed to actually get somewhere useful.

Phil Schmid, a prominent AI engineer, offers the clearest analogy: the model is the CPU—raw processing power. The context window is RAM—limited, volatile working memory. The agent harness is the operating system—it curates context, handles the boot sequence, provides standard drivers, and manages resources.

The harness is not the model (which provides raw reasoning capability), not the prompt (a single input), not a tool (a specific capability), and not a framework like LangChain or CrewAI (which provides reusable abstractions). The harness is the specific, configured assembly of all these elements—the orchestration logic, tool definitions, context management strategy, feedback loops, and guardrails that together channel a model's intelligence toward reliable work.

The harness encompasses:

  • Tool integration and action spaces — The interfaces through which agents interact with external systems

  • Context management — How information flows to and from the model

  • Feedback loops — Mechanisms for learning and self-correction

  • Orchestration logic — The workflows that coordinate complex multi-step operations

  • Safety and guardrails — Constraints that keep agents operating within acceptable boundaries

When Anthropic released their landmark research on building effective agents, they emphasized that "the most successful implementations... aren't using complex frameworks or specialized libraries. Instead, they're building with simple, composable patterns." This insight cuts to the heart of harness engineering: it's not about the sophistication of individual components, but about how elegantly they work together.

The Four Pillars of Harness Engineering

1. Action Space Design

The action space defines what an agent can do—its tools, APIs, and interfaces with the external world. This is where harness engineering diverges most dramatically from traditional software development.

Claude Code ships with a deliberately minimal set of primitives: file reading, directory listing, pattern search (grep), surgical edits, and a Bash shell that serves as a "universal adapter." Rather than hundreds of specialized integrations, the philosophy is "primitives over integrations." Anthropic's team found they "actually spent more time optimizing tools than the overall prompt" when building their SWE-bench agent. Their concept of the Agent-Computer Interface (ACI)—deliberately analogous to Human-Computer Interface (HCI)—frames tool design as a first-class discipline. A tool-testing agent that rewrote tool descriptions achieved a 40% decrease in task completion time. Claude Code's tool responses are capped at 25,000 tokens by default—because context is the scarcest resource an agent has.

Claude Code itself is built with extraordinary efficiency: 90% of its code is written by Claude, the team consists of roughly 12 engineers, and they ship 60–100 internal releases per day.

OpenAI's experience scaling Codex offers a complementary masterclass. Their system processes over one million lines of code daily, and they discovered that success depends heavily on tool design philosophy:

"We want each tool to handle one thing and handle it well. We want tools to be difficult to misuse by the model."

The key principles:

  • Single responsibility — Each tool should do one thing exceptionally well

  • Misuse resistance — Design interfaces that make incorrect usage difficult

  • Graceful degradation — Tools should fail informatively, giving agents clear signals for recovery

  • Composability — Simple tools that combine into complex behaviors outperform monolithic solutions

The Model Context Protocol (MCP), introduced by Anthropic in late 2024, represents an industry attempt to standardize these interfaces. MCP provides a universal protocol for connecting AI assistants to external data sources and tools—addressing the fragmentation that previously plagued agent deployments.

2. Context Management

Context is the lifeblood of effective agents. How you manage what information reaches the model—and when—often determines success or failure more than model capability itself.

LangChain's benchmark data illustrates this dramatically. When evaluating coding agents on Terminal Bench 2.0, they found that context engineering improvements boosted task completion from 52.8% to 66.5%—jumping from 30th place to 5th place—a gain achieved entirely through better information flow, with zero changes to the underlying model. This 13.7 percentage point improvement from harness changes alone is the clearest empirical proof that infrastructure matters more than model selection.

Effective context management requires:

  • Progressive disclosure — Revealing information incrementally as needed rather than overwhelming the model upfront

  • Relevance filtering — Ensuring only pertinent information enters the context window

  • Memory hierarchies — Distinguishing between working memory, episodic memory, and long-term knowledge

  • Context compression — Techniques for representing more information in fewer tokens

Anthropic's Agent Skills system takes progressive disclosure further with three explicit layers: Level 1 loads only YAML frontmatter (name and description) into the system prompt at startup. Level 2 loads the full SKILL.md body when Claude determines a skill is relevant. Level 3 accesses additional bundled reference files only when needed for deep work.

The most dramatic demonstration came from Anthropic's code execution approach to MCP tools. Instead of loading all tool definitions into context (consuming ~150,000 tokens), they present MCP servers as code APIs on a filesystem that agents discover by navigating directory structures. This achieved a 98.7% reduction in token usage—from 150,000 to 2,000 tokens—while maintaining full capability.

The AGENTS.md pattern—a structured file (typically ~100 lines) that provides agents with project-specific context—exemplifies this principle. Rather than expecting agents to discover project conventions through trial and error, harness engineers encode critical knowledge directly into the agent's starting context.

3. Feedback Loops

Production agents need mechanisms for learning from their actions without requiring constant human supervision. This is where harness engineering becomes genuinely novel.

Anthropic's research highlights the importance of what they call "observation loops":

"Agents can be given the ability to monitor their actions and self-correct, watching for common failure modes and correcting course when detected."

Effective feedback systems include:

  • Execution verification — Confirming that actions produced intended effects

  • Error classification — Distinguishing between recoverable failures and hard stops

  • Self-reflection prompts — Structured opportunities for agents to evaluate their own reasoning

  • Human escalation paths — Clear triggers for when autonomous operation should pause

The challenge is designing feedback that improves performance without creating infinite loops or excessive caution. OpenAI reports spending significant engineering effort on "ensuring agents know when to ask for help versus when to proceed autonomously."

4. Orchestration Patterns

As agent tasks grow more complex, orchestration—the logic governing how agents coordinate across multi-step workflows—becomes critical infrastructure.

Anthropic identifies several orchestration patterns of increasing sophistication:

  • Prompt chaining — Sequential decomposition where each step's output feeds the next

  • Routing — Dynamic dispatch to specialized handlers based on input characteristics

  • Parallelization — Concurrent execution of independent subtasks

  • Orchestrator-workers — A central agent coordinates specialist subagents

  • Evaluator-optimizer loops — Iterative refinement through generation and critique cycles

Anthropic's multi-agent research system demonstrates the power of orchestration at scale. Using an orchestrator-worker pattern with Claude Opus 4 as lead and Claude Sonnet 4 as subagents, the system outperformed single-agent Opus 4 by 90.2% on internal research evaluations. Token usage explained 80% of performance variance—multi-agent systems used approximately 15× more tokens than chat interactions, but parallel tool calling cut research time by up to 90% for complex queries.

Claude Code uses a single-threaded master loop with support for subagents—Plan, Explore, and Task types—each running in its own context window. This "context firewall" approach prevents "context rot" where intermediate noise accumulates in the parent thread.

The principle that should guide architectural decisions: "Agents can be thought of as an LLM using tools in a loop, with the LLM deciding at each step what to do next... the key is to provide the model with as much relevant context as possible while keeping the workflow as simple as possible."

Why 2025–2026 Became the Inflection Point

Three converging forces made harness engineering suddenly urgent.

First, models crossed a capability threshold where they could reliably use tools, follow multi-step plans, and recover from errors—making autonomous agent behavior practical rather than theoretical.

Second, the Model Context Protocol (MCP), launched by Anthropic in November 2024 and adopted by OpenAI in March 2025, created a universal standard for connecting agents to external systems. MCP now has over 10,000 active public servers and 97 million monthly SDK downloads. Its donation to the Agentic AI Foundation under the Linux Foundation in December 2025—co-founded by Anthropic, OpenAI, and Block—signaled industry alignment on interoperability.

Third, the economics shifted decisively. Sequoia Capital's Sonya Huang articulated the thesis: "The cloud transition was software-as-a-service—a $350B opportunity. Thanks to agentic reasoning, the AI transition is service-as-a-software. The addressable market is not the software market, but the services market measured in trillions."

The investment numbers confirm the shift:

  • AI captured 61% of all global venture capital in 2025—$258.7 billion of $427.1 billion total (OECD)

  • AI infrastructure firms raised $109.3 billion, up from $47.4 billion in 2024

  • Agentic AI startups raised $2.8 billion in H1 2025, nearly matching full-year 2024

  • 85% of senior executives plan to increase AI agent spending, with 171% average expected ROI

The practical catalyst, however, was simpler: teams building agents discovered that model intelligence wasn't the bottleneck—the surrounding infrastructure was. Vercel removed 80% of its agent's tools and got better results. Manus refactored its harness five times in six months. OpenAI's engineers found that "when something failed, the fix was almost never 'try harder.'"

The AI agent market tells this story in dollars. Currently valued at $7–8 billion, projections suggest growth to $140–200 billion by 2034—a 30%+ compound annual growth rate. This growth won't come from better models alone. It will come from better harnesses that make models actually useful in production.

The AI agent market tells this story in dollars. Currently valued at $7-8 billion, projections suggest growth to $140-200 billion by 2034—a 30%+ compound annual growth rate. This growth won't come from better models alone. It will come from better harnesses that make models actually useful in production.

OpenAI's February 2026 "Harness Engineering" post describes the most extreme experiment yet documented. A team of three engineers (growing to seven) produced approximately 1,500 pull requests over five months—averaging 3.5 PRs per engineer per day—with throughput that increased as the team grew. Single Codex runs regularly worked for upwards of six hours, often while humans slept.

The term "harness engineering" itself was coined by Mitchell Hashimoto (creator of Terraform and Ghostty): "It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

OpenAI's harness rests on three pillars: context engineering (the repository as single source of truth), architectural constraints (a rigid layered architecture—Types → Config → Repo → Service → Runtime → UI—enforced by structural tests), and entropy management (background agents that scan for deviations and open targeted refactoring PRs). The team initially spent every Friday—20% of their engineering time—cleaning up "AI slop" before automating it away.

Get insights like this in your inbox

Join our newsletter for deep dives on AI, technology, and building the future. No spam, unsubscribe anytime.

The Emerging Practice

Harness engineering is developing its own distinct practices and principles. Several patterns are emerging as industry standards:

Start Simple, Complexify Reluctantly

Both Anthropic and OpenAI emphasize beginning with the simplest possible harness and adding complexity only when necessary. Anthropic states this directly: "Developers often try to optimize prematurely before understanding what's actually needed."

Invest in Observability

You cannot improve what you cannot measure. Production harnesses require comprehensive logging, tracing, and monitoring infrastructure. Every decision point, tool call, and context injection should be observable.

Design for Failure

Agents will fail. Harness engineering is largely the art of making failures graceful, recoverable, and informative. The best harnesses turn failures into learning opportunities rather than system crashes.

Treat Context as Code

The prompts, system instructions, and context documents that shape agent behavior deserve the same rigor as application code: version control, testing, review processes, and documentation.

The Framework Landscape

The agent harness ecosystem has fractured along a spectrum from low-level frameworks to opinionated products.

Frameworks:

  • LangChain/LangGraph (80,000+ GitHub stars, 38 million+ monthly PyPI downloads) — Graph-based orchestration with explicit state management and no hidden prompts

  • CrewAI (44,600+ GitHub stars, 450 million monthly workflows) — Role-based multi-agent collaboration optimized for rapid prototyping. PwC improved code-generation accuracy from 10% to 70% using it

  • Microsoft Agent Framework (merging AutoGen with Semantic Kernel, 54,700+ GitHub stars) — Enterprise deployments with Azure integration

Products:

  • Cursor — Became the fastest software product to reach $100 million ARR, achieving it within 12 months with an agent-first IDE approach

  • Devin by Cognition — Grew from $1 million to $73 million ARR between September 2024 and June 2025, with a 67% PR merge rate. Goldman Sachs deployed Devin as its first AI employee

These products represent the harness as the entire value proposition—the model is interchangeable, but the orchestration, tool design, and user experience are what customers pay for.

The Business Implications

The Harness as Competitive Moat

Salesforce stated it directly: "As we move deeper into the era of AI-driven business, the models themselves will become a commodity. The true competitive moat for an organization will be its agentic infrastructure."

Three properties make harness moats durable:

  • Model-agnostic — A well-designed harness can swap underlying models while maintaining orchestration, constraints, and feedback infrastructure

  • Natural switching costs — Agentic workflows with multi-step dependencies make it increasingly difficult to change models (per a16z's CIO survey of 100+ enterprises)

  • Accumulated institutional knowledge — Every agent failure captured and resolved represents organizational learning competitors cannot replicate by licensing the same foundation model

For organizations deploying AI agents, harness engineering capability will increasingly determine competitive advantage. The models are available to everyone. The harnesses are not.

This creates interesting strategic implications:

  • Hiring — The most valuable AI engineers may be those who excel at harness design, not model development

  • Build vs. Buy — Harness infrastructure is becoming a key differentiator worth building in-house

  • Technical Debt — Poor early harness decisions will compound as agent deployments scale

  • New Roles — Agent Infrastructure Engineers, AI Reliability Engineers, and Capability Architects are crystallizing as distinct disciplines

  • Competitive Moats — Proprietary harness innovations may prove more defensible than model advantages

McKinsey reports that AI-centric organizations are achieving 20% to 40% reductions in operating costs and 12–14 point increases in EBITDA margins. Greg Brockman's February 2026 recommendations for OpenAI engineering teams crystallize the new operating model: designate an "agents captain" on every team, maintain a list of tools made agent-accessible via CLI or MCP, create and maintain AGENTS.md for every project, ensure human accountability for all merged code, and write tests with high-quality interfaces between components.

Looking Forward

Build to Delete: The Bitter Lesson for Agent Infrastructure

The most important strategic principle for harness engineering is that every piece of harness logic should have an expiration date. Capabilities that required complex, hand-coded pipelines in 2024 are now handled by a single context-window prompt in 2026. Manus rewrote its harness five times in six months. LangChain re-architected Open Deep Research four times in a year. LangChain explicitly warns that its LoopDetectionMiddleware "is a design heuristic that engineers around today's perceived model issues. As models improve, these guardrails will likely be unnecessary."

This creates a paradox: teams must invest heavily in harness infrastructure while designing it to be modular enough to dismantle. The resolution is to encode invariants rather than implementations—specify what must be true (dependency directions, data shape validation, test coverage) without prescribing how to achieve it.

The Harness-Training Feedback Loop

The most consequential long-term development may be the feedback loop between harnesses and model training. As Phil Schmid argues, "Labs will use the harness to detect exactly when a model stops following instructions or reasoning correctly after the 100th step. This data will be fed directly back into training to create models that don't get 'tired' during long tasks."

This creates a virtuous cycle: better harnesses produce better training data, better training data produces more capable models, and more capable models allow simpler harnesses.

We are witnessing the birth of a new engineering discipline. Like DevOps emerged from the collision of development and operations, harness engineering is emerging from the collision of AI capabilities and production requirements.

The practitioners who master this discipline—who can design elegant action spaces, engineer effective context flows, build robust feedback loops, and orchestrate complex agent workflows—will shape how AI actually gets deployed in the real world.

The models will continue to improve. But increasingly, the question won't be "which model are you using?" It will be "how good is your harness?"

That question will define the agentic AI era. And answering it well is now the most important skill you can develop.

Get insights like this in your inbox

Join our newsletter for deep dives on AI, technology, and building the future. No spam, unsubscribe anytime.