Harness Engineering: The Complete Guide
The model is the horse. The harness is what makes it useful. This is the practitioner's guide to the discipline that replaced prompt engineering as the defining skill of AI-native development — from the formula that started it all to the multi-agent orchestration patterns that make it real.
What Is Harness Engineering?
Harness engineering is the discipline of designing everything around an AI model that makes it a useful agent — context pipelines, guides, sensors, tools, memory, orchestration, permissions, and observability.
The term emerged in February 2026 from two near-simultaneous publications. Mitchell Hashimoto (co-founder of HashiCorp, creator of Terraform and Ghostty) described a discipline he'd developed while working with AI coding agents: "Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again." Six days later, Ryan Lopopolo at OpenAI published a case study where a small team built and shipped a product with zero lines of manually-written code — 1 million lines generated by Codex agents, 1,500 PRs merged by 3 engineers at 3.5 PRs/engineer/day.
The insight from both: when AI writes all the code, the craft shifts from writing code to designing the system around the code writer.
Birgitta Böckeler at Thoughtworks then published the canonical framework on Martin Fowler's site, establishing the vocabulary that the industry now uses: guides (feedforward controls) and sensors (feedback controls), borrowed from cybernetics.
The Formula: Agent = Model + Harness
A model alone is not an agent. GPT-4o, Claude, Gemini — these are reasoning engines. They become agents only when a harness gives them: state, tool execution, feedback loops, and enforceable constraints.
The metaphor comes from horse tack — reins, saddle, bit — the complete set of equipment for channeling a powerful but unpredictable animal. The model is the horse. The harness is what makes it useful.
What's in the harness? Everything that isn't the model:
- Context pipelines — the information the agent reasons over (CLAUDE.md, schemas, retrieved docs)
- Guides — constraints that steer the agent before it acts (system prompts, rules files, tool definitions)
- Sensors — checks that validate the agent after it acts (linters, tests, evals, output parsers)
- Tool interfaces — controlled access to external systems (MCP servers, filesystem, APIs)
- Memory — durable state across sessions (MEMORY.md, vector stores, conversation history)
- Orchestration — multi-agent coordination (task boards, session management, worktree isolation)
- Hooks — programmatic interception points (PreToolUse, PostToolUse, pre-commit)
- Permissions — what the agent can and cannot do (allowlists, tool-risk ratings, human-in-the-loop triggers)
- Sandboxes — safe execution environments (Docker, E2B, Firecracker)
- Observability — monitoring and debugging (traces, token usage, error rates)
Prompt → Context → Harness: The Evolution
Harness engineering didn't replace prompt engineering — it absorbed it. The hierarchy is additive, not competitive.
Prompt Engineering
Shapes behavior within a single interaction. Crafting instructions, examples, and formatting for a single model call.
Still essential — but now one component inside the harness, not the whole skill.
Context Engineering
Shapes reasoning by architecting the complete information environment — conversation history, retrieved docs, tools, output formats.
Recognized that what the model sees matters more than how you ask. See our complete guide.
Harness Engineering
Shapes execution by designing the entire operational environment — context, tools, sensors, orchestration, guardrails.
Contains both prompt and context engineering. The complete system around the model.
As Atlan puts it: prompt engineering "did not die; it was reclassified." It's now one tool inside the harness, alongside a dozen others. 82% of IT leaders say prompt engineering alone is insufficient for production agents.
Guides and Sensors: The Two Controls
Böckeler's framework splits the harness into two types of controls, borrowed from cybernetics:
Guides
Constrain and direct the agent before it acts. These are proactive controls that shape what the agent does.
- CLAUDE.md / AGENTS.md — project-level rules, build commands, forbidden patterns
- System prompts — role definitions, output format constraints
- Skills / reference docs — on-demand knowledge loaded when relevant
- Tool definitions — what tools exist, their parameters, when to use them
- Task specs — per-task instructions (goal, requirements, acceptance criteria)
- Architectural constraints — file structure rules, API contracts, schema definitions
Guides are followed ~70% of the time. They influence, they don't enforce.
Sensors
Observe and validate the agent's behavior after it acts. These are reactive controls that catch mistakes.
- Computational sensors (deterministic) — linters, type checkers, formatters, test suites, syntax validators
- Inferential sensors (LLM-based) — semantic code review, output quality scoring, drift detection
- Hooks — PreToolUse and PostToolUse that block dangerous actions with exit code 2
- Evals — automated grading of agent output against expected behavior
- UI automation — browser-based verification that the app actually works
Sensors enforce at ~100%. Böckeler argues sensors are the "powerful and underdiscussed" component — most teams over-invest in guides and under-invest in sensors.
The key insight: guides tell the agent what to do; sensors verify it did it right. A harness with only guides is like a car with a steering wheel but no brakes. A harness with only sensors catches errors but can't prevent them. You need both.
The 10 Components of a Harness
Aggregated from OpenAI, Thoughtworks, LangChain, Addy Osmani, and Red Hat, here are the building blocks of a production harness:
| Component | Role | Claude Code Implementation |
|---|---|---|
| Context pipelines | Supply information the agent reasons over | CLAUDE.md, .claude/rules/, subdirectory CLAUDE.md files |
| Guides / Skills | Feedforward constraints that steer before action | .claude/skills/, system prompts, session instructions |
| Sensors | Feedback controls that validate after action | Hooks (PreToolUse, PostToolUse), linters, test suites |
| Tool interfaces | Controlled access to external systems | MCP servers, bash execution, browser automation |
| Memory | Durable state across sessions | MEMORY.md, session memory, shared notes |
| Sandboxes | Safe execution environments | Docker, E2B, Firecracker, permissions allowlists |
| Orchestration | Multi-agent coordination | amux — session management, task boards, worktrees |
| Hooks / Lifecycle | Programmatic interception at key events | PreToolUse, PostToolUse, pre-commit hooks |
| Permissions | What the agent can and cannot do | settings.json allowlists, tool-risk ratings, YOLO mode |
| Observability | Monitoring and debugging agent behavior | Session peek, SSE real-time events, token tracking |
The Ratchet Principle
Every agent mistake becomes a permanent fix.
Addy Osmani calls this the ratchet principle: the harness only tightens, never loosens. When an agent makes an error, you don't fix the output — you fix the harness so the error can never happen again.
This is the core workflow of harness engineering. When an agent produces bad output, diagnose which harness component failed:
- Agent didn't know a rule? → Add it to CLAUDE.md (guide)
- Agent knew the rule but violated it? → Add a hook to enforce it (sensor)
- Agent lacked information? → Add a skill or MCP server (context pipeline)
- Agent used a dangerous tool? → Restrict permissions (guardrail)
- Agent's context was polluted? → Use a subagent for isolation (orchestration)
- Agent crashed and nobody noticed? → Add monitoring (observability)
Hashimoto's rule: every line in your AGENTS.md should trace to a real agent failure. If you can't point to the specific mistake that prompted the line, delete it. Zero aspirational rules.
Building Your First Harness
You don't need a framework. Start with the tools you already have.
Step 1: Write a lean CLAUDE.md
This is your primary guide — the always-loaded project rules. Keep it under 500 lines. Include only what the agent cannot infer from code: build commands, test commands, style rules, forbidden patterns, and pointers to deeper docs. See our CLAUDE.md templates.
# CLAUDE.md
## Build & test
- `npm run build` — builds the project
- `npm test` — runs all tests (must pass before committing)
- `npm run lint` — ESLint + Prettier (auto-fix with --fix)
## Rules
- Never modify files in `src/generated/` — these are auto-generated
- All API endpoints must have request validation with zod
- Use `pnpm` not `npm` for package management
- Database migrations go in `prisma/migrations/` — never edit existing ones
Step 2: Add sensors (hooks)
Move any CLAUDE.md rule that must be enforced 100% of the time into a hook. CLAUDE.md instructions are followed ~70% of the time. Hooks enforce at 100%.
// .claude/settings.json
{
"hooks": {
"PostToolUse": [{
"matcher": "Write|Edit",
"command": "if echo '$TOOL_INPUT' | grep -q 'src/generated/'; then echo 'BLOCKED: never modify generated files' >&2; exit 2; fi"
}]
}
}
Step 3: Add skills for on-demand context
Large reference docs, API specs, and workflow instructions go in .claude/skills/. The agent sees only the skill's name and description at startup (~200 tokens) and loads the full content on demand.
Step 4: Wire in fast feedback loops
Tests, linters, type checkers — these are computational sensors. Make them run automatically after every edit. The faster the feedback, the fewer bad outputs cascade into more bad outputs.
Step 5: Add memory for cross-session knowledge
MEMORY.md stores what the agent learned across sessions — user preferences, project conventions, past decisions. This prevents the same mistakes from recurring in new conversations.
Step 6: Iterate based on failures
Apply the ratchet: every failure tightens the harness. After a week of running agents, your CLAUDE.md will be twice as long, your hooks will catch twice as many edge cases, and your agents will be dramatically more reliable.
Harness Engineering for Multi-Agent Systems
A harness for one agent is configuration. A harness for ten agents is an orchestration platform. This is where amux comes in.
When you scale from 1 agent to 10+, the harness gains new components that single-agent setups don't need:
| Challenge | Single-Agent Harness | Multi-Agent Harness (amux) |
|---|---|---|
| Task assignment | You type the prompt | Kanban board with atomic task claiming via REST API |
| Git isolation | One branch | One worktree per agent, branched from main |
| Crash recovery | You restart manually | Self-healing watchdog: auto-restart, context compaction, stuck-prompt resolution |
| Coordination | N/A | Inter-session messaging, shared board, session peek |
| Monitoring | Watch the terminal | Web dashboard with real-time SSE status for all sessions |
| Cost tracking | Check your API dashboard | Per-session token accounting in the dashboard |
| Context pipelines | One CLAUDE.md | Root CLAUDE.md + per-session steering + shared memory via REST API |
| Overnight operation | Screen + hope | Unattended operation with health monitoring and auto-recovery |
OpenAI's Symphony project demonstrated this at scale: 1 million lines of code, 1 billion tokens per day, zero human-written code, all driven by agents managed through a harness. Stripe's "Minions" ship 1,300 AI-generated PRs per week using a harness that separates deterministic nodes (run a linter, push a commit) from agentic nodes (implement a feature, fix CI).
You don't need to be Stripe or OpenAI. A solo developer with amux and 5-10 Claude Code sessions has the same harness architecture: orchestration, sensors, guides, memory, and self-healing — in a single Python file.
The Evidence: Harness > Model
The most surprising finding of 2026 is that the harness matters more than the model. The data is consistent across benchmarks:
- SWE-bench: Swapping the harness changes scores by 22 points. Swapping the model changes scores by 1 point.
- Terminal Bench 2.0: LangChain gained 13.7 points by redesigning the harness alone — same model, 52.8% → 66.5%.
- Atlan's data pipelines: Without governed context, bare schema accuracy is 10-31%. With a proper harness: 94-99%.
- Princeton research: Harness configurations can improve solve rates by 64% compared to basic setups.
This is why Augment Code, Red Hat, and SIG are all investing in harness engineering as a discipline — the returns are dramatically higher than chasing the latest model.
10 Best Practices
- Start simple. A good CLAUDE.md and pre-commit hooks are more impactful than complex middleware. Add complexity only when simple controls fail.
- Apply the ratchet. Every agent mistake becomes a permanent harness fix. The harness only tightens, never loosens. (Osmani)
- Zero aspirational rules. Every line in CLAUDE.md should trace to a real agent failure. If you can't point to the mistake, delete the line. (Hashimoto)
- Invest in sensors, not just guides. Most teams over-invest in markdown files and under-invest in automated checks. Sensors are the underdiscussed component.
- Separate planning from execution. A planner agent expands prompts into specs. A generator implements. An evaluator tests and grades. Agents rate their own work too generously — separation creates honest feedback.
- Context structure > prompt wording. Most practitioners spend hours on prompt wording and minutes on context structure. This is backwards.
- Achieve information parity. If it's available to humans but not agents, the harness has a hole. Encode every convention, shortcut, and tribal knowledge.
- Wire fast feedback loops. Tests, linters, type checkers running after every edit. The faster the feedback, the fewer cascading failures.
- Use computational sensors before inferential ones. A linter is faster, cheaper, and more reliable than an LLM-based code reviewer. Use deterministic tools first; add LLM-based judgment where deterministic tools can't reach. (Böckeler)
- Design for overnight operation. The harness should handle crashes, context exhaustion, and stuck states without human intervention. If you can't walk away from your agents, your harness has gaps. See our self-healing guide.
Tools and Resources
Agent platforms and orchestrators
- amux — open-source multi-agent orchestration for Claude Code. Session management, kanban board, self-healing watchdog, REST API. The complete harness for parallel agents.
- Claude Code — terminal-native coding agent by Anthropic. Built-in harness primitives: CLAUDE.md, hooks, skills, subagents, MCP.
- OpenAI Codex — cloud-sandboxed coding agent. AGENTS.md for guides, containerized execution for isolation.
- Cursor — AI-native IDE.
.cursorrulesfor guides, built-in linting for sensors.
Harness components
- CLAUDE.md templates — starter configs for different project types
- Hooks cookbook — 20 production-ready sensor recipes
- MCP servers — tool interfaces for agents
- Sandboxing guide — Docker, E2B, Firecracker, gVisor compared
- Config files compared — CLAUDE.md vs .cursorrules vs AGENTS.md
Further reading
- Mitchell Hashimoto: My AI Adoption Journey — the origin of Agent = Model + Harness
- OpenAI: Harness Engineering — the Codex case study (0% human code)
- Thoughtworks / Martin Fowler: Harness Engineering — the guides-and-sensors framework
- Thoughtworks: Exploring AI Coding Sensors — the case for investing in sensors
- Addy Osmani: Agent Harness Engineering — the ratchet principle and harness primitives
- LangChain: The Anatomy of an Agent Harness — components defined by working backwards from behaviors
- Red Hat: Harness Engineering for AI-Assisted Development — structured workflows, symbol analysis
- Latent Space: Extreme Harness Engineering — 1M LOC, 1B tokens/day, 0% human code
- Awesome Harness Engineering — curated list of tools, patterns, and resources
- HumanLayer: Skill Issue — Harness Engineering for Coding Agents
FAQ
What is harness engineering?
Harness engineering is the discipline of designing everything around an AI model that makes it a useful agent — context pipelines, guides, sensors, tools, memory, orchestration, permissions, and observability. The formula is Agent = Model + Harness. The term was coined by Mitchell Hashimoto in February 2026 and formalized by OpenAI, Thoughtworks, and others.
How is harness engineering different from prompt engineering?
Prompt engineering shapes behavior within a single interaction. Context engineering shapes reasoning by architecting the complete information environment. Harness engineering shapes execution by designing the entire operational environment — it contains both prompt and context engineering alongside tool interfaces, sensors, memory, orchestration, and permissions. Prompt engineering didn't die; it was reclassified as one component inside the harness.
What are guides and sensors?
Guides are feedforward controls that constrain the agent before it acts — CLAUDE.md, system prompts, tool definitions. Sensors are feedback controls that validate the agent after it acts — linters, tests, hooks, evals. Together they form a control system. Framework by Birgitta Böckeler at Thoughtworks.
What is the ratchet principle?
Coined by Addy Osmani: every agent mistake should become a permanent fix in the harness. The harness only tightens, never loosens. This means every line in your CLAUDE.md traces to a real failure, every hook traces to a real violation, and the harness accumulates institutional knowledge about what goes wrong.
Why does the harness matter more than the model?
On SWE-bench, swapping the harness changes scores by 22 points; swapping the model changes scores by 1 point. LangChain improved 13.7 points on Terminal Bench 2.0 by changing only the harness. The model is increasingly a commodity; the harness is the competitive moat.
How does amux implement harness engineering?
amux is a harness for multi-agent systems. It provides orchestration (parallel session management, atomic task claiming via kanban board), sensors (health monitoring, crash detection, context exhaustion detection), context pipelines (CLAUDE.md, session memory, MCP servers), tool interfaces (REST API, browser automation), and self-healing (automatic crash recovery, context compaction, stuck-prompt resolution). It wraps Claude Code sessions with the complete operational environment that turns raw model access into a coordinated agent fleet.
Do I need amux to practice harness engineering?
No. Harness engineering starts with a CLAUDE.md file and a pre-commit hook — tools you already have. amux becomes valuable when you scale to 3+ parallel agents and need orchestration, monitoring, and self-healing that a single terminal session can't provide. Start simple, scale when the single-agent harness isn't enough.
amux is the harness
Orchestration, sensors, guides, memory, and self-healing for your AI agent fleet. Open source. One Python file.
git clone https://github.com/mixpeek/amux && cd amux && ./install.sh
amux register myproject --dir ~/Dev/myproject --yolo
amux start myproject
amux serve # → https://localhost:8822View on GitHub