Harness engineering

An agent harness is the execution environment and control framework that wraps an AI agent so it can operate reliably against tools, systems, and tasks. Harness engineering is the discipline of designing, building, and operating these harnesses.

What a harness does

If the model is the decision-maker, the harness is the runtime infrastructure that lets it act — the scaffolding around the model that handles:

Tool invocation — shell, APIs, browser, database, code execution.
State management — memory, context, scratchpads.
Input/output mediation — formatting prompts, validating outputs.
Safety guardrails — permissions, sandboxing, rate limits.
Observation and feedback loops — capturing tool results and feeding them back.
Task orchestration — retry logic, planning, branching, checkpoints.

A useful analogy: if the agent is the driver, the harness is the car, the dashboard, the rules of the road, and the telemetry system. The driver decides; everything around them determines what they can safely do and how well they can see the consequences.

Concrete examples

Coding agent harness

A coding model that can edit files and run commands. The harness provides:

Filesystem access
Shell execution
Git integration
Patch application
Output capture
Rollback if commands fail

Examples include OpenAI Codex-style execution sandboxes, Anthropic computer-use environments, and CLI agents such as OpenHands.

A typical loop:

The agent decides: "run the tests".
The harness executes pytest.
The harness captures stdout and stderr.
The results are fed back into the agent’s context.
The agent decides its next action.

Without the harness, the model can only suggest commands — it cannot run them, see their output, or act on the result.

Browser automation harness

An agent that interacts with websites. The harness provides:

Browser session lifecycle
DOM access
Click, type, and navigation primitives
Screenshot capture
Page-state serialization

Examples include Playwright wrapped for agents, Browser Use, and web-task agents built on Selenium.

The harness translates a high-level intent — "click the login button" — into the deterministic browser API calls that carry it out.

Why it’s called a harness

The term is borrowed from an older practice: the test harness, the scaffolding of drivers, stubs, and fixtures that wraps a unit of code so it can be exercised in isolation — supplying its inputs, invoking it repeatedly, and capturing its outputs for inspection. An agent harness plays the same role for a model: it wraps the model in a controlled, observable environment, feeds it inputs (prompts, context, tools), and captures its outputs (tool calls and their results) so they can be validated and fed back. In both cases the harness is not the thing being run — it is the rig that makes running it safe, repeatable, and observable.

Why it matters

Recall that Agent = Model + Harness: the model is supplied by a third party and improves on its own schedule, so the harness is the part of the system a team actually controls — and, in practice, the part that most determines whether an agent is useful and trustworthy. The harness shapes what an agent can do far more than the choice of model alone; two agents powered by the same model but run inside different harnesses will behave very differently.

The motivation is trust and reliability. A capable model let loose without constraints produces work that is hard to review and easy to get wrong. A well-designed harness constrains the solution space so the agent is more likely to do the right thing, and so that when it does go wrong, the problem is caught early and cheaply.

Versus prompt and context engineering

Harness engineering is easily confused with two neighbouring practices that also shape an LLM’s behaviour. The difference is one of altitude:

Prompt engineering shapes the instructions given to the model — how a task is phrased, what examples and output format are supplied. It operates on a single request.
Context engineering is the broader practice of curating everything in the context window: system prompts, retrieved documents, conversation history, and tool outputs. Prompt engineering is a subset of it.
Harness engineering operates one level out again. Where prompt and context engineering shape what the model sees (its inputs), harness engineering shapes what the model can do and how it is controlled — its action space, tools, guardrails, and feedback loops.

The three nest rather than compete: the harness is what assembles and manages the context on each turn, so context engineering is one of the concerns a harness handles. A rough division of labour — prompt engineering tunes the words, context engineering tunes the information, and harness engineering builds the machine that delivers both and acts on the results.

Guides and sensors

Birgitta Böckeler (Thoughtworks) frames the controls in a harness as two complementary kinds:

Guides (feedforward) — anticipatory steering applied before the agent acts: documentation, rules, conventions, skills, scaffolding, and code-transformation tools.
Sensors (feedback) — observational controls that detect problems after the agent acts and feed signals back for self-correction: linters, tests, type checks, and semantic reviews.

Both kinds come in two flavours: computational (deterministic tools such as linters and test suites) and inferential (semantic checks performed by an LLM). When an agent struggles, the harness-engineering response is to treat it as a signal — identify what is missing (a tool, a guardrail, a piece of documentation) and add it.

Harness engineering as a discipline

For production and multi-agent systems, harness engineering broadens into something close to platform engineering for agents: defining agent lifecycles, enforcing permission boundaries, wiring observability pipelines, and coordinating multi-agent workflows. The concern is the reliability, governance, and operational characteristics of agentic systems, rather than the raw capability of the underlying model. It is an emerging discipline, analogous to platform engineering in the DevOps world.

The payoff shows up at the orchestration layer. Orchestration tools such as OpenAI Symphony — which spawn an autonomous agent per task and expect each to return a validated deliverable — are explicitly designed to run against codebases built on harness-engineering principles. Without the guides and sensors a harness provides, autonomous, unsupervised agents have nothing to keep their output trustworthy.

Useful links

Harness Engineering — Birgitta Böckeler’s article on the system of controls (guides and sensors) that surround a coding agent.
Harness engineering memo — a shorter companion memo introducing the idea.