Most agent failures aren’t reasoning failures. They’re harness failures.
An agent that passes every test while looping between two failing strategies, because the harness has no dead-end detection.
A new 100-page survey from UIUC, Meta, and Stanford spells out why.
It’s called “Code as Agent Harness.” 40+ researchers from UIUC, Meta, and Stanford wrote it, and it synthesizes 400+ papers under a single taxonomy with the harness, not the model, as the subject.
The anchor systems are the familiar ones: Claude Code, Codex, SWE-agent, Voyager, MetaGPT, OpenHands. What’s been a Twitter-thread topic for the last six months now has academic scaffolding around it. The contribution is the synthesis, not the discovery.
Long-running agents fail at state, feedback, and verification. Not at reasoning. The bottleneck of autonomy is whether the system around the model can hold its outputs accountable to something executable.
The paper splits any agent system into three coupled pieces.
First is model-internal capabilities: reasoning, planning, perception.
The second is system-provided infrastructure: tools, sandboxes, memory, permission tiers, telemetry.
And third, the underexplored one, is agent-initiated code artifacts: regression tests, temporary tools, DSL programs, executable workflows, and reusable skills the agent itself authors mid-task. Voyager’s skill library and Claude Code’s skill files are early instances.
Above this distinction sit three layers.
Harness Interface, puts code at the center as the medium for reasoning, action, and environment state.
Harness Mechanisms, covers planning, memory, tool use, and the plan-execute-verify loop.
Scaling the Harness, extends the picture to multi-agent systems coordinating over shared code artifacts.
Three questions, one per layer. They map to where most stacks actually break.
I
nterface.
Does your agent’s reasoning, action, and environment state pass through code something can actually execute and inspect? A healthy stack has tool calls, generated programs, repo state, traces, and tests. An unhealthy stack runs on natural-language plans the agent never has to defend against execution.
If unhealthy: have the model output executable code as its reasoning, give the agent a structured Agent-Computer Interface like SWE-agent’s shell + edit + search commands, and let it operate on real repo state rather than text descriptions of it.
Mechanisms.
When something fails, what does the harness do about it? A healthy stack runs a plan-execute-verify loop with named verifiers (unit tests, type checks, linters, runtime monitors), durable memory across sessions, and feedback that closes the loop. An unhealthy stack retries with more tokens and a longer context window.
If unhealthy: add named verifiers as gates between generation steps, not just at the end. Most agents only have working memory, which is whatever sits in the current prompt. The paper names four more types that decide whether yesterday’s debugging session helps today: semantic memory of the repo, experiential memory of past trajectories, long-term memory with a compression policy, and multi-agent memory for shared state. OpenHands’ stateful workspace and CodeMem’s budgeted memory slots are reference implementations to study.
Scaling.
When two agents work on the same task, what’s the shared substrate? A healthy stack uses shared code artifacts (repos, tests, traces, structured workflows) with a conflict policy. An unhealthy stack passes messages back and forth with no shared state both can safely modify.
If unhealthy: replace direct message-passing with shared artifacts both agents can read and write. AgentCoder’s programmer-tester-executor split and MetaGPT’s role-specialized agents over a shared message pool are the patterns the paper highlights.
If any of these answers feel unhealthy, the paper has already named the failure mode.
Five application domains. Code assistants, GUI/OS agents, scientific discovery, embodied agents, personalization.
Self-evolving harnesses. AutoHarness, Meta-Harness, Agentic Harness Engineering (AHE) (related article down below), GEPA, EvoMAC, and SEW. The harness itself as the object of optimization, with the agent’s task code as a downstream effect.
Workflow topologies. Five patterns for multi-agent code work: waterfall, cyclic, hierarchical (related article down below), star, adaptive.
Planning paradigms. Four categories, from ReAct-style linear decomposition to tree-search-based exploration across candidate paths.
Three more open problems. Harness evolution that doesn’t break old behaviors, shared state across agents with safe coordination, and multimodal harnesses for screenshots and physical state.
The most useful vocabulary the field has had for what practitioners are already building. Just not a build plan. Three gaps from the paper’s open problems. Each one is a design warning for what’s already in your stack.
Oracle adequacy.
If your eval is pass/fail on unit tests, you’re measuring the wrong thing. Every agent evaluation today collapses model quality, tool reliability, and harness quality into one end-task number. The paper names this as the central bottleneck and offers no metric that fixes it.
The verification gap.
Green tests are not a correct specification. Every accepted action should ship with an evidence bundle: which checks ran, which assumptions held, which parts of the code stayed untested, what risks remain. No current harness does this. The architecture pattern is sitting there, waiting for someone to ship it.
Approvals that don’t reset.
If approvals vanish after the session ends, your agent will repeat the same unsafe action next time. Permission rules should mutate in response to human decisions, not reset. The paper flags this and stops there.
Read it as a vocabulary, not a roadmap. The harness is the layer teams now invest in optimizing. The taxonomy will sharpen how you talk about your stack. It won’t tell you what to build on Monday.
Full breakdown of recent updates + daily signals in our newsletter (link in bio).