In ~9 mins: the six-component framework, the three bottlenecks every agent runs into, why “agent score” is now a system score, and a runnable appendix to inspect the reference harness yourself.
The model is the part you can pick. Everything else is the part that breaks.
Shangding Gu (UC Berkeley) just argued that long-horizon agent performance now depends as much on the surrounding system as on stronger foundation models.
The paper names this work “scaling the harness” and decomposes any agent into six interacting components, with three of them already at saturation.
A runnable appendix at the end installs the reference harness and points each step back to one of those six components.
The research is authored by Shangding Gu (UC Berkeley) and titled “From Model Scaling to System Scaling: Scaling the Harness in Agentic AI“ (arXiv 2605.26112v1, May 25 2026).
It lands in a year where “agent score” has stopped tracking model score. SWE-agent showed that redesigning the tool schema alone moves SWE-bench accuracy with the same model underneath. tau-bench’s pass^k metric exposed that agents looking strong at single-shot success collapse under repeated trials.
Anthropic reported a 90.2% gain from a multi-agent system (Opus 4 lead with Sonnet 4 subagents) over single-agent Opus 4 on internal research, with token usage alone explaining 80% of BrowseComp performance variance and tool-call count plus model choice taking that to 95%.
The frame is now mainstream. OpenAI calls it “harness engineering.” Anthropic calls it “context engineering.” AMA-Bench (ICML 2026) and RealMem both target memory hygiene. Gu’s paper is the first to put a clean name and a decomposition on what those efforts have in common.
An agent is six things, not one. The model reasons. The harness around it picks what to remember, what context to assemble, which tool to call, how to verify each step, and what trace to record. Most current evaluation only measures the model.
Reader test. When someone says “this agent is better,” ask whether the model improved or whether the harness changed. Tool schemas, file inspection order, retry policies, and memory rules all sit inside the headline number and currently cannot be separated from it.
Gu defines agent performance over a horizon H as:
P_H = Φ(R, M, C, S, O, G)
Model scaling improves R. System scaling improves the other five.
The paper’s main claim is conditional. Once R clears a capability threshold, marginal gains shift to M, C, S, O, and G. Gu also flags that Φ has no closed form and the six factors are not strictly orthogonal. They are six distinct points of intervention, each one engineering effort can change while holding the rest fixed.
The hard problem of context is not capacity. It’s governance.
Gu factors context quality into four subaxes: relevance (matches the current subproblem), compactness (no more than the minimum sufficient set), traceability (provenance can be inspected), and refresh policy (stale context gets rechecked).
Failure mode: exposure without access. As context grows, the model sees more tokens but does not necessarily attend to the right ones. Relevant evidence competes with low-value padding. Attention dilutes over long inputs. Models prefer evidence at the start or end of the window rather than the middle.
System move: treat each turn’s context as the output of a selection policy, not a fixed buffer. Weight relevance, penalize verbosity against a token budget, prefer recently validated content, record provenance.
Claude Code is the production version of this. Persistent project context lives in CLAUDE.md. Just-in-time access happens through glob, grep, and file reads. The harness gets stable priors plus environment refresh on demand.
The right systems question is no longer how many tokens the model can hold. It is what the minimum sufficient context is for the current subproblem.
The hard problem of agent memory is not storage. It’s trust.
Four subaxes again: precision (the claim has a narrow scope), durability (the target has not silently changed), retrievability (the memory can be found at acceptable cost), verifiability (the claim can be checked against the live environment). Retrievability is a precondition for using trust, not a source of it.
Failure mode: stale-but-confident. Gu’s example: a note like “the data loader is defined in utils/loader.py.” After a refactor, this is flatly wrong, but semantic search still ranks it highly. If the agent acts on it without re-checking, the error is operational.
The safety variant is worse. “Remembering More, Risking More” (Al-Tawaha et al., 2026) finds that memory-enabled agents exceed a NullMemory baseline on safety violations, and that violation rates trend upward with exposure length.
System move: trust is a runtime decision, not a property of the stored item. Store confidence, source, age. Penalize staleness in retrieval. Re-check environment-dependent claims before acting. Treat retrieved memory as a hypothesis until verified.
CheetahClaws implements this directly. Each memory entry stores confidence, source, last_used_at, and conflict_group as first-class fields. Search re-ranks by confidence × recency. The other two harnesses in the paper’s comparison derive trust implicitly from access patterns.
Durable memory without verification accumulates undetected drift. Environment-only search throws away every prior verification. A working harness keeps both.
The hard problem of skill is not having skills. It’s routing and checking them.
Four subaxes: specificity (each skill states what it can and cannot do), selectivity (the router invokes the right skill at the right time), composability (one skill’s post-conditions feed the next), verifiability (every skill output has an explicit check).
Failure mode: confident-but-unchecked. As specialized subagents multiply, the risk shifts from missing capability to present-but-unverified capability. A subagent returns plausible output. No downstream layer validates. This is the symmetric form of stale-but-confident memory.
System move: routing as a learned policy, not a fixed rule set, paired with post-condition checks at every step. S and G are not independent. Scaling skill quality without scaling verification produces faster but less reliable progress.
Adjacent work is converging. OpenAI’s skills docs define skills as repeatable procedures, separated from prompts, attached to the execution environment. SkillOpt (Yang et al., 2026) treats the skill file like neural-network weights and accepts edits only when held-out validation improves.
The open research direction is adaptive allocation across skills based on subtask type, confidence, and verification cost. Production routing today is mostly hand-coded.
Outcome metrics answer whether the task was solved. Process metrics answer how. Two agents may both pass the same benchmark while differing in tokens, tool calls, retries, failed edits, and human interventions. Endpoint accuracy cannot see that.
Gu lists eight dimensions a harness-aware benchmark should track:
Only the first is well-covered today. tau-bench’s pass^k is an early step on reliability. AMA-Bench (ICML 2026) is an early step on long-horizon memory. The other six dimensions need work.
Safe agent evolution is the second open question. Gu proposes four sub-questions for any agent that adapts over time: what persists, what updates, what is measured, what is auditable. Memory, skills, preferences, and guardrails should not collapse into one undifferentiated state.
The evidence from Anthropic’s multi-agent post lands here. Token usage alone explained 80% of BrowseComp performance variance. Token usage plus tool calls plus model choice took it to 95%. Compute allocation across the harness is the dominant factor, not which model is wrapped.
Six questions, one per component, that work against any agent a reader ships today (Claude Code, Codex CLI, Hermes, OpenClaw, custom).
R: which model, and what’s its pass^k under repeated trials?
M: what’s stored, how is staleness handled, is there a re-check before action?
C: is context a buffer or a policy? What gets evicted first?
S: how many tools and skills, who routes, who post-condition-checks?
O: retry policy, handoff protocol, max-turn cap?
G: what’s logged, what’s auditable, which permissions exist independently of model output?
If your stack has no answer for one of these, that’s where the next harness gain will come from.
The framework is sharp. The honest gaps are three.
No metric yet. Gu names the levers but offers no quantitative measure for “harness quality.” Without that, system scaling is direction, not science. The evaluation agenda is the right shape, but every dimension after one-shot completion is still under-benchmarked today.
The reference harness is illustration, not benchmark. CheetahClaws stores confidence and recency as first-class fields and gets credit for that. The Claude Code / OpenClaw / CheetahClaws comparison table is deliberately non-ranking. The paper says explicitly: the point is not to declare a winner. Don’t read it as one.
The evaluation agenda is a wishlist. Of the eight benchmark dimensions, only one is well-covered today. tau-bench, AMA-Bench, and RealMem are early steps on three of the others. Four dimensions have no public benchmark. Scaling the Harness v2 is the paper that turns this framework into measurable scores, and it doesn’t exist yet.
Today, the practical value is vocabulary. Six components, three bottlenecks, four subaxes each, eight benchmark dimensions, four evolution questions. That’s a checklist a reader can run on Monday morning against whatever harness they ship.
Which of the six components is the weakest in the agent stack you ship today?
All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).
The paper is conceptual but its reference implementation is runnable in 40K lines of Python. Each step below points back to which paper component it instantiates.
curl -fsSL https://raw.githubusercontent.com/SafeRL-Lab/cheetahclaws/main/scripts/install.sh | bash
source ~/.zshrc
cheetahclawsOr via pip: pip install cheetahclaws. Python 3.10+ required. Installs the CLI binary, the harness modules, and the default config dir at ~/.cheetahclaws/.
cheetahclaws --model claude-opus-4-6Switch between claude-opus-4-6, gpt-4o, gemini-2.5-pro-preview-03-25, deepseek-chat, qwen-max, or any local Ollama model with one flag. No recompile. Component: R (reasoning substrate).
ls ~/.cheetahclaws/memory/
cat ~/.cheetahclaws/memory/MEMORY.mdEach entry is a Markdown file under ~/.cheetahclaws/memory/<slug>.md with frontmatter for confidence, source, last_used_at, and conflict_group. MEMORY.md is the rebuilt-on-write index. Search re-ranks by confidence × recency. Run /memory consolidate for a manual deduplication pass. Component: M (memory store).
Open context.py. The harness assembles each turn from: env block, git info, prompt assets, memory index, registered commands, tmux state, and plan-mode fragments. Compaction fires at 70% of the model’s context window with a two-layer rule-based plus AI-summarization stack. Component: C (context constructor).
/skillsList Markdown skills under skill/ with frontmatter for allowed_tools, model_override, and inline-vs-fork execution. 27 built-in tools register at startup via tool_registry.py. Subagents live under multi_agent/ with built-in roles, tool restrictions, and worktree isolation. Component: S (skill router).
/verboseThe agent loop yields typed events: TextChunk, ToolStart, ToolEnd, TurnDone. The full loop lives in agent.py with retry, handoff, and termination logic in one file. Component: O (orchestration loop).
/permissions manualPermission modes (auto, accept-all, manual, plan) are set via /permissions <mode> inside the REPL. The append-only event store lives at ~/.cheetahclaws/kernel.db (SQLite). Schema is defined in cc_kernel/event_log.py with fields event_id, ts, kind, payload, causation_id, and correlation_id. Component: G (verification and governance).
cheetahclaws --webAll six components surface in one dashboard: model selector, memory browser, context preview, tool registry, event log. Useful for comparing the same task across two models or two permission modes. Spans all components.
This is not a recommendation that CheetahClaws is the right harness for production. It is the one in the paper’s comparison table a reader can read end-to-end in an afternoon.