LangChain Blog · 12일 전 · 원문 보기

에이전트 하네스의 해부

The Anatomy of an Agent Harness

TLDR: Agent = Model + Harness. Harness engineering은 모델 주변에 시스템을 구축하여 이를 작업 엔진으로 바꾸는 방식입니다. 모델은 지능을 담고 있고, harness는 그 지능을 유용하게 만듭니다. 오늘 harness가 무엇인지 정의하고 현재와 미래의 agent들이 필요한 핵심 요소들을 도출해 봅시다.

누군가 "Harness"를 정의해 줄 수 있을까요?

Agent = Model + Harness

당신이 모델이 아니라면, 당신이 harness입니다.

Harness는 모델 자체를 제외한 모든 코드, 설정, 실행 로직입니다. 순수한 모델만으로는 agent가 아닙니다. 하지만 harness가 상태, 도구 실행, 피드백 루프, 강제 가능한 제약 조건 같은 것들을 제공할 때 agent가 됩니다.

구체적으로, harness는 다음과 같은 것들을 포함합니다:

System Prompts
Tools, Skills, MCPs와 그들의 설명
번들된 인프라 (파일시스템, 샌드박스, 브라우저)
오케스트레이션 로직 (subagent 생성, 핸드오프, 모델 라우팅)
결정론적 실행을 위한 Hooks/Middleware (압축, 연속성, lint 체크)

Agent 시스템의 경계를 모델과 harness 사이에서 나누는 방법은 많고 복잡합니다. 하지만 제 의견으로는, 이것이 가장 깨끗한 정의입니다. 왜냐하면 이것이 우리에게 모델 지능 주변의 시스템을 설계하는 것에 대해 생각하도록 강요하기 때문입니다.

이 포스트의 나머지 부분은 핵심 harness 요소들을 거쳐가고, 모델이라는 핵심 원시 요소로부터 역으로 작동하면서 왜 각 부분이 존재하는지를 도출합니다.

모델의 관점에서 우리가 Harnesses를 필요로 하는 이유

Agent가 할 수 있기를 원하지만 모델은 기본적으로 할 수 없는 것들이 있습니다. 여기서 harness가 들어옵니다.모델은 주로 텍스트, 이미지, 오디오, 비디오 같은 데이터를 입력으로 받고 텍스트를 출력합니다. 그게 전부입니다. 기본적으로 모델은 다음을 할 수 없습니다:

상호작용 간에 지속 가능한 상태를 유지하기
코드를 실행하기
실시간 지식에 접근하기
작업을 완료하기 위해 환경을 설정하고 패키지를 설치하기

이 모든 것들은 harness 수준 기능입니다. LLM의 구조는 이들을 유용한 작업으로 변환하기 위해 이들을 감싸는 어떤 종류의 기계 장치를 필요로 합니다. 예를 들어, "채팅"이라는 제품 UX를 얻기 위해, 우리는 모델을 이전 메시지들을 추적하고 새로운 사용자 메시지를 추가하는 while 루프로 감싸줍니다. 이것을 읽고 있는 모든 사람이 이미 이런 종류의 harness를 사용했을 것입니다. 주요 아이디어는 우리가 원하는 agent 동작을 harness의 실제 기능으로 변환하고 싶다는 것입니다.

원하는 Agent 동작으로부터 역으로 Harness Engineering으로 가기

Harness Engineering은 인간이 agent 동작을 안내할 유용한 선행 지식을 주입하는 것을 도와줍니다. 그리고 모델이 더 능력 있어져 왔기 때문에, harness는 모델을 외과적으로 확장하고 이전에 불가능했던 작업을 완료하도록 수정하는 데 사용되어 왔습니다.

우리는 모든 harness 기능의 완전한 목록을 거쳐가지 않을 것입니다. 목표는 모델이 유용한 작업을 할 수 있도록 도와주는 시작점으로부터 기능 집합을 도출하는 것입니다. 우리는 이런 패턴을 따를 것입니다:

우리가 원하는 (또는 수정하길 원하는) 동작 → 모델이 이를 달성하도록 돕는 Harness Design.

지속 가능한 저장소와 컨텍스트 관리를 위한 파일시스템

우리는 agent가 지속 가능한 저장소를 가지고 있어서 실제 데이터와 인터페이스 하고, 컨텍스트에 맞지 않는 정보를 오프로드하고, 세션을 걸쳐 작업을 지속할 수 있기를 원합니다.

모델은 자신의 컨텍스트 윈도우 내의 지식에만 직접 작동할 수 있습니다. 파일시스템이 나오기 전에, 사용자는 콘텐츠를 직접 모델에 복사/붙여넣기 해야 했는데, 이것은 어색한 UX이고 자율 agent에게는 작동하지 않습니다. 세상은 이미 파일시스템을 사용하여 작업을 하고 있었으므로 모델은 자연스럽게 파일시스템 사용 방법에 대한 수십억 토큰으로 학습되었습니다. 자연스러운 해결책이 되었습니다:

Harness는 파일시스템 추상화와 fs-ops 도구를 제공합니다.

파일시스템은 아마도 가장 기초적인 harness 원시 요소입니다. 왜냐하면 그것이 잠금을 해제하는 것 때문입니다:

Agent는 데이터, 코드, 문서를 읽을 수 있는 작업 공간을 얻습니다.
작업은 점진적으로 추가되고 오프로드될 수 있는 대신 모든 것을 컨텍스트에 보관합니다. Agent는 중간 출력을 저장하고 하나의 세션보다 더 오래 지속되는 상태를 유지할 수 있습니다.
파일시스템은 자연스러운 협업 표면입니다. 여러 agent와 인간이 공유 파일을 통해 조정할 수 있습니다. Agent Teams과 같은 아키텍처는 이것에 의존합니다.

Git은 파일시스템에 버전 관리를 추가하므로 agent는 작업을 추적하고, 오류를 롤백하고, 실험을 분기할 수 있습니다. 우리는 아래에서 파일시스템을 다시 살펴봅니다. 왜냐하면 그것이 우리가 필요한 다른 기능들에 대한 핵심 harness 원시 요소가 되기 때문입니다.

일반 목적 도구로서의 Bash + Code

우리는 agent가 인간이 모든 도구를 사전에 설계할 필요 없이 문제를 자율적으로 해결하기를 원합니다.

오늘날의 주요 agent 실행 패턴은 모델이 추론하고, 도구 호출을 통해 행동하고, 결과를 관찰하고, while 루프에서 반복하는 ReAct loop입니다. 하지만 harness는 그것들이 로직을 가진 도구들만 실행할 수 있습니다. 사용자들이 모든 가능한 행동을 위해 도구를 만들도록 강요하는 대신, 더 나은 해결책은 agent에게 bash처럼 일반 목적 도구를 제공하는 것입니다.

Harness는 모델이 코드를 작성하고 실행하여 자율적으로 문제를 해결할 수 있도록 bash 도구를 제공합니다.

Bash + code exec는 모델에게 컴퓨터를 주는 것에 대한 큰 진전이고 나머지는 자율적으로 파악하게 놔두는 것입니다. 모델은 사전에 구성된 고정된 도구 집합으로 제한되는 대신 코드를 통해 즉석에서 자신의 도구를 설계할 수 있습니다.

Harness는 여전히 다른 도구들을 제공하지만, 코드 실행이 자율적 문제 해결을 위한 기본 일반 목적 전략이 되었습니다.

샌드박스와 작업을 실행하고 검증하는 도구

Agent는 안전하게 행동하고, 결과를 관찰하고, 진전을 이룰 수 있도록 올바른 기본값을 가진 환경이 필요합니다.

우리는 모델에게 저장소와 코드를 실행할 수 있는 능력을 주었지만, 이 모든 것이 어딘가에서 일어나야 합니다. Agent가 생성한 코드를 로컬에서 실행하는 것은 위험하고, 단일 로컬 환경은 대규모 agent 작업에 확장되지 않습니다.

샌드박스는 agent에게 안전한 운영 환경을 제공합니다. 로컬에서 실행하는 대신, harness는 코드를 실행하기 위해 샌드박스에 연결되고, 파일을 검사하고, 의존성을 설치하고, 작업을 완료합니다. 이것은 코드의 안전하고 격리된 실행을 만듭니다. 더 많은 보안을 위해, harness는 명령을 화이트리스트하고 네트워크 격리를 강제할 수 있습니다. 샌드박스는 또한 환경을 요청 시 생성하고, 많은 작업에 걸쳐 펼치고, 작업이 완료되면 삭제할 수 있기 때문에 규모를 잠금 해제합니다.

좋은 환경은 좋은 기본 도구와 함께 옵니다. Harness는 agent가 유용한 작업을 할 수 있도록 도구를 구성하는 책임이 있습니다. 여기에는 언어 런타임과 패키지의 사전 설치, git과 테스트를 위한 CLI, 웹 상호작용과 검증을 위한 브라우저가 포함됩니다.

브라우저, 로그, 스크린샷, 테스트 러너 같은 도구들은 agent에게 자신의 작업을 관찰하고 분석할 수 있는 방법을 제공합니다. 이것은 자기 검증 루프를 만드는 데 도움이 되는데, 여기서 agent는 애플리케이션 코드를 작성하고, 테스트를 실행하고, 로그를 검사하고, 오류를 수정할 수 있습니다.

모델은 기본적으로 자신의 실행 환경을 구성하지 않습니다. Agent가 실행되는 곳, 어떤 도구를 사용할 수 있는지, 무엇에 접근할 수 있는지, 그리고 어떻게 자신의 작업을 검증하는지 결정하는 것은 모두 harness 수준의 설계 결정입니다.

지속적인 학습을 위한 메모리 & 검색

Agent는 자신이 본 것을 기억하고 학습되었을 때 존재하지 않았던 정보에 접근할 수 있어야 합니다.

모델은 자신의 가중치와 현재 컨텍스트에 있는 것 이상의 추가 지식이 없습니다. 모델 가중치를 수정할 수 있는 방법이 없으면, "지식을 추가하는" 유일한 방법은 컨텍스트 주입입니다.

메모리의 경우, 파일시스템이 다시 핵심 원시 요소입니다. Harness는 AGENTS.md 같은 메모리 파일 표준을 지원하며, 이는 agent 시작 시 컨텍스트에 주입됩니다. Agent가 이 파일을 추가하고 수정할 때, harness는 업데이트된 파일을 컨텍스트에 로드합니다. 이것은 agent가 한 세션에서 지속적으로 지식을 저장하고 미래 세션에 그 지식을 주입하는 형태의 지속적 학습입니다.

지식 기한은 모델이 업데이트된 라이브러리 버전처럼 새로운 데이터에 직접 접근할 수 없다는 것을 의미합니다. 최신 지식을 위해, 웹 검색과 Context7 같은 MCP 도구가 agent가 지식 기한처럼 새로운 라이브러리 버전이나 학습이 멈춘 때 존재하지 않았던 현재 데이터와 같은 지식 기한 너머의 정보에 접근하도록 도와줍니다.

웹 검색과 최신 컨텍스트를 쿼리하는 도구는 harness에 구워 넣기에 유용한 원시 요소입니다.

컨텍스트 로트와 싸우기

Agent 성능은 작업 과정에서 악화되지 않아야 합니다.

컨텍스트 로트는 모델이 컨텍스트 윈도우가 채워지면서 추론과 작업 완료를 더 못하게 되는 방식을 설명합니다. 컨텍스트는 귀하고 부족한 자원이므로, harness는 그것을 관리하기 위한 전략이 필요합니다.

오늘날의 Harness는 대체로 좋은 컨텍스트 엔지니어링을 위한 전달 메커니즘입니다.

압축은 컨텍스트 윈도우가 가득 차가려 할 때 무엇을 할 것인가를 다룹니다. 압축 없이, 대화가 컨텍스트 윈도우를 초과하면 어떻게 될까요? 한 가지 옵션은 API가 오류가 되는 것인데, 그것은 좋지 않습니다. Harness는 이 경우에 대한 어떤 전략을 사용해야 합니다. 따라서 압축은 기존 컨텍스트 윈도우를 지능적으로 오프로드하고 요약하므로 agent가 계속 작업할 수 있습니다.

도구 호출 오프로딩은 유용한 정보를 제공하지 않으면서 컨텍스트 윈도우를 시끄럽게 어지럽히는 큰 도구 출력의 영향을 줄이는 데 도움이 됩니다. Harness는 일정 토큰 수 위의 도구 출력의 헤드와 테일 토큰을 유지하고 필요에 따라 모델이 접근할 수 있도록 전체 출력을 파일시스템에 오프로드합니다.

Skill은 agent 시작 시 컨텍스트에 로드된 너무 많은 도구 또는 MCP 서버의 문제를 해결하는데, 이는 agent가 작업을 시작하기 전에 성능을 악화시킵니다. Skill은 점진적 공개를 통해 이 문제를 해결하는 harness 수준의 원시 요소입니다. 모델은 시작 시 Skill 전면부 메타데이터가 컨텍스트에 로드되기로 선택하지 않았지만, harness는 모델을 컨텍스트 로트에 대해 보호하도록 이것을 지원할 수 있습니다.

장기 지평 자율 실행

우리는 agent가 복잡한 작업을 자율적으로 정확하게 완료하길 원합니다. 장기 시간 지평에 걸쳐.

자율 소프트웨어 생성은 코딩 agent의 성배입니다. 하지만 오늘날의 모델은 조기 정지, 복잡한 문제를 분해하는 데 있는 문제, 그리고 작업이 여러 컨텍스트 윈도우에 걸쳐 확장될 때 부조화로 인해 고통받습니다. 좋은 harness는 이 모든 것 주변을 설계해야 합니다.

이것은 앞의 harness 원시 요소들이 복합되기 시작하는 곳입니다. 장기 작업은 여러 컨텍스트 윈도우에 걸쳐 작동을 유지하기 위해 지속 가능한 상태, 계획, 관찰, 검증이 필요합니다.

세션에 걸쳐 작업을 추적하기 위한 파일시스템과 git. Agent는 장기 작업에서 수백만 토큰을 생성하므로 파일시스템은 지속적으로 작업을 캡처하여 시간에 걸쳐 진전을 추적합니다. Git을 추가하면 새로운 agent가 프로젝트의 최신 작업과 역사를 빠르게 파악할 수 있습니다. 여러 agent가 함께 작업할 때, 파일시스템은 또한 agent가 협업할 수 있는 공유 장부 역할을 합니다.

작업을 계속하기 위한 Ralph Loop. Ralph Loop는 hook을 통해 모델의 종료 시도를 차단하고 원래 프롬프트를 깨끗한 컨텍스트 윈도우에 재주입하는 harness 패턴인데, 이는 완료 목표에 대해 작업을 계속하도록 agent를 강제합니다. 파일시스템이 이것을 가능하게 합니다. 왜냐하면 각 반복이 깨끗한 컨텍스트로 시작하지만 이전 반복에서 상태를 읽기 때문입니다.

계획과 자기 검증으로 추적 상태를 유지합니다. 계획은 모델이 목표를 일련의 단계로 분해하는 것입니다. Harness는 좋은 프롬프팅과 파일시스템에서 계획 파일을 사용하는 방법에 대한 알림을 주입하여 이것을 지원합니다. 각 단계를 완료한 후, agent는 자기 검증을 통해 자신의 작업의 정확성을 확인하는 것으로부터 이점을 얻습니다. Harness의 Hook은 사전 정의된 테스트 제품군을 실행하고 오류 메시지와 함께 실패 시 모델에 루프백할 수 있으며, 모델은 독립적으로 자신의 코드를 자기 평가하라는 프롬프트를 받을 수 있습니다. 검증은 테스트에서 솔루션을 접지하고 자기 개선을 위해 피드백 신호를 만듭니다.

Harness의 미래

모델 학습과 Harness 설계의 결합

Claude Code와 Codex 같은 오늘날의 agent 제품은 모델과 harness가 루프에 있는 post-trained입니다. 이것은 harness 설계자가 파일시스템 연산, bash 실행, 계획, subagent와 함께 작업을 병렬화하는 것처럼 자체적으로 잘하는 것이 나타나기를 원하는 행동에서 모델을 개선하는 데 도움이 됩니다.

이것은 피드백 루프를 만듭니다. 유용한 원시 요소들이 발견되고, harness에 추가되고, 다음 세대 모델 학습 시 사용됩니다. 이 사이클이 반복되면서, 모델은 학습된 harness 내에서 더 능력이 있어집니다.

하지만 이 공진화는 일반화에 대해 흥미로운 부작용을 가지고 있습니다. 이것은 도구 로직 변화가 모델 성능 저하를 초래하는 방식으로 나타나요. 좋은 예는 Codex-5.3 프롬팅 가이드의 apply_patch 도구 로직에서 파일 편집을 위한 설명입니다. 진정으로 지능 있는 모델은 patch 방법 사이에서 전환하는 데 거의 어려움이 없어야 하지만, harness 루프와 함께 학습하는 것이 이 오버피팅을 만듭니다.

하지만 이것이 당신의 작업에 가장 좋은 harness가 모델이 post-trained된 것이라는 것을 의미하지는 않습니다. Terminal Bench 2.0 리더보드는 좋은 예입니다. Claude Code의 Opus 4.6은 다른 harness의 Opus 4.6보다 훨씬 낮은 점수를 얻습니다. 이전 블로그에서, 우리는 harness만 변경함으로써 우리의 코딩 agent를 Terminal Bench 2.0에서 Top 30에서 Top 5로 개선하는 방법을 보여주었습니다. 당신의 작업에 대해 harness를 최적화하는 것에서 짜낼 수 있는 많은 주스가 있습니다.

Harness Engineering이 어디로 가는가

모델이 더 능력이 있어질 때, 오늘 harness에 있는 것의 일부는 모델로 흡수될 것입니다. 모델은 계획, 자기 검증, 그리고 긴 지평 정합성을 본적으로 더 잘 해질 것이고, 따라서 예를 들어 컨텍스트 주입을 덜 필요로 할 것입니다.

이것은 harness가 시간에 따라 덜 중요해야 한다는 것을 시사합니다. 하지만 prompt engineering이 오늘 계속 가치 있듯이, harness engineering이 좋은 agent를 구축하는 데 계속 유용할 것 같습니다.

harness는 오늘 모델 결함을 패치하는 것이 사실이지만, 모델 지능 주변의 시스템을 엔지니어링하여 더 효과적으로 만드는 것도 합니다. 잘 구성된 환경, 올바른 도구, 지속 가능한 상태, 검증 루프는 기본 지능에 관계없이 어떤 모델이라도 더 효율적으로 만듭니다.

Harness engineering은 우리가 harness 구축 라이브러리 deepagents를 LangChain에서 개선하는 데 사용하는 매우 활발한 연구 영역입니다. 우리가 오늘 탐색하고 있는 흥미로운 몇 가지 미해결 문제입니다:

공유 코드베이스에서 병렬로 작동하는 수백 개의 agent 오케스트레이션
자신의 추적을 분석하여 harness 수준의 실패 모드를 식별하고 수정하는 agent
주어진 작업에 대해 정확하게 올바른 도구와 컨텍스트를 동적으로 조립하는 harness. 사전에 구성되지 않은 대신

이 블로그는 harness가 무엇인지 정의하고 우리가 모델이 할 수 있기를 원하는 작업으로부터 어떻게 형성되는지에 대한 연습이었습니다.

모델은 지능을 담고 있고, harness는 그 지능을 유용하게 만드는 시스템입니다.

더 많은 harness 구축, 더 나은 시스템, 더 나은 agent에게.

1-클릭으로 Deep Agent를 배포하세요. LangSmith 배포. 우리 팀의 전문가와 대화하세요.

TLDR: Agent = Model + Harness. Harness engineering is how we build systems around models to turn them into work engines. The model contains the intelligence and the harness makes that intelligence useful. We define what a harness is and derive the core components today's and tomorrow's agents need.

Can Someone Please Define a "Harness"?

Agent = Model + Harness

If you're not the model, you're the harness.

A harness is every piece of code, configuration, and execution logic that isn't the model itself. A raw model is not an agent. But it becomes one when a harness gives it things like state, tool execution, feedback loops, and enforceable constraints.

Concretely, a harness includes things like:

System Prompts
Tools, Skills, MCPs + and their descriptions
Bundled Infrastructure (filesystem, sandbox, browser)
Orchestration Logic (subagent spawning, handoffs, model routing)
Hooks/Middleware for deterministic execution (compaction, continuation, lint checks)

There are many messy ways to split the boundaries of an agent system between the model and the harness. But in my opinion, this is the cleanest definition because it forces us to think about designing systems around model intelligence.

The rest of this post walks through core harness components and derives why each piece exists working backwards from the core primitive of a model.

Why Do We Need Harnesses. From a Model's Perspective

There are things we want an agent to do that a model cannot do out of the box. This is where a harness comes in.Models (mostly) take in data like text, images, audio, video and they output text. That's it. Out of the box they cannot:

Maintain durable state across interactions
Execute code
Access realtime knowledge
Setup environments and install packages to complete work

These are all harness level features. The structure of LLMs requires some sort of machinery that wraps them to do useful work.For example, to get a product UX like "chatting", we wrap the model in a while loop to track previous messages and append new user messages. Everyone reading this has already used this kind of harness. The main idea is that we want to convert a desired agent behavior into an actual feature in the harness.

Working Backwards from Desired Agent Behavior to Harness Engineering

Harness Engineering helps humans inject useful priors to guide agent behavior. And as models have gotten more capable, harnesses have been used to surgically extend and correct models to complete previously impossible tasks.

We won’t go over an exhaustive list of every harness feature. The goal is to derive a set of features from the starting point of helping models do useful work. We’ll follow a pattern like this:

Behavior we want (or want to fix) → Harness Design to help the model achieve this.

Filesystems for Durable Storage and Context Management

We want agents to have durable storage to interface with real data, offload information that doesn't fit in context, and persist work across sessions.

Models can only directly operate on knowledge within their context window. Before filesystems, users had to copy/paste content directly to the model, that’s clunky UX and doesn't work for autonomous agents. The world was already using filesystems to do work so models were naturally trained on billions of tokens of how to use them. The natural solution became:

Harnesses ship with filesystem abstractions and tools for fs-ops.

The filesystem is arguably the most foundational harness primitive because of what it unlocks:

Agents get a workspace to read data, code, and documentation.
Work can be incrementally added and offloaded instead of holding everything in context. Agents can store intermediate outputs and maintain state that outlasts a single session.
The filesystem is a natural collaboration surface. Multiple agents and humans can coordinate through shared files. Architectures like Agent Teams rely on this.

Git adds versioning to the filesystem so agents can track work, rollback errors, and branch experiments. We revisit the filesystem more below, because it turns out to be a key harness primitive for other features we need.

Bash + Code as a General Purpose Tool

We want agents to autonomously solve problems without humans needing to pre-design every tool.

The main agent execution pattern today is a ReAct loop, where a model reasons, takes an action via a tool call, observes the result, and repeats in a while loop. But harnesses can only execute the tools they have logic for. Instead of forcing users to build tools for every possible action, a better solution is to give agents a general purpose tool like bash.

Harnesses ship with a bash tool so models can solve problems autonomously by writing & executing code.

Bash + code exec is a big step towards giving models a computer and letting them figure out the rest autonomously. The model can design its own tools on the fly via code instead of being constrained to a fixed set of pre-configured tools.

Harnesses still ship with other tools, but code execution has become the default general-purpose strategy for autonomous problem solving.

Sandboxes and Tools to Execute & Verify Work

Agents need an environment with the right defaults so they can safely act, observe results, and make progress.

We've given models storage and the ability to execute code, but all of that needs to happen somewhere. Running agent-generated code locally is risky and a single local environment doesn’t scale to large agent workloads.

Sandboxes give agents safe operating environments. Instead of executing locally, the harness can connect to a sandbox to run code, inspect files, install dependencies, and complete tasks. This creates secure, isolated execution of code. For more security, harnesses can allow-list commands and enforce network isolation. Sandboxes also unlock scale because environments can be created on demand, fanned out across many tasks, and torn down when the work is done.

Good environments come with good default tooling. Harnesses are responsible for configuring tooling so agents can do useful work. This includes pre-installing language runtimes and packages, CLIs for git and testing, browsers for web interaction and verification.

Tools like browsers, logs, screenshots, and test runners give agents a way to observe and analyze their work. This helps them create self-verification loops where they can write application code, run tests, inspect logs, and fix errors.

The model doesn’t configure its own execution environment out of the box. Deciding where the agent runs, what tools are available, what it can access, and how it verifies its work are all harness-level design decisions.

Memory & Search for Continual Learning

Agents should remember what they've seen and access information that didn't exist when they were trained.

Models have no additional knowledge beyond their weights and what's in their current context. Without access to edit model weights, the only way to "add knowledge" is via context injection.

For memory, the filesystem is again a core primitive. Harnesses support memory file standards like AGENTS.md which get injected into context on agent start. As agents add and edit this file, harnesses load the updated file into context. This is a form of continual learning where agents durably store knowledge from one session and inject that knowledge into future sessions.

Knowledge cutoffs mean that models can't directly access new data like updated library versions without the user providing them directly. For up-to-date knowledge, Web Search and MCP tools like Context7 help agents access information beyond the knowledge cutoff like new library versions or current data that didn't exist when training stopped.

Web Search and tools for querying up-to-date context are useful primitives to bake into a harness.

Battling Context Rot

Agent performance shouldn’t degrade over the course of work.

Context Rot describes how models become worse at reasoning and completing tasks as their context window fills up. Context is a precious and scarce resource, so harnesses need strategies to manage it.

Harnesses today are largely delivery mechanisms for good context engineering.

Compaction addresses what to do when the context window is close to filling up. Without compaction, what happens when a conversation exceeds the context window? One option is that the API errors, that’s not good. The harness has to use some strategy for this case. So compaction intelligently offloads and summarizes the existing context window so the agent can continue working.

Tool call offloading helps reduce the impact of large tool outputs that can noisily clutter the context window without providing useful information. The harness keeps the head and tail tokens of tool outputs above a threshold number of tokens and offloads the full output to the filesystem so the model can access it if needed.

Skills address the issue of too many tools or MCP servers loaded into context on agent start which degrades performance before the agent can start working. Skills are a harness level primitive that solve this via progressive disclosure. The model didn't choose to have Skill front-matter loaded into context on start but the harness can support this to protect the model against context rot.

Long Horizon Autonomous Execution

We want agents to complete complex work, autonomously, correctly, over long time horizons.

Autonomous software creation is the holy grail for coding agents. But today's models suffer from early stopping, issues decomposing complex problems, and incoherence as work stretches across multiple context windows. A good harness has to design around all of this.

This is where the earlier harness primitives start to compound. Long-horizon work requires durable state, planning, observation, and verification to keep working across multiple context windows.

Filesystems and git for tracking work across sessions. Agents produce millions of tokens over a long task so the filesystem durably captures work to track progress over time. Adding git allows new agents to quickly get up to speed on the latest work and history of the project. For multiple agents working together, the filesystem also acts as a shared ledger of work where agents can collaborate.

Ralph Loops for continuing work. The Ralph Loop is a harness pattern that intercepts the model's exit attempt via a hook and reinjects the original prompt in a clean context window, forcing the agent to continue its work against a completion goal. The filesystem makes this possible because each iteration starts with fresh context but reads state from the previous iteration.

Planning and self-verification to stay on track. Planning is when a model decomposes a goal into a series of steps. Harnesses support this via good prompting and injecting reminders how to use a plan file in the filesystem. After completing each step, agents benefit from the checking correctness of their work via self-verification. Hooks in harnesses can run a pre-defined test suite and loop back to the model on failure with the error message or models can be prompted to self-evaluate their code independently. Verification grounds solution in tests and creates a feedback signal for self-improvement.

The Future of Harnesses

The Coupling of Model Training and Harness Design

Today's agent products like Claude Code and Codex are post-trained with models and harnesses in the loop. This helps models improve at actions that the harness designers think they should be natively good at like filesystem operations, bash execution, planning, or parallelizing work with subagents.

This creates a feedback loop. Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in.

But this co-evolution has interesting side effects for generalization. It shows up in ways like how changing tool logic leads to worse model performance. A good example is described here in the Codex-5.3 prompting guide with the apply_patch tool logic for editing files. A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting.

But this doesn't mean that the best harness for your task is the one a model was post-trained with. The Terminal Bench 2.0 Leaderboard is a good example. Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses. In a previous blog, we showed how we improved our coding agent Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness. There's a lot of juice to be squeezed out of optimizing the harness for your task.

Where Harness Engineering is Going

As models get more capable, some of what lives in the harness today will get absorbed into the model. Models will get better at planning, self-verification, and long horizon coherence natively, thus requiring less context injection for example.

That suggests harnesses should matter less over time. But just as prompt engineering continues to be valuable today, it’s likely that harness engineering will continue to be useful for building good agents.

It’s true that harnesses today patch over model deficiencies, but they also engineer systems around model intelligence to make them more effective. A well-configured environment, the right tools, durable state, and verification loops make any model more efficient regardless of its base intelligence.

Harness engineering is a very active area of research that we use to improve our harness building library deepagents at LangChain. Here are a few open and interesting problems we’re exploring today:

orchestrating hundreds of agents working in parallel on a shared codebase
agents that analyze their own traces to identify and fix harness-level failure modes
harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being pre-configured

This blog was an exercise in defining what a harness is and how it’s shaped by the work we want models to do.

The model contains the intelligence and the harness is the system that makes that intelligence useful.

To more harness building, better systems, and better agents.

Deploy Deep Agents in 1-click with LangSmith Deployment. Speak with an expert from our team.

#ai-agents #agent-harness #autonomous-systems #sandboxing #agent-memory #ai-infrastructure