LangChain Blog · 13일 전 · 원문 보기

딥 에이전트의 인터프리터: 도구 호출과 샌드박스 사이의 코드

Interpreters in Deep Agents: Code Between Tool Calls and Sandboxes

TL;DR Deep Agents에 인터프리터를 추가하고 있습니다: 에이전트가 에이전트 루프 내에서 코드를 작성하고 실행할 수 있는 작은 임베디드 런타임입니다. 에이전트에게 한 번에 하나씩 도구 호출과 전체 샌드박스 사이의 중간 지점을 제공하므로 에이전트는 다중 단계 작업을 표현하고, 중간 상태를 모델 컨텍스트 밖에 유지하며, 더 예측 가능한 방식으로 코드와 작업을 실행할 수 있습니다.

인터프리터란 무엇인가요?

인터프리터는 에이전트가 작업하는 동안 코드를 작성할 수 있는 작은 임베디드 런타임입니다. 기능적으로는 에이전트에게 Python 또는 Node REPL을 제공하는 것처럼 느껴집니다: 변수를 정의하고, 값을 검사하고, 도우미 함수를 작성하고, 호출 전반에서 상태를 재사용할 수 있습니다.

오늘날 많은 에이전트는 이미 호스트 또는 샌드박스 환경에 명령을 발급하여 코드를 실행합니다. 이는 작업이 환경 수준의 작업일 때 훌륭합니다: 명령 실행, 종속성 설치 또는 파일 시스템 작동. 인터프리터는 다른 계층을 목표로 합니다: 에이전트는 에이전트 루프 내에서 실행되어 위임을 조율하고, 도구 호출을 작성하고, 구조화된 데이터를 변환하고, 어떤 정보를 모델로 돌려보낼지 결정하는 코드를 작성합니다.

// 에이전트가 이렇게 코드를 작성합니다
const rows = [
  { team: "support", tickets: 18 },
  { team: "infra", tickets: 7 },
  { team: "sales", tickets: 11 },
];

const total = rows.reduce((sum, row) => sum + row.tickets, 0);
const busiest = rows.sort((a, b) => b.tickets - a.tickets)[0];

`${busiest.team} has the most tickets. ${total} tickets total.`;

이것은 에이전트에게 도구 호출의 순서에 깔끔하게 맞지 않는 동작을 표현할 수 있는 새로운 공간을 제공합니다. 에이전트는 다중 단계 로직을 위한 작업 공간을 얻고, 하니스는 여전히 해당 작업 공간이 터치할 수 있는 것을 제어합니다. 인터프리터는 임시 상태를 보유하고 중요한 부분만 반환할 수 있습니다.

인터프리터가 맞는 위치

에이전트를 생각할 때, 일반적으로 도구를 연결하는 것을 생각합니다.

에이전트의 가장 간단한 형태에서는 에이전트가 루프에서 이러한 도구를 사용합니다: 모델이 하나의 도구를 호출하고, 관찰을 검사한 다음, 다음에 할 일을 결정합니다. 이 한 번에 한 단계씩 스타일은 디버그하고 평가하기가 간단하며, 많은 워크플로우가 즉각적인 관찰에 대해 추론할 방법이 필요합니다.

샌드박스는 에이전트에게 환경에 대해 작동하는 bash 도구를 제공하여 명령을 실행하고, 종속성을 설치하고, 파일로 작업할 수 있도록 하는 것으로 그 위에 구축합니다.

하지만 양쪽 끝 모두 단점이 있습니다: 샌드박스는 bash를 작성하면 되므로 로컬 프로시저를 처리할 수 있지만, 프로비저닝하고 확장하기가 더 어려울 수 있습니다. 그리고 순수 직렬 도구 루프는 중간 단계가 대부분 다음 단계로만 이동할 때 어색할 수 있습니다.

일부 에이전트 작업은 이 두 극단 사이에 있으며, 인터프리터는 여기에 잘 맞습니다. 에이전트에게 전체 환경을 주지 않으면서도 범위가 지정된 기능에 대한 코드 수준 구성을 제공합니다. 모델은 기존 기능에 대한 제어 흐름을 표현하는 작은 프로그램을 작성할 수 있으며, 하니스는 호스트를 통해 사용 가능한 기능을 결정합니다.

설계상 더 제한됨

우리는 이것을 코드 런타임이 아닌 인터프리터라고 부르는 이유는 인터프리터가 의도적으로 제한되기 때문입니다. 기본적으로 일반 프로그래밍 환경에서 기대할 API가 없습니다: 파일 시스템 없음, 네트워크 없음, 셸 없음, 패키지 설치 없음, 월시간 액세스 없음. 에이전트는 기본 제어 흐름 및 객체 조작으로 시작합니다: 객체, 배열, 맵, JSON 및 나머지 작은 언어 런타임.

이러한 기능은 호스트 런타임에 대한 명시적 브리지를 통해 노출됩니다. 에이전트가 도구를 호출하거나, 범위가 지정된 파일 시스템 API에서 읽거나, URL을 가져오거나, 서브에이전트에 위임해야 하는 경우 하니스가 해당 기능을 명시적으로 노출해야 합니다. 예를 들어, 이 스크립트는 우리가 fetch, read_file 및 task 도구를 인터프리터에 명시적으로 브리지할 때만 작동합니다:

// 네트워크 요청을 하기 위해 `fetch` 도구를 호출합니다
const response = tools.fetch("https://docs.langchain.com");
// 에이전트의 파일 시스템에서 파일을 가져오기 위해 `readFile` 도구를 호출합니다
const file = tools.readFile("SPEC.md");
// 서브에이전트를 생성하기 위해 `task` 도구를 호출합니다
const subagentOutput = tools.task({
  description: "Do you know the muffin man?"
});

호스트 런타임(하니스를 실행하는 것과 동일)에는 에이전트가 인터프리터를 사용하여 취할 수 있는 모든 작업이 포함되며, 인터프리터 코드가 호출할 수 있는 작업을 명시적으로 결정합니다. 인터프리터는 에이전트의 그 경계의 프로그래밍 가능한 쪽입니다.

기본적으로 인터프리터는 샌드박스가 제공하는 것과 같은 일반 호스트 액세스가 아닌 언어 기능으로만 시작합니다. 외부 세계를 터치하는 모든 것은 사용자가 지정한 명시적 브리지를 통과해야 합니다.

우리는 몇 가지 이유로 이렇게 합니다:

더 작은 작업 표면: bash 또는 샌드박스를 사용하면 시작점이 광범위합니다: 에이전트가 컴퓨터처럼 보이는 무언가를 가지고 있으며, 거기서부터 할 수 있는 일을 제한합니다. 인터프리터를 사용하면 시작점이 좁습니다: 에이전트가 언어 런타임을 가지고 있으며, 기능이 의도적으로 다시 추가됩니다. 이것은 위협 모델이 프로세스 또는 VM 격리를 요구할 때 샌드박싱을 대체하지 않지만, 에이전트가 기본적으로 광범위한 호스트 액세스를 상속하지 않는다는 의미입니다.
예측 가능성: 작고 고정된 런타임은 에이전트 동작을 예측하고 평가하기 쉽게 만듭니다. 인터프리터가 광범위한 호스트 액세스 또는 풍부한 라이브러리 표면을 가지고 있다면, 동일한 목표를 많은 다른 전략을 통해 달성할 수 있으므로 출력의 일관성이 떨어지고 테스트하기 어렵습니다. 기본 환경을 최소한으로 유지하고 추가 기능이 명시적 브리지를 통과하도록 강제함으로써 에이전트의 작업 공간을 좁히고, 실패 모드를 명확히 하고, 결과를 더 반복 가능하게 만듭니다.

Figma, Shopify, AWS 등의 시스템에서 동일한 아키텍처 형태를 볼 수 있습니다: 한쪽에서 제약이 있는 코드가 실행되고, 호스트는 다른 쪽에서 제어된 API 경계를 노출합니다.

인터프리터가 잠금 해제하는 것

몇 가지 최근 시스템은 비슷한 패턴으로 수렴했습니다: 모델에게 제어 흐름을 관리하고 중간 상태를 관리하는 작은 코드를 작성할 수 있는 작은, 범위가 지정된 런타임을 제공합니다. Cloudflare의 Code Mode, Anthropic의 Programmatic Tool Calling (PTC), 그리고 RLM 스타일 워크플로우는 각각 다양한 각도에서 해당 아이디어를 가리킵니다. Deep Agents에서 인터프리터는 모델에 구애받지 않는 방식으로 해당 패턴을 얻는 방법입니다. 다음은 이미 유용했던 몇 가지 장소입니다:

컨텍스트 표면으로서의 인터프리터 상태

에이전트 하니스는 이미 몇 개의 표면에서 컨텍스트를 구성합니다:

메시지 히스토리는 모델에 즉시 사용 가능한 컨텍스트입니다.
- 비용이 많이 들고 주의력이 제한됩니다: 모델이 100만 개의 토큰을 수용할 수 있다는 것이 모든 토큰을 동등하게 추론한다는 의미는 아닙니다. (예: context rot)
파일 시스템은 에이전트가 지속 가능한 아티팩트, 메모, 중간 파일 및 수명이 긴 작업 메모리를 저장할 수 있는 장소를 제공합니다.
- 지속 가능하고 유연하지만, 에이전트가 작업 상태를 파일로 직렬화한 다음 나중에 재구성하도록 강제합니다.
- 하니스의 일부는 파일 시스템과 메시지 히스토리 간의 컨텍스트 흐름을 제어하는 것입니다.

인터프리터 상태는 에이전트에게 또 다른 옵션을 제공합니다. 값은 배열, 객체, 맵, 카운터, 대기열 및 도우미 함수로 런타임에 유지될 수 있습니다. 모델은 모든 중간 값을 프롬프트 텍스트로 볼 필요가 없지만 나중에 인터프리터에 해당 값을 검사하거나 재사용하도록 요청할 수 있습니다.

이것은 REPL이 일회용 명령과 다르게 느껴지는 이유와 유사합니다. REPL에서 변수를 정의하면 제출하는 다음 명령에서도 여전히 존재합니다. 이를 stdout으로 변환하거나, 파일에 쓰거나, 다음 작업을 하기 전에 재구성할 필요가 없습니다. 에이전트가 인터프리터를 여러 번 호출할 때도 동일한 원리가 적용됩니다. 이전 호출에서 값을 재사용할 수 있기 때문입니다.

이렇게 하면 인터프리터가 에이전트 루프 상태에 유용합니다. 메시지 히스토리는 모델이 지금 추론해야 할 것을 위한 것이고, 파일 시스템은 지속 가능한 아티팩트와 환경 수준의 작업을 위한 것이고, 인터프리터 상태는 나중에 유용할 수 있지만 모델 입력이 될 필요가 없는 라이브 작업 값을 위한 것입니다.

프로그래매틱 도구 호출

Anthropic의 Programmatic Tool Calling (PTC)는 이 패턴의 또 다른 버전입니다: 도구 호출은 에이전트가 작성한 코드 내에서 발생하며, 모델이 중개한 작업의 순서가 아닙니다.

모델이 도구를 호출하고, 전체 결과를 수신하고, 그것에 대해 추론하고, 다음 도구를 호출하면, 모든 작은 단계가 또 다른 모델 왕복이 됩니다. 에이전트가 도구를 직접 호출하는 코드를 작성할 수 있다면, 중간 출력을 런타임에 유지하고 최종 결과 또는 선택한 증거만 반환할 수 있습니다.

Deep Agents에서 PTC는 모델 공급자 동작이 아닌 미들웨어로 구현됩니다. 개발자는 허용 목록을 전달하고, 허용 목록된 도구는 전역 tools 네임스페이스 아래에 나타나며, 각 도구는 인터프리터가 await로 호출할 수 있는 비동기 함수로 노출됩니다. 이는 모든 모델(오픈 소스 모델 포함)에 대해 PTC를 활성화할 수 있음을 의미합니다.

const topics = ["retrieval", "memory", "evaluation"];

const reports = await Promise.all(
  topics.map((topic) =>
    tools.task({
      description: `Research ${topic} in Deep Agents and return three concise findings.`,
      subagent_type: "general-purpose",
    }),
  ),
);

reports.join("\\n\\n");

초기 테스트 중 일부에서 이 스타일의 도구 호출은 일부 작업에서 최대 35% 적은 토큰을 사용했습니다. (우리는 OOLONG trec-coarse 데이터셋에서 수집한 작업 세트에서 이를 평가했습니다)

대규모 데이터셋 작업

문서 집약적인 작업을 생각해 보십시오: 에이전트는 10,000개의 문서에서 정보를 분류, 추출 또는 종합해야 합니다.

표준 도구 호출 에이전트를 사용하면 자연스러운 형태는 모델이 중개한 작업의 긴 순서입니다. 모델은 검색하고, 컨텍스트에서 결과를 다시 가져오고, 다음에 검사할 것을 결정하고, 다른 도구를 호출하고, 더 많은 결과를 다시 가져오고, 반복합니다. 작은 작업의 경우 해당 루프로 충분합니다. 하지만 규모에서 붕괴되기 시작합니다:

에이전트가 실제로 의도된 절차를 따랐는지 확인하기 어렵습니다.
너무 많은 중간 컨텍스트가 모델을 통해 다시 라우팅됩니다.
지연 시간, 컨텍스트 또는 도구 호출 한계에 쉽게 도달합니다.
응답이 저하될 수 있습니다. 모델이 히스토리를 통해 너무 많은 작업 상태를 관리하도록 강제됩니다.

인터프리터 형태의 버전은 다릅니다. 모델은 문서 및 검색 상태를 런타임에 유지하고, 배치를 프로그래매틱으로 반복하고, 후보를 채점하거나 필터링하고, 선택한 슬라이스에서만 서브에이전트를 호출하는 코드를 작성할 수 있습니다. 모든 중간 결과를 모델로 반환하는 대신 인터프리터는 컴팩트한 증거 세트를 반환합니다: 일치한 문서, 추출된 필드, 미해결 사례 또는 추론할 가치가 있는 몇 가지 요약.

인터프리터는 마법처럼 10,000개의 모든 문서에 대해 추론하지 않습니다. 에이전트가 검색 공간을 더 잘 제어하고 모델 컨텍스트에 무엇을 입력할지 결정할 수 있는 방법을 제공합니다.

const candidates = documents
  .map((doc) => ({ doc, score: scoreDocument(doc, query) }))
  .filter(({ score }) => score > 0.75)
  .sort((a, b) => b.score - a.score)
  .slice(0, 10);

const reports = await Promise.all(
  candidates.map(({ doc }) =>
    tools.task({
      description: `Extract evidence from ${doc.id} for: ${query}`,
      subagent_type: "general-purpose",
    }),
  ),
);

reports.join("\n\n");

재귀 오케스트레이션

또 다른 관련된 아이디어는 Recursive Language Models (RLMs)입니다. RLM은 긴 프롬프트를 외부 REPL 환경의 일부로 취급한 다음 모델이 코드를 작성하여 검사하고, 분해하고, 선택된 스니펫에 대해 재귀적으로 모델을 호출하도록 합니다.

Deep Agents 인터프리터는 모델 계층에서 RLM을 구현하지 않지만 하니스 수준에서 여전히 관련된 연결이 있습니다: 코드는 모델 컨텍스트 외에 작업 상태를 보유하고, 해당 상태의 슬라이스를 선택하고, 그 슬라이스만 다음 모델 또는 서브에이전트 호출에 전달할 수 있습니다.

Deep Agents에서 tools.task는 이에 대한 브리지입니다. 인터프리터 코드는 작업의 슬라이스를 선택하고, 그 슬라이스를 서브에이전트에 위임하고, 결과를 기존 런타임 상태와 결합하고, 합성된 출력만 주 모델로 반환할 수 있습니다.

Deep Agents에서 어떻게 작동하는가

하니스 수준에서 인터프리터는 에이전트 루프와 작은 런타임 사이의 미들웨어입니다. 미들웨어:

에이전트에 eval 도구를 추가합니다
QuickJS 컨텍스트를 생성하고 유지 관리합니다
에이전트의 TypeScript 코드를 실행합니다
구성될 때 console.log 출력을 캡처합니다
최종 식을 모델 컨텍스트로 반환합니다

eval 도구는 "호스트에서 임의의 코드를 실행"하지 않습니다. 코드는 인터프리터 컨텍스트 내에서 실행됩니다. 외부 세계와 통신해야 하는 경우 호스트 런타임이 노출하는 브리지를 통해 수행합니다.

프로그래매틱 도구 호출은 이러한 호스트 브리지 중 하나입니다. 개발자는 ptc 허용 목록을 전달하고, 허용 목록된 도구는 tools 네임스페이스 아래의 인터프리터 내에 나타나며(예: tools.getWeather(...)), 각 도구는 인터프리터가 await로 호출할 수 있는 비동기 함수로 노출됩니다. 호스트 런타임은 여전히 실제 도구 호출을 실행합니다.

대략적인 흐름은 다음과 같습니다:

모델이 코드를 작성하고 eval을 호출합니다
QuickJS는 인터프리터 컨텍스트 내에서 코드를 평가합니다
인터프리터 코드는 선택적으로 허용 목록된 도구를 호출합니다
호스트 런타임이 실제 도구 호출을 실행합니다
결과는 인터프리터로 다시 교차합니다
최종 식은 모델 컨텍스트로 교차합니다

실행 중 반복된 eval 호출은 동일한 라이브 인터프리터 컨텍스트를 공유할 수 있으므로 값이 REPL 상태처럼 작동합니다. 대화 턴 간에 스냅샷을 만드는 것도 가능하지만, 라이브 핸들이나 호스트 리소스보다는 직렬화 가능한 작업 데이터를 보존하는 방법으로 취급해야 합니다.

런타임 제어도 이 경계에서 작동합니다:

메모리 한계
eval당 타임아웃
최대 프로그래매틱 도구 호출
최대 결과 크기
콘솔 캡처
턴 간 스냅샷

Deep Agents에서 사용 방법

인터프리터를 설치하고 create_deep_agent를 사용하여 미들웨어를 추가할 수 있습니다:

uv add "deepagents[quickjs]"

from deepagents import create_deep_agent
from langchain_quickjs import CodeInterpreterMiddleware

agent = create_deep_agent(
    model="openai:gpt-5.5",
    middleware=[CodeInterpreterMiddleware()],
)

(그리고 TypeScript에서)

pnpm install deepagents @langchain/quickjs

import { createDeepAgent } from "deepagents";
import { createCodeInterpreterMiddleware } from "@langchain/quickjs";

const agent = createDeepAgent({
  model: "openai:gpt-5.5",
  middleware: [createCodeInterpreterMiddleware()],
});

‍

인터프리터 코드가 에이전트 도구를 호출하도록 하려면 허용 목록으로 프로그래매틱 도구 호출을 활성화합니다. 도구는 인터프리터 코드에 자동으로 노출되지 않습니다. 도구를 호스트 런타임 브리지를 통과할 수 있도록 선택해야 합니다.

agent = create_deep_agent(
    model="openai:gpt-5.5",
    middleware=[CodeInterpreterMiddleware(ptc=["task"])],
)

const agent = createDeepAgent({
  model: "openai:gpt-5.5",
  middleware: [createCodeInterpreterMiddleware({ ptc: ["task"] })],
});

‍

PTC가 활성화되면 허용 목록된 도구는 전역 tools 네임스페이스 아래에 나타납니다. 각 도구는 비동기 함수이고, 모델은 모든 중간 도구 결과가 아니라 최종 인터프리터 출력을 받습니다.

Deep Agents는 Python과 TypeScript에서 사용 가능합니다. 인터프리터에 대한 자세한 내용과 전체 미들웨어 옵션 및 런타임 제어 세트는 문서를 참조하십시오.

‍

TL;DR We’re adding interpreters to Deep Agents: small embedded runtimes where agents can write and execute code inside the agent loop. They give agents a middle ground between one-at-a-time tool calls and full sandboxes, so agents can express multi-step work, keep intermediate state out of model context, and execute code and actions in a more predictable way.

What’s an interpreter?

An interpreter is a small embedded runtime that an agent can write code against while it is working. Functionally, it feels like giving the agent a Python or Node REPL: it can define variables, inspect values, write helper functions, and reuse state across calls.

Many agents today already execute code by issuing commands to a host or sandbox environment. This is great when the task is environment-level work: running commands, installing dependencies, or operating over a filesystem. Interpreters are aimed at a different layer: the agent writes code that runs inside the agent loop to coordinate delegation, compose tool calls, transform structured data, and decide what information should come back to the model.

// agent writes code like this
const rows = [
  { team: "support", tickets: 18 },
  { team: "infra", tickets: 7 },
  { team: "sales", tickets: 11 },
];

const total = rows.reduce((sum, row) => sum + row.tickets, 0);
const busiest = rows.sort((a, b) => b.tickets - a.tickets)[0];

`${busiest.team} has the most tickets. ${total} tickets total.`;

This gives agents a new place to express behavior that doesn't fit cleanly into a sequence of tool calls. The agent gets a working space for multi-step logic, while the harness still controls what that working space can touch. The interpreter can hold temporary state and return only the part that matters.

Where interpreters fit

When you think of an agent, you usually think of attaching tools.

In the simplest form of an agent, the agent uses those tools in a loop: the model calls one tool, inspects the observation, then decides what to do next. That one-step-at-a-time style is straightforward to debug and evaluate, and a lot of workflows do require a way to reason over immediate observations.

Sandboxes build on top of that by giving the agent a bash tool that works against an environment to run commands, install dependencies, and work with files.

But both ends have downsides: sandboxes can handle local procedure (since it can just write code to do so), but they can be harder to provision and scale; and purely serial tool loops can be awkward when those intermediate steps mostly feed the next step.

Some agent work sits between those two extremes, which interpreters slot nicely into. They give the agent code-level composition over scoped capabilities without giving it a whole environment. The model can write a small program to express control flow over existing capabilities, while the harness decides which capabilities are available through the host.

More limited by design

We call this an interpreter, not just a code runtime, because the interpreter is intentionally limited. By default it does not have the APIs you would expect from a normal programming environment: no filesystem, no network, no shell, no package installation, and no wall-time access. The agent starts with basic control flow and object manipulation: objects, arrays, maps, JSON, and the rest of the small language runtime.

Those capabilities are exposed through explicit bridges to the host runtime. If the agent needs to call a tool, read from a scoped filesystem API, fetch a URL, or delegate to a subagent, the harness has to expose that capability deliberately. For instance, this script only works when we explicitly bridge the fetch, read_file and task tools directly to the interpreter:

// calls the `fetch` tool to make a network request
const response = tools.fetch("https://docs.langchain.com");
// calls the `readFile` tool to fetch files from the agents filesystem
const file = tools.readFile("SPEC.md");
// calls the `task` tool to spawn a subagent
const subagentOutput = tools.task({
  description: "Do you know the muffin man?"
});

The host runtime (the same one that runs the harness) contains all the actions an agent can take using the interpreter, and explicitly decides which ones the interpreter code can call. The interpreter is the agent’s programmable side of that boundary.

By default, the interpreter starts with language features only, not generic host access like a sandbox gives you. Anything that touches the outside world has to cross an explicit bridge that you specify.

We do this for a few reasons:

Smaller action surface: With bash or a sandbox, the starting point is broad: the agent has something shaped like a computer, and you restrict what it can do from there. With an interpreter, the starting point is narrow: the agent has a language runtime, and capabilities are added back deliberately. That does not replace sandboxing when your threat model requires process or VM isolation, but it does mean the agent is not inheriting broad host access by default.
Predictability: A small, fixed runtime makes agent behavior easier to anticipate and evaluate. If the interpreter had broad host access or a rich library surface, the same goal could be achieved through many different strategies, which makes outputs less consistent and harder to test. By keeping the default environment minimal and forcing extra capabilities to cross explicit bridges, you make the agent’s action space narrower, the failure modes clearer, and the results more repeatable.

You see the same architectural shape in systems from Figma, Shopify, AWS, and others: constrained code runs on one side, while the host exposes a controlled API boundary on the other.

What interpreters unlock

A few recent systems have converged on similar patterns: give the model a small, scoped runtime where it can write a bit of code to manage control flow and intermediate state. Cloudflare’s Code Mode, Anthropic’s Programmatic Tool Calling (PTC), and RLM-style workflows each point at that idea from different angles. In Deep Agents, an interpreter is how you get that pattern in a model-agnostic way. Here are a few places it’s already been useful:

Interpreter state as a context surface

Agent harnesses already organize context across a few surfaces:

Message history is the context immediately available to the model.
- It is expensive and attention-constrained: just because a model can accept a million tokens does not mean it will reason over every token equally well. (e.g. context rot)
A filesystem gives the agent somewhere to store durable artifacts, notes, intermediate files, and longer-lived working memory.
- It is durable and flexible, but it forces the agent to serialize working state into files and then reconstruct it later.
- Part of the job of the harness is to control the flow of context between the filesystem and the message history.

Interpreter state gives the agent another option. Values can stay in the runtime as arrays, objects, maps, counters, queues, and helper functions. The model does not need to see every intermediate value as prompt text, but it can still ask the interpreter to inspect or reuse those values later.

This is similar to why a REPL feels different from running a one-off command. If you define a variable in a REPL, it is still there on the next command you submit. You do not have to turn it into stdout, write it to a file, or reconstruct it before doing the next thing. The same principle applies when an agent calls the interpreter multiple times, since it can just reuse the value from a previous call.

That makes interpreters useful for agent-loop state. Message history is for what the model needs to reason over now, the filesystem is for durable artifacts and environment-level work, and interpreter state is for live working values that may be useful later but do not need to become model input yet.

Programmatic tool calling

Anthropic’s Programmatic Tool Calling (PTC) is another version of this pattern: tool calls happen from inside code the agent writes, rather than as a sequence of model-mediated actions.

If the model calls a tool, receives the full result, reasons over it, and calls the next tool, every small step becomes another model round trip. If the agent can write code that calls tools directly, it can keep intermediate outputs in the runtime and return only the final result or selected evidence.

In Deep Agents, PTC is implemented as middleware rather than as a model-provider behavior. The developer passes an allowlist, allowlisted tools appear under the global tools namespace, and each tool is exposed as an async function the interpreter can call with await. This means that you can enable PTC for any model (including open source ones).

const topics = ["retrieval", "memory", "evaluation"];

const reports = await Promise.all(
  topics.map((topic) =>
    tools.task({
      description: `Research ${topic} in Deep Agents and return three concise findings.`,
      subagent_type: "general-purpose",
    }),
  ),
);

reports.join("\\n\\n");

In some of our early testing, this style of tool calling used up to 35% fewer tokens on some tasks. (we evaluated this on a collected set of tasks from the OOLONG trec-coarse dataset)

Working over large datasets

Take a document-heavy task: an agent needs to classify, extract, or synthesize information from 10,000 documents.

With a standard tool-calling agent, the natural shape is a long sequence of model-mediated actions. The model searches, gets results back in context, decides what to inspect next, calls another tool, gets more results back, and repeats. For small tasks, that loop is sufficient. But at scale it starts to break down:

It is hard to verify that the agent actually followed the intended procedure.
Too much intermediate context gets routed back through the model.
It is easy to run into latency, context, or tool-call limits.
The response can degrade because the model is forced to manage too much working state through history.

An interpreter-shaped version looks different. The model can write code that keeps document and search state in the runtime, iterates through batches programmatically, scores or filters candidates, and calls subagents only on selected slices. Instead of returning every intermediate result to the model, the interpreter returns a compact evidence set: the documents that matched, the fields that were extracted, the unresolved cases, or the few summaries worth reasoning over.

The interpreter is not magically reasoning over all 10,000 documents. It gives the agent a better way to control the search space and decide what should enter model context.

const candidates = documents
  .map((doc) => ({ doc, score: scoreDocument(doc, query) }))
  .filter(({ score }) => score > 0.75)
  .sort((a, b) => b.score - a.score)
  .slice(0, 10);

const reports = await Promise.all(
  candidates.map(({ doc }) =>
    tools.task({
      description: `Extract evidence from ${doc.id} for: ${query}`,
      subagent_type: "general-purpose",
    }),
  ),
);

reports.join("\n\n");

Recursive orchestration

Another related idea is Recursive Language Models (RLMs). RLMs treat long prompts as part of an external REPL environment, then let the model write code to inspect, decompose, and recursively call models over selected snippets.

Deep Agents interpreters are not implementing RLMs at the model layer, but there is still a relevant connection at the harness level: code can hold working state outside the model context, select a slice of that state, and pass only that slice into the next model or subagent call.

In Deep Agents, tools.task is the bridge for this. Interpreter code can select a slice of work, delegate that slice to a subagent, combine the result with existing runtime state, and return only the synthesized output to the main model.

How it works in Deep Agents

At the harness level, the interpreter is middleware between the agent loop and a small runtime. The middleware:

adds an eval tool to the agent
creates and maintains a QuickJS context
executes the agent’s TypeScript code
captures console.log output when configured
returns the final expression back into model context

The eval tool is not “run arbitrary code on the host.” The code runs inside the interpreter context. If it needs to communicate with the outside world, it does so through bridges the host runtime exposes.

Programmatic tool calling is one of those host bridges. The developer passes a ptc allowlist, allowlisted tools appear inside the interpreter under the tools namespace (e.g. tools.getWeather(...)), and each tool is exposed as an async function the interpreter can call with await. The host runtime still executes the real tool call.

The rough flow looks like this:

the model writes code and calls eval
QuickJS evaluates the code inside the interpreter context
interpreter code optionally calls allowlisted tools
the host runtime executes the real tool calls
results cross back into the interpreter
the final expression crosses back into model context

Repeated eval calls in a run can share the same live interpreter context, which is what lets values behave like REPL state. Snapshotting between conversation turns is also available, but it should be treated as a way to preserve serializable working data rather than live handles or host resources.

Runtime controls live at this boundary too:

memory limits
per-eval timeouts
maximum programmatic tool calls
maximum result size
console capture
snapshotting between turns

How to use it in Deep Agents

You can install the interpreter and add the middleware using create_deep_agent:

uv add "deepagents[quickjs]"

from deepagents import create_deep_agent
from langchain_quickjs import CodeInterpreterMiddleware

agent = create_deep_agent(
    model="openai:gpt-5.5",
    middleware=[CodeInterpreterMiddleware()],
)

(and in TypeScript)

pnpm install deepagents @langchain/quickjs

import { createDeepAgent } from "deepagents";
import { createCodeInterpreterMiddleware } from "@langchain/quickjs";

const agent = createDeepAgent({
  model: "openai:gpt-5.5",
  middleware: [createCodeInterpreterMiddleware()],
});

‍

To let interpreter code call agent tools, enable programmatic tool calling with an allowlist. Tools are not automatically exposed to interpreter code; you must choose which tools can cross the host-runtime bridge.

agent = create_deep_agent(
    model="openai:gpt-5.5",
    middleware=[CodeInterpreterMiddleware(ptc=["task"])],
)

const agent = createDeepAgent({
  model: "openai:gpt-5.5",
  middleware: [createCodeInterpreterMiddleware({ ptc: ["task"] })],
});

‍

Once PTC is enabled, allowlisted tools appear under the global tools namespace. Each tool is an async function, and the model receives the final interpreter output rather than every intermediate tool result.

Deep Agents is available in Python and TypeScript. See the docs for more information on interpreters, as well as the full set of middleware options and runtime controls.

‍

#deep-agents #interpreters #code-execution #sandboxes #agent-tools #ai-orchestration