하네스, 스캐폴드, 그리고 올바르게 알아야 할 AI 에이전트 용어들

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

하니스, 스캐폴드, 그리고 올바르게 이해할 가치가 있는 AI 에이전트 용어들

어떤 분야가 빠르게 발전할 때, 그 분야의 어휘는 종종 공유된 이해보다 더 빠르게 발전합니다. 용어들은 모호해지거나, 다양한 맥락에서 재사용되거나, 완전히 설명되지 않은 개념의 약자가 됩니다. 우리는 현재 AI 에이전트 분야에서 이러한 현상을 목격하고 있으며, 여기서 개념들은 혼합되고, 일부는 이름이 바뀌며, 다른 것들은 몇 개월 동안 널리 사용된 후 조용히 사라집니다.

이는 신규 진입자들뿐만 아니라 최신 발전 동향을 따라잡으려고 노력하는 실무자들도 압도할 수 있습니다. ICLR 2026 이후, 우리 중 한 명(@ariG23498)이 이러한 혼동을 잘 포착한 질문을 게시했습니다:

"에이전트의 맥락에서 '하니스'와 '스캐폴드'라는 용어는 무엇을 의미하나요? ICLR에 참석하는 동안 많은 설명을 들었지만, 왜 하나의 설명으로 수렴하지 않았는지 이해할 수 없었습니다."

이 용어집은 명확하고 일관된 설명 없이 계속 나타나는 용어들을 정리하려는 우리의 시도입니다. 이것은 이 분야의 모든 용어를 포괄하는 사전이 되기 위한 것이 아닙니다. 대신, 우리는 종종 혼합되거나 다양한 방식으로 재사용되거나 명확하지 않음에도 불구하고 당연한 것으로 가정되는 개념에 초점을 맞춥니다.

이러한 대부분의 용어는 에이전트를 구축하거나, 배포하거나, Claude Code, Codex, 또는 Hermes Agent와 같은 도구를 사용할 때 나타납니다. 마지막 섹션은 모델 학습과 관련된 개념을 다루며, 이는 그 쪽 분야에서 일하는 경우 더 관련이 있습니다.

이러한 용어 중 많은 것들은 아직 보편적으로 인정된 정의가 없으며, 다양한 프레임워크는 동일한 단어를 다르게 사용합니다. 여기서의 목표는 하나의 올바른 어휘를 강제하는 것이 아니라, 토론을 더 쉽게 따를 수 있게 하는 실용적인 정신 모델을 제공하는 것입니다.

시작하겠습니다.

모델

모델은 LLM입니다: 텍스트를 입력받아 텍스트를 출력합니다(예: Claude, Qwen, GPT, Kimi, DeepSeek…). 그 자체로는 호출 간에 메모리가 없으며, 루프도 없습니다. 모델은 도구를 호출하려는 의도를 표현할 수 있지만, 실제로 실행하려면 하니스가 필요합니다. 하나의 프롬프트에 대답하고 멈춥니다. 스캐폴딩과 하니스로 감싸면 에이전트가 됩니다.

스캐폴딩

모델 주변의 행동 정의 계층: 시스템 프롬프트, 도구 설명, 모델의 응답이 어떻게 파싱되는지, 단계별로 기억하는 내용(컨텍스트 관리). 학습 중이든 추론 중이든 모델이 세상을 어떻게 인식하고 행동하는지를 형성합니다.

Claude Code, Codex, Antigravity CLI와 같은 제품들은 이 전체를 하니스라고 부릅니다. Claude Code의 공식 문서는 이를 직접 명시합니다: "Claude Code는 Claude 주변의 에이전트 하니스 역할을 합니다." 이것이 광범위한 사용법입니다: 하니스는 모델이 아닌 모든 것을 의미합니다. 스캐폴드/하니스 구분은 학습 파이프라인과 같이 이들을 별도로 추론해야 할 때 가장 중요합니다. "스캐폴드"가 하니스가 의존하는 모든 인프라(훅, 런타임 구성, 디렉토리 구조)를 포함하도록 더 광범위하게 사용되는 것을 들을 수도 있습니다.

Claude Code, Codex와 같은 일부 제품은 해당 제공자의 모델과 밀접하게 결합되어 있습니다. Antigravity CLI, Hermes Agent와 같은 다른 제품은 모든 모델을 연결할 수 있게 합니다.

하니스

에이전트 내부의 실행 계층: 모델을 호출하고, 도구 호출을 처리하며, 언제 멈출지 결정합니다. 하니스는 에이전트를 실행하게 하는 것입니다. 위에서 정의한 스캐폴딩은 모델이 작업하는 기반입니다: 그것의 지시사항, 그것의 도구, 그것의 형식.

하니스 엔지니어링은 이 계층을 잘 설계하는 규칙입니다: 에이전트가 언제 멈춰야 하는지, 오류가 어떻게 처리되는지, 그것을 올바른 경로에 유지하는 가드레일이 무엇인지 결정합니다. 학습과 추론 모두에 적용됩니다. Addy Osmani의 글과 OpenAI의 Codex를 사용하여 구축하는 계정은 모두 추론 측면에서 이를 다룹니다.

평가 시간에, 동일한 패턴이 평가 하니스로 나타납니다: 학습 데이터를 수집하는 대신, 모델 체크포인트에서 고정된 시나리오 세트를 실행하고 가중치를 업데이트하는 대신 메트릭을 기록합니다.

일부 프레임워크는 여러 에이전트 간에 작업을 조정하는 상위 수준의 컨트롤러에 대해 오케스트레이터를 사용합니다. 모델을 실행 루프를 통해 구동하는 하니스와 달리, 오케스트레이터는 에이전트를 단위로 관리하며, 각각 자신의 하니스를 실행합니다(아래 Sub-agents 참조).

에이전트

이 용어는 강화 학습에서 유래했으며, 여기서 에이전트는 단순히 관찰을 받아 행동을 반환하는 함수입니다. 환경은 그 행동을 받아 새로운 관찰을 반환하고, 루프가 반복됩니다. 그 루프는 여전히 LLM 에이전트가 작동하는 방식의 핵심입니다.

LLM 세계에서, 이 용어는 확장되었습니다. 에이전트는 모델과 그것이 행동할 수 있도록 하는 주변의 모든 것이며, 단순히 응답하는 것이 아닙니다. 이는 원시 텍스트 생성을 루프에서 행동할 수 있는 것으로 변환합니다: 정보를 받아들이고, 무엇을 할지 결정하고, 그 결과에 따라 행동합니다.

구체적인 예로 코딩 에이전트를 들어봅시다. 시스템 프롬프트, 도구 설명, 모델이 따르는 출력 형식은 스캐폴딩을 형성합니다. 모델을 호출하고, 도구 호출을 처리하고, 언제 멈플지 결정하는 루프가 하니스입니다. 학습 시간에, 하니스는 이러한 루프의 많은 부분을 병렬로 실행하고 결과를 모델 업데이트로 피드백합니다.

커뮤니티에서는 일반적으로 에이전트 = 모델 + 하니스(@Vtrivedy10 및 Will Brown의 트윗 참조)로 표현됩니다. 모델이 아니면, 당신은 하니스입니다. 대부분의 혼동을 야기하는 하니스와 스캐폴드 간의 미묘한 구분은 위의 두 섹션에서 다룬 내용입니다.

사람들이 Claude Code, Codex, Cursor와 같은 제품에 대해 말할 때, 그들은 특정 모델 위에 구축된 특정 하니스를 지칭하며, 함께 설계되고 최적화됩니다. 동일한 기본 모델을 사용하는 두 제품은 해당 하니스가 다른 선택을 하기 때문에 완전히 다르게 느껴질 수 있습니다. 동일한 하니스로 더 나은 모델을 교체하면 경험도 변합니다. 모델, 하니스, 제품은 세 가지 다른 것입니다.

컨텍스트 엔지니어링

에이전트의 컨텍스트 윈도우에 들어가는 것을 설계합니다: 모델이 각 단계에서 보는 것, 시스템 프롬프트, 도구 설명, 대화 이력, 검색된 지식. 이것은 일회성 결정이 아닙니다: 모델이 실행될 때, 이전 턴은 향후 호출에 들어가는 것을 형성하고, 하니스는 실행 전체에서 이를 적극적으로 관리합니다. 학습과 추론 모두에 적용되지만, 잘못 얻을 때의 비용은 매우 다릅니다. 학습에서, 모델이 보는 것은 학습되는 것을 형성합니다. 잘못하면 재학습합니다. 추론에서, 그것은 단순한 텍스트입니다: 프롬프트를 변경하고 재배포합니다. HF Context Engineering Course는 이를 깊이 있게 다룹니다.

메모리는 이 그림의 일부입니다. 단기 메모리는 단일 실행 중에 컨텍스트 윈도우에 남아 있는 것입니다: 대화 이력, 도구 결과, 이전 추론. 장기 메모리는 세션을 통해 지속되며, 외부에 저장되고 필요에 따라 검색되며, 관련성이 있을 때 컨텍스트로 다시 주입됩니다.

정책

정책은 에이전트가 따르는 행동입니다: 어떤 상황이든 주어지면, 각 가능한 행동을 취할 확률을 정의합니다. LLM 시스템에서, 그 정책의 일부는 모델 가중치에서 학습되지만, 행동은 주변 스캐폴딩과 하니스에도 의존합니다. 동일한 모델은 프롬프트, 도구, 메모리, 실행 루프에 따라 매우 다르게 행동할 수 있습니다.

정책은 에이전트가 아닙니다. 정책은 행동을 정의하고; 에이전트는 환경에서 행동하는 전체 시스템입니다. 체크포인트를 스캐폴딩과 하니스로 감싸고 배포하면, 행동이 정책인 에이전트를 얻습니다.

도구 사용

에이전트가 자신의 외부에 도달하는 방법: API, 코드 인터프리터, 데이터베이스, 웹 검색, 파일 시스템. 모델은 구조화된 형식으로 도구를 사용하려는 의도를 표현합니다. 최신 추론 API는 이를 일급 객체로 표면화합니다: 하니스는 호출을 직접 받아 올바른 함수로 라우팅합니다. 결과는 컨텍스트로 피드백되고 루프가 계속됩니다.

기술

다단계 작업을 가능하게 하는 재사용 가능하고 구조화된 지식 패키지. 도구가 행동("이 명령어를 실행하세요")인 곳에서, 기술은 목표를 달성하는 데 필요한 모든 것을 묶습니다("이 버그를 조사하고, 가설을 수립하고, 수정 프로그램을 작성하세요"). 이들은 에이전트 전체에서 이식 가능하며 필요에 따라 로드됩니다. 도구, 기술, 부에이전트 간의 경계는 프레임워크를 통해 변합니다. HF Context Engineering Course는 기술을 깊이 있게 다룹니다.

부에이전트

특정 부작업을 처리하기 위해 다른 에이전트에 의해 호출되는 에이전트입니다. 자신의 모델과 스캐폴드를 가지며, 독립적으로 추론하고 결과를 반환합니다. 호출하는 에이전트는 내부적으로 어떻게 작동하는지 알 필요가 없습니다. 이것이 부에이전트를 도구(함수 호출) 또는 기술(패키지된 지식)과 구분하는 것입니다: 부에이전트는 자신이 추론하고, 도구를 사용하고, 추가 부에이전트를 호출할 수 있습니다. 호출하는 에이전트를 때때로 오케스트레이터라고 합니다.

학습

위의 용어는 학습 중이든 배포 중이든 적용됩니다. 이 네 가지는 에이전트가 작업을 실행하고 점수를 받으며 모델의 가중치가 업데이트되는 학습과 관련이 있습니다. LLM에 대한 모든 RL 학습 시스템은 동일한 파이프라인 주변에 구축됩니다:

RL 환경

환경은 상호작용할 수 있는 모든 것입니다: 행동을 입력으로 받아 내부 상태를 업데이트하고 관찰을 반환하는 상태 저장 객체입니다. LLM 컨텍스트에서, 행동은 일반적으로 도구 호출입니다. 파일 시스템은 간단한 예입니다: touch foo.txt 행동은 파일을 생성하여 상태를 업데이트하고, 관찰은 업데이트된 파일 목록일 수 있습니다. 정의는 프레임워크마다 다릅니다.

우리는 최근에 이에 대한 전용 가이드를 발행했으므로, 여기서 압축하는 대신, The Ultimate Guide to RL Environments를 참조하여 유형, 프레임워크, 예제의 완전한 분석을 확인하십시오.

트레이너

트레이너는 에이전트를 더 나아지게 하는 것입니다: 많은 에이전트 에피소드를 실행하고, 결과에 점수를 매기고, 내부 모델의 가중치를 업데이트하는 데 사용합니다. TRL의 GRPOTrainer는 구체적인 예입니다: 에피소드 생성, 보상 점수 매기기, 가중치 업데이트를 처리하는 단일 클래스입니다.

롤아웃

롤아웃은 시작부터 끝까지 하나의 완전한 에이전트 실행입니다: 에이전트가 본 것, 그것이 한 것, 각 단계에서 받은 보상. 맥락에 따라 궤적이나 추적이라고도 합니다. 이것이 RL 알고리즘이 학습하는 원시 데이터입니다.

보상

학습 알고리즘에 모델이 더 나아지고 있는지 알려주는 점수입니다. 검증 가능할 수 있습니다(테스트 통과/실패, 답변 일치), 또는 학습됨(인간 선호도, LLM-as-judge), 희소(에피소드 끝에서 한 점수), 또는 밀집(각 단계에서 한 점수). 이것이 트레이너가 내부 모델의 가중치를 실제로 업데이트하는 데 사용하는 것입니다. 각 유형의 철저한 분석은 Adithya의 가이드의 보상 아키텍처 섹션을 참조하십시오.

루브릭은 보상을 단일 숫자가 아닌 가중치가 있는 명시적 차원으로 분해합니다. OpenEnv와 Verifiers는 루브릭을 조합할 수 있는 객체로 구현합니다(WeightedSum, Sequential, Gate).

자세히 알아보기

정의가 부정확하거나 우리가 놓친 용어를 만났다면, 우리는 여러분의 의견을 듣고 싶습니다.

이 게시물을 검토해주신 Pedro Cuenca, Quentin Gallouédec, Shaun Smith, Adithya S Kolavi에게 감사드립니다.

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

When a field evolves quickly, its vocabulary often evolves faster than its shared understanding. Terms start to blur, get reused in different contexts, or become shorthand for ideas that are never fully explained. We are currently seeing this happen in the field of AI Agents, where concepts are getting mixed together, some are renamed, and others are widely used for a few months before quietly disappearing.

This can be overwhelming for newcomers, and even for practitioners trying to keep up with the latest developments. After ICLR 2026, one of us (@ariG23498) posted a question that captured this confusion well:

"What do you mean by the terms 'harness' and 'scaffold' in the context of agents? I have heard a lot of explanations while I was at ICLR, but I could not understand why they did not converge to a single explanation."

This glossary is our attempt to ground the terms that keep coming up without clear, consistent explanations. It is not meant to be a comprehensive dictionary of every term in the field. Instead, we focus on the concepts that are often mixed up, reused in different ways, or assumed to be obvious when they are not.

Most of these terms come up whether you're building an agent, deploying one, or just using tools like Claude Code, Codex, or Hermes Agent. The last section covers concepts specific to training models, which is more relevant if you work on that side of things.

Many of these terms don't have universally accepted definitions yet, and different frameworks use the same word differently. The goal here is not to enforce one correct vocabulary, but to provide a practical mental model that makes discussions easier to follow.

Let's get started.

Model

The model is the LLM: it takes text in and produces text out (e.g., Claude, Qwen, GPT, Kimi, DeepSeek…). On its own, it has no memory between calls, and no loop. The model can express the intent to call a tool, but it needs a harness to actually execute it. It answers one prompt and stops. Wrap it in scaffolding and a harness and it becomes an agent.

Scaffolding

The behavior-defining layer around the model: system prompt, tool descriptions, how the model's responses get parsed, what it remembers across steps (context management). It shapes how the model sees the world and acts in it, whether during training or at inference.

Products like Claude Code, Codex, and Antigravity CLI call the whole thing a harness. Claude Code's own docs say it directly: "Claude Code serves as the agentic harness around Claude." That's the broad use: harness means everything that isn't the model. The scaffold/harness distinction matters most when you need to reason about them separately, as in a training pipeline. You'll also hear "scaffold" used more broadly to cover any infrastructure the harness relies on: hooks, runtime configuration, even directory structure.

Some products like Claude Code and Codex are tightly coupled to their provider's models. Others like Antigravity CLI and Hermes Agent let you plug in any model.

Harness

The execution layer inside the agent: it calls the model, handles its tool calls, decides when to stop. The harness is what makes the agent run. Scaffolding, defined above, is what the model works from: its instructions, its tools, its format.

Harness engineering is the discipline of designing this layer well: deciding when the agent should stop, how errors get handled, and what guardrails keep it on track. It applies at both training and inference. Addy Osmani's piece and OpenAI's account of building with Codex both cover this from the inference side.

At evaluation time, the same pattern shows up as an eval harness: instead of collecting training data, it runs a fixed set of scenarios at a model checkpoint and records metrics rather than updating weights.

Some frameworks use orchestrator for a higher-level controller that coordinates work across multiple agents. Unlike a harness, which drives a model through its execution loop, an orchestrator manages agents as units, each running their own harness (see Sub-agents below).

Agent

The term comes from reinforcement learning, where an agent is simply a function that takes an observation and returns an action. The environment takes that action and returns a new observation, and the loop repeats. That loop is still at the core of how LLM agents work.

In the LLM world, the term has expanded. An agent is a model plus everything around it that lets it act, not just respond. It turns raw text generation into something that can act in a loop: taking in information, deciding what to do, and acting on the results.

Take a coding agent as a concrete example. The system prompt, tool descriptions, and the output format the model follows form the scaffolding. The loop that calls the model, handles its tool calls, and decides when to stop is the harness. At training time, the harness also runs many of these loops in parallel and feeds the results back to update the model.

In the community, it's usually put as Agent = Model + Harness (@Vtrivedy10 and Will Brown's tweet for reference). If you're not the model, you're the harness. The subtle distinction between harness and scaffold that creates most of the confusion is what the two sections above address.

When people talk about products like Claude Code, Codex, or Cursor, they're referring to a specific harness built on top of a specific model, designed and optimized together. Two products using the same underlying model can feel completely different because their harnesses make different choices. And swapping a better model into the same harness also changes the experience. The model, the harness, and the product are three different things.

Context Engineering

Designing what goes into the agent's context window: what the model sees at each step, system prompt, tool descriptions, conversation history, retrieved knowledge. It's not a one-time decision: as the model runs, previous turns shape what goes into future calls, and the harness actively manages this throughout the run. It applies at both training and inference, but the cost of getting it wrong is very different. At training, what the model sees shapes what gets learned. Get it wrong and you're retraining. At inference, it's just text: change a prompt and redeploy. The HF Context Engineering Course covers this in depth.

Memory is part of this picture. Short-term memory is what stays in the context window during a single run: conversation history, tool results, previous reasoning. Long-term memory persists across sessions, stored externally and retrieved on demand, then injected back into context when relevant.

Policy

A policy is the behavior an agent follows: given any situation, it defines the probability of taking each possible action. In LLM systems, part of that policy is learned in the model weights, but the behavior also depends on the surrounding scaffolding and harness. The same model can behave very differently depending on its prompts, tools, memory, and execution loop.

A policy is not an agent. The policy defines behavior; the agent is the full system that acts in an environment. Wrap a checkpoint in scaffolding and a harness and deploy it, and you get an agent whose behavior is the policy.

Tool Use

How agents reach outside themselves: APIs, code interpreters, databases, web search, file systems. The model expresses the intent to use a tool in a structured format. Modern inference APIs surface this as a first-class object: the harness receives the call directly and routes it to the right function. The result gets fed back into context and the loop continues.

Skills

Reusable, structured packages of knowledge that enable multi-step tasks. Where a tool is an action ("run this command"), a skill bundles everything needed to accomplish a goal ("investigate this bug, form a hypothesis, write a fix"). They are portable across agents and loaded on demand. The line between tool, skill, and sub-agent shifts across frameworks. The HF Context Engineering Course covers skills in depth.

Sub-agents

An agent called by another agent to handle a specific subtask. It has its own model and scaffold, reasons independently, and returns a result. The calling agent doesn't need to know how it works internally. This is what separates a sub-agent from a tool (a function call) or a skill (packaged knowledge): a sub-agent can itself reason, use tools, and call further sub-agents. The calling agent is sometimes called an orchestrator.

Training

The terms above apply whether you're training or deploying. These four are specific to training, where the agent runs through tasks, gets scored, and its model's weights get updated. Every RL training system for LLMs is built around the same pipeline:

RL Environment

The environment is anything you can interact with: a stateful object that takes an action as input, updates its internal state, and returns an observation. In the LLM context, actions are typically tool calls. A filesystem is a simple example: the action touch foo.txt updates the state by creating the file, and the observation might be the updated file listing. Definitions vary across frameworks.

We recently published a dedicated guide on this, so rather than compress it here, see The Ultimate Guide to RL Environments for a complete breakdown of types, frameworks, and examples.

Trainer

The trainer is what makes the agent better: it runs many agent episodes, scores the results and uses them to update the inner model's weights. TRL's GRPOTrainer is a concrete example: a single class that handles episode generation, reward scoring, and weight updates.

Rollout

A rollout is one full agent run from start to finish: what the agent saw, what it did, and what reward it got at each step. It's also called a trajectory or a trace, depending on the context. This is the raw data RL algorithms learn from.

Reward

The score that tells the training algorithm whether the model is getting better. It can be verifiable (tests pass/fail, answer matches), or learned (human preferences, LLM-as-judge), sparse (one score at the end of an episode), or dense (a score at each step). This is what the trainer uses to actually update the inner model's weights. For a thorough breakdown of each type, see the Reward Architecture section in Adithya's guide.

Rubrics break the reward into explicit dimensions with weights, rather than a single number. OpenEnv and Verifiers implement rubrics as objects you can combine (WeightedSum, Sequential, Gate).

Learn More

If any definition feels imprecise or you've encountered a term we've missed, we'd love to hear from you.

Thanks to Pedro Cuenca, Quentin Gallouédec, Shaun Smith, and Adithya S Kolavi for reviewing this post.

#ai-agents #agent-terminology #software-engineering #agent-architecture #development-tools #best-practices

하네스, 스캐폴드, 그리고 올바르게 알아야 할 AI 에이전트 용어들

하니스, 스캐폴드, 그리고 올바르게 이해할 가치가 있는 AI 에이전트 용어들

목차

모델

스캐폴딩

하니스

에이전트

컨텍스트 엔지니어링

정책

도구 사용

기술

부에이전트

학습

RL 환경

트레이너

롤아웃

보상

자세히 알아보기

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

Table of Contents

Model

Scaffolding

Harness

Agent

Context Engineering

Policy

Tool Use

Skills

Sub-agents

Training

RL Environment

Trainer

Rollout

Reward

Learn More