LangChain Blog · 9시간 전 · 원문 보기

루브릭 소개: 자신의 작업을 평가하고 수정하는 에이전트 구축

Introducing Rubrics: Build Agents that Evaluate and Correct Their Work

에이전트는 그 어느 때보다 복잡한 작업을 수행하고 있습니다. 대부분의 경우 완료 지점에 도달하지 못합니다. 이를 해결하기 위해 Deep Agents에 RubricMiddleware를 추가했습니다. 루브릭을 정의하면 에이전트가 자체 평가를 수행하고 모든 기준을 만족하거나 설정된 한계에 도달할 때까지 반복합니다.

Claude Code의 /goal이나 Codex에 익숙하다면 이것은 유사한 패턴입니다. 이 구현은 도구를 호출하고 전체 기록에 대해 추론하며 기준별 피드백을 반환할 수 있는 전담 채점자 서브에이전트가 평가를 처리하기 때문에 더 유연합니다.

문제

일부 에이전트 작업은 "완료"에 대한 명확한 정의가 있습니다. 코드 리팩토링은 테스트 스위트가 통과할 때 완료됩니다. 보고서는 필요한 모든 섹션이 포함될 때 완료됩니다.

하지만 에이전트가 첫 시도에서 항상 도달하는 것은 아닙니다. 컨텍스트가 커질수록 모호한 지시, 도구 오용, 비결정적 오류가 모두 복합적으로 작용하여 출력 품질이 저하되고 개발자는 작업을 수동으로 진단하고 다시 실행해야 합니다.

작동 방식

에이전트 실행이 끝나기 전에 별도의 채점자 서브에이전트가 루브릭에 대해 검토합니다. 모든 것이 통과하면 실행이 종료됩니다. 무언가 부족하면 채점자의 기준별 피드백이 대화에 다시 주입되고 에이전트가 다시 실행됩니다. 루브릭이 만족되거나 설정된 반복 횟수 제한에 도달하면 루프가 종료됩니다.

루프는 satisfied, max_iterations_reached, failed 또는 grader_error에서 종료됩니다.

연결하기

다음은 단계별로 나눈 최소 설정입니다. 핵심 아이디어: RubricMiddleware를 한 번 정의하고 깊은 에이전트에 첨부한 다음 호출 시간에 rubric 문자열을 전달합니다. (rubric이 없으면 미들웨어는 아무것도 하지 않습니다.)

1) `RubricMiddleware` 정의

이 미들웨어는 기본 에이전트 위에 채점자 루프를 추가합니다. 채점자는 다음과 같이 구성됩니다:

model: 채점에 사용되는 LLM (종종 주 에이전트 모델보다 더 작거나 저렴함)
system_prompt: 채점자의 역할과 "좋음"의 의미를 정의하는 지침
tools: 채점자가 증거를 수집하기 위해 호출할 수 있는 선택적 도구 (예: 테스트 실행, lint, 출력 검증)
max_iterations: 실행이 중지되기 전의 최대 수정 → 재평가 루프 횟수

from deepagents import RubricMiddleware

rubric_middleware = RubricMiddleware(
    model="anthropic:claude-haiku-4-5",
    system_prompt="You are a code reviewer grading generated code against a rubric.",
    tools=[run_test_suite],
    max_iterations=5,
)

2) 깊은 에이전트에 전달

깊은 에이전트도 자체 "작동 지침"을 가져야 합니다. 에이전트의 system_prompt는 작업을 수행하는 방법을 알려주고, 루브릭은 채점자에게 작업을 판단하는 방법을 알려줍니다.

아래 코드 조각에서:

model: 솔루션을 생성하는 데 사용되는 LLM
system_prompt: 에이전트를 위한 코딩 규칙 + 제약
middleware: 에이전트를 반복적으로 수정할 수 있도록 rubric_middleware를 연결합니다

from deepagents import create_deep_agent

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    system_prompt=(
        "You are a careful Python engineer. Write correct, readable code. "
        "Follow the user's instructions exactly."
    ),
    middleware=[rubric_middleware],
)

3) 인간 메시지 + 루브릭으로 호출

호출 시간에 제공하는 항목:

messages: 인간의 요청 (선택적으로 이전 턴)
rubric: 채점자가 만족으로 표시해야 하는 줄바꿈으로 구분된 체크리스트

from langchain.messages import HumanMessage

result = agent.invoke(
    {
        "messages": [
            HumanMessage(
                content=(
                    "Write a Python function `find_duplicates(lst)` that returns a list of "
                    "all elements that appear more than once in the input list, in the order "
                    "they first appear."
                )
            )
        ],
        "rubric": (
            "- All tests pass in run_test_suite\n"
            "- The function is named `find_duplicates` and accepts a single list argument\n"
        ),
    },
    config={"configurable": {"thread_id": "code-generation-session"}},
)
print(result["messages"][-1].text)

채점자에게 정확성에 대해 추상적으로 추론하도록 요청하는 대신 동작을 직접 검증하기 위해 run_test_suite 도구를 제공합니다. 채점자는 판정을 내리기 전에 도구를 호출하여 확실한 증거를 수집할 수 있으며, 도구가 제공되지 않을 때 기록에서의 추론으로 돌아갑니다.

실제 적용 사례

위의 코드 생성 예제에서 에이전트의 첫 번째 시도는 올바른 것처럼 보였지만 하나의 테스트가 실패했습니다. 채점자는 다음을 반환했습니다:

"하나의 테스트가 실패합니다: test_unhashable. 함수는 입력 리스트 내의 리스트와 같은 해시 불가능한 유형을 만날 때 TypeError로 충돌합니다."

에이전트는 구현을 수정하고 두 번째 반복에서 모든 테스트를 통과했습니다. 피드백은 일반적인 "다시 시도"가 아닙니다. 각 기준은 자신의 판정을 얻으므로 에이전트는 정확히 무엇을 수정해야 하는지 알 수 있습니다.

이 추적에서 전체 예제를 보세요.

중요한 이유

에이전트 출력은 확률적입니다: 동일한 프롬프트가 한 실행에서는 성공하고 다음 실행에서는 부족할 수 있습니다. RubricMiddleware는 해당 분산을 포착하는 부담을 개발자에게서 시스템으로 옮깁니다.

출력을 수동으로 검사하고 실패한 작업을 다시 실행하는 대신 "완료"의 의미를 한 번 정의하고 루프가 나머지를 처리합니다. 각 재시도는 정보를 기반으로 합니다. 채점자는 정확히 무엇이 잘못되었는지 확인하고 대상화된 기준별 피드백을 생성합니다.

결과: 정확성이 중요한 작업에서 더 안정적인 에이전트.

자세히 알아보기

RubricMiddleware는 베타 상태이며 API가 변경될 수 있습니다. 구성, 관찰성 및 루브릭 지속성을 포함한 전체 설명은 설명서를 참조하세요.

Agents are taking on more complex tasks than ever. More often than not, they fall short of the finish line. We've added RubricMiddleware to Deep Agents to fix that: you define a rubric, and the agent self-evaluates and iterates until it satisfies every criterion, or hits a configured cap.

If you're familiar with /goal in Claude Code or Codex, this is a similar pattern. This implementation is a bit more flexible because evaluation is handled by a dedicated grader sub-agent that can call tools, reason over the full transcript, and return per-criterion feedback.

The problem

Some agent tasks have a clear definition of "done." A code refactor is finished when the test suite passes. A report is complete when every required section is covered.

But agents don't always get there on the first try. As context grows larger, ambiguous instructions, tool misuse, and non-deterministic errors all compound — output quality deteriorates, and developers are left to diagnose and re-run tasks manually.

How it works

Before the agent run finishes, a separate grader sub-agent reviews it against the rubric. If everything passes, the run concludes. If anything falls short, the grader's per-criterion feedback is injected back into the conversation and the agent runs again. The loop terminates when the rubric is satisfied, or when a configured iteration limit is hit.

The loop terminates on satisfied, max_iterations_reached, failed, or grader_error.

Wiring it up

Here's the minimal setup, broken into steps. The key idea: define a RubricMiddleware once, attach it to a deep agent, then pass a rubric string at invoke time. (If rubric is absent, the middleware does nothing.)

1) Define `RubricMiddleware`

This middleware adds a grader loop on top of the base agent. The grader is configured with:

model: the LLM used for grading (often smaller/cheaper than your main agent model)
system_prompt: instructions that define the grader’s role and what “good” looks like
tools: optional tools the grader can call to gather evidence (e.g., run tests, lint, validate outputs)
max_iterations: the maximum number of fix → re-grade loops before the run stops

from deepagents import RubricMiddleware

rubric_middleware = RubricMiddleware(
    model="anthropic:claude-haiku-4-5",
    system_prompt="You are a code reviewer grading generated code against a rubric.",
    tools=[run_test_suite],
    max_iterations=5,
)

2) Pass it to a deep agent

Your deep agent should also have its own “operating instructions.” The agent’s system_prompt tells it how to do the work, while the rubric tells the grader how to judge the work.

In the snippet below:

model: the LLM used to generate the solution
system_prompt: coding conventions + constraints for the agent
middleware: attaches rubric_middleware so the agent can be iteratively corrected

from deepagents import create_deep_agent

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    system_prompt=(
        "You are a careful Python engineer. Write correct, readable code. "
        "Follow the user’s instructions exactly."
    ),
    middleware=[rubric_middleware],
)

3) Invoke with a human message + rubric

At invocation time you provide:

messages: the human request (and optionally prior turns)
rubric: a newline-delimited checklist that the grader must mark satisfied

from langchain.messages import HumanMessage

result = agent.invoke(
    {
        "messages": [
            HumanMessage(
                content=(
                    "Write a Python function `find_duplicates(lst)` that returns a list of "
                    "all elements that appear more than once in the input list, in the order "
                    "they first appear."
                )
            )
        ],
        "rubric": (
            "- All tests pass in run_test_suite\n"
            "- The function is named `find_duplicates` and accepts a single list argument\n"
        ),
    },
    config={"configurable": {"thread_id": "code-generation-session"}},
)
print(result["messages"][-1].text)

Rather than asking the grader to reason abstractly about correctness, we give it a run_test_suite tool to verify behavior directly. The grader can call tools to gather hard evidence before producing a verdict — and falls back to reasoning from the transcript when no tools are provided.

Seeing it in practice

In the code generation example above, the agent's first attempt looked correct but failed one test. The grader returned:

"One test fails: test_unhashable. The function crashes with TypeError when encountering unhashable types like lists within the input list."

The agent revised its implementation and passed all tests on the second iteration. The feedback isn't a generic "try again" — each criterion gets its own verdict, so the agent knows exactly what to fix.

See the full example in this trace.

Why it matters

Agent outputs are probabilistic: the same prompt can succeed on one run and fall short on the next. RubricMiddleware shifts the burden of catching that variance away from the developer and onto the system.

Instead of manually inspecting outputs and re-running failed tasks, you define what "done" looks like once and the loop handles the rest. Each retry is informed — the grader identifies exactly what's wrong and produces targeted, per-criterion feedback.

The result: more reliable agents on tasks where correctness matters.

Learn more

RubricMiddleware is in beta and the API may change. For a full walkthrough including configuration, observability, and rubric persistence, see the documentation.

#ai-agents #self-evaluation #evaluation-rubrics #output-validation #agent-middleware #self-correction