LangChain Blog · 2026-04-30 · 원문 보기

LLM 판사를 인간 선호도에 정렬하기

Aligning LLM-as-a-Judge with Human Preferences

주요 링크:

평가는 당신의 LLM 애플리케이션을 지속적으로 개선하는 과정입니다. 이는 애플리케이션의 성능을 측정할 수 있는 방법이 필요합니다.

LLM 애플리케이션은 종종 자연어로 출력을 생성하므로 하드코딩된 규칙으로 판단하기 어렵습니다. 예를 들어, 간결성이나 참조 출력에 상대적인 정확성과 같은 속성은 일반적인 단위 테스트로 표현하기 어렵습니다.

"LLM-as-a-Judge"를 사용하는 것은 LLM 애플리케이션의 자연어 출력을 평가하는 인기 있는 방법입니다. 이는 생성된 출력(및 기타 정보)을 별도의 LLM에 전달하고 출력을 판단하도록 요청하는 것을 포함합니다. 이것이 여러 맥락에서 유용한 것으로 입증되었지만, 흥미로운 문제를 야기합니다: 이제 LLM-as-a-Judge가 잘 작동하는지 확인하기 위해 또 다른 라운드의 프롬프트 엔지니어링을 해야 합니다.

LangSmith는 이 증가하는 문제에 대한 새로운 해결책을 제시합니다. LangSmith 평가자는 이제 LLM-as-a-Judge 출력에 대한 인간의 수정이 few-shot 예제로 저장되고 이후 반복에서 프롬프트에 다시 공급되는 "자체 개선" 기능을 갖추고 있습니다.

💡

순 효과는 프롬프트 엔지니어링 없이 당신의 선호도를 정확하게 반영하는 LLM-as-a-Judge 평가자를 더 쉽게 만들 수 있고, LangSmith와 자연스럽게 상호작용할 때 시간에 따라 적응하도록 하는 것입니다.

이 글에서 우리는 LLM-as-a-Judge 평가자의 부상, 우리를 이 솔루션으로 이끈 동기 부여 연구, 그리고 이것이 정확히 어떻게 구현되는지에 대해 깊이 있게 살펴볼 것입니다. 시도해 보고 싶다면 여기에서 무료로 LangSmith에 가입할 수 있습니다.

LLM-as-a-Judge

LLM 출력을 프로그래매틱하게 평가하기는 종종 어렵습니다. 이의 큰 부분은 좋은 메트릭의 부족입니다. 물론, 분류 또는 명명된 엔티티 추출이나 기타 "전통적인" ML 작업을 수행하고 있다면 사용할 표준 ML 메트릭이 있습니다. 하지만 더 많은 "생성형" 작업(대부분의 애플리케이션이 종종 그런)을 수행하고 있다면 많은 훌륭한 옵션이 없습니다.

그리고 평가는 매우 중요합니다! 응용 프로그램을 시작하고 최고를 기대할 수는 없습니다. 실제 데이터에 대한 성능을 평가하고, 앱을 변경한 다음 변경이 회귀를 일으키지 않는지 확인해야 합니다. 우리는 온라인 및 오프라인 평가 단계 모두에 상당한 시간을 투자하는 빌더를 보았으며, 그에 따라 LangSmith를 구축했습니다.

LangSmith는 사용하는 메트릭에 대해 의견을 제시하지 않습니다(거의 모든 사람이 자신의 맞춤형 메트릭을 정의합니다). 우리는 Elastic과 Rakuten과 같은 훌륭한 팀들과 함께 일했으며, 그들이 평가를 어떻게 수행하는지 직접 목격했습니다. 우리가 주목한 것 중 하나는 "LLM-as-a-Judge" 평가자의 사용 증가입니다.

"LLM-as-a-Judge" 평가자는 단순히 LLM을 사용하여 출력을 채점하는 평가자입니다. 이는 애플리케이션을 프로그래매틱하게 평가하기 어렵고 유일한 대안이 인간 라벨링일 때 종종 유용합니다. 우리가 본 주요 사용 사례는 다음과 같습니다:

RAG 환각 감지(온라인 평가)
RAG 정확성 감지(오프라인 평가)
LLM이 독성 또는 부적절한 답변을 생성했는지 감지(오프라인 및 온라인 평가)

이것이 작동하는 이유는 무엇입니까? LLM이 처음에 답변을 생성하고 있다면, LLM을 사용하여 결과를 채점하는 것이 실제로 어떻게 작동합니까?

작용 중인 두 가지 요소가 있습니다. 첫째, 평가 중에 LLM은 생성 시점에 가지지 않았던 정보에 접근할 수 있습니다. 예를 들어, RAG 정확성을 판단할 때, 평가자 LLM에 기준 진실 답변을 제공하고 이와 비교하도록 요청합니다. 명백하게, 이것은 그 순간에 가지지 않았던 정보입니다. 둘째, 답변의 정확성을 판단하는 것이 LLM이 올바른 답변을 생성하는 것보다 더 쉽습니다. 작업의 이러한 "단순화"는 LLM-as-a-Judge를 실현 가능하게 만듭니다.

이 과정이 잘 작동할 수 있지만 복잡성이 있습니다. 여전히 평가자 프롬프트에 대해 또 다른 라운드의 프롬프트 엔지니어링을 해야 하며, 이는 시간이 많이 걸리고 팀이 적절한 평가 시스템을 설정하는 것을 방해할 수 있습니다. LangSmith를 사용하여 우리는 이 평가 프로세스를 간소화하려고 했습니다.

동기 부여 연구

우리가 솔루션을 구현하도록 이끈 두 가지 동기 부여 연구가 있었습니다.

첫 번째는 새로운 것이 아닙니다: 언어 모델은 few-shot 학습에 능숙합니다. LLM에 올바르게 수행된 것의 예를 주면, 그들은 올바른 동작을 모방할 것입니다. 이 방법은 우리의 클라이언트 LLM 애플리케이션에서 널리 채택되고 있습니다. 특히 LLM이 어떻게 작동해야 하는지 지시사항에서 설명하기 어렵고 출력이 특정 형식을 가져야 하는 경우에 특히 효과적입니다. 평가는 이 두 기준 모두에 부합합니다!

다른 연구는 새로운 것입니다: Berkeley의 Shreya Shankar가 저술한 Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences 논문입니다. 이 논문은 같은 문제를 다루고 있으며, 비록 우리의 것과 다른 솔루션을 제안하지만, 피드백 수집을 LLM 평가를 인간의 선호도와 프로그래매틱하게 정렬하는 방법으로 사용하도록 우리를 동기 부여했습니다.

그래서, 우리는 어떻게 이 두 가지 아이디어를 가져와 우리의 "자체 개선" 평가자를 만들었습니까?

우리의 솔루션: LangSmith의 자체 개선 평가

최근 연구와 LLM-as-a-Judge 평가자의 광범위한 채택을 바탕으로 LangSmith 평가자를 위한 새로운 "자체 개선" 시스템을 개발했습니다. 이 접근 방식은 LLM 평가를 인간의 선호도와 정렬하는 프로세스를 간소화하고 광범위한 프롬프트 엔지니어링의 필요성을 제거하는 것을 목표로 합니다. 작동 방식은 다음과 같습니다:

시작하려면 LLM을 판사로 설정합니다(온라인 또는 오프라인): 사용자는 온라인 또는 오프라인 평가를 위해 LangSmith에서 LLM-as-a-Judge 평가자를 쉽게 설정할 수 있습니다. 이 초기 설정은 시스템이 시간에 따라 개선되도록 설계되었으므로 최소한의 구성이 필요합니다. 설정할 때 few-shot 예제가 프롬프트에 어떻게 포맷되어야 하는지 지정할 수 있습니다.

피드백을 남기도록 합니다: LLM 평가자는 생성된 출력에 대한 피드백을 제공하여 정확성, 관련성 또는 이전 단계에서 판사의 일부로 지정된 기타 기준과 같은 요소를 평가합니다.
사용자는 앱에서 해당 피드백을 자연스럽게 수정할 수 있습니다: 사용자가 LLM의 평가를 검토할 때, LangSmith 인터페이스 내에서 피드백을 직접 수정하거나 수정할 수 있습니다. 이 단계는 인간의 선호도와 판단을 캡처하는 데 중요합니다.
이러한 수정은 few-shot 예제로 저장됩니다: LangSmith는 이러한 인간의 수정을 자동으로 few-shot 예제로 저장합니다. 이는 팀 또는 애플리케이션의 특정 선호도와 표준을 반영하는 인간 정렬 평가의 성장하는 데이터세트를 만듭니다. 이 흐름의 일부로 수정에 대한 설명을 남길 수도 있습니다.
평가자가 다음에 실행될 때, 해당 예제(및 선택적으로 설명)를 저장하고 생성에 정보를 제공하는 데 사용합니다: 후속 평가 실행에서 시스템은 이러한 저장된 예제를 LLM-as-a-Judge의 프롬프트에 통합합니다. 언어 모델의 few-shot 학습 기능을 활용하여 평가자는 시간에 따라 인간의 선호도와 점점 더 정렬됩니다.

기술적인 설명은 이 방법 가이드를 참조하십시오.

이 자체 개선 주기는 LLM-as-a-Judge가 실제 피드백을 기반으로 평가를 적응하고 개선할 수 있게 하여 수동 프롬프트 조정 또는 시간이 많이 걸리는 프롬프트 엔지니어링을 제거합니다. 이제 팀은 필요할 때 평가를 검토하고 수정하는 데 집중할 수 있으며, 자신의 입력이 시간에 따라 시스템의 성능을 직접 개선한다는 것을 알고 있습니다.

결론

LLM-as-a-Judge 평가자는 생성형 AI 시스템을 평가하기 위한 강력한 도구이지만 프롬프트 엔지니어링과 인간 선호도 정렬에서 새로운 도전을 제기했습니다. LangSmith의 자체 개선 평가자는 이 문제에 대한 우아한 솔루션을 제공하여 few-shot 학습과 사용자 수정을 활용하여 일정한 수동 개입 없이 정확하고 관련성 있는 평가를 위해 인간 피드백을 통합합니다.

AI가 빠르게 발전함에 따라 자체 개선 평가자는 기계 기능과 인간의 기대 사이의 간격을 메우는 데 중요할 것입니다. LangSmith를 사용하여 우리는 팀이 더 큰 자신감과 효율성으로 AI 애플리케이션을 구축, 평가 및 개선할 수 있도록 하고 있습니다. 아직 하지 않았다면 여기에서 LangSmith에 무료로 가입하십시오.

Key Links:

Evaluation is the process of continuously improving your LLM application. This requires a way to measure your application’s performance.

LLM applications often produce outputs in natural language, which are difficult to judge using hard-coded rules. For example, attributes like conciseness or correctness relative to a reference output are difficult to express as typical unit tests.

Using an “LLM-as-a-Judge” is a popular way to grade natural language outputs from LLM applications. This involves passing the generated output (and other information) to a separate LLM and asking it to judge the output. Although this has proven useful in several contexts, it raises an interesting problem: you now have to do another round of prompt engineering to make sure the LLM-as-a-Judge is performing well.

LangSmith presents a novel solution to this rising problem. LangSmith evaluators now feature “self-improvement” whereby human corrections to LLM-as-a-Judge outputs are stored as few-shot examples, which are then fed back into the prompt in future iterations.

💡

The net impact is that it is easier to create LLM-as-a-Judge evaluators that accurately reflect your preferences with no prompt engineering, and have them adapt over time as you interact natively with LangSmith.

In this post we will talk about the rise of LLM-as-a-Judge evaluators, some motivating research that led us to this solution, and then deep dive into how exactly this is implemented. If you want to try it out, you can sign up for LangSmith for free here.

LLM-as-a-Judge

It’s often hard to evaluate LLM output programmatically. A big part of this is a lack of good metrics. Sure, if you are doing classification or named entity extraction or other “traditional” ML tasks then there’s standard ML metrics to use. But if you are doing more “generative” tasks (which most applications often are) then there aren’t many great options.

And evaluation is super important! You can’t just launch an application and hope for the best - you should be evaluating its performance on real data, making changes to your app, and then making sure the changes don’t cause regressions. We see builders spending significant time on both the online and offline evaluation stages - and built LangSmith accordingly.

LangSmith is not opinionated in what metrics you use (we see that nearly everyone defines their own custom metric). We've worked with fantastic teams like Elastic and Rakuten, seeing firsthand how they are doing evaluation – and one of the things we’ve noticed is a rise in the usage of “LLM-as-a-Judge” evaluators.

An “LLM-as-a-Judge” evaluator is simply an evaluator that uses an LLM to score the output. This is often useful when it’s tough to programmatically evaluate an application and the only other recourse would be human labels. Key use cases we’ve seen involve:

Detecting RAG hallucinations (online evaluation)
Detecting RAG correctness (offline evaluation)
Detecting whether the LLM generated a toxic or inappropriate answer (offline and online evaluation)

Why does this work? If LLMs are generating the answers in the first place, then why does using an LLM to score the results actually work at all?

There are two factors at play. First, during evaluation the LLM may have access to information it didn’t have at the time of generation. For example, when judging RAG correctness, you give the evaluator LLM the ground truth answer and ask it to compare to that. Obviously, this is some information you didn’t have in the moment. Second, judging the correctness of an answer is easier for an LLM than generating a correct answer itself. This “simplifying” of the task makes LLM-as-a-Judge feasible.

While this process can work well, it has complications. You still have to do another round of prompt engineering for the evaluator prompt, which can time-consuming and hinder teams from setting up a proper evaluation system. With LangSmith, we've aimed to streamline this evaluation process.

Motivating research

There were two pieces of motivating research that led us to implement a solution.

The first piece is nothing new: language models are adept at few-shot learning. If you give LLMs examples of things done correctly, they will imitate the correct behavior. This method is widely-adopted in our client LLM applications; it's particularly effective in cases where it's tough to explain in instructions how the LLM should behave, and where the output is expected to have a particular format. Evaluations fit both these criteria!

The other piece of research is new: a paper out of Berkeley by Shreya Shankar titled Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. This paper addresses the same problem, and, though it proposes a different solution than ours, it helped motivate our usage of feedback collection as a way to programmatically align LLM evaluations with human preferences.

So - how did we take these two ideas and build our “self-improving” evaluators?

Our solution: Self-improving evaluation in LangSmith

Building on recent research and the widespread adoption of LLM-as-a-Judge evaluators, we've developed a novel "self-improving" system for LangSmith evaluators. This approach aims to streamline the process of aligning LLM evaluations with human preferences, eliminating the need for extensive prompt engineering. Here's how it works:

To start, set up LLM as a judge (online or offline): Users can easily set up an LLM-as-a-Judge evaluator in LangSmith for either online or offline evaluation. This initial setup requires minimal configuration, as the system is designed to improve over time. When setting it up you can specify how few-shot examples should be formatted into the prompt.

Have it leave feedback: The LLM evaluator provides feedback on the generated outputs, assessing factors such as correctness, relevance, or any other criteria specified as part of the judge in the prior step.
Users can make corrections on that feedback natively in the app: As users review the LLM's evaluations, they can directly modify or correct the feedback within the LangSmith interface. This step is crucial for capturing human preferences and judgments.
These corrections will be stored as few-shot examples: LangSmith automatically stores these human corrections as few-shot examples. This creates a growing dataset of human-aligned evaluations that reflect the specific preferences and standards of your team or application. You can also leave explanations for your corrections as part of this flow.
The next time an evaluator runs, it will store those examples (and optionally, the explanations) and use those to inform its generation: In subsequent evaluation runs, the system incorporates these stored examples into the prompt for the LLM-as-a-Judge. By leveraging the few-shot learning capabilities of language models, the evaluator becomes increasingly aligned with human preferences over time.

For more of a technical walkthrough, see this how-to guide.

This self-improving cycle allows the LLM-as-a-Judge to adapt and refine its evaluations based on real-world feedback, eliminating manual prompt adjustments or time-consuming prompt engineering. Now, teams can focus on reviewing and correcting evaluations when necessary, knowing that their input directly improves the system's performance over time.

Conclusion

LLM-as-a-Judge evaluators are powerful tools for assessing generative AI systems but have posed new challenges in prompt engineering and human preference alignment. LangSmith's self-improving evaluators provide an elegant solution to this problem, leveraging few-shot learning and user corrections to integrate human feedback for accurate, relevant evaluations without constant manual intervention.

As AI rapidly advances, self-improving evaluators will be crucial in bridging the gap between machine capabilities and human expectations. With LangSmith, we're empowering teams to build, assess, and refine their AI applications with greater confidence and efficiency. If you haven’t already - sign up for LangSmith for free here.

#llm-as-judge #human-preferences #llm-evaluation #few-shot-learning #preference-alignment #self-improving