LangSmith Engine 소개

Introducing Langsmith Engine

오늘 우리는 LangSmith Engine을 출시합니다.

지금까지 에이전트를 개선하는 것은 추적(trace)을 읽고, 패턴을 찾고, 평가를 작성하고, 수정을 만드는 수동 프로세스였습니다. 이제 LangSmith Engine이 해당 사이클을 자동으로 실행할 수 있습니다. 프로덕션 추적을 모니터링하고, 장애를 명명된 문제로 클러스터링하고, 코드에 대해 근본 원인을 진단하며, 회귀가 다시 발생하지 않도록 수정 및 평가 범위를 제안합니다. 당신은 개선 사항을 검토하고 병합하기만 하면 됩니다.

LangSmith Engine은 오늘부터 공개 베타로 이용 가능합니다.

LangSmith Engine 사용해보기

VIDEO

모든 에이전트 팀은 동일한 사이클을 실행합니다

일반적인 에이전트 개발 사이클은 다음과 같습니다:

에이전트가 무엇을 하는지 이해하기 위해 에이전트를 추적합니다
장애 또는 기능 격차의 패턴을 식별합니다
프롬프트, 도구, 로직 또는 구조를 변경합니다
프로덕션 추적에서 그라운드 진실 데이터셋을 생성합니다
개선 사항을 확인하고 회귀를 확인하기 위해 실험을 실행합니다
배포 및 반복합니다

LangSmith는 이미 각 단계를 지원하기 위한 추적 보기, 빠른 데이터셋 생성 및 실험을 제공합니다. 그러나 우리는 고객으로부터 동일한 마찰 지점을 계속 들었습니다:

개별 추적 검토가 패턴을 드러내지 않기 때문에 무엇을 수정할지 아는 것이 어렵습니다.
규모에서 오류가 추적 전체에서 얼마나 자주 반복되는지 확인하기가 어렵습니다.
프로덕션 데이터에서 오프라인 평가를 위한 그라운드 진실 예제를 만드는 것은 번거롭고 건너뛰기 쉽습니다.
수정이 배포되면 문제가 다시 발생할 경우 이를 감지할 대상 평가자가 없는 경우가 종종 있습니다.

Engine은 전체 사이클에 걸쳐 작동합니다. 팀은 우선순위 목록에서 클러스터된 장애를 보고, 자동으로 수정 초안을 받으며, 테스트 스위트에 대한 오프라인 평가 예제를 제안받습니다.

실제 문제가 어떻게 보이는지

에이전트가 고객 지원 봇이라고 가정합시다. Engine은 사용자가 구독 취소에 대해 묻는 추적 클러스터를 감지합니다. 에이전트가 응답하지만 온라인 평가에서는 응답을 실패로 평가하고 사용자 피드백이 부정적입니다. 레이턴시는 정상이므로 시스템 경고가 발생하지 않았습니다.

Engine은 이것을 "Agent fails to handle subscription cancellation requests accurately"라는 단일 명명된 문제로 표시합니다. 심각도(높음, 이번 주 지원 세션의 12%에 영향), 타임라인(4일 전 시작, 최근 배포와 상관관계), 증거로서의 특정 추적에 대한 링크를 표시합니다.

저장소가 연결된 경우 Engine은 관련 코드를 읽고 근본 원인을 식별합니다. 취소 도구의 설명이 모호하여 사용자가 옵션에 대해 묻기만 할 때 에이전트가 취소를 시도하게 합니다. Engine은 도구 설명에 대한 대상 수정을 포함하는 PR을 작성합니다.

향후 이 동작을 계속 추적하기 위해 Engine은 이 정확한 문제로 범위가 지정된 사용자 정의 온라인 평가자를 제안하므로 수정이 배포된 후 실패 패턴이 반복되면 업데이트된 세부 정보와 함께 문제가 자동으로 다시 표시됩니다.

Engine은 또한 실패한 추적을 오프라인 평가 스위트의 데이터셋으로 끌어당기며, 정확한 출력이 포함해야 할 내용을 정의하는 예제별 기준이 있습니다. 프로덕션에 도달한 실패는 이를 제외하는 테스트 사례가 됩니다.

이것이 전체 사이클이며, 자동으로 실행되고 검토를 위해 표시됩니다. 프로덕션 신호는 클러스터된 문제가 되고, 진단된 근본 원인, 제안된 수정 및 평가 범위가 됩니다.

Engine이 각 문제에 대해 하는 일

표시하는 모든 문제에 대해 Engine은 3가지 해결 조치를 제안합니다.

PR 열기. 저장소 액세스를 통해 Engine은 대상 코드 또는 프롬프트 변경을 작성하고 저장소에 대해 엽니다. 검토하고 병합합니다.

사용자 정의 온라인 평가자를 만듭니다. Engine은 정확한 문제로 범위가 지정된 평가자를 제안합니다. 다시 발생하면 업데이트된 세부 정보와 함께 문제가 자동으로 다시 표시됩니다.

오프라인 평가 스위트에 추가합니다. Engine은 실패한 프로덕션 추적을 오프라인 평가 스위트에서 실행할 준비가 된 그라운드 진실 예제 데이터셋으로 끌어당깁니다.

해결된 모든 문제는 그 과정에서 평가 범위를 개선합니다. 수정을 확인하면 향후 성능을 모니터링하는 평가자도 생성합니다. 시간이 지남에 따라 이미 해결한 문제는 평가 스위트를 더 완전하게 만들어 향후 개선을 더욱 견고하게 합니다.

Engine의 작동 방식

LangSmith Engine은 추적 데이터, 평가자 피드백 및 에이전트의 소스 코드(저장소에 연결된 경우)에 액세스할 수 있는 깊이 있는 에이전트에 의해 구동됩니다.

명시적 오류(도구 호출 실패, 시간 초과), 온라인 평가자 실패, 추적 이상(레이턴시 스파이크, 토큰 부하, 예상치 못한 단계 수), 부정적인 사용자 피드백, 에이전트가 답변하도록 설계되지 않은 질문을 하는 사용자와 같은 비정상적인 동작 등 여러 신호 유형의 추적을 모니터링합니다. Engine이 여러 추적에 걸쳐 패턴을 발견하면 각 실패를 개별적으로 표시하기보다는 이를 단일 명명된 문제로 클러스터링합니다.

LangSmith Engine은 LangSmith의 기존 추적 및 평가 인프라 위에 구축되었습니다. 기존 평가자 결과를 입력으로 사용하므로 평가에서 감지하는 실패가 문제 감지에 직접 공급됩니다. Engine이 새로운 평가자를 제안할 때는 현재 범위에서 격차를 감지했기 때문입니다. 데이터셋 예제를 만들면 기존 오프라인 평가 워크플로우에 직접 이동합니다.

고객이 보는 것

Cogent, Harmonic, Campfire와 같은 팀은 이미 Engine을 사용하여 수천 개의 추적에 영향을 미치는 문제를 해결했습니다. 회귀를 더 일찍 감지하고, 수정을 더 빠르게 배포하며, 분류에 더 적은 시간을 소비합니다.

지금까지 우리는 좋아합니다. 우리의 deepagent 추적은 수십 개 또는 수백 개의 턴을 포함할 수 있으므로 검토 및 패턴 식별이 번거롭습니다. LangSmith Engine은 새로운 장애 모드를 식별할 뿐만 아니라 이를 빠르게 해결하기 위해 평가 및 코드 변경을 선제적으로 제안하여 팀의 시간을 절약합니다.

‍-Austin Berke, Founding Eng @ Harmonic

앞으로 나아갈 방향

에이전트 개발 라이프사이클은 너무 오래 수동이었으며, 우리는 더 많은 부분이 수동 트리거 없이 지속적으로 실행되고, 잘 알려진 문제 유형이 인간의 검토 없이 해결되며, 시간이 지남에 따라 하네스가 특정 에이전트에 대해 더 똑똑해지는 미래를 향해 노력하고 있습니다. LangSmith Engine은 첫 번째 단계입니다.

시작하기

LangSmith Engine은 현재 공개 베타로 이용 가능합니다. 추적 프로젝트를 연결하고, 선택적으로 저장소를 연결하면, Engine이 프로덕션 추적에서 문제를 자동으로 표시하기 시작합니다.

LangSmith Engine 사용해보기

‍

Today we're launching LangSmith Engine.

Until now, improving your agent has been a manual process of reading traces, looking for patterns, writing evals, and creating fixes. Now LangSmith Engine can run that cycle for you. It watches your production traces, clusters failures into named issues, diagnoses root causes against your code, and proposes fixes and eval coverage to keep regressions from coming back. You just review and merge improvements.

LangSmith Engine is available today in public beta.

Try LangSmith Engine

VIDEO

Every agent team runs the same cycle

The typical agent development cycle looks like this:

Trace your agent to understand what it's doing
Identify patterns in failures or gaps in functionality
Make changes to prompts, tools, logic, or structure
Create ground truth datasets from production traces
Run experiments to confirm improvements and check for regressions
Ship and repeat

LangSmith already gives you trace views, fast dataset creation, and experimentation to support each step. But we kept hearing the same friction points from customers:

Knowing what to fix is hard because individual trace review doesn't reveal patterns.
Seeing how often an error recurs across traces is difficult at scale.
Creating ground truth examples for offline evals from production data is tedious and easy to skip.
Once a fix ships, there's often no targeted evaluator in place to catch the same problem if it comes back.

Engine works across the entire cycle. Teams see clustered failures in a prioritized list, get fixes drafted automatically, and have offline eval examples proposed for their test suite.

What an issue looks like in practice

Say your agent is a customer support bot. Engine detects a cluster of traces where users ask about canceling their subscription. Your agent responds, but online evals are scoring the responses as failures and user feedback is negative. Latency is normal, so no systems alert fired.

Engine surfaces this as a single named issue, "Agent fails to handle subscription cancellation requests accurately." It shows you the severity (high, affecting 12% of support sessions this week), the timeline (started four days ago, correlating with a recent deployment), and links to the specific traces as evidence.

With your repository connected, Engine reads the relevant code and identifies the root cause. The cancellation tool description is ambiguous, causing the agent to attempt cancellation when users are only asking about their options. Engine drafts a PR with a targeted fix to the tool description.

To keep tracking this behavior going forward, Engine proposes a custom online evaluator scoped to this exact issue, so if the failure pattern recurs after the fix ships, the issue gets resurfaced automatically with updated details.

Engine also pulls the failing traces into a dataset for your offline eval suite, with per-example criteria that define what the correct output should contain. The failures that made it to production become the test cases that keep them out.

That's the full cycle, run autonomously and surfaced for your review. Production signal becomes a clustered issue, then a diagnosed root cause, a proposed fix, and eval coverage.

What Engine does about each issue

For every issue it surfaces, Engine proposes three resolution actions.

Open a PR. With repository access, Engine drafts a targeted code or prompt change and opens it against your repo. You review and merge.

Create a custom online evaluator. Engine proposes an evaluator scoped to the exact problem. If it fires again, the issue gets resurfaced automatically with updated details.

Add to your offline eval suite. Engine pulls the failing production traces into a dataset of ground truth examples, ready to run in your offline eval suite.

Every resolved issue improves your eval coverage along the way. When you confirm a fix, you're also generating an evaluator that monitors performance going forward. Over time, the issues you've already resolved make your eval suite more complete, which makes future improvements more robust.

How Engine works

LangSmith Engine is powered by a deep agent that has access to your trace data, evaluator feedback, and your agent's source code (if connected to your repo).

It monitors traces for several signal types: explicit errors (tool call failures, timeouts), online evaluator failures, trace anomalies (latency spikes, token blowouts, unexpected step counts), negative user feedback, and unusual behaviors like users asking questions the agent wasn't built to answer. When Engine spots a pattern across multiple traces, it clusters them into a single named issue rather than surfacing each failure individually.

LangSmith Engine is built on top of LangSmith's existing tracing and evaluation infrastructure. It uses your existing evaluator results as inputs, so failures your evals catch feed directly into issue detection. When Engine proposes a new evaluator, it's because it detected a gap in your current coverage. When it creates a dataset example, it goes directly into your existing offline eval workflow.

What customers are seeing

Teams like Cogent, Harmonic, and Campfire have already used Engine to resolve issues affecting thousands of traces. They're catching regressions earlier, shipping fixes faster, and spending less time on triage.

We love it so far. Our deepagent traces can contain dozens or hundreds of turns, which makes review and identifying patterns tedious. LangSmith Engine saves our team hours of digging by not only identifying emerging failure modes, but also proactively suggesting evals and code changes to resolve them quickly.

‍-Austin Berke, Founding Eng @ Harmonic

Where this is going

The agent development lifecycle has been manual for too long, and we're working toward a future where more of it runs continuously without manual triggers, where well-understood issue types resolve without human review, and where the harness gets smarter about your specific agent over time. LangSmith Engine is the first step.

Get started

LangSmith Engine is available now in public beta. Connect a tracing project, optionally connect your repo, and Engine will begin surfacing issues from your production traces automatically.

Try LangSmith Engine

‍

#langsmith #production-monitoring #error-clustering #llm-agents #automated-fixes #observability