Lyft가 LangGraph와 LangSmith로 구축한 자체 서빙 AI 에이전트 플랫폼

How Lyft Built a Self-Serve AI Agent Platform with LangGraph and LangSmith

이것은 Lyft의 친구들로부터의 게스트 포스트입니다. SCX 데이터 사이언스 및 MLE 팀이 기술이 아닌 도메인 전문가가 AI 에이전트를 배포할 수 있게 해주는 멀티 에이전트 고객 지원 시스템을 구축했습니다. 머신러닝 엔지니어 Akshay Sharma가 주도했습니다. 귀중한 기여감사드립니다.

TL;DR

LangGraph를 활용하여 정교한 멀티 에이전트 시스템을 조율함으로써 Lyft는 고객 지원 운영을 변화시켰고, 라이더와 드라이버의 수백만 건의 상호작용을 관리하고 있습니다. 저희 "셀프 서빙" 플랫폼은 LangGraph의 서브그래프 아키텍처와 LangSmith의 강력한 추적 및 모니터링 도구를 통합하여 기술이 아닌 도메인 전문가가 독립적으로 AI 에이전트를 개발하고 개선할 수 있도록 힘을 실어줍니다. 이러한 전환은 에이전트 개발 시간을 대략 6개월에서 단 몇 주로 단축시켰으며, 자동화된 LLM-as-a-judge 평가 시스템을 통해 높은 기준을 유지하고 있습니다.

Lyft의 목표: 안전하게 에이전트 반복 속도 높이기

계정 접근, 손상 청구, 요금 검토 및 수익 분쟁을 포함한 수많은 범주에서 Lyft의 AI Assist는 라이더와 드라이버의 고객 지원을 관리합니다. 저희의 여정은 2023년에 시작되었지만 프로세스는 노동 집약적이었습니다. 각 AI 에이전트를 개발하려면 머신러닝 엔지니어(MLE)와 엔지니어링 팀의 몇 개월에 걸친 헌신적인 작업이 필요했습니다. 라이더와 드라이버를 위한 에이전트를 성공적으로 출시했으며 효율성도 증가했지만, 전체적인 속도는 여전히 상당한 병목이었습니다.

2026년까지 저희의 기존 운영 모델은 새로운 사용자 세그먼트, 추가 문제 유형, 자율 주행 자동차 지원 등으로 인한 지속 불가능한 수요 급증에 직면했습니다. 개발 사이클은 느린 반복 루프에 의존했습니다. 도메인 전문가는 워크플로우 동작을 정의하고, MLE는 이를 도구 설정 및 프롬프트로 변환했습니다. 추적을 검토하고, 문제를 지적하고, 코드를 조정하는 이러한 왕복은 모든 단일 에이전트마다 주 단위의 협업이 필요했습니다. 결과적으로 고객 문제에 대한 가장 깊은 이해를 가진 사람들은 기술적 중개자 없이 솔루션을 구현할 수 없었습니다.

이는 우리에게 중대한 질문으로 이어졌습니다. ops 팀, VoC 리더, 제품 관리자가 자연어를 사용하여 에이전트를 직접 구성하고 개선할 수 있을까요? 저희의 목표는 일일 반복 프로세스에서 기술적 중개자를 제거하여 학습과 배포를 가속화하는 것이었습니다. 중요하게도, 이러한 셀프 서빙으로의 전환은 경험, 정확성 및 안전성에 대한 엄격한 기준을 손상시킬 수 없었습니다. 모든 에이전트는 여전히 수동으로 설계된 시스템의 품질과 일치해야 했습니다.

아키텍처: LangGraph를 기반으로 구축된 멀티 에이전트 시스템

라우터 멀티 에이전트 패턴

저희 시스템은 LangGraph의 라우터 멀티 에이전트 아키텍처를 따릅니다. 메타 에이전트는 상태 저장 라우터로 작동합니다. 들어오는 요청을 분류하고 Command(goto=...)를 사용하여 적절한 특화된 서브에이전트로 디스패치합니다. 각 서브에이전트는 완전한 LangGraph StateGraph이며 메타 에이전트의 서브그래프 노드로 등록됩니다.

라이더와 드라이버를 위해 별도의 라우터 인스턴스를 실행합니다. 라이더가 지원을 요청하면 메타 에이전트가 rider_intent 서브에이전트로 라우팅하고, 이는 라이더별 의도에 대해 분류합니다(예: 분실물, 요금 분쟁, 여행 문제). 드라이버의 경우 driver_intent 서브에이전트로 라우팅하고, 이는 드라이버별 의도를 처리합니다(예: 수익, 계정 접근, 손상 청구). 의도 에이전트가 대화 중에 사용자가 더 전문화된 에이전트가 필요하다고 판단하면 Command(goto=..., graph=Command.PARENT)를 사용하여 메타 에이전트로 제어를 반환하고, 메타 에이전트는 적절한 전문가로 재라우팅합니다. 예를 들어 드라이버 의도 에이전트에서 손상 청구 에이전트로 대화 중에 점프합니다.

각 서브에이전트는 전문화 여부와 관계없이 일관된 노드 패턴을 따릅니다:

이는 우리에게 두 가지 중요한 특성을 제공합니다. 첫째, 안전성이 매 턴마다 병렬로 실행되므로 악의적 의도 감지 및 안전 문제 감지가 LLM 추론이 일어나기 전에 LangGraph의 Command(goto=[...]) 팬아웃을 통해 동시에 실행됩니다. 둘째, 서브에이전트는 모듈식이고 독립적으로 배포 가능합니다. 새 에이전트를 추가한다는 것은 새 서브그래프를 정의하고 메타 에이전트에 등록하는 것을 의미합니다.

전문화된 에이전트와 구성 가능한 에이전트

에이전트의 두 가지 범주가 있습니다:

전문화된 에이전트는 복잡한 고위험 워크플로우를 위해 MLE에 의해 손으로 만들어집니다. 예를 들어 저희의 손상 청구 에이전트는 이미지 처리, 부정 행위 감지, 다단계 분류 및 로우 코드 접근 방식으로는 너무 복잡한 자동화 호출을 지원합니다.

구성 가능한 에이전트는 셀프 서빙 레이어입니다. 이들은 저희의 내부 설정 서비스에 저장된 JSON 설정에서 런타임에 초기화되며, 프롬프트는 LangSmith의 프롬프트 허브에서 가져옵니다. 도메인 전문가는 저희의 구조화된 템플릿(역할, 범위, 워크플로우 단계, 콘텐츠 지침)에 따라 프롬프트를 작성하고, ConfigurableAgent 클래스가 나머지를 처리합니다: 그래프 구성, 도구 바인딩, 안전 게이트 및 상태 관리.

# 구성 가능한 에이전트는 시작 시 동적으로 로드됩니다
for configurable_agent in load_configurable_agents():
    self.configurable_subagents[configurable_agent.config.intent] = configurable_agent

# 각각은 메타 에이전트에 서브그래프로 등록됩니다
for configurable_subagent in self.configurable_subagents.values():
    graph_builder.add_node(
        configurable_subagent.config.intent,
        configurable_subagent.get_state_graph()
    )
    graph_builder.add_edge(configurable_subagent.config.intent, "finalize")

이는 제품 관리자가 예를 들어 드라이버 세금 질문을 위한 에이전트를 정의할 수 있다는 의미입니다. 프롬프트와 JSON 설정을 작성하면 됩니다. MLE 코드 변경이 필요하지 않습니다. 플랫폼이 그래프 구성, 도구 실행, 체크포인팅, 추적 및 안전성을 처리합니다.

DynamoDB를 사용한 상태 지속성

다중 턴 대화에는 지속적인 상태가 필요합니다. 저희는 LangGraph의 BaseCheckpointSaver 인터페이스를 구현하는 커스텀 DynamoDBSaver를 구축했으며, 인메모리 가정 없이 턴 간 지속적인 대화 상태를 제공합니다. 각 체크포인트는 전체 그래프 상태, 실행 메타데이터 및 부모 체크포인트 참조를 저장하여 프로덕션에서 대화 재생, 디버깅 및 상태 검사를 가능하게 합니다.

LangSmith: 추적에서 프로덕션 모니터링까지

모든 에이전트 턴 추적

LANGSMITH_TRACING=true를 사용하여 모든 환경(개발, 스테이징, 프로덕션)에 걸쳐 모든 에이전트 호출이 LangSmith로 추적됩니다. 각 추적은 전체 그래프 실행을 캡처합니다: 어떤 노드가 실행되었는지, LLM이 무엇을 보았는지, 어떤 도구가 호출되었는지, 토큰 사용량 및 매 단계의 지연 시간.

저희는 필터링을 위한 런타임 메타데이터를 구축하는 유틸리티를 사용하여 커스텀 메타데이터로 추적을 풍부하게 합니다:

# 메타데이터는 LangSmith로 흘러가 필터링 및 디버깅을 위해 사용됩니다
tags = build_langsmith_metadata(
    agent_name=self.name,
    user_type=context.user_type,
    interaction_id=context.interaction_id
)

이는 매우 가치 있었습니다. 드라이버가 혼란스러운 응답을 보고할 때 정확한 추적을 가져와 모든 노드의 입출력을 보고, 문제가 의도 분류, 도구 실행 또는 최종 LLM 응답에 있었는지 식별하고 몇 시간 내에 수정할 수 있습니다.

LLM-as-a-Judge 평가 파이프라인

어떤 에이전트가 트래픽의 100%에 출시되기 전에 저희의 평가 파이프라인을 통과해야 합니다. 프로세스:

소규모 프로덕션 출시(5~10%) — 에이전트가 낮은 볼륨의 실제 트래픽을 제공합니다.
샘플 프로덕션 추적 — 저희는 실제 대화를 평가 데이터셋으로 캡처합니다.
LLM-as-a-Judge 평가자 실행 — LangSmith의 프롬프트 허브의 공유 판사 프롬프트 템플릿을 사용하여 에이전트별 메트릭으로 확장합니다.

저희의 기본 평가 메트릭(모든 에이전트에 적용):

그러면 각 전문화된 에이전트에 대해 일부 도메인별 메트릭을 추가합니다. 예를 들어, 핵심 수익 에이전트는 에이전트가 관련 정책을 따랐거나 벗어났는지, 또는 비논리적이거나 일관성 없는 추론을 사용했는지를 확인합니다.

평가자는 LangSmith의 다중 턴 평가자를 사용하여 프로덕션 추적에 대해 자동으로 실행되며, 스레드 필터(예: run name is ride_earnings)와 초기 출시 중에는 높고 신뢰도가 증가함에 따라 점감하는 샘플링 속도로 구성됩니다.

프로덕션 모니터링 대시보드

프로덕션의 모든 에이전트에는 다음을 추적하는 복제된 LangSmith 모니터링 대시보드가 있습니다:

실행 볼륨 및 오류율 — 예상치 못한 스파이크나 장애가 발생하고 있나요?
p50/p95 지연 시간 — 에이전트가 실시간 지원에 충분히 빠르게 응답하고 있나요?
토큰 사용량 — 비용이 예산 범위 내인가요?
도구 호출 성공률 — 외부 API 통합이 건강한가요?
시간 경과에 따른 LLM-as-a-Judge 점수 — 품질이 상향 또는 하향 추세인가요?

저희는 또한 LangSmith 메트릭에 의해 트리거되는 PagerDuty 경고를 설정했습니다. 오류율이 5%를 초과하거나 p95 지연 시간이 15분 윈도우에서 10초를 넘으면 온콜 엔지니어에게 자동으로 페이지됩니다.

프로덕션 모니터링 대시보드의 일부인 도구별 오류율의 예시 차트

프로덕션에서 실행 중인 커스텀(에이전트별) 메트릭을 사용한 LLM Judge 평가. 팁: 점수보다는 이진 출력(True/False 또는 Pass/Fail)을 사용하세요. 점수는 부정확하고 실행 불가능합니다.

어려운 교훈: 프롬프트 품질이 인프라가 아닌 병목

저희가 에이전트 빌딩을 기술이 아닌 팀원에게 개방했을 때, 가장 어려운 부분은 플랫폼 자체(도구 바인딩을 올바르게 처리, 그래프의 엣지 케이스 처리, 상태 관리)일 것이라고 가정했습니다. 저희가 틀렸습니다.

가장 어려운 부분은 프롬프트 품질이었습니다. 도메인 전문가는 그들의 문제 유형을 깊이 있게 알고 있었지만 항상 그 지식을 LLM이 안정적으로 따를 수 있는 명령으로 변환하는 방법을 알고 있었던 것은 아닙니다. 저희는 행복한 경로를 아름답게 처리했지만 엣지 케이스에서 무너진 에이전트를 보았습니다. 프롬프트가 드라이버가 요금을 분쟁할 때 에이전트가 해야 할 일을 정의했을 수도 있지만, 드라이버가 대화 중에 주제를 바꾸면 어떤 일이 일어나는지에 대해서는 아무 말도 없었습니다. 또는 톤 섹션이 "공감하다"라고 말했지만 그것이 실제로 무엇을 의미하는지 지정하지 않아 LLM이 매번 다르게 해석했습니다.

실패 모드는 놀랍도록 일관되었습니다: 범위 벗어난 정의 누락(에이전트가 도구가 없는 질문에 답하려고 시도함), 모호한 분기 논리(명시적 진입 또는 종료 조건이 없는 단계), 그리고 종이에는 좋게 들리지만 LLM에게 즉흥할 수 있는 너무 많은 여유를 제공하는 모호한 콘텐츠 지침.

저희는 두 가지 방향에서 이것을 공격했습니다.

첫째, 구조화된 프롬프트 작성 프레임워크입니다. 저희는 다섯 가지 필수 요소를 포함하는 템플릿을 만들었습니다: 정체성(이 에이전트는 누구인가, 어떤 사용자 유형, 어떤 주제 영역), 주요 목표(모호한 "도움" 또는 "처리"가 아닌 구체적인 동사), 범위(범위 내 AND 범위 외 명시적 라우팅 작업 포함), 단계 워크플로우(각 if/else에 대한 진입 조건, 분기 및 모든 단계에 대한 터미널 작업을 포함한 번호 매김 단계), 콘텐츠 지침(추상적 원칙이 아닌 구체적 do/don't 규칙과 예시 구문). 저희는 이를 모든 프롬프트가 활성화 전에 통과해야 하는 검토 체크리스트와 쌍을 이루었으며, "모든 단계가 출구를 가지고 있는가?" 및 "도구를 사용할 수 없을 때 수행할 작업에 대한 명령이 있는가?"와 같은 것들입니다.

둘째, 자동화된 프롬프트 검증입니다. 저희는 어떤 프롬프트가 프로덕션에 도달하기 전에 실행되는 Git 기반 프롬프트 린팅 파이프라인을 구축하고 있습니다. 도메인 전문가가 저희의 빌더 UI에서 프롬프트 작성을 마치면 저희의 설정 저장소에 대해 풀 요청을 엽니다. 그러면 CI 파이프라인이 두 가지 검사 레이어를 실행합니다: 빠른 정적 규칙(형식이 잘못된 템플릿 변수, 중복 의도 슬러그, 맞춤법 오류 포착) 다음으로 프롬프트 주입 취약점, 모순된 명령 및 구조적 막다른 지점을 감지하는 LLM 구동 규칙(대화 흐름에 빠져나갈 수 없는 경우). 모든 위반이 병합을 차단합니다. 저자는 UI에서 인라인 피드백을 받고 MLE를 끌어들일 필요 없이 문제를 직접 수정할 수 있습니다.

이 모든 것의 핵심 통찰: 프롬프트를 코드 주석이 아닌 제품 사양으로 취급하세요. 프롬프트가 더 명시적일수록, 에이전트는 더 일관됩니다. 그리고 품질 문제를 더 일찍 포착할수록, 이상적으로는 단일 실제 고객이 출력을 보기 전에, 전체 시스템이 개선되는 속도가 빨라집니다.

결과

셀프 서빙 에이전트 플랫폼을 출시한 이후:

에이전트 개발 시간: ~6개월(첫 드라이버 에이전트)에서 새로운 구성 가능한 에이전트의 경우 ~2주로 단축.
에이전트 커버리지: 여러 문제 유형을 다루는 프로덕션의 성장하는 수의 구성 가능한 에이전트, 여러 전문화된 에이전트와 함께.
평가 커버리지: 프로덕션의 100% 에이전트에는 라이브 추적에 대해 실행되는 자동화된 LLM-as-a-Judge 파이프라인이 있습니다.
품질: 저희가 Langsmith 평가 메트릭을 기반으로 설정한 환각 가드레일로 인해 환각 및 모순 비율이 20% 감소했습니다.
운영 효율성: 많은 비엔지니어링 팀 멤버가 이제 독립적으로 에이전트를 구축하고 반복하고 있습니다.
AI 해결률: 저희의 셀프 서빙 플랫폼을 사용하여 몇 가지 에이전트를 출시한 이후 16% 증가했습니다.

다음 단계

저희는 이 플랫폼을 더욱 발전시킬 수 있는 몇 가지 영역을 살펴보고 있습니다:

프롬프트 린팅 파이프라인 완성 — 위에서 설명한 Git 기반 CI 검증은 적극적으로 개발 중입니다. 완전히 출시되면, 모든 구성 가능한 에이전트 프롬프트는 프로덕션에 도달하기 전에 자동화된 정적 및 LLM 구동 검사를 통과해야 하며 일반적인 오류에 대한 수동 MLE 검토가 0이 됩니다.
모킹 및 시뮬레이션 인프라 — 에이전트 빌더가 실제 트래픽에 배포하기 전에 합성 대화 및 모킹된 도구 응답에 대해 테스트할 수 있는 시뮬레이션 레이어를 구축하여 새 에이전트에 대한 피드백 루프를 크게 단축합니다.
쌍 평가 — LangSmith의 쌍 주석 큐를 사용하여 출시 전 인간 검토자와 프롬프트 수정을 A/B 테스트합니다.
더 많은 지역 및 사용자 유형으로 확장 — Freenow 고객을 유럽으로 확장하고 자율 주행 자동차 지원 시나리오로 플랫폼을 가져옵니다.
더 깊은 평가 자동화 — 샘플링된 평가에서 모든 프로덕션 추적에 대한 연속 채점으로 이동하고, 자동 프롬프트 성능 저하 경고를 제공합니다.

This is a guest post from our friends at Lyft, where the SCX Data Science and MLE team built a multi-agent customer support system that enables non-technical domain experts to ship AI agents. Led by Akshay Sharma, Machine Learning Engineer. Thank you for your contribution.

TL;DR

By leveraging LangGraph to orchestrate a sophisticated multi-agent system, Lyft has transformed its customer support operations, managing millions of interactions for riders and drivers. Our "self-serve" platform integrates LangGraph’s subgraph architecture with LangSmith’s robust tracing and monitoring tools, empowering non-technical domain experts to develop and refine AI agents independently. This shift has accelerated agent development from roughly six months to just a few weeks, all while upholding high standards through an automated LLM-as-a-judge evaluation system.

Lyft’s Goal: Speeding Up Agent Iteration, Safely

Across numerous categories including account access, damage claims, charge reviews, and earnings disputes, Lyft's AI Assist manages customer support for riders and drivers. Our journey began in 2023, but the process was labor-intensive; developing each AI agent demanded months of dedicated work from Machine Learning Engineers (MLEs) and engineering teams. Although we successfully launched agents for riders and drivers with increasing efficiency, the overall pace remained a significant bottleneck.

By 2026, our existing operating model faced an unsustainable surge in demand driven by new user segments, additional issue types, autonomous vehicle support, and more. The development cycle relied on a slow, iterative loop: domain experts would define workflow behaviors, which MLEs then translated into tool configurations and prompts. This back and forth reviewing traces, flagging problems, and adjusting code required weeks of collaboration for every single agent. Consequently, those with the deepest understanding of customer issues were unable to implement solutions without a technical middleman.

This led us to a pivotal question: Could we empower ops teams, VoC leads, and product managers to construct and refine agents directly using natural language? Our goal was to eliminate the technical intermediary from the daily iteration process to accelerate learning and deployment. Crucially, this shift toward self-service could not compromise our rigorous standards for experience, accuracy, and safety; every agent still had to match the quality of our manually engineered systems.

Architecture: A Multi-Agent System Built on LangGraph

The Router Multi-Agent Pattern

Our system follows LangGraph's router multi-agent architecture. A meta agent acts as a stateful router: it classifies the incoming request and uses Command(goto=...) to dispatch to the appropriate specialized subagent. Each subagent is a full LangGraph StateGraph, registered as a subgraph node in the meta agent.

We run separate router instances for riders and drivers. When a rider contacts support, the meta agent routes to the rider_intent subagent, which classifies across rider-specific intents (e.g. lost items, charge disputes, trip issues). For drivers, it routes to the driver_intent subagent, which handles driver-specific intents (e.g. earnings, account access, damage claims). If the intent agent determines during a conversation that the user needs a more specialized agent, it uses Command(goto=..., graph=Command.PARENT) to hand control back to the meta agent, which re-routes to the appropriate specialist, for example, jumping from the driver intent agent to the damage claim agent mid-conversation.

Each subagent, regardless of specialization, follows a consistent node pattern:

This gives us two important properties. First, safety runs in parallel at every turn, malicious intent detection and safety issue detection execute concurrently via LangGraph's Command(goto=[...]) fan-out before any LLM reasoning happens. Second, subagents are modular and independently deployable adding a new agent means defining a new subgraph and registering it with the meta agent.

Specialized vs. Configurable Agents

We have two categories of agents:

Specialized agents are hand-built by MLE for complex, high-stakes workflows. Our damage claim agent, for example, assists with image processing, fraud detection, multi-step classification, and automation calls too complex for a low-code approach.

Configurable agents are the self-serve layer. They're initialized at runtime from JSON configuration stored in our internal config service, with prompts pulled from LangSmith's Prompt Hub. A domain expert writes the prompt following our structured template (role, scope, workflow phases, content guidelines), and the ConfigurableAgent class handles the rest: graph construction, tool binding, safety gates, and state management.

# Configurable agents are loaded dynamically at startup
for configurable_agent in load_configurable_agents():
    self.configurable_subagents[configurable_agent.config.intent] = configurable_agent

# Each one registers as a subgraph in the meta agent
for configurable_subagent in self.configurable_subagents.values():
    graph_builder.add_node(
        configurable_subagent.config.intent,
        configurable_subagent.get_state_graph()
    )
    graph_builder.add_edge(configurable_subagent.config.intent, "finalize")

This means a product manager can define a new agent, such as for driver tax questions, by writing a prompt and a JSON config. No MLE code changes are required. The platform handles graph construction, tool execution, checkpointing, tracing, and safety.

State Persistence with DynamoDB

Multi-turn conversations require a durable state. We built a custom DynamoDBSaver that implements LangGraph's BaseCheckpointSaver interface, giving us persistent conversation state across turns without any in-memory assumptions. Each checkpoint stores the full graph state, execution metadata, and parent checkpoint references enabling conversation replay, debugging, and state inspection in production.

LangSmith: From Tracing to Production Monitoring

Tracing Every Agent Turn

Every agent invocation across all environments (development, staging, production) is traced to LangSmith with LANGSMITH_TRACING=true. Each trace captures the full graph execution: which nodes ran, what the LLM saw, which tools were called, token usage, and latency at every step.

We enrich traces with custom metadata (user type, agent name, intent, conversation ID) using a utility that builds runtime metadata for filtering:

# Metadata flows through to LangSmith for filtering and debugging
tags = build_langsmith_metadata(
    agent_name=self.name,
    user_type=context.user_type,
    interaction_id=context.interaction_id
)

This has been invaluable. When a driver reports a confusing response, we can pull the exact trace, see every node's input/output, identify whether the issue was in intent classification, tool execution, or the final LLM response, and fix it within hours.

LLM-as-a-Judge Evaluation Pipeline

Before any agent rolls out to 100% of traffic, it must pass our evaluation pipeline. The process:

Small production rollout (5–10%) — the agent serves real traffic at low volume.
Sample production traces — we capture real conversations as evaluation datasets.
Run LLM-as-a-Judge evaluators — using a shared judge prompt template from LangSmith's Prompt Hub, extended with agent-specific metrics.

Our baseline evaluation metrics (applied to every agent):

We then add some domain specific metrics for each specialized agent. For example, the core earnings agent checks whether the agent followed or deviated from the relevant policies or used any illogical or inconsistent reasoning.

The evaluators run automatically on production traces using LangSmith's multi-turn evaluator, configured with thread filters (e.g., run name is ride_earnings) and sampling rates that start high during initial rollout and taper as confidence grows.

Production Monitoring Dashboards

Every agent in production has a cloned LangSmith monitoring dashboard tracking:

Run volume and error rates — are we seeing unexpected spikes or failures?
p50/p95 latency — is the agent responding fast enough for real-time support?
Token usage — are costs within budget?
Tool call success rates — are external API integrations healthy?
LLM-as-a-Judge scores over time — is quality trending up or down?

We also set up PagerDuty alerts triggered by LangSmith metrics. If the error rate exceeds 5% or p95 latency crosses 10 seconds over a 15-minute window, the on-call engineer is paged automatically.

An example chart of error rate by tool (part of monitoring dashboard in production)

LLM Judge evaluation with custom (agent specific) metrics running in production. Tip: Use binary outputs (True/False or Pass/Fail) instead of scores which are inaccurate and non actionable.

The Hard Lesson: Prompt Quality Is the Bottleneck, Not Infrastructure

When we first opened agent building to non-technical teammates, we assumed the hardest part would be the platform itself getting tool bindings right, handling edge cases in the graph, and managing state. We were wrong.

The hardest part was prompt quality. Domain experts knew their issue types deeply but didn't always know how to translate that knowledge into instructions an LLM would follow reliably. We saw agents that handled the happy path beautifully but fell apart on edge cases. A prompt might define what the agent should do when a driver disputes a fare, but say nothing about what happens when the driver changes topic mid-conversation. Or the tone section would say "be empathetic" without specifying what that actually means so the LLM would interpret it differently every time.

The failure modes were surprisingly consistent: missing out-of-scope definitions (so the agent tried to answer questions it had no tools for), ambiguous branching logic (phases with no explicit entry or exit conditions), and vague content guidelines that sounded good on paper but gave the LLM too much room to improvise.

We attacked this on two fronts.

First, a structured prompt writing framework. We created a template with five required components: identity (who is this agent, what user type, what topic area), primary objective (concrete verbs, not vague "help" or "handle"), scope (both in-scope AND out-of-scope with explicit routing actions), phased workflow (numbered steps with entry conditions, branching for every if/else, and a terminal action for every phase), and content guidelines (concrete do/don't rules with example phrases, not abstract principles). We paired this with a review checklist that every prompt must pass before activation, things like "does every phase have an exit?" and "are there instructions for what to do when a tool is unavailable?"

Second, automated prompt validation. We're building a Git-backed prompt linting pipeline that runs before any prompt reaches production. When a domain expert finishes writing a prompt in our builder UI, it opens a pull request against our config repository. A CI pipeline then runs two layers of checks: fast static rules (catching malformed template variables, duplicate intent slugs, spelling errors) followed by LLM-powered rules that detect prompt injection vulnerabilities, contradictory instructions, and structural dead-ends where a conversation flow has no way out. All violations block the merge. The author gets inline feedback in the UI and can fix issues themselves without pulling in an MLE.

The key insight behind all of this: treat prompts like product specs, not code comments. The more explicit the prompt, the more consistent the agent. And the earlier you catch quality issues, ideally before a single real customer ever sees the output, the faster the whole system improves.

Results

Since launching the self-serve agent platform:

Agent development time: Reduced from ~6 months (first driver agent) to ~2 weeks for new configurable agents.
Agent coverage: A growing number of configurable agents in production covering multiple issue types, alongside several specialized agents.
Evaluation coverage: 100% of production agents have automated LLM-as-a-Judge pipelines running on live traces.
Quality: Hallucination and contradiction rates have decreased by 20% with hallucination guardrails we have set up based on Langsmith evaluation metrics.
Operational efficiency: Many non-engineering team members are now building and iterating on agents independently.‍
AI Resolution Rate: Up by 16% since we launched a few agents using our self-serve platform.

What's Next

We're looking at several areas to push this platform further:

Completing the prompt linting pipeline — the Git-backed CI validation described above is actively in development. Once fully rolled out, every configurable agent prompt will pass through automated static and LLM-powered checks before it can reach production, with zero manual MLE review needed for common errors.
Mocking and simulation infrastructure — building a simulation layer that lets agent builders test against synthetic conversations and mocked tool responses before deploying to real traffic, dramatically shortening the feedback loop for new agents.
Pairwise evaluation — using LangSmith's Pairwise Annotation Queues to A/B test prompt revisions with human reviewers before shipping.
Expanding to more geographies and user types — bringing the platform to Freenow customers in Europe and autonomous vehicle support scenarios.‍
Deeper eval automation — moving from sampled evaluation to continuous scoring on all production traces, with automatic prompt degradation alerts.

#ai-agents #langraph #langsmith #customer-support #self-serve-platform #agent-development