Hugging Face Blog · 2026-05-06 · 원문 보기

vLLM V0에서 V1로: 강화학습에서 수정보다 정확성을 먼저

vLLM V0 to V1: Correctness Before Corrections in RL

vLLM V0에서 V1로: RL의 정확성 우선, 수정은 나중에

은 롤아웃 생성을 위해 vLLM을 추론 엔진으로 사용합니다. 추론 엔진은 토큰을 샘플링하고 토큰 로그확률을 반환하며, 트레이너는 이러한 로그확률을 사용하여 정책 비율, KL, 클립 비율, 엔트로피 및 보상을 계산합니다. 로그확률 계산 방식의 불일치는 훈련 역학을 변경할 수 있습니다. 이것이 vLLM V0에서 V1로 마이그레이션하는 동안 해결해야 했던 훈련-추론 불일치입니다.

요약: vLLM V1은 네 가지를 수정한 후 vLLM V0 참조와 일치했습니다: 처리된 롤아웃 로그확률, V1 특정 런타임 기본값, 인플라이트 가중치 업데이트 경로, 최종 투영에 사용된 fp32 lm_head. RL 목표를 변경하기 전에 백엔드 동작을 먼저 수정했습니다.

참조 실행은 vLLM 0.8.5를 사용했고, V1 실행은 vLLM 0.18.1을 사용했습니다. 그림 1은 최종 결과를 보여줍니다. 빨간색 실행은 초기 V1 시도이고, 녹색 실행은 아래에서 설명하는 수정 사항을 적용한 최종 V1 실행입니다.

그림 1. vLLM V0 참조(파란색), 초기 vLLM V1 시도(빨간색), fp32 `lm_head`를 포함한 최종 vLLM V1 실행(녹색)의 트레이너 측 메트릭. 최종 V1 실행은 클립 비율, KL, 엔트로피 및 보상에 걸쳐 V0 궤적에 가깝게 수렴합니다.

마이그레이션 목표

vLLM V1은 V0 엔진의 대규모 재구현입니다. 따라서 우리의 마이그레이션 목표는 의도적으로 좁혔습니다:

V1이 트레이너가 예상하는 형식의 롤아웃 로그확률을 반환하는지 확인
V0 참조에 대해 동일한 워크로드를 다시 실행
백엔드 패리티가 복원된 후에만 목표 수준 변경 평가

첫 번째 가시적 증상이 나타난 곳:

clamp_log_ratio_new_old_indicator
kl_new_old
entropy
reward

이러한 메트릭은 이 실험에 사용된 목표인 GSPO 훈련 실행에서 나왔습니다. 동일한 클래스의 불일치는 PPO, GRPO 또는 롤아웃 측 로그확률을 최적화 목표의 일부로 취급하는 모든 온라인 RL 시스템에서 나타날 수 있습니다.

초기 V1 실행은 문제를 명확하게 보여주었습니다. 트레이너 측 로그확률과 보상이 훈련 초기에 V0 참조에서 벗어났습니다.

그림 2. 업데이트 중 트레이너가 계산한 현재 정책 로그확률(왼쪽) 및 보상(오른쪽). 초기 vLLM V1 실행(빨간색)은 vLLM V0 참조(파란색)에서 분리됩니다.

동일한 패턴이 트레이너 메트릭에서도 나타납니다. 클립 비율이 초기 비교에서 가장 읽기 쉬운 신호입니다.

그림 3. vLLM V0 참조(파란색) 및 초기 vLLM V1 시도(빨간색)의 트레이너 측 메트릭. 클립 비율은 롤아웃/트레이너 정책 간의 간격을 추적하며, 엔트로피 및 보상은 그 간격이 훈련 과정에 어떻게 전파되는지 보여줍니다.

실패 모드

가능한 원인을 세 가지 계층으로 분류했습니다:

의미론적 불일치: 백엔드가 트레이너가 예상하는 것과 다른 의미의 로그확률을 반환합니다.
추론 경로 불일치: 백엔드가 캐싱, 스케줄링 또는 요청 처리에 다른 런타임 기본값을 사용하여 동일한 프롬프트가 다른 실행 경로를 따릅니다.
목표 불일치: RL 목표가 남아있는 지연이나 백엔드 불일치의 양에 대해 수정이 필요합니다.

처음에는 세 번째 범주를 너무 빨리 의심했습니다. 유용한 진단은 첫 두 개를 백엔드 동작 문제로 취급하고 먼저 배제함으로써 나왔습니다.

V1 백엔드 수정

로그확률 의미론

첫 번째 문제는 의미론적이었습니다. vLLM V1은 기본적으로 온도 스케일링, 페널티, top-k/top-p 필터링과 같은 로짓 후처리 전에 원본 모델 출력에서 로그확률을 반환합니다. PipelineRL은 샘플러가 사용하는 처리된 분포에서의 로그확률을 예상했습니다.

필요한 설정은:

logprobs-mode=processed_logprobs

이것은 롤아웃 로그확률의 명백한 평균 오프셋을 제거했습니다. 그러나 훈련 곡선이 여전히 알려진 좋은 참조에 비해 간격을 보였으므로 다음 문제는 추론 경로에 있어야 했습니다.

정책 비율 플롯이 이것을 직접 보여줍니다. V1에 대해 processed_logprobs를 활성화하면 평균 정책 비율이 세 실행 모두에서 1.0에 매우 가깝게 중심을 유지합니다. 이것이 평균 편향 수정을 확립합니다. 남은 불일치는 클립 비율, KL, 엔트로피 및 다운스트림 훈련 동작에서 나타납니다.

그림 4. vLLM V0 참조(파란색), 초기 vLLM V1 실행(빨간색) 및 수정된 vLLM V1 실행(녹색)에 대해 1.0에서 스케일된 롤아웃/트레이너 정책 비율의 단계별 편차(×10,000).

런타임 기본값

초기 V1 실행은 엔진 버전을 V1 런타임 기본값과 혼합했습니다:

접두사 캐싱(초기 실행에서 설정되지 않아 vLLM 0.18.1 기본값이 적용됨)
비동기 스케줄링(초기 실행에서 설정되지 않아 vLLM 0.18.1 기본값이 적용됨)
실행 시간 kwarg 통과를 통해 설정되고 커밋된 구성의 패리티 레시피 외부에 있는 ad-hoc disable-cascade-attn 오버라이드

패리티 실행을 위해 이러한 선택을 명시적으로 만들었습니다:

vllm_config:
  use_v1: true
  vllm_kwargs:
    logprobs-mode: processed_logprobs
    enable-prefix-caching: false
    async-scheduling: false

접두사 캐싱은 별도의 설명이 필요합니다. 일반적으로 고정 모델 상태에 대한 정확성 보존 추론 최적화입니다. 이 온라인 RL 설정에서는 V0 참조 경로와 비교하여 캐시 수명 및 재사용에서 V1 고유의 차이였습니다. 행위자는 또한 반복된 접두사, 동시 요청, 비동기 스케줄링 및 인플라이트 가중치 업데이트를 처리했습니다.

접두사 캐시 히트는 캐시 정책이 가중치 업데이트 경계를 무시할 때 가중치 업데이트 전에 계산된 상태를 재사용할 수 있습니다. 접두사 캐싱을 비활성화하면 패리티 비교에서 V1 고유의 자유도 하나가 제거되었습니다.

인플라이트 가중치 업데이트

가중치 동기화도 온라인 RL 업데이트 모델과 일치해야 했습니다. 한 가지 옵션은 모든 업데이트에서 요청을 드레인하고 캐시를 지워 V1을 V0보다 더 엄격하게 만드는 것이었습니다. 이것은 별도의 질문에 답할 것입니다. 먼저 V1이 기존 V0 동작과 일치할 수 있는지 확인해야 했습니다.

V0이 효과적으로 한 것은 더 가까웠습니다:

엔진 경계에서 실행 차단
새 가중치 로드
명시적 캐시 상태 무효화 없이 재개

가장 가까운 V1 대응은:

await engine.pause_generation(mode="keep", clear_cache=False)
await engine_client.collective_rpc_async(
    "receive_weight_update",
    args=(request.model_dump_json(),),
)
await engine.resume_generation()

두 가지 세부 사항이 중요합니다:

mode="keep"는 wait 또는 abort보다 이전 인플라이트 업데이트 모델과 더 가깝게 일치합니다
clear_cache=False는 업데이트 시 캐시된 상태를 그대로 둔 V0 래퍼 동작과 일치합니다

지연(lag)이 유용한 런타임 진단이었습니다. 초기 V1 경로는 훈련 후반에 수정된 V1 실행보다 더 많은 지속적인 지연을 유지합니다.

그림 5. vLLM V0 참조(파란색), 초기 vLLM V1 실행(빨간색) 및 수정된 vLLM V1 실행(녹색)에 대해 롤아웃 서버의 가중치가 트레이너 정책보다 뒤에 있는 단계의 수.

남은 간격: fp32 lm_head

위의 V1 백엔드 수정은 명백한 마이그레이션 문제를 제거했지만 최종 패리티는 여전히 로짓을 계산하는 데 사용된 수치 경로를 일치시켜야 했습니다. 트레이너는 최종 투영에 fp32 lm_head를 사용했습니다. 롤아웃 백엔드가 해당 동작과 일치해야 했습니다.

밀접하게 관련된 문제는 MiniMax-M1 기술 보고서에서 나타났습니다: 그들의 RL 실행은 훈련/추론 토큰 확률 불일치를 보였고 이를 LM 출력 헤드로 추적하여 헤드를 fp32로 계산함으로써 수정했습니다.

이것이 중요한 이유는 RL 업데이트가 토큰 로그확률을 직접 소비하기 때문입니다. 로짓의 작은 변화는 정책 비율, KL 및 클리핑에서 가시적이 될 수 있습니다. 따라서 최종 투영 정밀도는 온라인 RL의 정확성 기준의 일부입니다. ScaleRL 논문은 나중에 fp32 로짓/헤드 계산을 RL 레시피의 일부로 포함하고 대규모 RL에 대한 유용한 설계 선택으로 절제합니다.

fp32 lm_head 경로가 포함되면 보상이 최종 패리티 결과를 간결하게 보여줍니다. 그림 6에서 최종 V1 실행은 V0 참조를 추적합니다. 초기 V1 시도는 명확하게 다른 보상 곡선을 생성합니다.

그림 6. vLLM V0 참조(파란색), 초기 vLLM V1 시도(빨간색) 및 fp32 `lm_head` 경로가 있는 최종 vLLM V1 실행(녹색)의 보상. fp32 헤드가 포함되면 최종 V1 실행은 V0 참조를 추적합니다.

절제 연구

부정적인 결과는 일반적인 설명을 배제하기 때문에 중요합니다.

processed_logprobs 단독: 의미론적 로그확률 버그를 수정했습니다. 훈련 불일치는 남아있었습니다.
배치 불변성: 불일치는 별도의 테스트에서 더 높은 지연, 더 높은 클립 비율 및 NCCL 복잡성으로 남아있었습니다.
첫 번째 V1 실행을 공정한 기준선으로 취급: 첫 번째 V1 실행은 여러 V1 고유 기본값이 활성화되어 있었으므로 혼동된 마이그레이션 비교였습니다.

우리가 백엔드 정확성을 먼저 수정한 이유

절단된 중요도 샘플링, 중요도 비율 재가중화 및 관련 방법과 같은 목표 측 수정은 유용한 도구입니다. 롤아웃이 의도적으로 지연되거나 비동기적으로 생성되거나 트레이너 측 정책과의 동등성이 불가능한 백엔드에서 생성되는 경우 어떤 형태의 수정을 추가하는 것이 종종 맞습니다.

여기서 첫 번째 문제는 추론 정확성이었습니다. V1로 이동한 후 롤아웃 백엔드는 트레이너 가정을 깨뜨린 로그확률 및 런타임 동작을 반환했습니다. 그 시점에서 목표 측 수정을 추가했다면 두 가지 질문이 혼합되었을 것입니다:

추론 백엔드가 올바른 로그확률을 생성하고 있습니까?
올바른 로그확률이 주어지면 목표가 여전히 정책 외 또는 비동기 수정이 필요합니까?

이러한 질문은 분리되어야 합니다. 그렇지 않으면 목표 측 수정이 깨진 추론 백엔드 동작을 보상할 수 있으며, 이는 훈련 곡선을 해석하기 어렵게 만듭니다.

현재 목표는 여전히 개선될 수 있습니다. 추론 패리티가 복원된 후 다음 개선은 일반적인 비동기/정책 외 정리입니다:

롤아웃 시간부터 명시적 행동 정책 로그확률 유지
최적화 시간에 트레이너 측 이전 정책 로그확률 재계산
백엔드 불일치 수정을 정책 업데이트 비율에서 분리
수정 항과 함께 ESS와 같은 진단 추적 및 집계 트레이너 메트릭

이 마이그레이션의 주요 교훈은 더 좁습니다: 백엔드 정확성을 먼저 수정한 다음 남은 불일치에 대한 수정을 추가합니다.

vLLM V0 to V1: Correctness Before Corrections in RL

PipelineRL

uses vLLM as the inference engine for rollout generation. The inference engine samples tokens and returns token logprobs; the trainer uses those logprobs to compute policy ratios, KL, clip rate, entropy, and reward. Any discrepancy in how those logprobs are computed can change the training dynamics. This is the train-inference mismatch we needed to eliminate during the vLLM V0 to V1 migration.

TL;DR. vLLM V1 matched our vLLM V0 reference after we fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head used for the final projection. We fixed the backend behavior before changing the RL objective.

The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Figure 1 shows the final result. The red run is the initial V1 attempt, and the green run is the final V1 run after the fixes described below.

Figure 1. Trainer-side metrics for the vLLM V0 reference (blue), the initial vLLM V1 attempt (red), and the final vLLM V1 run after our fixes (green), including the fp32 `lm_head`. The final V1 run returns close to the V0 trajectory across clip rate, KL, entropy, and reward.

Migration Objective

vLLM V1 is a substantial rewrite of the V0 engine. Our migration target was therefore deliberately narrow:

verify that V1 returned rollout logprobs in the form the trainer expected
rerun the same workload against the V0 reference
evaluate objective-level changes only after backend parity was restored

The first visible symptoms appeared in:

clamp_log_ratio_new_old_indicator
kl_new_old
entropy
reward

Those metrics came from a GSPO training run, the objective used for this experiment. The same class of mismatch can surface in PPO, GRPO, or any online RL system that treats rollout-side logprobs as part of the optimization target.

The initial V1 run showed the problem clearly. The trainer-side logprobs and reward moved away from the V0 reference early in training.

Figure 2. Current-policy logprobs computed by the trainer during updates (left) and reward (right). The initial vLLM V1 run (red) separates from the vLLM V0 reference (blue).

The same pattern appears in the trainer metrics. Clip rate is the easiest signal to read in the initial comparison.

Figure 3. Trainer-side metrics for the vLLM V0 reference (blue) and the initial vLLM V1 attempt (red). Clip rate tracks the rollout/trainer policy gap; entropy and reward show how that gap propagates into training.

Failure Modes

We separated the possible causes into three layers:

Semantic mismatch: the backend returns logprobs with different meaning relative to what the trainer expects.
Inference-path mismatch: the backend uses different runtime defaults for caching, scheduling, or request handling, so the same prompts follow a different execution path.
Objective mismatch: the RL objective needs correction for the amount of staleness or backend mismatch that remains.

We initially suspected the third category too early. The useful diagnosis came from treating the first two as backend behavior problems and ruling them out first.

V1 Backend Fixes

Logprob Semantics

The first issue was semantic. vLLM V1 returns logprobs from the raw model outputs by default, before logits post-processing such as temperature scaling, penalties, and top-k/top-p filtering. PipelineRL expected logprobs from the processed distribution used by the sampler.

The required setting was:

logprobs-mode=processed_logprobs

This removed the obvious mean offset in rollout logprobs. The training curves still showed a gap relative to the known-good reference, so the next issue had to be in the inference path.

The policy-ratio plot shows this directly. Once processed_logprobs is on for V1, the mean policy ratio stays centered extremely close to 1.0 across all three runs. That establishes the mean-bias fix. The remaining mismatch shows up in clip rate, KL, entropy, and downstream training behavior.

Figure 4. Per-step deviation of the rollout/trainer policy ratio from 1.0, scaled by 10,000, for the vLLM V0 reference (blue), the initial vLLM V1 run (red), and the corrected vLLM V1 run (green).

Runtime Defaults

The early V1 run mixed the engine version with V1 runtime defaults:

prefix caching, left unset in the early run so the vLLM 0.18.1 default applied
async scheduling, left unset in the early run so the vLLM 0.18.1 default applied
an ad-hoc disable-cascade-attn override that was set through launch-time kwarg passthrough and sits outside the parity recipe in committed config

For the parity run, we made these choices explicit:

vllm_config:
  use_v1: true
  vllm_kwargs:
    logprobs-mode: processed_logprobs
    enable-prefix-caching: false
    async-scheduling: false

Prefix caching deserves a separate note. It is normally a correctness-preserving inference optimization for a fixed model state. In this online RL setup, it was a V1-only difference in cache lifetime and reuse relative to the V0 reference path. The actor was also handling repeated prefixes, concurrent requests, async scheduling, and inflight weight updates.

A prefix-cache hit can reuse state computed before a weight update when the cache policy ignores the weight-update boundary. Disabling prefix caching removed one V1-only degree of freedom from the parity comparison.

Inflight Weight Updates

Weight synchronization also had to match the online-RL update model. One option was to make V1 stricter than V0 by draining requests and clearing caches at every update. That would answer a separate question. We first needed to verify that V1 could match the existing V0 behavior.

What V0 effectively did was closer to:

block execution at an engine boundary
load the new weights
resume without an explicit cached-state invalidation

The nearest V1 analogue was:

await engine.pause_generation(mode="keep", clear_cache=False)
await engine_client.collective_rpc_async(
    "receive_weight_update",
    args=(request.model_dump_json(),),
)
await engine.resume_generation()

Two details matter:

mode="keep" matches the old inflight update model more closely than wait or abort
clear_cache=False matches the V0 wrapper behavior, which left cached state intact on update

Lag was a useful runtime diagnostic. The initial V1 path carries more persistent lag later in training than the corrected V1 run.

Figure 5. Number of steps the weights in the rollout server are behind the trainer policy, for the vLLM V0 reference (blue), the initial vLLM V1 run (red), and the corrected vLLM V1 run (green).

The Remaining Gap: fp32 lm_head

The V1 backend fixes above removed the obvious migration issues, but final parity still required matching the numerical path used to compute logits. The trainer used an fp32 lm_head for the final projection. The rollout backend had to match that behavior.

A closely related issue appears in the MiniMax-M1 technical report: their RL run showed a training/inference token-probability mismatch that they traced to the LM output head and fixed by computing the head in fp32.

This matters because the RL update consumes token logprobs directly. Small changes in logits can become visible in policy ratios, KL, and clipping. The final projection precision is therefore part of the correctness surface for online RL. The ScaleRL paper later includes fp32 logits/head computation as part of its RL recipe and ablates it as a useful design choice for large-scale RL.

With the fp32 lm_head path included, reward gives a compact view of the final parity result. In Figure 6, the final V1 run tracks the V0 reference; the initial V1 attempt produces a clearly different reward curve.

Figure 6. Reward for the vLLM V0 reference (blue), the initial vLLM V1 attempt (red), and the final vLLM V1 run with the fp32 `lm_head` path (green). With the fp32 head included, the final V1 run tracks the V0 reference.

Ablations

The negative results are important because they rule out common explanations.

processed_logprobs alone: fixed the semantic logprob bug; the training mismatch remained.
Batch invariance: the mismatch remained in a separate test, with higher lag, higher clip rate, and NCCL complications.
Treating the first V1 run as a fair baseline: the first V1 run had multiple V1-only defaults enabled, so it was a confounded migration comparison.

Why We Fixed Backend Correctness First

Objective-side corrections such as truncated importance sampling, importance-ratio reweighting, and related methods are useful tools. If rollouts are intentionally stale, generated asynchronously, or produced by a backend where equivalence to the trainer-side policy is unavailable, then some form of correction is often the right thing to add.

The first problem here was inference correctness. After moving to V1, the rollout backend returned logprobs and runtime behavior that broke the trainer assumption. Adding an objective-side correction at that point would have mixed two questions:

is the inference backend producing the right logprobs?
given correct logprobs, does the objective still need an off-policy or async correction?

Those questions need to be separated. Otherwise an objective-side correction can compensate for broken inference-backend behavior, which makes the training curve harder to interpret.

The current objective can still improve. After inference parity is restored, the next improvement is the usual async/off-policy cleanup:

keep explicit behavior-policy logprobs from rollout time
recompute trainer-side old-policy logprobs at optimization time
separate backend mismatch correction from the policy-update ratio
track diagnostics like ESS for the correction term alongside aggregate trainer metrics

The main lesson from this migration is narrower: fix backend correctness first, then add corrections for the mismatch that remains.

#vllm #reinforcement-learning #correctness #llm-inference #large-language-models #model-optimization