Hugging Face Blog · 2026-05-06 · 원문 보기

오픈 ASR 리더보드에 벤치마킹 조작 방지 기능 추가

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Open ASR Leaderboard에 Benchmaxxer 억제제 추가하기

"측정이 목표가 되면, 그것은 더 이상 좋은 측정이 아니다." (Goodhart의 법칙)

요약: Appen Inc.와 DataoceanAI는 여러 악센트에 걸친 스크립트된 음성과 대화형 음성을 포함하는 고품질 영어 ASR 데이터셋을 제공했습니다. 벤치마크 최적화 또는 테스트 세트 오염의 잠재적 위험을 방지하기 위해, 여러 작업에 대한 성능의 고품질 측정을 위해 이 데이터셋을 비공개로 유지할 것입니다.

현재 평균 WER을 업데이트하지 않고 있습니다: 기본적으로 리더보드의 평균 WER은 공개 데이터셋에서만 계산됩니다. 토글을 사용하여 비공개 데이터셋을 선택적으로 포함하여 그 영향을 볼 수 있습니다 👀

2023년 9월 출시 이후 Open ASR Leaderboard는 710K회 이상 방문했습니다. 음성 인식을 계속 개선하려는 커뮤니티의 관심과 동기에 우리는 깊은 감동을 받았습니다 🗣️

Open ASR Leaderboard와 같은 벤치마크를 유지하는 목표(하지만 도전 과제도)를 요약하는 두 단어가 있습니다:

표준화: 모델은 사용과 출력에 대해 다른 규칙을 가질 수 있습니다. 예를 들어 구두점과 대소문자를 포함하거나 포함하지 않을 수 있습니다. 데이터셋도 동일한 문제를 가지고 있으며 다르게 구조화될 수 있습니다. 이를 위해 모든 테스트 세트를 Hub의 단일 데이터셋으로 수집하여 쉽게 접근하고 미리 볼 수 있도록 했습니다. 또한 모델 출력과 데이터셋 전사를 표준화하기 위해 구두점과 대소문자를 제거하고 미국 철자에 매핑하는 정규화기를 사용합니다. Whisper의 정규화기를 기반으로 합니다.
개방성: UI 코드와 평가 스크립트는 오픈 소스입니다. 이는 새로운 모델을 통합할 뿐만 아니라 커뮤니티의 피드백과 기여를 통해 평가 절차의 품질을 개선하는 데 도움이 되었습니다.

표준화와 개방성은 의미 있는 벤치마킹에 필수적이지만, 벤치마크를 벤치마크 특화 최적화("벤치마크 최적화")에 더 취약하게 만듭니다. 여기서 모델은 실제 견고성에 대응하는 이득 없이 리더보드 성능을 개선합니다. 모델과 사용 사례가 발전함에 따라, Open ASR Leaderboard는 실제 성능을 더 잘 반영하고 벤치마크 특화 최적화에 대한 견고성을 개선하기 위해 고품질 데이터셋과 새로운 평가 설정을 계속 통합할 것입니다.

우리의 보고서에서 논의했듯이, 단일의 "만능" ASR 모델은 없습니다: 일부는 미국 영어에서 더 잘 수행하고, 다른 것들은 다양한 악센트와 다국어 설정에서 수행하며, 또 다른 것들은 속도나 대화형 오디오에 최적화되어 있습니다. 다양한 애플리케이션도 다양한 기능을 우선시하므로, 한 차원에서 덜 수행하는 모델이 전반적으로 더 나쁜 모델은 아닙니다. Open ASR Leaderboard의 목표는 이러한 뉘앙스를 포착하고 ASR 성능에 대한 더 전체적인 보기를 제공하는 것입니다.

새로운 고품질 비공개 데이터셋

이를 위해 우리는 ASR 벤치마킹을 위한 고품질 데이터셋을 선별하기 위해 Appen Inc.와 DataoceanAI와 협력했습니다. 아래는 다양한 분할에 대한 정보입니다.

데이터셋	악센트	지속 시간 [h]	남성 (%) / 여성 (%)	스타일	전사
Appen Scripted AU	호주	1.42	49 / 51	낭독	구두점 있음, 대소문자 있음.
Appen Scripted CA	캐나다	1.53	52 / 48	낭독	구두점 있음, 대소문자 있음.
Appen Scripted IN	인도	1.02	49 / 51	낭독	구두점 있음, 대소문자 있음.
Appen Scripted US	미국	1.45	49 / 51	낭독	구두점 있음, 대소문자 있음.
Appen Conversational IN	인도	1.37	51 / 49	대화형, 자발적	구두점 있음, 불유창함.
Appen Conversational US003	미국	1.64	49 / 51	대화형, 자발적	구두점 있음, 대소문자 있음, 불유창함.
Appen Conversational US004	미국	1.65	49 / 51	대화형, 자발적	구두점 있음, 불유창함.
DataoceanAI Scripted US	미국	2.43	54 / 46	낭독	구두점 있음, 대소문자 있음(고유명사), 불유창함.
DataoceanAI Scripted GB	영국	2.43	47 / 53	낭독	구두점 있음, 불유창함.
DataoceanAI Conversational US	미국	8.82	NA	대화형, 자발적	구두점 있음, 불유창함.
DataoceanAI Conversational GB	영국	5.96	NA	대화형, 자발적	구두점 있음, 불유창함.

아래는 다양한 콘텐츠의 샘플 오디오입니다(스크립트된, 대화형, 약어, 불유창함, 고유명사).

비공개 데이터셋은 개방성의 정신에 모순되는 것처럼 들릴 수 있지만, 우리는 그러한 데이터셋을 통합하면 Open ASR Leaderboard의 신뢰성을 증가시킬 것이라고 믿습니다. 공개 테스트 세트를 명시적으로 사용하는 모델 개발자이든 특정 데이터셋과 매우 유사한 학습 데이터를 찾으려는 벤치마크 최적화를 위해 거의 악용될 가능성이 낮기 때문입니다.

이 데이터셋을 통해 우리는 제어되고 종종 포화된 설정(스크립트된, 미국 악센트)과 더 미묘한 조건(대화형 및 비미국 악센트) 간의 격차와 편견을 강조하는 대상 메트릭을 제공할 수 있습니다. 아래는 새로운 "비공개 데이터" 탭의 스크린샷입니다.

아래는 각 열이 어떻게 계산되는지입니다.

"평균 WER"은 데이터 제공자 평균의 거시 평균을 계산하므로 동등하게 가중치가 적용됩니다.
"평균 스크립트된"은 모든 스크립트된 데이터셋의 거시 평균을 수행합니다.
"평균 대화형"은 모든 대화형 데이터셋의 거시 평균을 수행합니다.
"평균 US"는 미국 악센트를 가진 모든 데이터셋의 거시 평균을 수행합니다.
"평균 비US"는 비미국 악센트를 가진 모든 데이터셋의 거시 평균을 수행합니다.

우리는 의도적으로 각 분할에 대한 점수를 제공하지 않습니다. 모델 개발자가 특정 데이터 제공자 또는 악센트로 점수를 높이는 것을 방지하기 위함입니다.

이 데이터에서 내 모델을 어떻게 평가할 수 있나요?

Open ASR Leaderboard에 모델을 올리세요. 그러면 평가를 실행할 것입니다! 이전처럼 리더보드에 모델을 추가하는 프로세스는 Open ASR Leaderboard GitHub에서 진행됩니다:

풀 요청을 열면 모델 체크리스트가 나타날 것입니다. 이전처럼 공개 데이터셋에 대한 결과를 보고해야 합니다.
우리는 공개 세트의 결과를 확인하고 비공개 세트의 메트릭을 계산할 것입니다.
우리가 얻은 결과를 확인하세요.

Open ASR Leaderboard에 모델이 추가될 때까지 기다리는 동안, 모델 카드에 이와 같은 YAML 파일을 추가하여 공개 세트에 대한 메트릭을 자체 보고할 수 있습니다. 그러면 모델이 데이터셋 페이지에 나타나는 (미검증) 리더보드에 표시될 것입니다(아래 스크린샷 참조). 이 분산형 평가 접근 방식에 대해 자세히 읽을 수 있습니다.

데이터 제공자에 대해 학습된 모델이 이점을 갖나요?

그럴 수 있습니다. 우리는 Appen과 DataoceanAI에 이 데이터를 클라이언트에 제공하지 않도록 요청했습니다. 하지만 이 정확한 데이터를 제공하지 않더라도, 유사한 분포의 데이터는 여전히 대응하는 평가 세트에서 모델을 도울 수 있습니다(공개 세트의 도전적인 작업을 최적화하여 벤치마크 최적화하는 것과 유사). 이를 위해 여러 데이터 제공자를 보유하면 모델이 제공자 중 하나의 데이터를 사용했을 때 얻을 수 있는 이점을 균형잡힙니다. 그리고 우리는 "비공개 데이터" 탭을 위해 더 많은 데이터 제공자와 평가 세트에 개방되어 있습니다!

더욱이, 비공개 세트가 모델 순위에 영향을 주지 않도록 하기 위해, 우리는 평균 WER을 비공개 세트를 거시 평균에 포함하지 않도록 기본값으로 설정했습니다.

아래 스크린샷에서 "비공개 데이터"가 토글되어 있지 않음을 볼 수 있습니다. 이는 데이터셋 전체의 거시 평균에 포함되지 않음을 의미합니다.

"비공개 데이터" 분할을 토글하여 거시 평균에 포함하기만 하면 됩니다.

"순위 Δ" 열은 기본 거시 평균 구성에 상대적으로 순서가 어떻게 변경되는지 보여줍니다. 공개 데이터셋을 포함하거나 제외하면 거시 평균도 변경되어 사용자가 평가를 자신의 애플리케이션과 가장 관련된 사용 사례와 데이터 분포에 맞출 수 있습니다.

다음은?

새로운 트랙과 데이터셋 토글 기능이 사용자가 자신의 애플리케이션에 가장 잘 맞는 모델을 식별하는 데 어떻게 도움이 되는지에 대한 커뮤니티의 피드백을 듣기를 기대합니다. 우리는 또한 실제 잡음이 있는 조건을 더 잘 반영하는 평가를 검토하고 있으며, 그에 대한 뉴스를 기대할 수 있습니다 😉

비공개 평가 세트를 준비하는 동안, 우리는 낮은 신호 대 잡음 비율이나 전사 불일치와 같은 도전적인 경우를 식별하기 위한 도구 개발을 포함하여 데이터셋 전체에서 일관된 오디오 및 전사 품질을 보장하기 위해 각별한 주의를 기울였습니다. 이 요소들은 WER에 의미 있게 영향을 미칠 수 있기 때문입니다. 향후 게시물에서 더 많은 정보를 확인하세요!

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

"When a measure becomes a target, it ceases to be a good measure." (Goodhart’s Law)

TLDR: Appen Inc. and DataoceanAI have provided high-quality English ASR datasets covering scripted and conversational speech over multiple accents. To prevent potential risks of benchmaxxing or test-set contamination, we will keep these datasets private for a high-quality measure of performance on multiple tasks.

We’re not updating the average WER at this time: by default, the leaderboard’s Average WER remains computed on public datasets only. You can optionally include the private datasets using the toggle to see their impact 👀

Since its launch in September 2023, the Open ASR Leaderboard has been visited over 710K times. We’re blown away by the community’s interest and motivation to keep pushing speech recognition 🗣️

Two words sum up the objectives (but also challenges) in maintaining a benchmark like the Open ASR Leaderboard:

Standardization: models can have different conventions for their usage and outputs, e.g. with/without punctuation and casing. Datasets have the same challenges and can be structured differently. To this end, all test sets have been gathered into a single dataset on the Hub for easy access and previewing. Moreover, to standardize model outputs and dataset transcripts, we use a normalizer that (among other things) removes punctuation and casing, and maps to American spelling. It is based on the normalizer of Whisper.
Openness: the UI code and evaluation scripts are open-sourced. This has helped not only to incorporate new models, but also to improve the quality of the evaluation procedure through community feedback and contributions.

Standardization and openness are essential for meaningful benchmarking, but they also make benchmarks more susceptible to benchmark-specific optimization ("benchmaxxing"), where models improve leaderboard performance without corresponding gains in real-world robustness. As models and use cases evolve, the Open ASR Leaderboard will continue incorporating high-quality datasets and new evaluation settings to better reflect real-world performance and improve robustness against benchmark-specific optimization.

As discussed in our report, there is no single "catch-all" ASR model: some perform better on American English, others on diverse accents and multilingual settings, while others are optimized for speed or conversational audio. Different applications also prioritize different capabilities, so a model that performs less well on one dimension is not necessarily a worse model overall. The goal of the Open ASR Leaderboard is to capture these nuances and provide a more holistic view of ASR performance.

New high-quality, private datasets

To this end, we have worked with Appen Inc. and DataoceanAI to curate high-quality datasets for ASR benchmarking. Below is some information on the various splits.

Dataset	Accent	Duration [h]	Male (%) / Female (%)	Style	Transcription
Appen Scripted AU	Australian	1.42	49 / 51	Read	Punctuated, cased.
Appen Scripted CA	Canadian	1.53	52 / 48	Read	Punctuated, cased.
Appen Scripted IN	Indian	1.02	49 / 51	Read	Punctuated, cased.
Appen Scripted US	American	1.45	49 / 51	Read	Punctuated, cased.
Appen Conversational IN	Indian	1.37	51 / 49	Conversational, spontaneous	Punctuated, disfluencies.
Appen Conversational US003	American	1.64	49 / 51	Conversational, spontaneous	Punctuated, cased, disfluencies.
Appen Conversational US004	American	1.65	49 / 51	Conversational, spontaneous	Punctuated, disfluencies.
DataoceanAI Scripted US	American	2.43	54 / 46	Read	Punctuated, cased (proper nouns), disfluencies.
DataoceanAI Scripted GB	British	2.43	47 / 53	Read	Punctuated, disfluencies.
DataoceanAI Conversational US	American	8.82	NA	Conversational, spontaneous	Punctuated, disfluencies.
DataoceanAI Conversational GB	British	5.96	NA	Conversational, spontaneous	Punctuated, disfluencies.

Below are sample audio showing the variety of content (scripted, conversational, acronyms, disfluencies, proper nouns).

While private datasets may sound contrary to the spirit of openness, we believe that incorporating such datasets will increase the trustworthiness of the Open ASR Leaderboard, as they are less likely to be exploited for benchmaxxing, whether by model developers who explicitly use the public test sets or who try to find training data that closely resembles a particular dataset to boost their score in the macroaverage.

With these datasets, we can also provide targeted metrics to highlight gaps and biases between controlled and often saturated settings (scripted, American accent) and more nuanced conditions (conversational and non-American accents). Below is a screenshot of the new "Private data" tab.

Below is how each column is computed.

"Average WER" computes a macroaverage of the data provider averages, so that they are weighted equally.
"Avg Scripted" performs a macroaverage of all scripted datasets.
"Avg Conversational" performs a macroaverage of all conversational datasets.
"Avg US" performs a macroaverage of all datasets with American accents.
"Avg non-US" performs a macroaverage of all datasets with non-American accents.

We intentionally do not provide a score on each split, to avoid model developers from boosting their score with a specific data provider or accent.

How can I evaluate my model on this data?

Get your model on the Open ASR Leaderboard, and we'll run the evaluation! As before, the process for adding a model to the leaderboard takes place on the Open ASR Leaderboard GitHub:

Open a pull request, and a model checklist will appear. As before, you should report your results on the public datasets.
We will verify the results on the public sets and compute the metrics on the private ones.
Confirm the results we’ve obtained.

While you wait for your model to be added to the Open ASR Leaderboard, you can self-report your metrics on the public sets by adding a YAML file like this to your model card. Your model will then appear on an (unverified) leaderboard that appears on the dataset page (see screenshot below). More on this approach to decentralized evaluation can be read here.

Do models trained on the data providers have an advantage?

They could. We’ve asked Appen and DataoceanAI to not provide this data to their clients. But even if they do not provide this exact data, data from a similar distribution could still help the model on the corresponding evaluation set (similar to benchmaxxing by optimizing for a challenging task from the public sets). To this end, having multiple data providers balances out the advantage a model may get from having used data from one of the providers. And we are open to more data providers and eval sets for the "Private data" tab!

Moreover, to ensure that the private sets do not affect the model ranking, we’ve defaulted the Average WER to not include the Private sets in its macroaverage.

In the screenshot below, you can see that "Private data" is toggled off. This means that the macroaverage across datasets does not include it.

Simply toggle on "Private data" splits to include them in the macroaverage.

The "Rank Δ" column shows how the ordering changes relative to the default macroaverage configuration. Including or excluding public datasets also changes the macroaverage, allowing users to tailor the evaluation to the use cases and data distributions most relevant to their application.

What's next?

We’re excited to hear the community’s feedback on how the new track and dataset toggling features help users identify the model(s) that best fit their application(s). We’re also looking into evaluations that better reflect real-world noisy conditions, and you can expect some news on that 😉

While preparing the private evaluation sets, we took extra care to ensure consistent audio and transcript quality across datasets, including developing tooling to identify challenging cases such as low signal-to-noise conditions or transcript mismatches, since these factors can meaningfully affect WER. More on that in a future post!

#benchmark #asr #speech-recognition #leaderboard #ml-evaluation #benchmarking-integrity