Hugging Face Blog · 2026-05-19 · 원문 보기

Ettin Reranker 패밀리 소개

Introducing the Ettin Reranker Family

Ettin 리랭커 패밀리 소개

오늘 저는 6개의 새로운 Sentence Transformers CrossEncoder 리랭커를 공개합니다. 이들은 각 크기 범주에서 최첨단이며, Ettin ModernBERT 인코더를 기반으로 구축되었고, 이를 만든 데이터와 완전한 학습 레시피도 함께 제공합니다:

모델들은 증류(distillation) 레시피로 학습되었습니다: cross-encoder/ettin-reranker-v1-data에 대한 mixedbread-ai/mxbai-rerank-large-v2 점수를 이용한 포인트와이즈 MSE로, 이는 lightonai/embeddings-pre-training의 부분 집합과 lightonai/embeddings-fine-tuning의 리랭킹된 부분 집합이 섞여 있습니다.

우리의 6개 리랭커가 google/embeddinggemma-300m과 쌍을 이루어 MTEB(eng, v2) 검색에서의 성능. 더 많은 임베딩 모델 쌍에 대해서는 결과를 참조하세요.

리랭커가 처음이고 먼저 "왜"를 알고 싶다면 리랭커란 무엇이고 왜 임베더와 짝을 이루는가?로 이동하세요. 모델을 바로 적용하고 싶다면 사용법으로 이동하세요. 직접 학습하고 싶다면 학습으로 이동하세요.

아래의 학습 레시피는 Sentence Transformers v5.5.0에서 출시된 새로운 train-sentence-transformers Agent Skill로 시작했습니다. 이를 hf skills add train-sentence-transformers [--global] [--claude]로 설치하고 AI 코딩 에이전트(Claude Code, Codex, Cursor, Gemini CLI, ...)에게 SentenceTransformer, CrossEncoder, 또는 SparseEncoder 모델을 데이터에 미세 조정하도록 요청할 수 있습니다.

리랭커란 무엇이고 왜 임베더와 짝을 이루는가?

리랭커(포인트와이즈 크로스 인코더라고도 함)는 (질의, 문서) 쌍을 입력받아 단일 관련성 점수를 출력하는 신경망 모델입니다. 질의와 문서를 별도로 인코딩하고 두 임베딩 벡터에서 유사도를 계산하는 임베딩 모델과 달리, 리랭커는 두 텍스트가 모든 트랜스포머 레이어를 통해 서로 주의(attention)를 기울이도록 합니다. 이러한 조인트 인코딩이 더 정확하지만 비용이 더 큽니다: 텍스트당 한 번이 아니라 (질의, 문서) 쌍마다 한 번 모델을 실행해야 합니다.

크로스 인코더는 전체 코퍼스에서 실행하기에 너무 비용이 크기 때문에, 일반적인 프로덕션 패턴은 검색-후-리랭킹입니다: 빠른 임베딩 모델이 상위 K개 후보를 검색하고(저렴), 크로스 인코더는 정확하게 이 K개만 재정렬합니다. 전체 비용은 제한된 상태로 유지되면서 최종 순위는 철저한 크로스 인코더 패스가 생성할 수 있는 것에 훨씬 더 가깝습니다.

이 블로그 전체에서 "리랭커"와 "크로스 인코더"를 상호 교환적으로 사용합니다.

사용법

공개된 모델들은 일반적인 Sentence Transformers CrossEncoder 모델이므로 단 3줄의 코드로 사용할 수 있습니다:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ettin-reranker-32m-v1")
scores = model.predict([
    ("Apple은 어디에서 설립되었는가?", "Apple Inc.는 1976년 Steve Jobs, Steve Wozniak, Ronald Wayne에 의해 캘리포니아 쿠퍼티노에서 설립되었습니다."),
    ("Apple은 어디에서 설립되었는가?", "Fuji 사과는 1930년대 후반에 개발되어 1962년에 시장에 출시된 사과 품종입니다."),
])
print(scores)

질의와 후보 목록에 대해 정렬된 인덱스와 점수를 반환하는 rank를 사용할 수도 있습니다:

ranked = model.rank(
    query="어느 행성이 붉은 행성으로 알려져 있는가?",
    documents=[
        "금성은 크기와 근접성이 유사하여 지구의 쌍둥이로 불리곤 합니다.",
        "붉은 색깔로 알려진 화성은 종종 붉은 행성이라고 불립니다.",
        "우리 태양계에서 가장 큰 행성인 목성은 눈에 띄는 붉은 반점을 가지고 있습니다.",
        "고리로 유명한 토성은 때때로 붉은 행성으로 오인됩니다.",
    ],
    top_k=4,
    return_documents=True,
)
for r in ranked:
    print(f"({r['score']:.2f}): {r['text']}")

cross-encoder/ettin-reranker-32m-v1를 다른 크기로 바꿔서 품질을 속도와 교환할 수 있습니다. 모두 6개는 ModernBERT의 장문맥 사전 학습 덕분에 최대 8K 토큰의 맥락을 수락합니다(장문서 리랭킹에 유용).

kernels를 설치하고 model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"}를 설정하여 최고의 처리량을 얻을 것을 권장합니다. 아래 속도 섹션을 참조하세요. 일반적으로 모델 크기와 시퀀스 길이에 따라 기본 로딩보다 1.7배~8.3배 속도 향상을 기대할 수 있습니다.

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ettin-reranker-32m-v1",
    model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"},
)

엔드-투-엔드 검색-후-리랭킹 파이프라인

검색용 빠른 임베더와 최종 정렬용 리랭커를 포함한 완전한 예제:

from sentence_transformers import SentenceTransformer, CrossEncoder


embedder = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1")
reranker = CrossEncoder("cross-encoder/ettin-reranker-68m-v1")

corpus = [
    "Apple Inc.는 1976년 Steve Jobs, Steve Wozniak, Ronald Wayne에 의해 캘리포니아 쿠퍼티노에서 설립되었습니다.",
    "Fuji 사과는 1930년대 후반에 개발된 사과 품종입니다.",
    "Steve Jobs는 2007년 Macworld에서 iPhone을 소개했습니다.",
    "Macintosh 컴퓨터는 1984년부터 Apple에서 판매했습니다.",
    
]
query = "Apple은 어디에서 설립되었는가?"


query_emb = embedder.encode_query(query, convert_to_tensor=True)
corpus_emb = embedder.encode_document(corpus, convert_to_tensor=True)
scores = embedder.similarity(query_emb, corpus_emb)[0]
top_k_idx = scores.topk(min(100, len(corpus))).indices.tolist()


top_k_docs = [corpus[i] for i in top_k_idx]
ranked = reranker.rank(query, top_k_docs, top_k=5, return_documents=True)
for r in ranked:
    print(f"({r['score']:.2f}): {r['text']}")

이는 대부분의 현대 검색 시스템에서 사용되는 구조입니다. 검색기는 무엇이 깔때기에 들어갈지 결정하고, 리랭커는 무엇이 우승할지 결정합니다.

아키텍처 세부사항

모든 6개 리랭커는 동일한 아키텍처를 공유하며 백본 크기만 다릅니다. 백본은 Johns Hopkins University의 Ettin 스위트에서 나온 6개 Ettin 인코더 중 하나입니다. 이들은 패딩 없는 주의, RoPE 위치 인코딩, GeGLU, 그리고 2T 토큰의 오픈 라이선스 사전 학습을 지원하는 ModernBERT 스타일 모델로, 최대 8192 토큰의 맥락을 지원합니다.

각 인코더 위에 리랭커는 Sentence Transformers의 모듈식 구성 요소로 구축된 ModernBertForSequenceClassification을 반영하는 4개 모듈 분류 헤드를 사용합니다. 기본 Transformer는 AutoModelForSequenceClassification 대신 순수한 AutoModel이므로, Flash Attention 2에 대해 가변 길이 입력을 위한 시퀀스 언패딩을 사용할 수 있습니다. 중간 길이 문서에서 이는 fp32+SDPA에 비해 1.7배~8.3배 속도 향상을 가져옵니다(모델 크기에 따라 전체 벤치마크는 속도 참조):

1. Transformer(FA2)
2. Pooling(cls)
3. Dense(H, H, bias=False, GELU)
4. LayerNorm(H)
5. Dense(H, 1, scores)

제 실험에서 CLS 풀링이 평균 풀링을 능가했습니다. 이것은 조금 놀라웠습니다. ModernBERT는 3번마다 글로벌 주의만 사용하고 나머지 2/3는 먼 위치에서 CLS에 도달할 수 없는 로컬 윈도우 주의를 사용합니다. 경험적으로 이러한 몇 글로벌 레이어는 CLS를 더 나은 풀링 선택으로 만들기에 충분한 신호를 전달합니다.

모든 6개 모델은 Ettin 인코더와 일치하는 Apache 2.0 라이선스로 공개됩니다.

결과

MTEB(eng, v2) 검색

각 공개 모델을 전체 MTEB(eng, v2) 검색 벤치마크(10개 작업, 상위 100 리랭킹)를 통해 MTEB의 2단계 리랭킹 흐름을 사용하여 실행했으며, 각 리랭커를 속도/품질 스펙트럼에 걸친 6개 임베딩 모델과 쌍을 이루었습니다:

아래 각 차트의 점선 검색기 전용 라인은 이를 이겨야 할 중요한 수치입니다. 그 아래에 있는 것은 리랭커가 평균적으로 파이프라인을 실제로 해치고 있다는 뜻입니다:

결과의 전체 표(확인하려면 클릭)

6개 임베더 쌍 위의 평균 NDCG@10, 내림차순으로 정렬됨. 우리의 6개 모델은 굵게표시되고 교사 mixedbread-ai/mxbai-rerank-large-v2는 밑줄 표시됩니다.

^† max_seq_length=8192로 상한 설정됨(4B Qwen3 기반 리랭커는 단일 H100 80GB에서 네이티브 맥락에 맞지 않음). 네이티브 맥락 평가는 더 높을 가능성이 있습니다.

NanoBEIR 결과의 전체 표(확인하려면 클릭)

NanoBEIR은 BEIR의 빠른 13개 데이터셋 부분으로 데이터셋당 최대 5000개 문서에 대해 50개 질의를 사용합니다. NanoBEIR은 학습 중 metric_for_best_model으로 설정된 것입니다(평가 참조)이며 실험을 안내하는 데 사용한 것입니다.

공개하는 가장 작은 모델인 우리의 17M은 MTEB에서 33M ms-marco-MiniLM-L12-v2를 +0.051 NDCG@10(0.5576 vs 0.5066)로, NanoBEIR에서 +0.038(0.6746 vs 0.6369)로 능가하며 약 절반의 파라미터 수입니다. 32M은 MTEB에서 568M BAAI/bge-reranker-v2-m3을 +0.025(0.5779 vs 0.5526)로 능가합니다. 17배의 파라미터 격차입니다. 검색-후-리랭킹 스택에서 기본값으로 레거시 MiniLM 리랭커 중 하나를 사용했다면, 우리의 17M(또는 32M)으로 교체하는 것은 저위험 드롭인 교체이며 두 벤치마크에서 눈에 띄는 품질 향상을 가져옵니다.

표를 올라가면, 우리의 150M은 MTEB에서 테스트한 600M 미만 범위의 가장 강력한 리랭커이며, 최근 Qwen/Qwen3-Reranker-0.6B(596M)을 +0.005(0.5994 vs 0.5940)로 능가하고 모든 BAAI bge-reranker 변형을 0.03~0.05로 능가합니다. 68M도 언급할 가치가 있습니다: 0.5915에서 파라미터의 1/9을 사용하면서 Qwen3-Reranker-0.6B(0.5940)와 거의 정확히 일치합니다.

공개된 범위의 상단에서 우리의 1B 모델은 교사를 밀접하게 추적합니다. 자신보다 54% 더 큰 모델에서 증류됨에도 불구하고 MTEB에서 1.54B mxbai-rerank-large-v2의 0.0001 범위 내(0.6114 vs 0.6115)이고 NanoBEIR에서 0.008 범위 내입니다. 증류는 교사에 대한 격차를 효과적으로 종료했으며 이는 이 공개에 들어가기 전에 보고 싶었던 것입니다.

비교에서 전반적으로 가장 강력한 리랭커는 MTEB에서 0.6367인 Qwen/Qwen3-Reranker-4B이며, 우리의 1B 모델보다 +0.025입니다. 현재 레시피에서 이 격차를 종료하려면 아마도 더 강력한 교사에서 증류해야 할 것입니다(우리의 교사는 자신이 Qwen3-Reranker-4B 아래에 앉음). 대부분의 검색-후-리랭킹 워크로드의 경우 우리의 1B이 파라미터의 1/4(속도 참조)로 훨씬 더 실용적인 선택입니다.

속도

품질 수치는 리랭커에서 중요한 것의 절반일 뿐입니다. 다른 절반은 검색과 사용자에게 결과를 보여주는 사이의 예산에 지연 시간이 맞는지 여부입니다. 제가 측정한 내용을 걸어가겠습니다.

단일 NVIDIA H100 80GB에서 공개된 6개 모델을 13개의 공개 리랭커(약 1B 파라미터까지의 강력한 기준선)에 대해 벤치마크했습니다. 질의와 문서는 자연 문서 길이 분포의 sentence-transformers/natural-questions에서 옵니다: 대부분의 NQ 답변은 짧고, 일부는 깁니다. 문서는 이전 모델에 불공정한 이점을 주지 않도록 max_length=512로 잘립니다. 각 모델은 가장 잘 지원되는 주의 구현을 사용합니다: 아키텍처가 지원하는 곳에 Flash Attention 2(BERT, XLM-RoBERTa, ModernBERT, Qwen2), 그렇지 않은 곳에 SDPA, DeBERTa-v2는 eager(현재 transformers에서 FA2 또는 SDPA 지원이 없음).

모든 모델에 대해 자동 배치 검색은 배치 크기 8에서 시작하여 GPU 메모리가 부족할 때까지 두 배로 늘어납니다. 각 배치 크기에서 3회 시간 측정 패스를 실행하고 중앙값 처리량을 유지하므로 불운한 단일 실행이 수치를 끌어내리지 않습니다. 보고된 처리량은 처리량이 가장 높은 배치 크기입니다.

표 1. 쌍당 처리량(초), 모두 bfloat16. 우리의 6개 리랭커는 굵게 표시됩니다.

우리의 17M은 초당 7517쌍으로 전체 비교에서 가장 빠른 리랭커입니다. 이는 ms-marco-MiniLM-L6-v2(3817)의 처리량의 거의 2배이며 더 작은 ms-marco-MiniLM-L4-v2(4029)보다도 빠릅니다. 그리고 앞서 MTEB 표에서 본 것처럼 우리의 17M은 모든 MiniLM 변형보다도 더 정확합니다. 현재 MiniLM 크로스 인코더를 실행 중이라면 우리의 17M으로 교체하는 것은 한 줄 변경으로 지연 시간과 검색 품질을 모두 개선합니다.

우리의 150M은 더 흥미로운 비교입니다. 정확히 150M 파라미터의 두 가지 직접 아키텍처 동료가 있기 때문입니다: Alibaba-NLP/gte-reranker-modernbert-base와 ibm-granite/granite-embedding-reranker-english-r2. 둘 다 동일한 ModernBERT-base 백본에 구축됩니다. 우리의 150M은 초당 3237쌍으로 실행되는 반면 두 동료는 각각 1418과 1404로 나오며 2.3배 속도 격차입니다.

모든 3개의 150M 모델은 Flash Attention 2를 사용하지만 두 동료는 AutoModelForSequenceClassification을 통해 로드하므로 입력을 패딩된 채로 유지합니다. 따라서 주의 자체는 FA2 커널을 실행하지만 모델의 나머지는 기여하지 않는 패딩 토큰에 대해 여전히 밀집 계산을 수행합니다. 우리의 모듈식 Transformer 모듈(아키텍처 세부사항 위 참조)은 모델 전체를 통해 언패딩된 입력을 전파하므로 모든 레이어는 실제 토큰에만 계산을 소비합니다. 그것이 FA2 이점의 일부를 얻는 것과 모두를 얻는 것의 차이입니다.

표의 하단에서 우리의 1B 모델은 초당 928쌍으로 1.54B 교사 mxbai-rerank-large-v2(초당 387쌍)보다 2.4배 빠르면서 MTEB 점수를 0.0001 범위 내에서 일치시킵니다. 교사는 Qwen2 기반으로 쌍당 프롬프트 템플릿 오버헤드가 있으므로 증류된 학생은 교사의 보정과 판단을 상속하지만 모든 런타임 부하를 건너뜁니다. 이것은 정직하게 말해서 전체 공개에서 가장 만족스러운 단일 수치입니다.

불행한 점: DeBERTa-v2 기반 mxbai-rerank-{xsmall,base,large}-v1 시리즈는 DeBERTa-v2가 현재 transformers에서 Flash Attention 2 또는 SDPA를 지원하지 않기 때문에 표의 나머지보다 훨씬 느립니다. 70M mxbai-rerank-xsmall-v1은 초당 2636쌍에서 실행되며 거의 동일한 파라미터 수에서 우리의 68M의 처리량의 약 절반입니다. 모델들 자체는 완벽하지만 현대적 주의 커널을 사용할 수 없습니다.

소비자 GPU(RTX 3090, 24GB)에서의 동일한 벤치마크

데이터센터 GPU 대신 소비자 카드에서 자체 호스팅하는 경우, 표 1과 동일한 설정에서 RTX 3090의 동일한 처리량 스윕입니다: bfloat16, 모델당 최고 지원 주의, 맞는 가장 큰 배치에서의 3회 시도 중앙값 처리량.

우리의 17M은 여전히 초당 9008쌍에서 표의 가장 빠른 모델이며 실제로 H100 수치보다 높으므로 작은 크기에서 순수 계산이 병목 현상이 아니고 H100의 추가 근육이 변환되지 않음을 시사합니다. 표의 중간부는 약간 섞이며 MiniLM 리랭커가 우리의 32M과 68M을 능가하고 1B은 mxbai-rerank-base-v2(189 vs 221쌍 초)보다 뒤떨어집니다. 우리의 150M 모델은 여전히 두 150M ModernBERT 기반 동료에 비해 견고한 리드를 유지하고 교사 교체 이야기는 여전히 유지되며 우리의 1B이 1.5B mxbai-rerank-large-v2(189 vs 69쌍 초)의 처리량의 2.7배입니다.

CPU(Intel Core i7-13700K)에서의 동일한 벤치마크

CPU에서 bf16, Flash Attention 2, 언패딩의 이점을 활용할 수 없으므로 지연 시간 이야기는 조금 더 간단합니다: 파라미터 수가 높을수록 모델은 느립니다. 17M 모델은 ms-marco-MiniLM-L6-v2(267.4 vs 143.9쌍 초)보다 상당히 빠르고 더 작은 ms-marco-MiniLM-L4-v2(206.2)보다도 빠릅니다. 예상대로 우리의 150M 모델은 언패딩이 더 이상 적용되지 않으므로 두 150M 동료 옆에 위치합니다(14.0 vs 14.5 및 14.7쌍 초). CPU 바운드라면 우리의 17M과 32M이 실용적인 선택입니다.

속도가 어디서 나오는지 설명하기 위해 다음 표는 fp32+SDPA, bf16+SDPA, bf16+FA2를 동일한 벤치 구성을 사용하여 우리의 6개 모델에 대해 스윕합니다. FA2 열은 2개로 나뉘어 있습니다: 입력이 여전히 패딩된 것(래핑된 모델이 보는 것)과 언패딩된 입력(우리의 모듈식 Transformer이 실제로 하는 것). 가장 오른쪽 열은 FA2를 활성화할 때 우리의 모델이 기본적으로 사용하는 것입니다.

표 2. max_length=512에서 자연 NQ 문서에 대한 6개 공개 크기의 정밀도 및 주의 절제. 각 셀은 fp32+SDPA의 상대 승수와 함께 쌍/초를 보여주고 두 번째 줄에 피크 GPU 메모리를 표시합니다. 가장 오른쪽 열(**굵게**)은 FA2를 활성화할 때 우리의 모델이 기본적으로 사용하는 구성입니다.

모델	파라미터	fp32+SDPA	bf16+SDPA	bf16+FA2 w. padding	bf16+FA2 w.o. padding
`ettin-reranker-17m-v1`	17M	4402 (1.00x) 0.8 GB	4523 (1.03x) 2.2 GB	3744 (0.85x) 1.9 GB	7517 (1.71x) 1.4 GB
`ettin-reranker-32m-v1`	32M	3307 (1.00x) 1.2 GB	4357 (1.32x) 1.6 GB	3040 (0.92x) 2.9 GB	6602 (2.00x) 1.1 GB
`ettin-reranker-68m-v1`	68M	1364 (1.00x) 1.0 GB	2861 (2.10x) 2.2 GB	2003 (1.47x) 2.0 GB	4913 (3.60x) 1.5 GB
`ettin-reranker-150m-v1`	150M	671 (1.00x) 1.6 GB	1942 (2.90x) 1.8 GB	1396 (2.08x) 3.1 GB	3237 (4.83x) 1.4 GB
`ettin-reranker-400m-v1`	400M	266 (1.00x) 2.5 GB	1113 (4.18x) 1.8 GB	864 (3.25x) 2.7 GB	1738 (6.53x) 2.2 GB
`ettin-reranker-1b-v1`	1B	112 (1.00x) 4.6 GB	630 (5.60x) 2.8 GB	522 (4.64x) 3.6 GB	928 (8.26x) 4.5 GB

bf16+FA2 w.o. padding에서 fp32+SDPA 기준선으로의 전체 속도 향상은 모델 크기에 따라 급격히 증가합니다. 17M에서 1.71배에서 1B에서 8.26배로 증가합니다. 대부분의 증가는 bf16 단독에서 나옵니다: fp32+SDPA에서 bf16+SDPA 단계는 17M에 1.03배 속도 향상만 제공하지만 1B에 5.60배 속도 향상을 제공하며 더 높은 배치 크기를 허용하는 낮아진 메모리 비용 때문에 발생합니다. 간단히 말해 bfloat16은 전체 속도 향상의 가장 큰 단일 기여자입니다.

예상 외로 입력이 여전히 패딩되어 있는 동안 FA2를 켜는 것은 실제로 모든 크기에서 bf16+SDPA보다 느립니다. FA2 커널은 언패딩된 형식을 선호하며 패딩된 입력을 제공하면 형식 간 변환의 오버헤드를 지불하면서도 패딩 토큰 자체에 계산을 소비합니다. 따라서 bf16+FA2 w. padding 열은 모델 로더에서 다른 것을 변경하지 않고 sdpa를 flash_attention_2로 바꾼 경우 측정할 내용입니다. 이것은 gte-reranker-modernbert-base와 granite-embedding-reranker-english-r2이 표 1에 있는 상황입니다.

마지막으로 bf16+FA2 w. padding에서 bf16+FA2 w.o. padding으로 이동하는 것은 1B에서 1.78배에서 68M에서 2.45배의 추가 처리량을 가치 있게 만들고 피크 메모리도 상당히 줄이므로 더 높은 배치 크기를 허용합니다.

따라서 제 권장사항은 간단합니다: bf16과 FA2를 함께 활성화하세요. 6개의 Ettin 리랭커는 기본적으로 언패딩된 입력을 사용합니다. 아키텍처 세부사항 섹션의 모듈식 Transformer 모듈이 설정되어 있기 때문입니다. 전체 스니펫은 위 사용법 섹션과 동일합니다:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ettin-reranker-150m-v1",
    model_kwargs={
        "dtype": "bfloat16",
        "attn_implementation": "flash_attention_2",  
    },
)

FA2를 설치하려면 pip install kernels를 사용하세요. 다양한 GPU 아키텍처, CUDA 버전, 운영 체제에 대해 사전 구축된 커널을 제공합니다.

다른 CrossEncoders에 대한 주의: 전체 속도 향상은 Ettin 리랭커처럼 모듈식 Transformer로 구축된 모델에만 사용할 수 있습니다. 동일한 두 플래그를 AutoModelForSequenceClassification을 통해 로드되는 CrossEncoder에 적용하면 표 2의 느린 bf16+FA2 w. padding 열에 도달합니다.

학습

아래 학습 스크립트는 Sentence Transformers v5.5.0에서 출시된 새로운 train-sentence-transformers Agent Skill의 출력으로 시작했습니다. AI 코딩 에이전트(Claude Code, Codex, Cursor, Gemini CLI, ...)를 사용한다면 스킬을 설치하고 SentenceTransformer, CrossEncoder, 또는 SparseEncoder 모델을 데이터에 미세 조정하도록 요청할 수 있습니다. 스킬은 기본 모델 선택, 손실 및 평가 선택, 하드 네거티브 마이닝, 증류, LoRA, Matryoshka, 다국어 학습, 정적 임베딩에 대한 버전 인식 가이드를 제공합니다.

hf skills add train-sentence-transformers --claude   
hf skills add train-sentence-transformers --global

"(질의, 문서) 쌍의 데이터셋에서 크로스 인코더 리랭커를 미세 조정하고, 하드 네거티브를 마이닝하고, 내 Hub 저장소로 푸시"와 같은 프롬프트는 반복할 수 있는 스크립트를 생성합니다. 이것이 아래 레시피 작업을 시작한 방법입니다.

모든 6개 리랭커는 동일한 단일 단계 레시피로 학습되었습니다. 모델 크기당 학습률과 장치별 배치 크기만 다릅니다. 전체 학습 스크립트는 약 150줄이고 하나의 공개 데이터셋을 사용합니다.

레시피는 모델 크기에 걸친 단일 스윕 후 수렴했습니다. 각 크기의 학습률은 최종 학습 데이터의 약 15% 부분 집합에서 작은 그리드 검색을 통해 조정되었고, 결과 학습률은 재조정 없이 전체 데이터 실행으로 깔끔하게 전달되었습니다. 학습률 이상으로 크기별 조정이 필요 없었습니다.

증류 레시피

대부분의 공개 리랭커 레시피는 인간이 레이블한 관련성 삼중항(질의, 하나의 긍정 문서, 그리고 선택적으로 어려운 네거티브)에서 학습하고 MultipleNegativesRankingLoss, BinaryCrossEntropyLoss, RankNetLoss, 또는 LambdaLoss와 같은 대조, 포인트와이즈, 페어와이즈, 리스트와이즈 손실을 사용합니다. 예를 들어 이전의 Sentence Transformers로 리랭커 모델 학습 및 미세 조정 블로그 게시물을 참조하세요.

하지만 이 접근 방식에는 몇 가지 실용적이고 이론적 단점이 있습니다. 첫째, 긍정은 인간이 레이블을 붙여야 하는데 이는 비싸고 많은 도메인에 걸쳐 확장하기 느립니다. 둘째, 모델은 누군가가 거쳐간 (질의, 문서) 쌍의 작은 부분에 대한 레이블만 봅니다. 특히 어려운 네거티브 마이닝 후 거짓 음수가 많이 남아 있습니다. 예를 들어 어려운 네거티브, 어려운 교훈에서 보는 것처럼. 셋째, 이 레이블의 이진 특성은 현실과 일치하지 않습니다. 일부 문서는 단순히 다른 것보다 더 관련이 있습니다.

저는 다른 경로를 택했습니다: 기존의 강력한 교사 리랭커에서 포인트와이즈 MSE 증류. 설정은 3줄로 설명하기에 충분히 간단합니다:

교사: mixedbread-ai/mxbai-rerank-large-v2 (1.54B 파라미터).
손실: 원시 교사 로짓(범위 ~[−12, 22])에 대한 MSELoss, 즉 재스케일링 없음.
학습 데이터: ~143M (질의, 문서, 교사_점수) 삼중항.

데이터셋

학습 데이터를 단일 Hugging Face 데이터셋 cross-encoder/ettin-reranker-v1-data로 공개했습니다. 이는 두 소스에서 조립됩니다. 각 소스는 출처가 투명하도록 자신의 분할로 유지됩니다:

LightOn 사전 학습 데이터(lightonai/embeddings-pre-training, 큐레이션되지 않음): 광범위한 도메인 텍스트 유사도 신호(MTP, FW-EDU, Reddit, PAQ, S2ORC, Amazon, Wikipedia, MS MARCO 등)를 다루는 32개 분할. 일부 분할의 샘플 수를 제한하면 총 약 110M (질의, 문서, 유사도) 삼중항이 됩니다.
lightonai/embeddings-fine-tuning에서 재점수 매긴 검색 데이터: 7개 분할(msmarco, hotpotqa, trivia, nq, squadv2, fiqa, fever). 소스 데이터셋은 질의당 최대 2048개 후보 문서(초기에 Alibaba-NLP/gte-modernbert-base로 점수 매김)를 가지고 있고, mixedbread-ai/mxbai-rerank-large-v2로 재점수 매기고 cross-encoder/lightonai-embeddings-fine-tuning-reranked-v1로 업로드했습니다. 해당 데이터셋은 Jang et al. 분위수-앵커 레시피를 사용하여 각 질의의 2048개 후보를 256개로 부분 표본화합니다(모든 긍정 + 상위 16개 어려움 + ~239 분위수-앵커 계층화됨). 학습을 위해 질의당 256개 중 64개를 선택합니다: 점수 정렬 헤드(긍정 + 가장 어려운 네거티브)에서 32개, 교사의 순위에서 더 아래 범위의 중간 난이도 네거티브 32개. 정확한 순위 위치는 데이터셋 카드를 참조하세요.

합계: ~143M (질의, 문서, 점수) 삼중항, 더하기 보유된 5K 행 평가 분할(quora의 꼬리)은 학습 중 평가 손실을 주도합니다.

학습 인수

대부분의 하이퍼파라미터는 모델 크기에 걸쳐 상수입니다:

CrossEncoderTrainingArguments(
    num_train_epochs=1,                    
    per_device_train_batch_size=...,       
    gradient_accumulation_steps=1,
    learning_rate=...,                     
    warmup_ratio=0.03,                     
    bf16=True,                             
    eval_strategy="steps",
    eval_steps=0.05,                       
    save_strategy="steps",
    save_steps=0.05,
    save_total_limit=5,
    load_best_model_at_end=True,
    metric_for_best_model="eval_NanoBEIR_R100_mean_ndcg@10",
    seed=12,
)

모델 크기별로 학습률과 전역 배치 크기만 다릅니다.

크기	학습률	전역 배치 크기
17m	2.4e-4	1024
32m	1.2e-4	512
68m	3e-5	256
150m	1.5e-5	192
400m	7e-6	256
1b	3e-6	512

global_batch_size는 per_device_batch_size x world_size x gradient_accumulation_steps입니다. 단일 8-GPU 노드에서 17m의 1024 전역 배치는 per_device=128을 의미합니다. 8개 노드에서 per_device=8을 의미합니다. 학습 스크립트는 global_batch_size // world_size에서 per_device_batch_size를 계산하므로 동일한 스크립트가 어떤 노드 수에서든 작동합니다. 전역 배치 크기는 더 일관되게 만들어질 수 있지만 위 값이 잘 작동했고 일관성만을 위해 재조정하고 싶지 않았습니다.

평가

학습 중 NanoBEIR 평균 NDCG@10을 모니터링했습니다(단계의 5%마다 평가). 이를 load_best_model_at_end를 위한 metric_for_best_model으로 사용했습니다. NanoBEIR은 빠르므로 학습 실행당 20번 감당할 수 있었습니다. 학습 후 최고 체크포인트(NanoBEIR에 따르면)와 마지막 체크포인트를 모두 전체 MTEB(eng, v2) 검색 벤치마크에서 평가했습니다. 최종 공개 체크포인트는 MTEB에서 가장 잘 수행한 것이었습니다. NanoBEIR이 선호하는 체크포인트가 68m을 제외한 모든 크기에서 우승했습니다. 마지막 체크포인트가 약간 더 강력했습니다.

전체 학습 스크립트

완전한 스크립트(모든 공개 모델이 학습된 것)는 단일 파일입니다. ENCODER_SIZE만 실행마다 변경되고 다른 모든 것은 자동입니다:

from __future__ import annotations

import logging
import os
from pathlib import Path

import torch
import torch.nn as nn
from datasets import concatenate_datasets, get_dataset_config_names, load_dataset

from sentence_transformers import CrossEncoder
from sentence_transformers.base.modules import Dense
from sentence_transformers.cross_encoder import (
    CrossEncoderModelCardData,
    CrossEncoderTrainer,
    CrossEncoderTrainingArguments,
)
from sentence_transformers.cross_encoder.evaluation import CrossEncoderNanoBEIREvaluator
from sentence_transformers.cross_encoder.losses import MSELoss
from sentence_transformers.sentence_transformer.modules import LayerNorm, Pooling, Transformer

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s", datefmt="%H:%M:%S")
logging.getLogger("httpx").setLevel(logging.WARNING)



CONFIGS: dict[str, dict] = {
    "17m":  {"base_model_name": "jhu-clsp/ettin-encoder-17m",  "learning_rate": 2.4e-4, "global_batch_size": 1024},
    "32m":  {"base_model_name": "jhu-clsp/ettin-encoder-32m",  "learning_rate": 1.2e-4, "global_batch_size": 512},
    "68m":  {"base_model_name": "jhu-clsp/ettin-encoder-68m",  "learning_rate": 3e-5,   "global_batch_size": 256},
    "150m": {"base_model_name": "jhu-clsp/ettin-encoder-150m", "learning_rate": 1.5e-5, "global_batch_size": 192},
    "400m": {"base_model_name": "jhu-clsp/ettin-encoder-400m", "learning_rate": 7e-6,   "global_batch_size": 256},
    "1b":   {"base_model_name": "jhu-clsp/ettin-encoder-1b",   "learning_rate": 3e-6,   "global_batch_size": 512},
}
ENCODER_SIZE = "17m"

def main() -> None:
    config = CONFIGS[ENCODER_SIZE]
    encoder_id = config["base_model_name"]
    learning_rate = config["learning_rate"]
    global_batch_size = config["global_batch_size"]

    world_size = int(os.environ.get("WORLD_SIZE", 1))
    per_device_batch_size = global_batch_size // world_size
    dataloader_workers = 0 if world_size > 8 else 4
    run_name = f"ettin-reranker-{ENCODER_SIZE}-lr{learning_rate:.0e}"

    
    
    
    
    torch.manual_seed(12)
    transformer = Transformer(encoder_id, model_kwargs={"attn_implementation": "flash_attention_2"})
    transformer.model.config.num_labels = 1
    embedding_dimension = transformer.get_embedding_dimension()
    pooling = Pooling(embedding_dimension=embedding_dimension, pooling_mode="cls")
    dense_inner = Dense(
        in_features=embedding_dimension, out_features=embedding_dimension, bias=False,
        activation_function=nn.GELU(),
        module_input_name="sentence_embedding", module_output_name="sentence_embedding",
    )
    norm = LayerNorm(dimension=embedding_dimension)
    dense_score = Dense(
        in_features=embedding_dimension, out_features=1, bias=True,
        activation_function=nn.Identity(),
        module_input_name="sentence_embedding", module_output_name="scores",
    )
    model = CrossEncoder(
        modules=[transformer, pooling, dense_inner, norm, dense_score],
        num_labels=1,
        activation_fn=nn.Identity(),
        model_card_data=CrossEncoderModelCardData(
            model_name=f"Ettin Reranker {ENCODER_SIZE} distilled from mxbai-rerank-large-v2",
            language="en",
            license="apache-2.0",
        ),
    )
    actual_attn = getattr(model[0].model.config, "_attn_implementation", None)
    if not (actual_attn and "flash" in actual_attn.lower()):
        logging.warning(f"FA2 may not be active (attn_impl={actual_attn!r}); training will be slower.")

    
    
    dataset_repo = "cross-encoder/ettin-reranker-v1-data"
    train_pieces = []
    eval_dataset = None
    for config_name in get_dataset_config_names(dataset_repo):
        dataset = load_dataset(dataset_repo, config_name)
        train_pieces.append(dataset["train"])
        if "validation" in dataset:
            eval_dataset = dataset["validation"]
    train_dataset = concatenate_datasets(train_pieces)
    print(train_dataset)

    
    loss = MSELoss(model)

    
    args = CrossEncoderTrainingArguments(
        output_dir=f"models/{run_name}",
        num_train_epochs=1,
        per_device_train_batch_size=per_device_batch_size,
        per_device_eval_batch_size=per_device_batch_size,
        gradient_accumulation_steps=1,
        learning_rate=learning_rate,
        warmup_ratio=0.03,
        bf16=True,
        eval_strategy="steps",
        eval_steps=0.05,
        save_strategy="steps",
        save_steps=0.05,
        save_total_limit=5,
        logging_steps=0.025,
        logging_first_step=True,
        load_best_model_at_end=True,
        metric_for_best_model="eval_NanoBEIR_R100_mean_ndcg@10",
        dataloader_num_workers=dataloader_workers,
        run_name=run_name,
        seed=12,
    )

    
    evaluator = CrossEncoderNanoBEIREvaluator(
        dataset_names=["msmarco", "nfcorpus", "nq", "fiqa2018", "touche2020", "scifact",
                       "hotpotqa", "arguana", "fever", "dbpedia", "climatefever", "scidocs",
                       "quoraretrieval"],
        batch_size=per_device_batch_size,
        always_rerank_positives=False,
        show_progress_bar=False,
    )

    
    trainer = CrossEncoderTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        loss=loss,
        evaluator=evaluator,
    )

    
    if trainer.is_world_process_zero():
        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
            evaluator(model)

    
    trainer.train()

    
    if trainer.is_world_process_zero():
        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
            evaluator(model)

    
    final_dir = f"models/{run_name}/final"
    model.save_pretrained(final_dir)


if __name__ == "__main__":
    main()

다중 노드 학습(17m/32m 이후)의 경우 동일한 스크립트를 torchrun으로 실행합니다:


python train.py


torchrun --nproc_per_node=8 --nnodes=4 ... train.py

결론

ettin-reranker-v1 패밀리는 단일 간단한 레시피로 학습되어 1B 파라미터까지의 모든 공개된 크기에서 최첨단입니다. 강력한 교사에서의 포인트와이즈 MSE 증류는 광범위한 도메인과 검색 특정 혼합에서 17M에서 1B 파라미터까지 깔끔하게 확장되며 크기 간에 학습률과 장치별 배치 크기만 변경됩니다.

모든 ettin-reranker-v1 모델은 MTEB와 NanoBEIR에서 ms-marco-MiniLM-L*-v2 패밀리를 편한 여유로 능가합니다. cross-encoder/ettin-reranker-150m-v1은 MTEB에서 테스트한 600M 미만 범위의 가장 강력한 중간 리랭커이고, cross-encoder/ettin-reranker-400m-v1은 1.54B 교사의 MTEB 점수의 0.0024 범위 내에 도착하며, cross-encoder/ettin-reranker-1b-v1은 해당 교사와 0.0001 범위 내에서 일치합니다.

한곳에 모든 것이:

이들 위에 뭔가를 구축하면 알려주세요! 사람들이 무엇을 하는지 정말 궁금합니다. 공개된 데이터를 사용하여 더 나은 리랭커를 학습할 수 있다면 더욱 좋습니다. 레시피는 의도적으로 간단하므로 누군가 다른 사람이 개선할 여지가 많습니다. 더 강력한 교사를 학습하면 동일한 스크립트가 계속 더 나은 학생을 생산할 수 있습니다.

감사의 말

이 리랭커들이 구축된 기본 인코더 구축한 Ettin 팀(Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme), 학습 데이터 수집 작업을 한 LightOn 팀(Antoine Chaffin, Raphael Sourty, Paulo Moura, Amélie Chatelain), 그리고 교사 모델 작업을 한 Mixedbread AI 팀(Xianming Li, Aamir Shakir, Rui Huang, Tsz-fung Andrew Lee, Julius Lipp, Benjamin Clavié, Jing Li)에 감사드립니다.

인용

ettin-reranker-v1 패밀리 또는 공개된 아티팩트를 사용하면 이 블로그 게시물을 인용하세요:

@misc{aarsen2026ettin-reranker,
    title = "Introducing the Ettin Reranker Family",
    author = "Aarsen, Tom",
    year = "2026",
    publisher = "Hugging Face",
    url = "https://huggingface.co/blog/ettin-reranker",
}

Introducing the Ettin Reranker Family

Today I'm releasing six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on top of the Ettin ModernBERT encoders, together with the data and full training recipe that produced them:

The models were trained with a distillation recipe: pointwise MSE on mixedbread-ai/mxbai-rerank-large-v2 scores over cross-encoder/ettin-reranker-v1-data, which is a subset of lightonai/embeddings-pre-training mixed with a reranked subset of lightonai/embeddings-fine-tuning.

Our six rerankers paired with google/embeddinggemma-300m on MTEB(eng, v2) Retrieval. See Results for five more embedder pairings.

If you're new to rerankers and want the "why" first, jump to What is a reranker, and why pair one with an embedder?. If you just want to plug a model in, jump to Usage. If you want to train your own, jump to Training.

I bootstrapped the training recipe below with the new train-sentence-transformers Agent Skill shipped in Sentence Transformers v5.5.0. Install it with hf skills add train-sentence-transformers [--global] [--claude] and ask your AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, ...) to fine-tune a SentenceTransformer, CrossEncoder, or SparseEncoder model on your data.

What is a reranker, and why pair one with an embedder?

A reranker (a.k.a. pointwise cross-encoder) is a neural model that takes a (query, document) pair and outputs a single relevance score. Unlike an embedding model, which encodes the query and document separately and computes their similarity from the two embedding vectors, a reranker lets the two texts attend to each other through every transformer layer. That joint encoding is more accurate but also more expensive: the model has to be run once per (query, document) pair rather than once per text.

Because cross-encoders are too expensive to run over a full corpus, the common production pattern is retrieve-then-rerank: a fast embedding model retrieves the top-K candidates (cheap), then a cross-encoder re-orders just those K with high accuracy. The total cost stays bounded while the final ranking is much closer to what an exhaustive cross-encoder pass would produce.

Throughout this blogpost I'll use "reranker" and "cross-encoder" interchangeably.

Usage

The released models are normal Sentence Transformers CrossEncoder models, so you can use them with just 3 lines of code:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ettin-reranker-32m-v1")
scores = model.predict([
    ("Where was Apple founded?", "Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne."),
    ("Where was Apple founded?", "The Fuji apple is an apple cultivar developed in the late 1930s and brought to market in 1962."),
])
print(scores)

For a query and a list of candidates, you can also use rank to get back sorted indices and scores:

ranked = model.rank(
    query="Which planet is known as the Red Planet?",
    documents=[
        "Venus is often called Earth's twin because of its similar size and proximity.",
        "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
        "Jupiter, the largest planet in our solar system, has a prominent red spot.",
        "Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
    ],
    top_k=4,
    return_documents=True,
)
for r in ranked:
    print(f"({r['score']:.2f}): {r['text']}")

You can swap cross-encoder/ettin-reranker-32m-v1 for any other size to trade quality for speed. All six accept up to 8K tokens of context (useful for long-document reranking) thanks to ModernBERT's long-context pre-training.

It is recommended to install kernels and set model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"} for the highest throughput. See the Speed section below for more details, but in general you can expect a 1.7x-8.3x speedup over default loading depending on model size and sequence length.

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ettin-reranker-32m-v1",
    model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"},
)

End-to-end retrieve-then-rerank pipeline

A complete example with a fast embedder for retrieval and the reranker for the final ordering:

from sentence_transformers import SentenceTransformer, CrossEncoder


embedder = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1")
reranker = CrossEncoder("cross-encoder/ettin-reranker-68m-v1")

corpus = [
    "Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne.",
    "The Fuji apple is an apple cultivar developed in the late 1930s.",
    "Steve Jobs introduced the iPhone in 2007 at Macworld.",
    "Macintosh computers were sold by Apple from 1984 onward.",
    
]
query = "Where was Apple founded?"


query_emb = embedder.encode_query(query, convert_to_tensor=True)
corpus_emb = embedder.encode_document(corpus, convert_to_tensor=True)
scores = embedder.similarity(query_emb, corpus_emb)[0]
top_k_idx = scores.topk(min(100, len(corpus))).indices.tolist()


top_k_docs = [corpus[i] for i in top_k_idx]
ranked = reranker.rank(query, top_k_docs, top_k=5, return_documents=True)
for r in ranked:
    print(f"({r['score']:.2f}): {r['text']}")

This is the same shape used by most modern search systems. The retriever decides what enters the funnel, the reranker decides what wins.

Architecture Details

All six rerankers share the same architecture and differ only in their backbone size. The backbone is one of the six Ettin encoders from Johns Hopkins University's Ettin suite. These are ModernBERT-style models with unpadded attention, RoPE positional encodings, GeGLU, and 2T tokens of open-license pre-training, supporting up to 8192 tokens of context.

On top of each encoder, the reranker uses a 4-module classification head that mirrors ModernBertForSequenceClassification but is built from Sentence Transformers' modular components. The underlying Transformer is a plain AutoModel rather than AutoModelForSequenceClassification, which lets us use sequence unpadding for variable-length inputs for Flash Attention 2. At medium-document sequence lengths this is a 1.7x-8.3x speedup over fp32+SDPA depending on model size (see Speed for the full benchmark):

1. Transformer(FA2)
2. Pooling(cls)
3. Dense(H, H, bias=False, GELU)
4. LayerNorm(H)
5. Dense(H, 1, scores)

In my ablations, CLS pooling outperformed mean pooling. That was a little surprising. ModernBERT uses global attention only every third layer and the other two-thirds use local-window attention that cannot reach CLS from distant positions. Empirically, those few global layers carry enough signal to make CLS the better pooling choice.

All six models are released under the Apache 2.0 license, matching the Ettin encoders.

Results

MTEB(eng, v2) Retrieval

I ran each released model through the full MTEB(eng, v2) Retrieval benchmark (10 tasks, top-100 reranked) using MTEB's two-stage reranking flow, pairing each reranker with six embedding models that span the speed/quality spectrum:

The dashed retriever-only line in each chart below is the headline number to beat. Anything below it means the reranker actively hurts the pipeline on average:

Full table of results (click to expand)

Mean NDCG@10 over the 6 embedder pairings, sorted descending. Our six models are in bold, and the teacher mixedbread-ai/mxbai-rerank-large-v2 is underlined.

^† Capped to max_seq_length=8192 (the 4B Qwen3-based rerankers don't fit on a single H100 80GB at native context). Native-context evaluation is likely higher.

Full table of NanoBEIR results (click to expand)

NanoBEIR is a fast 13-dataset subset of BEIR that uses 50 queries per dataset against up to 5000 documents each. NanoBEIR is what metric_for_best_model was set to during training (see Evaluation), and what I used to guide the experimentation.

The smallest model I'm releasing, our 17M, beats the 33M ms-marco-MiniLM-L12-v2 by +0.051 NDCG@10 (0.5576 vs 0.5066) on MTEB and +0.038 (0.6746 vs 0.6369) on NanoBEIR at roughly half the parameter count. The 32M beats the 568M BAAI/bge-reranker-v2-m3 by +0.025 (0.5779 vs 0.5526) on MTEB, a 17x parameter gap. If you've been using one of the legacy MiniLM rerankers as the default in your retrieve-then-rerank stack, swapping in our 17M (or 32M) is a low-risk drop-in replacement, with a noticeable quality bump on both benchmarks.

Moving up the table, our 150M is the strongest reranker I tested in the under-600M range on MTEB, edging out the recent Qwen/Qwen3-Reranker-0.6B (596M) by +0.005 (0.5994 vs 0.5940) and beating every BAAI bge-reranker variant by 0.03 to 0.05. The 68M is also worth a mention: at 0.5915 it lands almost exactly on Qwen3-Reranker-0.6B (0.5940) while using a ninth of the parameters.

At the top of the released range, our 1B model closely tracks its teacher. It comes within 0.0001 of the 1.54B mxbai-rerank-large-v2 on MTEB (0.6114 vs 0.6115) and within 0.008 on NanoBEIR, despite distilling from a model 54% larger than itself. The distillation effectively closes the gap to the teacher, which is what I was hoping to see going into this release.

The overall strongest reranker in the comparison is Qwen/Qwen3-Reranker-4B at 0.6367 MTEB, +0.025 above our 1B model. Closing that gap from the current recipe would likely require distilling from a stronger teacher (our teacher itself sits below Qwen3-Reranker-4B). For most retrieve-then-rerank workloads, our 1B at a quarter of the parameters (see Speed) is a much more practical pick.

Speed

Quality numbers are only half of what matters for a reranker. The other half is whether its latency fits inside the budget you have between retrieval and showing results to the user. Let me walk through what I measured.

I benchmarked all six released models against thirteen public rerankers (strong baselines up to about 1B parameters) on a single NVIDIA H100 80GB. The queries and documents come from sentence-transformers/natural-questions at its natural document-length distribution: most NQ answers are short, some are long. Documents are truncated at max_length=512 to avoid giving the older models an unfair advantage. Each model uses its best supported attention implementation: Flash Attention 2 wherever the architecture supports it (BERT, XLM-RoBERTa, ModernBERT, Qwen2), SDPA where it doesn't, and eager for DeBERTa-v2 (which currently has neither FA2 nor SDPA support in transformers).

For every model an auto-batch search starts at batch size 8 and doubles until the GPU runs out of memory. At each batch size I run three timed passes and keep the median throughput, so a single unlucky run doesn't drag the number around. The reported throughput is at whichever batch size won.

Table 1. Throughput in pairs per second, all in bfloat16. Our six rerankers are in bold.

Our 17M is the fastest reranker in the whole comparison, at 7517 pairs per second. That's almost twice the throughput of ms-marco-MiniLM-L6-v2 (3817) and faster even than the smaller ms-marco-MiniLM-L4-v2 (4029). And as you saw in the MTEB table earlier, our 17M is also more accurate than every MiniLM variant. If you're currently running a MiniLM cross-encoder, swapping to our 17M is a one-line change that improves both your latency and search quality.

Our 150M is an even more interesting comparison, because there are two direct architectural peers at exactly 150M parameters: Alibaba-NLP/gte-reranker-modernbert-base and ibm-granite/granite-embedding-reranker-english-r2. Both are built on the same ModernBERT-base backbone. Our 150M runs at 3237 pairs per second, while the two peers come in at 1418 and 1404 respectively, for a 2.3x speed gap.

All three 150M models use Flash Attention 2, but the two peers load through AutoModelForSequenceClassification, which keeps the inputs padded. So attention itself runs the FA2 kernel, but the rest of the model is still doing dense compute on padding tokens that don't contribute anything. Our modular Transformer module (see Architecture Details above) propagates unpadded inputs all the way through the model, so every layer only spends compute on real tokens. That's the difference between getting some of FA2's benefit and getting all of it.

At the bottom of the table, our 1B model hits 928 pairs per second, which is 2.4x faster than the 1.54B teacher mxbai-rerank-large-v2 (387 pairs per second) while matching its MTEB score within 0.0001. The teacher is Qwen2-based with a prompt-template overhead per pair, so the distilled student inherits the teacher's calibration and judgement but skips all the runtime baggage. This is honestly the most satisfying single number in the whole release for me.

One unfortunate note: the DeBERTa-v2-based mxbai-rerank-{xsmall,base,large}-v1 series ends up much slower than the rest of the table because DeBERTa-v2 currently supports neither Flash Attention 2 nor SDPA in transformers. The 70M mxbai-rerank-xsmall-v1 runs at 2636 pairs per second, about half the throughput of our 68M at almost the same parameter count. The models themselves are perfectly fine, they just don't get to use modern attention kernels.

Same benchmark on a consumer GPU (RTX 3090, 24 GB)

If you're self-hosting on a consumer card rather than a datacenter GPU, here's the same throughput sweep on an RTX 3090. Same benchmark setup as Table 1: bfloat16, best-supported attention per model, three-trial median throughput at the largest batch that fits.

Our 17M is still the fastest model in the table at 9008 pairs per second, actually higher than its H100 number, which suggests that at tiny sizes raw compute isn't the bottleneck and the H100's extra muscle doesn't translate. The middle of the table reshuffles a bit, with the MiniLM rerankers overtaking our 32M and 68M, and the 1B slipping behind mxbai-rerank-base-v2 (189 vs 221 pairs per second). Our 150M model still holds a solid lead over the two 150M ModernBERT-based peers, and the teacher-replacement story still holds, with our 1B at 2.7x the throughput of the 1.5B mxbai-rerank-large-v2 (189 vs 69 pairs per second).

Same benchmark on CPU (Intel Core i7-13700K)

On CPU, we can't take advantage of bf16, Flash Attention 2, or unpadding, so the latency story is a bit simpler: the higher the parameter count, the slower the model. The 17M model is considerably faster than ms-marco-MiniLM-L6-v2 (267.4 vs 143.9 pairs per second) and even faster than the smaller ms-marco-MiniLM-L4-v2 (206.2). As expected, our 150M model lands alongside the two 150M peers (14.0 vs 14.5 and 14.7 pairs per second) now that unpadding no longer applies. If you're CPU-bound, our 17M and 32M are the practical picks.

To explain where the speed comes from, the next table sweeps fp32+SDPA, bf16+SDPA, and bf16+FA2 for our six models using the same bench config. The FA2 column is split in two: one with the inputs still padded (what a wrapped model would see) and one with unpadded inputs (what our modular Transformer actually does). The rightmost column is what our models use by default when FA2 is enabled.

Table 2. Precision and attention ablation for the six released sizes at max_length=512 on natural NQ documents. Each cell shows pairs / second with the multiplier relative to fp32+SDPA in parentheses, and peak GPU memory on the second line. The rightmost column (in bold) is the configuration our models use by default when FA2 is enabled.

Model	Params	fp32+SDPA	bf16+SDPA	bf16+FA2 w. padding	bf16+FA2 w.o. padding
`ettin-reranker-17m-v1`	17M	4402 (1.00x) 0.8 GB	4523 (1.03x) 2.2 GB	3744 (0.85x) 1.9 GB	7517 (1.71x) 1.4 GB
`ettin-reranker-32m-v1`	32M	3307 (1.00x) 1.2 GB	4357 (1.32x) 1.6 GB	3040 (0.92x) 2.9 GB	6602 (2.00x) 1.1 GB
`ettin-reranker-68m-v1`	68M	1364 (1.00x) 1.0 GB	2861 (2.10x) 2.2 GB	2003 (1.47x) 2.0 GB	4913 (3.60x) 1.5 GB
`ettin-reranker-150m-v1`	150M	671 (1.00x) 1.6 GB	1942 (2.90x) 1.8 GB	1396 (2.08x) 3.1 GB	3237 (4.83x) 1.4 GB
`ettin-reranker-400m-v1`	400M	266 (1.00x) 2.5 GB	1113 (4.18x) 1.8 GB	864 (3.25x) 2.7 GB	1738 (6.53x) 2.2 GB
`ettin-reranker-1b-v1`	1B	112 (1.00x) 4.6 GB	630 (5.60x) 2.8 GB	522 (4.64x) 3.6 GB	928 (8.26x) 4.5 GB

The total speedup from bf16+FA2 w.o. padding over the fp32+SDPA baseline grows sharply with model size, from 1.71x on the 17M to 8.26x on the 1B. Most of that growth comes from bf16 alone: the fp32+SDPA to bf16+SDPA step gives the 17M only a 1.03x speedup but gives the 1B a full 5.60x speedup, also due to the lowered memory cost allowing for bigger batch sizes. In short, bfloat16 is the biggest single contributor to the overall speedup.

Unexpectedly, turning on FA2 while the inputs are still padded is actually slower than bf16+SDPA at every size in the release. The FA2 kernel prefers an unpadded format, and when you feed it padded inputs you pay the bookkeeping overhead of converting between formats while still spending compute on the padding tokens themselves. So the bf16+FA2 w. padding column is roughly what you'd measure if you swapped sdpa for flash_attention_2 in model_kwargs without changing anything else about the model loader. This is the situation that gte-reranker-modernbert-base and granite-embedding-reranker-english-r2 from Table 1 are in.

Lastly, going from bf16+FA2 w. padding to bf16+FA2 w.o. padding is worth between 1.78x (1B) and 2.45x (68M) of additional throughput, and it also cuts peak memory considerably, allowing for higher batch sizes.

So my recommendation is simple: enable bf16 and FA2 together. The six Ettin rerankers will use unpadded inputs by default, since that's what the modular Transformer module from the Architecture Details section is set up for. The full snippet is the same as in the Usage section above:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ettin-reranker-150m-v1",
    model_kwargs={
        "dtype": "bfloat16",
        "attn_implementation": "flash_attention_2",  
    },
)

Use pip install kernels to install FA2. It ships pre-built kernels for a wide range of GPU architectures, CUDA versions, and operating systems.

One caveat for other CrossEncoders: the full speedup is only available for models built with a modular Transformer like the Ettin rerankers. Applying the same two flags to a CrossEncoder that loads through AutoModelForSequenceClassification lands you in the slower bf16+FA2 w. padding column of Table 2 instead.

Training

The training script below started as the output of the new train-sentence-transformers Agent Skill, shipped in Sentence Transformers v5.5.0. If you use an AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, ...), you can install the skill and ask it to fine-tune a SentenceTransformer, CrossEncoder, or SparseEncoder model on your data. The skill carries version-aware guidance for base model selection, loss and evaluator choice, hard-negative mining, distillation, LoRA, Matryoshka, multilingual training, and static embeddings, plus template scripts for each model type.

hf skills add train-sentence-transformers --claude   
hf skills add train-sentence-transformers --global

A prompt like "Fine-tune a cross-encoder reranker on (query, document) pairs from my dataset, mine hard negatives, and push to my Hub repo" will produce a runnable script you can then iterate on. That's how I started working on the recipe below.

All six rerankers were trained with the same single-stage recipe. Only the learning rate and the per-device batch size vary per model size. The full training script is ~150 lines and uses one published dataset.

The recipe converged after a single sweep across model sizes. Each size's learning rate was tuned by a small grid search on a ~15% subset of the final training data, and the resulting LRs transferred cleanly to the full-data runs without re-tuning. No per-size tuning beyond LR was needed.

Distillation recipe

Most published reranker recipes train on human-labeled relevance triples (a query, one positive document, and optionally hard negatives) with a contrastive, pointwise, pairwise, or listwise loss like MultipleNegativesRankingLoss, BinaryCrossEntropyLoss, RankNetLoss, or LambdaLoss, respectively. See my earlier Training and Finetuning Reranker Models with Sentence Transformers blogpost, for example.

But this approach has a few practical and theoretical drawbacks. First, positives need to be human-labeled, which is expensive and slow to scale across many domains. Second, the model only ever sees a label for the small subset of (query, document) pairs that someone went through. Especially after hard negative mining, you end up with a lot of false negatives, e.g. as shown in Hard Negatives, Hard Lessons. Third, the binary nature of this labeling doesn't match reality, where some documents are simply more relevant than others.

I took a different route here: pointwise MSE distillation from an existing strong teacher reranker. The setup is simple enough to describe in three lines:

Teacher: mixedbread-ai/mxbai-rerank-large-v2 (1.54B parameters).
Loss: MSELoss on the raw teacher logits (range ~[−12, 22]), i.e. without rescaling.
Training data: ~143M (query, document, teacher_score) triples.

Dataset

I've released the training data as a single Hugging Face dataset, cross-encoder/ettin-reranker-v1-data, assembled from two sources. Each source is kept as its own split so the provenance is transparent:

LightOn pre-training data (lightonai/embeddings-pre-training, non-curated): 32 splits covering broad-domain text similarity signal (MTP, FW-EDU, Reddit, PAQ, S2ORC, Amazon, Wikipedia, MS MARCO, etc.). I limit the number of samples for some of the splits, resulting in ~110M (query, document, similarity) triples in total.
Rescored retrieval data from lightonai/embeddings-fine-tuning: 7 splits (msmarco, hotpotqa, trivia, nq, squadv2, fiqa, fever). The source dataset has up to 2048 candidate documents per query (initially scored with Alibaba-NLP/gte-modernbert-base), which I rescored with mixedbread-ai/mxbai-rerank-large-v2 and uploaded as cross-encoder/lightonai-embeddings-fine-tuning-reranked-v1. That dataset subsamples each query's 2048 candidates down to 256 using the Jang et al. quantile-anchor recipe (all positives + top-16 hard + ~239 quantile-anchor stratified). For training, I pick 64 of those 256 per query: 32 from the score-sorted head (the positive plus the hardest negatives) and 32 medium-difficulty negatives sampled from a band further down the teacher's ranking. See the dataset card for the exact rank positions.

Total: ~143M (query, document, score) triples, plus a held-out 5K-row eval split (the tail of quora) that drives the in-training eval loss.

Training Arguments

Most hyperparameters are constant across model sizes:

CrossEncoderTrainingArguments(
    num_train_epochs=1,                    
    per_device_train_batch_size=...,       
    gradient_accumulation_steps=1,
    learning_rate=...,                     
    warmup_ratio=0.03,                     
    bf16=True,                             
    eval_strategy="steps",
    eval_steps=0.05,                       
    save_strategy="steps",
    save_steps=0.05,
    save_total_limit=5,
    load_best_model_at_end=True,
    metric_for_best_model="eval_NanoBEIR_R100_mean_ndcg@10",
    seed=12,
)

Only the learning rate and global batch size very per model size.

Size	Learning rate	Global batch size
17m	2.4e-4	1024
32m	1.2e-4	512
68m	3e-5	256
150m	1.5e-5	192
400m	7e-6	256
1b	3e-6	512

global_batch_size is per_device_batch_size x world_size x gradient_accumulation_steps. On a single 8-GPU node, the 1024 global batch for 17m means per_device=128. On 8 nodes, it means per_device=8. The training script computes per_device_batch_size from global_batch_size // world_size so the same script works at any node count. The global batch size could be made more consistent, but I found that the above values worked well and didn't want to retune them just for the sake of consistency.

Evaluation

I monitored NanoBEIR mean NDCG@10 during training (eval every 5% of steps) and used it as the metric_for_best_model for load_best_model_at_end. NanoBEIR is fast, so I could afford it 20 times per training run. After training, I evaluated both the best checkpoint (according to NanoBEIR) and the last checkpoint on the full MTEB(eng, v2) Retrieval benchmark. The final release checkpoint was the one that did best on MTEB. The NanoBEIR-preferred checkpoint won for all sizes except 68m, where the last checkpoint was slightly stronger.

Overall Training Script

The complete script (what every released model was trained with) is a single file. Only ENCODER_SIZE changes per run, and everything else is automatic:

from __future__ import annotations

import logging
import os
from pathlib import Path

import torch
import torch.nn as nn
from datasets import concatenate_datasets, get_dataset_config_names, load_dataset

from sentence_transformers import CrossEncoder
from sentence_transformers.base.modules import Dense
from sentence_transformers.cross_encoder import (
    CrossEncoderModelCardData,
    CrossEncoderTrainer,
    CrossEncoderTrainingArguments,
)
from sentence_transformers.cross_encoder.evaluation import CrossEncoderNanoBEIREvaluator
from sentence_transformers.cross_encoder.losses import MSELoss
from sentence_transformers.sentence_transformer.modules import LayerNorm, Pooling, Transformer

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s", datefmt="%H:%M:%S")
logging.getLogger("httpx").setLevel(logging.WARNING)



CONFIGS: dict[str, dict] = {
    "17m":  {"base_model_name": "jhu-clsp/ettin-encoder-17m",  "learning_rate": 2.4e-4, "global_batch_size": 1024},
    "32m":  {"base_model_name": "jhu-clsp/ettin-encoder-32m",  "learning_rate": 1.2e-4, "global_batch_size": 512},
    "68m":  {"base_model_name": "jhu-clsp/ettin-encoder-68m",  "learning_rate": 3e-5,   "global_batch_size": 256},
    "150m": {"base_model_name": "jhu-clsp/ettin-encoder-150m", "learning_rate": 1.5e-5, "global_batch_size": 192},
    "400m": {"base_model_name": "jhu-clsp/ettin-encoder-400m", "learning_rate": 7e-6,   "global_batch_size": 256},
    "1b":   {"base_model_name": "jhu-clsp/ettin-encoder-1b",   "learning_rate": 3e-6,   "global_batch_size": 512},
}
ENCODER_SIZE = "17m"

def main() -> None:
    config = CONFIGS[ENCODER_SIZE]
    encoder_id = config["base_model_name"]
    learning_rate = config["learning_rate"]
    global_batch_size = config["global_batch_size"]

    world_size = int(os.environ.get("WORLD_SIZE", 1))
    per_device_batch_size = global_batch_size // world_size
    dataloader_workers = 0 if world_size > 8 else 4
    run_name = f"ettin-reranker-{ENCODER_SIZE}-lr{learning_rate:.0e}"

    
    
    
    
    torch.manual_seed(12)
    transformer = Transformer(encoder_id, model_kwargs={"attn_implementation": "flash_attention_2"})
    transformer.model.config.num_labels = 1
    embedding_dimension = transformer.get_embedding_dimension()
    pooling = Pooling(embedding_dimension=embedding_dimension, pooling_mode="cls")
    dense_inner = Dense(
        in_features=embedding_dimension, out_features=embedding_dimension, bias=False,
        activation_function=nn.GELU(),
        module_input_name="sentence_embedding", module_output_name="sentence_embedding",
    )
    norm = LayerNorm(dimension=embedding_dimension)
    dense_score = Dense(
        in_features=embedding_dimension, out_features=1, bias=True,
        activation_function=nn.Identity(),
        module_input_name="sentence_embedding", module_output_name="scores",
    )
    model = CrossEncoder(
        modules=[transformer, pooling, dense_inner, norm, dense_score],
        num_labels=1,
        activation_fn=nn.Identity(),
        model_card_data=CrossEncoderModelCardData(
            model_name=f"Ettin Reranker {ENCODER_SIZE} distilled from mxbai-rerank-large-v2",
            language="en",
            license="apache-2.0",
        ),
    )
    actual_attn = getattr(model[0].model.config, "_attn_implementation", None)
    if not (actual_attn and "flash" in actual_attn.lower()):
        logging.warning(f"FA2 may not be active (attn_impl={actual_attn!r}); training will be slower.")

    
    
    dataset_repo = "cross-encoder/ettin-reranker-v1-data"
    train_pieces = []
    eval_dataset = None
    for config_name in get_dataset_config_names(dataset_repo):
        dataset = load_dataset(dataset_repo, config_name)
        train_pieces.append(dataset["train"])
        if "validation" in dataset:
            eval_dataset = dataset["validation"]
    train_dataset = concatenate_datasets(train_pieces)
    print(train_dataset)

    
    loss = MSELoss(model)

    
    args = CrossEncoderTrainingArguments(
        output_dir=f"models/{run_name}",
        num_train_epochs=1,
        per_device_train_batch_size=per_device_batch_size,
        per_device_eval_batch_size=per_device_batch_size,
        gradient_accumulation_steps=1,
        learning_rate=learning_rate,
        warmup_ratio=0.03,
        bf16=True,
        eval_strategy="steps",
        eval_steps=0.05,
        save_strategy="steps",
        save_steps=0.05,
        save_total_limit=5,
        logging_steps=0.025,
        logging_first_step=True,
        load_best_model_at_end=True,
        metric_for_best_model="eval_NanoBEIR_R100_mean_ndcg@10",
        dataloader_num_workers=dataloader_workers,
        run_name=run_name,
        seed=12,
    )

    
    evaluator = CrossEncoderNanoBEIREvaluator(
        dataset_names=["msmarco", "nfcorpus", "nq", "fiqa2018", "touche2020", "scifact",
                       "hotpotqa", "arguana", "fever", "dbpedia", "climatefever", "scidocs",
                       "quoraretrieval"],
        batch_size=per_device_batch_size,
        always_rerank_positives=False,
        show_progress_bar=False,
    )

    
    trainer = CrossEncoderTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        loss=loss,
        evaluator=evaluator,
    )

    
    if trainer.is_world_process_zero():
        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
            evaluator(model)

    
    trainer.train()

    
    if trainer.is_world_process_zero():
        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
            evaluator(model)

    
    final_dir = f"models/{run_name}/final"
    model.save_pretrained(final_dir)


if __name__ == "__main__":
    main()

For multi-node training (anything past 17m/32m), launch the same script with torchrun:


python train.py


torchrun --nproc_per_node=8 --nnodes=4 ... train.py

Conclusion

The ettin-reranker-v1 family, trained with a single simple recipe, is state-of-the-art at every released size up to 1B parameters. Pointwise MSE distillation from a strong teacher onto a broad-domain and retrieval-specific mix scales cleanly from 17M to 1B parameters, with only the learning rate and per-device batch size changing between sizes.

Every ettin-reranker-v1 model beats the ms-marco-MiniLM-L*-v2 family by a comfortable margin on MTEB and NanoBEIR. cross-encoder/ettin-reranker-150m-v1 is the strongest mid-tier reranker I tested in the under-600M range, cross-encoder/ettin-reranker-400m-v1 lands within 0.0024 of the 1.54B teacher's MTEB score, and cross-encoder/ettin-reranker-1b-v1 matches that teacher within 0.0001.

Everything in one place:

If you build something on top of these, please let me know! I'd genuinely love to see what people do with them, and if you manage to train better rerankers using the released data, even better. The recipe is intentionally simple, partly so that there's plenty of headroom for someone else to improve it. Train a stronger teacher and the same script can keep producing better students.

Acknowledgements

I'd like to thank the Ettin team (Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme) for building the base encoders that these rerankers are built on, the LightOn team (Antoine Chaffin, Raphael Sourty, Paulo Moura, and Amélie Chatelain) for their work on the training data collection, and the Mixedbread AI team (Xianming Li, Aamir Shakir, Rui Huang, Tsz-fung Andrew Lee, Julius Lipp, Benjamin Clavié, and Jing Li) for their work on the teacher model.

Citation

If you use the ettin-reranker-v1 family or any of the released artifacts, please cite this blogpost:

@misc{aarsen2026ettin-reranker,
    title = "Introducing the Ettin Reranker Family",
    author = "Aarsen, Tom",
    year = "2026",
    publisher = "Hugging Face",
    url = "https://huggingface.co/blog/ettin-reranker",
}

#reranking #machine-learning #information-retrieval #nlp #neural-networks #semantic-search

Ettin Reranker 패밀리 소개

Ettin 리랭커 패밀리 소개

목차

리랭커란 무엇이고 왜 임베더와 짝을 이루는가?

사용법

엔드-투-엔드 검색-후-리랭킹 파이프라인

아키텍처 세부사항

결과

MTEB(eng, v2) 검색

속도

학습

증류 레시피

데이터셋

학습 인수

평가

전체 학습 스크립트

결론

감사의 말

인용

Introducing the Ettin Reranker Family

Table of contents

What is a reranker, and why pair one with an embedder?

Usage

End-to-end retrieve-then-rerank pipeline

Architecture Details

Results

MTEB(eng, v2) Retrieval

Speed

Training

Distillation recipe

Dataset

Training Arguments

Evaluation

Overall Training Script

Conclusion

Acknowledgements

Citation