Mellum2 소개: JetBrains의 12B 혼합 전문가 모델

Mellum2는 자연어와 코드에 대해 처음부터 학습된 12B-파라미터 혼합 전문가 모델입니다.
모델은 토큰당 2.5B 파라미터만 활성화하여 높은 처리량과 낮은 지연 시간 추론에 효율적입니다. Mellum2는 라우팅, RAG, 요약, 서브 에이전트, 높은 처리량 코딩 기능 및 개인 배포에 사용할 수 있습니다.
Apache 2.0 라이센스로 출시되었습니다.
비슷한 크기의 모델들과 비교할 때, Mellum2는 경쟁력 있는 벤치마크 성능을 제공하면서 2배 이상 빠른 추론을 달성합니다.
Hugging Face에서 모델 다운로드: https://huggingface.co/collections/JetBrains/mellum-2
아키텍처 세부 사항, 학습 설정, 벤치마크 및 평가 방법론을 확인하려면 전체 기술 보고서를 읽으세요: https://arxiv.org/pdf/2605.31268

오늘 우리는 낮은 지연 시간의 텍스트 및 코드 워크로드에 최적화된 오픈 혼합 전문가 모델인 Mellum2를 출시하고 있습니다. Mellum은 원래 코드 완성 모델로 시작했습니다. Mellum2를 통해 우리는 그 기초를 더 광범위한 자연어 및 소프트웨어 엔지니어링 작업으로 확장하면서 모델을 효율적인 추론과 배포 가능성에 집중하고 있습니다. 현대 AI 시스템은 점점 더 여러 모델 호출에 의존합니다: 라우팅, 검색, 요약, 계획, 검증 및 도구 사용. 이러한 작업 중 많은 작업들은 지연 시간에 민감하며 가장 큰 사용 가능한 모델이 필요하지 않습니다. Mellum2는 이러한 워크로드를 목표로 합니다.

벤치마크 하이라이트

우리의 기술 보고서에서 우리는 코드 생성, 추론, 과학 및 수학 벤치마크에서 Mellum2를 평가합니다. Mellum2는 비슷한 크기의 오픈 모델들과 경쟁력이 있으면서 2배 이상 빠른 추론을 제공하여 높은 처리량의 프로덕션 워크로드에 적합합니다. 모델 아키텍처 Mellum2는 혼합 전문가 모델입니다:

모델	전체 파라미터	토큰당 활성 파라미터	양식	라이센스
Mellum2	12B	2.5B	텍스트 및 코드	Apache 2.0

MoE 아키텍처는 전체 모델 용량을 높게 유지하면서 각 토큰에 대해 파라미터의 부분집합만 활성화합니다. 이는 추론을 더 효율적으로 만들고 실시간 워크로드의 서빙 비용을 줄이는 데 도움이 됩니다. Mellum2는 의도적으로 멀티모달 작업보다 텍스트와 코드에 집중합니다. 이러한 전문화는 소프트웨어 엔지니어링 워크로드를 위해 모델을 컴팩트하고 효율적으로 유지합니다.

주요 사용 사례

라우팅 및 오케스트레이션

Mellum2는 프롬프트 분류, 도구 선택 및 중간 제어 흐름 단계를 포함한 멀티 모델 시스템에서 경량 라우팅 및 오케스트레이션 모델로 잘 작동합니다.

RAG 파이프라인

모델은 컨텍스트 압축, 요약 및 검색 후처리를 포함한 지연 시간에 민감한 검색 파이프라인에 잘 적합합니다.

서브 에이전트

Mellum2는 계획, 검증, 변환 및 컨텍스트 준비와 같은 에이전트 부작업에 사용될 수 있으며, 중간 작업을 위해 더 큰 모델을 호출해야 하는 필요성을 줄입니다.

개인 배포

Mellum2는 오픈이고 효율적으로 서빙되기 때문에 독점 코드 또는 내부 데이터가 포함된 자체 호스팅 환경에 배포할 수 있습니다.

좋은 범위의 모델이 중요한 이유

AI 시스템이 성숙함에 따라 가장 효과적인 아키텍처는 점점 덜 단일형이 되고 있습니다. 단일 최첨단 모델은 강력할 수 있지만, 프로덕션 시스템은 종종 함께 작동하는 여러 전문화된 구성 요소가 필요합니다: 검색기, 라우터, 코드 인식 모델, 검증기, 도구 호출자 및 더 큰 추론 모델. 우리는 Mellum2를 "초점" 모델로 생각합니다: 더 큰 AI 시스템 내의 높은 빈도 작업에 최적화된 빠르고 좋은 범위의 모델. 목표는 스택의 모든 모델을 대체하는 것이 아닙니다. 목표는 스택을 더 빠르고, 저렴하고, 제어하기 쉽게 만드는 것입니다.

Mellum2 시작하기

IDE 내부, RAG 파이프라인에서, 에이전트 워크플로우의 일부로 또는 개인 인프라에서 소프트웨어 엔지니어링용 AI 시스템을 구축하고 있다면, Mellum2는 시도할 준비가 완료되었습니다.

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Mellum2 is a 12B-parameter Mixture-of-Experts model trained from scratch on natural language and code.
The model activates only 2.5B parameters per token, making it efficient for high-throughput, low-latency inference. Mellum2 is can be used for routing, RAG, summarization, sub-agents, high-throughput coding features, and private deployments.
It is released under the Apache 2.0 license.
Compared with similar-sized models, Mellum2 delivers competitive benchmark performance while achieving more than 2x faster inference.
Download the model on Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2
For architecture details, training setup, benchmarks, and evaluation methodology, read the full technical report: https://arxiv.org/pdf/2605.31268

Today we’re releasing Mellum2, an open Mixture-of-Experts model optimized for low-latency text-and-code workloads. Mellum originally started as a code completion model. With Mellum2, we extend that foundation to a broader set of natural language and software engineering tasks while keeping the model focused on efficient inference and deployability. Modern AI systems increasingly rely on multiple model calls: routing, retrieval, summarization, planning, validation, and tool use. Many of these operations are latency-sensitive and do not require the largest available model. Mellum2 targets these workloads.

Benchmark highlights

In our technical report, we evaluate Mellum2 across code generation, reasoning, science, and math benchmarks. Mellum2 is competitive with similarly sized open models while delivering more than 2x faster inference, making it suitable for high-throughput production workloads. Model architecture Mellum2 is a Mixture-of-Experts model:

Model	Total parameters	Active parameters per token	Modality	License
Mellum2	12B	2.5B	Text and code	Apache 2.0

The MoE architecture keeps total model capacity high while activating only a subset of parameters for each token. This makes inference more efficient and helps reduce serving cost for real-time workloads. Mellum2 is intentionally focused on text and code rather than multimodal tasks. This specialization keeps the model compact and efficient for software engineering workloads.

Key use cases

Routing and orchestration

Mellum2 works well as a lightweight routing and orchestration model in multi-model systems, including prompt classification, tool selection, and intermediate control-flow steps.

RAG pipelines

The model is well suited for latency-sensitive retrieval pipelines, including context compression, summarization, and retrieval post-processing.

Sub-agents

Mellum2 can be used for agent subtasks such as planning, validation, transformation, and context preparation, reducing the need to invoke larger models for intermediate operations.

Private deployment

Because Mellum2 is open and efficient to serve, it can be deployed in self-hosted environments involving proprietary code or internal data.

Why well-scoped models matter

As AI systems mature, the most effective architectures are becoming less monolithic. A single frontier model can be powerful, but production systems often need several specialized components working together: retrievers, routers, code-aware models, validators, tool callers, and larger reasoning models. We think of Mellum2 as a “focal” model: a fast, well-scoped model optimized for high-frequency tasks inside larger AI systems. The goal is not to replace every model in the stack. The goal is to make the stack faster, cheaper, and easier to control.

Getting started with Mellum2

If you are building AI systems for software engineering – inside an IDE, in a RAG pipeline, as part of an agent workflow, or on private infrastructure – Mellum2 is ready to try.

Mellum2 소개: JetBrains의 12B Mixture-of-Experts 모델