AI News · #benchmark

GeekNews · 11일 전 제목번역

Kimi K3는 Fable과 경쟁하며, 두 모델의 조합은 최고 수준의 성능을 달성함

Kimi K3는 Fable과 경쟁하며, 두 모델의 조합은 최고 수준의 성능을 달성함 | GeekNews

<ul> <li>약 1,030개 에이전트 작업에서 Kimi K3와 Fable 5를 비교한 결과, 작업별 라우팅은 93% 정확도로 개별 모델보다 높은 품질을 달성함</li> <li>SWE·터미널·알고리듬·다중 언어·법률 작업에서 전체 성능은 비슷했지만, 두 모델이…

#benchmark #fable #model-routing #kimi-k3 #sota #agentic-tasks

Hacker News (front) · 11일 전 제목번역

Kimi K3는 Fable과 경쟁 중, Kimi K3와 Fable은 최첨단 기술

Kimi K3 Is Competitive with Fable; Kimi K3 and Fable Is SoTA

Also: Kimi K3: second only to Fable 5 on AA-Briefcase <a href="https://artificialanalysis.ai/articles/kimi-k3-agentic-knowledge-benchmark" rel="nofollow">https://artificialanal…

#agent-models #benchmark #model-comparison #fable #kimi-k3 #sota

Hacker News (front) · 2026-07-18 제목번역

Fable 5 vs. GPT-5.6 Sol: NP-난제에서 /goal이 도움이 될까?

Fable 5 vs. GPT-5.6 Sol on an NP-Hard Problem: Does /goal Help? - Charles AZAM

Article URL: <a href="https://charlesazam.com/blog/fable-5-gpt-5-6-sol-goal/">https://charlesazam.com/blog/fable-5-gpt-5-6-sol-goal/</a> Comments URL: <a href="https://ne…

#benchmark #model-comparison #gpt-5 #optimization #fable-5 #np-hard-problem

GeekNews · 2026-07-13 제목번역

Apple SpeechAnalyzer API, Whisper·이전 API와 비교 벤치마크

Apple SpeechAnalyzer API, Whisper·이전 API와 비교 벤치마크 | GeekNews

<ul> <li>Apple M2 Pro에서 5,559개 LibriSpeech 음성을 동일한 프로덕션 코드로 처리한 결과, SpeechAnalyzer가 깨끗한 음성 2.12%, 잡음이 많은 음성 4.56%의 단어 오류율(WER)로 테스트한 모든 엔진보다 정확했음</li> <li>기존 <stro…

#machine-learning #benchmark #whisper #speech-recognition #audio-processing #apple-api

GeekNews · 2026-07-13 제목번역

Claude Code는 프롬프트를 읽기 전 3.3만 토큰, OpenCode는 7천 토큰을

Claude Code는 프롬프트를 읽기 전 3.3만 토큰, OpenCode는 7천 토큰을 | GeekNews

<ul> <li>동일한 모델·머신·작업에서 API 경계를 측정한 결과, Sonnet 4.5 첫 요청의 고정 오버헤드는 Claude Code 약 32,800토큰, OpenCode 약 6,900토큰으로 4.7배 차이 났으며 Fable 5에서는 약 3.3배로 줄어듦</li> <li>격차의 대부분은…

#claude-code #benchmark #token-overhead #api-efficiency #opencode

Hacker News (front) · 2026-07-12 제목번역

Claude Code가 프롬프트를 읽기 전에 OpenCode보다 4.7배 더 많은 토큰을 전송

Claude Code Sends 4.7x More Tokens Than OpenCode Before Reading Your Prompt

This started based off of a hunch. We usually use OpenCode, but were 'forced' to use Claude Code for a while due to issues with Meridian. In that time, we saw the usage meter ri…

#claude-code #ai-coding-tools #benchmark #token-usage #api-cost #efficiency-comparison

GeekNews · 2026-07-01 제목번역

Claude Sonnet 5 공개

Claude Sonnet 5 공개 | GeekNews

<ul> <li>Anthropic은 2026년 6월 30일 Claude Sonnet 5를 출시하며, 더 비싼 Opus급 모델에 가까운 에이전트 실행 능력을 Sonnet급 비용대로 제공하려 함</li> <li>Sonnet 4.6보다 추론, 도구 사용, 코딩, 지식 작업</stro…

#anthropic #claude #benchmark #reasoning #agents #cost-efficiency

GeekNews · 2026-06-29 제목번역

GLM 5.2, Semgrep IDOR 벤치마크에서 Claude 앞서

GLM 5.2, Semgrep IDOR 벤치마크에서 Claude 앞서 | GeekNews

<ul> <li>Semgrep의 IDOR 취약점 탐지 벤치마크에서 Zhipu AI의 open-weight 모델 GLM 5.2가 단순 프롬프트 조건만으로 Claude Code보다 높은 F1을 기록함</li> <li>실험은 데이터셋·평가 방식·시스템 프롬프트를 고정…

#benchmark #code-security #glm #semgrep #idor

Hacker News (front) · 2026-06-28 제목번역

우리도 Mythos를 가지고 있다: GLM 5.2가 사이버 벤치마크에서 Claude를 이기다

We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

Article URL: <a href="https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/">https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm…

#claude #benchmark #code-security #glm

GeekNews · 2026-06-27 제목번역

오픈 웨이트 LLM과 폐쇄형 LLM의 격차

오픈 웨이트 LLM과 폐쇄형 LLM의 격차 | GeekNews

<ul> <li>Artificial Analysis Intelligence Index에서는 오픈 웨이트 LLM이 폐쇄형 LLM의 과거 성능을 따라잡는 시간이 2024년 여름부터 꾸준히 줄어드는 흐름을 보임</li> <li>이 단일 지표에 추세선을 그으면 격차가 2026년 12월…

#llm #open-weight #ai-research #benchmark #model-performance

Anthropic Research · 2026-06-18 제목번역

Claude의 생물정보학 연구 능력을 BioMysteryBench로 평가하기

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

#language-models #bioinformatics #claude-ai #ai-evaluation #benchmark #research-capabilities

Anthropic Research · 2026-06-17 제목번역

Claude의 사이버 경쟁 참여

Claude does cyber competitions

#claude #benchmark #cybersecurity #competitions #capability-evaluation

Anthropic Research · 2026-06-17 제목번역

Claude 4의 사이버 평가

Cyber evaluations of Claude 4

#claude #benchmark #cybersecurity #evaluation #security-assessment

Hacker News (front) · 2026-06-17 제목번역

GLM-5.2가 Artificial Analysis Intelligence Index 최고 오픈 가중치 모델로 등극

GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index

Article URL: <a href="https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index">https://artificialanaly…

#benchmark #glm-5-2 #open-source-model #performance-ranking #zhipu

OpenAI Blog · 2026-06-17 제목번역

LifeSciBench 소개

Introducing LifeSciBench

Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.

#ai-evaluation #benchmark #lifescibench #life-science #research-methodology

Hacker News (front) · 2026-06-14 제목번역

Rio-3.5-Open-397B는 약 0.6 x Nex-N2_pro + 0.4 x Qwen

Rio-3.5-Open-397B ≈ 0.6 x Nex-N2_pro + 0.4 x Qwen · Issue #4 · nex-agi/Nex-N2

Article URL: <a href="https://github.com/nex-agi/Nex-N2/issues/4">https://github.com/nex-agi/Nex-N2/issues/4</a> Comments URL: <a href="https://news.ycombinator.com/item?…

#ai-models #model-evaluation #benchmark #model-comparison #open-source-llm

GeekNews · 2026-06-04 제목번역

VLM은 한국 공공기관 문서를 얼마나 잘 읽을까? KOLongDoc 벤치마크 공개

VLM은 한국 공공기관 문서를 얼마나 잘 읽을까? KOLongDoc 벤치마크 공개 | GeekNews

🔥 한국어 Long-Document VLM 벤치마크, <a href="https://github.com/Marker-Inc-Korea/KOLongDoc">KOLongDoc</a>를 공개했습니다! 최근 ChatGPT, Claude, Gemini 같은 멀티모달 AI가 공공·행정 업무에도 활용되기 시작했지만,…

#multimodal-ai #benchmark #vision-language-models #korean-documents #document-understanding