AI News · #ai-evaluation

OpenAI Blog · 2026-06-30 제목번역

Genebench-Pro 내부

Inside Genebench-Pro

#ai-evaluation #genebench-pro #genomics-benchmark

OpenAI Blog · 2026-06-30 제목번역

GeneBench-Pro 소개

Introducing GeneBench-Pro

Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets.

#ai-evaluation #genomics #genebench-pro #scientific-benchmark

Anthropic Research · 2026-06-18 제목번역

Claude의 생물정보학 연구 능력을 BioMysteryBench로 평가하기

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

#language-models #bioinformatics #claude-ai #ai-evaluation #benchmark #research-capabilities

OpenAI Blog · 2026-06-17 제목번역

LifeSciBench 소개

Introducing LifeSciBench

Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.

#ai-evaluation #benchmark #lifescibench #life-science #research-methodology

GeekNews · 2026-06-12 제목번역

Claude Fable 5: 코딩 작업에서 중간 수준 결과를 냄

Claude Fable 5: 코딩 작업에서 중간 수준 결과를 냄 | GeekNews

<ul> <li>실제 코드에서 취약점을 수정하고 기능을 유지하는 200개 작업에서 Claude Fable 5는 중간 수준 성능과 일부 예외적 성공을 동시에 보임</li> <li>Claude Code와 함께 실행한 결과 FuncPass 59.8%, …

#ai-evaluation #code-security #claude-fable-5 #benchmarks #coding-performance #vulnerability-fixing

Hacker News (front) · 2026-06-11 제목번역

Claude Fable 5: 신화급 과대광고, 벤치마크 조작, 그리고 명예의 전당 항목들

Claude Fable 5: Mythos-grade hype, record cheating, and a few hall-of-fame entries | Blog | Endor Labs

Article URL: <a href="https://www.endorlabs.com/learn/claude-fable-5-mythos-grade-hype">https://www.endorlabs.com/learn/claude-fable-5-mythos-grade-hype</a> Comments URL:…

#ai-evaluation #performance #claude-fable-5 #benchmark-gaming #hype

GeekNews · 2026-06-07 제목번역

LLM이 인간 같은 속성을 가진다면 Age of Empires II도 그렇다

LLM이 인간 같은 속성을 가진다면 Age of Empires II도 그렇다 | GeekNews

<ul> <li>LLM 연구의 의인화 평가는 모델 출력에 인간 같은 속성을 부여하거나 가정할 때 측정 기준 없이 해석이 표현 방식에 좌우될 수 있다는 문제 제기</li> <li>Age of Empires II 안에 단순 신경망을 구현·훈련한 사례는 충분히 강력한 기질(subst…

#ai-evaluation #neural-networks #llm-research #anthropomorphization-bias #substrate-independence

OpenAI Blog · 2026-05-29 제목번역

신뢰할 수 있는 제3자 평가를 위한 공유 플레이북

A shared playbook for trustworthy third party evaluations | OpenAI

OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems.

#frontier-models #ai-governance #ai-safety #ai-evaluation #model-assessment #third-party-testing