Holo3.1: 빠르고 로컬인 컴퓨터 사용 에이전트

Holo3.1: Fast & Local Computer Use Agents

Holo3.1: 빠르고 로컬한 컴퓨터 사용 에이전트

지난 3월, 우리는 최첨단 컴퓨터 사용 모델인 Holo3을 출시했습니다. 채택은 즉각적이었습니다. 개발자, 엔터프라이즈, 그리고 파트너들은 브라우저 자동화부터 비즈니스 소프트웨어, 내부 도구 및 데스크톱 애플리케이션에 이르기까지 다양한 워크플로우에 Holo3을 배포하기 시작했습니다. 채택이 증가하면서 우리는 성능만으로는 더 이상 충분하지 않다는 것을 깨달았습니다.

사용자들은 데스크톱과 모바일 환경 전반에서 동일한 컴퓨터 사용 기능을 실행하고 싶어하며, 다양한 에이전트 프레임워크와 원활하게 통합되기를 원합니다. 그들은 클라우드 추론부터 최종 사용자 기기의 완전한 로컬 실행까지 배포 유연성을 원합니다.

이것이 우리가 Holo3.1 패밀리를 출시하는 이유입니다. Holo3.1은 프로덕션에서 가장 중요한 세 가지 차원에서 견고성을 개선합니다: 환경(웹, 데스크톱, 모바일), 에이전트 프레임워크, 그리고 배포 대상. 처음으로 우리는 로컬 추론에 최적화된 양자화된 체크포인트를 출시합니다. FP8, Q4 GGUF, 및 NVFP4를 포함합니다.

Holo3.1은 환경 전반에서 작동하고, 모든 에이전트 스택에 통합되며, 워크플로우가 있는 곳이면 어디든 실행될 수 있는 시스템인 범용 컴퓨터 사용 에이전트에 대한 우리의 비전을 향한 큰 진전입니다.

GUI 환경과 에이전트 하네스 전반의 컴퓨터 사용

Qwen 패밀리를 기반으로 한 Holo3.1은 컴퓨터 사용 에이전트가 실제로 배포되는 환경에서의 견고성을 개선하도록 설계되었으며, 최첨단 성능을 유지합니다.

팀들이 Holo3을 평가에서 프로덕션으로 옮기면서, 우리는 반복적으로 같은 문제를 관찰했습니다: 한 설정에서의 강력한 성능이 다른 설정으로 반드시 전이되지는 않습니다. 모바일 기기, 대체 에이전트 하네스, 그리고 다양한 실행 프레임워크는 모두 자신만의 분포 변화의 원인을 소개합니다.

모바일 자동화

Holo3.1은 Holo3의 기능을 브라우저 및 데스크톱 제어를 넘어 모바일 환경으로 확장하여 큰 성과를 제공합니다. AndroidWorld에서 우리의 35B-A3B 모델은 67%에서 79.3%로 개선되었으며, 더 작은 4B 및 9B 변형은 58%에서 72%로 개선되었습니다.

교차 하네스 성능

제3자 에이전트 스택 내에서 Holo를 배포하는 팀들을 더 잘 지원하기 위해, Holo3.1은 Holo3에서 이미 이용 가능한 구조화된 JSON 출력에 추가로 함수 호출 프로토콜에 대한 기본 지원을 소개합니다.

OSWorld 및 e-커머스, 비즈니스 소프트웨어, 협업 워크플로우를 다루는 우리의 내부 벤치마크 스위트 전반에서, 함수 호출과 기본 실행은 이제 거의 동등한 성능을 달성합니다. Holo3.1은 또한 우리의 Holotab 제품 하네스 내에서 평가될 때 Holo3보다 25% 이상의 개선을 제공합니다.

비용 성능 절충을 위한 더 작은 크기

로컬 및 온디바이스 추론을 더욱 가능하게 하기 위해, 우리는 또한 비용 효율적이고 개인적인 배포를 위한 소형 모델(0.8B, 4B, 및 9B)을 포함하여, 최첨단 성능을 위한 더 큰 35B-A3B 모델에 추가하여 새로운 모델 크기를 출시합니다.

Holo3.1 및 Qwen 3.5 패밀리에 대한 성능 대 비용. 전체 성능은 먼저 4개의 H Corporate 벤치마크를 평균화한 다음(따라서 각 패밀리는 동등하게 가중됨), OSWorld, AndroidWorld, H Corporate, ScreenSpot-Pro, 및 OSWorld-G 전반에서 평균을 취합니다.

빠른 로컬 추론

이것은 양자화된 가중치를 배포하는 우리의 첫 번째 출시입니다. 우리는 FP8, Q4 GGUF, 및 NVFP4로 이용 가능한 35B-A3B 체크포인트로 시작합니다.

NVFP4의 경우, 우리는 W4A16 구성에서 NVIDIA의 Model Optimizer를 사용했습니다. 이 체크포인트들은 모델 성능에 거의 또는 전혀 저하 없이 컴퓨터 사용 에이전트의 빠른 로컬 추론을 가능하게 합니다. FP8 및 NVFP4는 동일한 OSWorld 점수를 달성하며, 전체 정밀도 BF16 체크포인트보다 약 2점 아래입니다.

속도 개선은 상당합니다: DGX Spark에서, NVFP4 W4A16은 FP8의 1.41배 전체 토큰 처리량과 BF16의 1.74배를 제공합니다.

소비자 하드웨어에서의 로컬 에이전트를 향해

우리는 또한 소비자 하드웨어에서 컴퓨터 사용 에이전트의 로컬 배포를 목표로 하는 Q4 GGUF 체크포인트를 출시합니다.

에이전트 자체는 Windows 또는 Mac 시스템에서 로컬로 실행되며, 모델은 동일한 시스템에서 실행되거나—우리는 Apple Silicon에 대한 참조 번호를 포함합니다—동일한 네트워크에서 DGX Spark에서 실행될 수 있습니다. 두 경우 모두, 실행은 완전히 개인적이고 로컬이며, 사용자의 네트워크를 떠나는 것이 없습니다.

Spark에서, 우리가 NVIDIA와 함께 개발한 에이전트 하네스 최적화와 위의 NVFP4 양자화는 FP8 기준선 대비 약 2배의 종합 엔드투엔드 속도 개선을 제공하여, 평균 단계 시간을 6.8초에서 3.3초로 단축합니다.

플랫폼 및 정밀도 전반의 에이전트 요청 레이트. DGX Spark에서, NVFP4를 사용한 vLLM은 Default 및 Fast 모드 모두에서 가장 높은 요청 레이트를 달성하며, 그 다음은 Q4 GGUF 및 FP8입니다. 이 개선과 더 많은 것들이 향후 데스크톱 에이전트 하네스에 적용될 예정입니다.

가용성

Holo3.1 패밀리는 4가지 크기로 이용 가능합니다:

모델	배포 대상
Holo3.1-0.8B	초경량 로컬 에이전트
Holo3.1-4B	비용 효율적인 배포
Holo3.1-9B	성능과 지연 시간의 균형
Holo3.1-35B-A3B	최첨단 성능

우리는 또한 로컬 및 엣지 배포를 위한 최적화된 FP8, NVFP4, 및 Q4 GGUF 체크포인트를 출시합니다.

시작하기

개발자들이 Holo3.1으로 무엇을 구축할지 보기를 기대합니다.

Holo3.1: Fast & Local Computer Use Agents

Last March, we released Holo3, our state-of-the-art computer-use model. Adoption was immediate. Developers, enterprises, and partners started deploying Holo3 across a wide range of workflows, from browser automation and business software to internal tools and desktop applications. As adoption grew, we realized performance alone was no longer enough.

Users want to run the same computer-use capabilities across desktop and mobile environments, with seamless integration with different agent frameworks. They want deployment flexibility, from cloud inference to fully local execution on end-user devices.

This is why we are releasing the Holo3.1 family. Holo3.1 improves robustness across the three dimensions that matter most in production: environments (web, desktop, mobile), agent frameworks, and deployment targets. For the first time, we release quantized checkpoints optimized for local inference, including FP8, Q4 GGUF, and NVFP4.

Holo3.1 is a major step toward our vision of universal computer-use agents: systems that can operate across environments, integrate into any agent stack, and run wherever the workflow lives.

Computer Use Across GUI Environments and Agent Harnesses

Based on the Qwen family, Holo3.1 was designed to improve robustness across the environments where computer-use agents are actually deployed, while retaining state-of-the-art performance.

As teams moved Holo3 from evaluation to production, we repeatedly observed the same challenge: strong performance in one setting does not necessarily transfer to another. Mobile devices, alternative agent harnesses, and different execution frameworks all introduce their own sources of distribution shift.

Mobile Automation

Holo3.1 expands Holo3's capabilities beyond browser and desktop control, delivering major gains on mobile environments. On AndroidWorld, our 35B-A3B model improves from 67% to 79.3%, while the smaller 4B and 9B variants improve from 58% to 72%.

Cross-Harness Performance

To better support teams deploying Holo inside third-party agent stacks, Holo3.1 introduces native support for function-calling protocols in addition to the structured JSON outputs already available in Holo3.

Across OSWorld and our internal benchmark suite covering e-commerce, business software, and collaboration workflows, function-calling and native execution now achieve near-parity performance. Holo3.1 also delivers more than a 25% improvement over Holo3 when evaluated inside our Holotab product harness.

Smaller Sizes for Cost-Performance Tradeoffs

To further enable local and on-device inference, we are also releasing new model sizes including small models (0.8B, 4B, and 9B) for cost-effective and private deployment, in addition to the larger 35B-A3B model for state-of-the-art performance.

Performance versus cost for the Holo3.1 and Qwen 3.5 families. Overall performance averages the four H Corporate benchmarks first (so each family is equally weighted), then takes the mean across OSWorld, AndroidWorld, H Corporate, ScreenSpot-Pro, and OSWorld-G.

Fast & Local Inference

This is our first release to ship quantized weights. We’re starting with 35B-A3B checkpoints, available in FP8, Q4 GGUF, and NVFP4.

For NVFP4, we used NVIDIA's Model Optimizer in a W4A16 configuration. These checkpoints enable fast local inference for Computer Use Agents with little to no degradation in model performance. FP8 and NVFP4 achieve the same OSWorld scores, only about two points below the full-precision BF16 checkpoint.

The speedups are substantial: on DGX Spark, NVFP4 W4A16 delivers 1.41× the total token throughput of FP8 and 1.74× that of BF16.

Towards Local Agents on Consumer Hardware

We also release Q4 GGUF checkpoints aimed at local deployment of Computer Use Agents on consumer hardware.

The agent itself runs locally on a Windows or Mac machine, while the model can either run on that same machine—we include reference numbers for Apple Silicon—or on a DGX Spark on the same network. In both cases, execution stays fully private and local, with nothing leaving the user's network.

On Spark, agent harness optimizations we developed with NVIDIA combined with the NVFP4 quantization above deliver a compound ~2× end-to-end speedup over the FP8 baseline, cutting average step time from 6.8s to 3.3s.

Agent request rate across platforms and precisions. On DGX Spark, vLLM with NVFP4 achieves the highest request rate in both Default and Fast modes, followed by Q4 GGUF and FP8. These improvements and more will land in an upcoming desktop agent harness.

Availability

The Holo3.1 family is available in four sizes:

Model	Deployment Target
Holo3.1-0.8B	Ultra-lightweight local agents
Holo3.1-4B	Cost-efficient deployment
Holo3.1-9B	Balanced performance and latency
Holo3.1-35B-A3B	State-of-the-art performance

We are also releasing optimized FP8, NVFP4, and Q4 GGUF checkpoints for local and edge deployment.

Get Started

We look forward to seeing what developers build with Holo3.1.

#computer-use #ai-agents #local-llm #fast-inference #automation #edge-computing