NVIDIA Cosmos 3 환영합니다: 물리적 AI 추론 및 행동을 위한 첫 번째 오픈 옴니모델

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

NVIDIA Cosmos 3 환영합니다: 물리 AI 추론과 행동을 위한 첫 번째 오픈 옴니모델

가 출시되었으며, 오늘

에서 사용할 수 있습니다. Cosmos 3는 물리 AI를 위한

(WFM)에서 큰 도약을 나타냅니다: 월드 생성, 물리 추론, 그리고 행동 생성을 하나의 모델에 결합한 단일의 통합 옴니모델입니다. 더 이상 다양한 모델과 추론 파이프라인 사이에서 번갈아 사용할 필요가 없습니다 - Cosmos 3가 모두를 처리합니다.

로봇공학, 자율주행차, 또는 스마트 공간을 구축하고 있든, Cosmos 3는 물리 세계를 시뮬레이션하고 이해할 기초를 제공합니다.

이 릴리스와 함께 출시되는 것은 다음과 같습니다:

Hugging Face의 Cosmos 3 Super와 Cosmos 3 Nano (모델 카드 및 라이센싱 포함)
생성 파이프라인을 위한 Cosmos 3 Diffusers 통합
자신의 데이터로 Cosmos 3를 훈련하기 위한 포스트 훈련 스크립트 (GitHub에서)
물리 AI를 위한 오픈 합성 데이터 생성(SDG) 데이터세트

섹션 1: Cosmos 3의 새로운 점?

이전 Cosmos 릴리스와 비교하여 Cosmos 3의 가장 큰 변화는 Mixture-of-Transformers(MoT) 아키텍처를 기반으로 구축된 옴니모델이라는 것입니다. 이전에는 개발자들이 월드 생성(Cosmos Predict), 제어된 생성(Cosmos Transfer), 장면 이해(Cosmos Reason), 정책 생성(Cosmos Policy)과 같은 다양한 기능을 위해 별도의 모델을 사용해야 했습니다. Cosmos 3는 모든 이것을 단일 모델에서 가능하게 하며, 단일의 통합 전진 패스에서 다양한 모달리티를 추론하고 생성할 수 있습니다.

이는 이제 하나의 모델에서 모든 이것을 수행할 수 있음을 의미합니다:

텍스트, 이미지, 비디오 또는 행동 입력으로부터 사실적이고 물리적으로 그럴듯한 비디오 세계 생성
운동, 인과관계, 공간 관계 같은 물리 속성에 대한 추론
현재 상태에 기반하여 향후 비디오 및 행동 시퀀스 예측

물리 AI에 중요한 이유

Cosmos 3는 실제 세계를 이해할 수 있는 물리 AI 시스템을 구축하는 데 도움이 됩니다. 단순한 픽셀과 토큰이 아니라 운동, 인과관계, 물리, 그리고 행동입니다. 로봇에게 세탁물 개기를 가르치고 있든, 자율주행 시뮬레이션을 구축하고 있든, 창고 안전 시나리오를 위한 합성 훈련 데이터를 생성하고 있든, Cosmos 3는 정확히 이러한 사용 사례를 위해 설계된 파운데이션 모델입니다.

로봇공학 집기 및 배치 사용 사례를 위해 Cosmos 3에 의해 생성된 비디오입니다.

긴 꼬리 운전 시나리오를 위해 Cosmos 3에 의해 생성된 비디오입니다.

창고 안전 데이터를 위해 Cosmos 3를 사용한 이미지-비디오 생성입니다.

자율주행 애플리케이션에서 Cosmos 3 체인-오브-생각 추론입니다.

아키텍처

Cosmos 3는 텍스트, 이미지, 비디오, 오디오, 그리고 행동 모두의 모달리티를 단일의 통합 아키텍처 내에서 처리하는 MoT 백본을 기반으로 구축되었습니다. 각 모달리티는 먼저 전담 인코더(시각적 이해를 위한 ViT, 시각적/오디오 생성을 위한 VAE, 그리고 행동을 위한 도메인 인식 벡터)에 의해 인코딩된 후 공유 표현 공간으로 투영됩니다.

입력 시퀀스는 두 개의 부분 시퀀스로 분할됩니다: 다음 토큰 예측을 통한 추론과 이해를 처리하는 자동회귀(AR) 부분 시퀀스, 그리고 반복적 노이즈 제거를 통한 생성을 처리하는 확산(DM) 부분 시퀀스입니다. AR 및 DM 토큰은 각 변환기 레이어 내에서 별도의 매개변수 집합을 사용하지만 공동 주의를 통해 상호 작용합니다 - 이것이 단일 모델이 어떤 아키텍처 변화도 없이 VLM, 비디오 생성기, 전진/역 역학 모델, 또는 로봇 정책으로 무리 없이 전환할 수 있게 해줍니다.

모델 버전

이번 Cosmos 3 릴리스에는 다양한 배포 시나리오에 최적화된 두 가지 모델 크기가 포함되어 있습니다:

Cosmos 3 Nano - 이는 16B 매개변수 모델(8B 추론기 및 8B 생성기)로, 효율적인 추론을 위해 최적화되었습니다. Cosmos 3 Nano는 RTX PRO 6000 GPU와 같은 워크스테이션급 컴퓨팅에서 실행되도록 설계되었으며, Hugging Face의 nvidia/Cosmos3-Nano에서 사용할 수 있습니다.
Cosmos 3 Super - 이는 64B 매개변수 모델(32B 추론기 및 32B 생성기)로, 대규모 합성 데이터 생성(SDG)과 연구를 위해 설계되었으며, NVIDIA Hopper 및 Blackwell GPU에서 실행됩니다. Cosmos 3 Super는 Hugging Face의 nvidia/Cosmos3-Super에서 사용할 수 있습니다.

섹션 2: Cosmos 3 기능

Cosmos 3는 단일의 통합 모델을 통해 여러 입력 및 생성 모달리티를 지원합니다:

입력 모달리티	출력 모달리티	애플리케이션
텍스트 \| 이미지 \| 비디오	비디오	비디오 모델
텍스트 \| 비디오	텍스트	비전 언어 모델(VLM)
행동 \| 이미지 \| 텍스트	비디오	전진 역학 모델
텍스트 \| 비디오	행동	역 역학 모델
이미지 \| 텍스트	비디오 & 행동	정책 모델

프롬프트 가이드

비디오 생성의 경우, 내러티브 단락 형식의 상세한 프롬프트를 사용할 것을 권장합니다. 예를 들어:

비디오는 맑은 푸른 하늘 아래 다중 차선 고속도로에서 이동하는 차량 내부의 시점으로 시작됩니다. 도로는 양쪽에 울창한 녹색 나무로 둘러싸여 있어 고요한 환경을 만듭니다. 눈에 띄는 흰색 세미 트럭과 다양한 자동차를 포함한 여러 차량이 앞에 보이며 꾸준한 속도를 유지하고 있습니다. 고속도로는 콘크리트 장벽으로 분리된 다중 차선을 특징으로 하며, 장면은 맑은 날을 나타내는 밝은 햇빛으로 목욕하고 있습니다. 비디오가 진행되면서 갑자기 많은 양의 파편이 앞의 차선에 나타납니다. 이를 피할 시간이 거의 없어서, 자아 차량은 파편 위를 운전해야 하고 계속 전진해야 합니다. 자아 차량이 흩어진 물체 위를 지나갈 때 눈에 띄는 충격이 발생합니다. 차량 내부에서 촬영한 시점 샷으로 앞의 도로와 주변 환경을 포착하고 있습니다.

행동 생성의 경우, 프롬프트는 간결해야 하며 공간적 참조를 제공해야 합니다. 예를 들어:

냄비를 보라색 항목의 왼쪽에 놓으세요. 이 비디오는 장면을 바라보고 있는 일인칭 관점에서 캡처됩니다.

프롬프트 업샘플링 템플릿을 찾고, GitHub의 프롬프팅 가이드에서 고품질 프롬프트 작성을 위한 최고의 사례를 찾으세요.

섹션 3: Diffusers와 함께 Cosmos 3 사용하기

Cosmos 3는 Hugging Face Diffusers 라이브러리와 통합되어 몇 줄의 코드로 월드 생성 파이프라인을 쉽게 사용할 수 있게 합니다. Cosmos3OmniPipeline을 통해 친숙한 DiffusionPipeline으로 Cosmos 3를 실행할 수 있습니다. 이것으로 목표는 Cosmos 3를 마찰 없이 채택하고 기존 파이프라인과 통합하는 것을 가능하게 하는 것입니다.

Cosmos 3 Nano 모델을 사용한 단일 프레임 생성을 위한 텍스트-이미지 예제를 살펴보겠습니다:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)

prompt = (
    "A medium shot of a modern robotics research laboratory with white walls and a gray floor. "
    "A robotic arm with a metallic finish is mounted on a clean white workbench, its gripper positioned "
    "above a row of small colored objects. A laptop and neatly arranged tools sit beside the robot. "
    "A large monitor on the wall behind displays a software interface. The scene is brightly lit by "
    "overhead fluorescent lights."
)

result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)
result.video[0].save("cosmos3_t2i.jpg", format="JPEG", quality=85)

다음은 Cosmos 3 Nano 모델과 주어진 프롬프트로 생성된 이미지입니다:

문서에는 텍스트-비디오, 이미지-비디오 등에 대한 예제도 있습니다. Cosmos 3 Diffusers 문서에서 정보 및 API 사용법을 찾으세요.

섹션 4: 물리 AI를 위한 데이터세트

Cosmos 3 출시의 일환으로, NVIDIA는 물리 AI 커뮤니티가 월드 파운데이션 모델을 훈련하고 평가할 수 있도록 도와주는 합성 데이터 생성(SDG) 데이터세트 세트를 릴리스하고 있습니다. 이 데이터세트는 다양한 NVIDIA 팀에 의해 생성되었으며 Hugging Face에서 사용할 수 있습니다.

섹션 5: Cosmos 프레임워크

Cosmos 프레임워크는 Cosmos 3과 같은 WFM을 훈련하고 제공하기 위한 엔드-투-엔드 프레임워크입니다. 이것이 추론 및 포스트 훈련 스크립트, 그리고 개발을 위한 에이전트 기술을 찾을 수 있는 곳입니다.

Cosmos 3 포스트 훈련

Cosmos 3는 즉시 로봇공학, 자율주행차, 그리고 스마트 공간을 위한 월드 비디오 및 행동을 이해하고 생성하지만, 일부 애플리케이션은 최고의 결과를 얻기 위해 특정 데이터세트에 대한 추가 포스트 훈련이 필요할 수 있습니다. 우리는 다양한 로봇, 환경, 그리고 작업을 위해 Cosmos 3를 포스트 훈련할 것을 권장합니다 - 리포지토리의 포스트 훈련 가이드를 확인하세요.

에이전트 기술

리포지토리는 또한 개발을 빠르고 쉽게 만들기 위한 에이전트 기술과 함께 제공됩니다. 이 기술들은 요구 사항을 검증하고 종속성으로 환경을 설정하는 데 도움이 됩니다. 또한 리포지토리 구조 및 예제에 대해 배우거나, 좋은 프롬프트를 작성하거나, 추론 및 포스트 훈련 스크립트를 실행하는 데도 사용할 수 있습니다.

섹션 6: 리소스

Cosmos 3 기능, 성능, 포스트 훈련, 그리고 NIM 마이크로서비스로 배포에 대해 배우려면 Cosmos 3 기술 블로그를 읽으세요.

감사의 말

Cosmos 3는 NVIDIA 전역의 많은 팀과 사람들 간의 놀라운 협력의 결과입니다:

Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alex Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski.

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

NVIDIA Cosmos 3

is here - and it's available on

Hugging Face

today. Cosmos 3 represents a major leap forward in

world foundation models

(WFMs) for physical AI: a single, unified omni-model that combines world generation, physical reasoning, and action generation in one model. No more juggling between different models and inference pipelines - Cosmos 3 does it all.

Whether you're building for robotics, autonomous vehicles, or smart spaces, Cosmos 3 gives you the foundation to simulate and understand the physical world.

Here's what's shipping with this release:

Cosmos 3 Super and Cosmos 3 Nano on Hugging Face with model cards and licensing
Cosmos 3 Diffusers integration for generation pipelines
Post-training scripts for training Cosmos 3 on your own data (on GitHub)
Open synthetic data generation (SDG) datasets for physical AI

TABLE OF CONTENTS

SECTION 1: What's new with Cosmos 3?

The biggest change in Cosmos 3 compared to previous Cosmos releases is that it's an omni-model, built on a Mixture-of-Transformers (MoT) architecture. Previously, developers had to work with separate models for different capabilities like world generation (Cosmos Predict), controlled generation (Cosmos Transfer), scene understanding (Cosmos Reason) and policy generation (Cosmos Policy). Cosmos 3 enables all of this in a single model that can reason and generate different modalities in one unified forward pass.

This means you can now do all this from one model:

Generate realistic and physically plausible video worlds from text, images, videos or action inputs
Reason about physical properties like motion, causality, and spatial relationships
Predict future video and action sequences based on the current state

Why this matters for physical AI

Cosmos 3 helps build physical AI systems capable of understanding the real world. Not just pixels and tokens, but motion, causality, physics, and action. If you're training a robot to fold laundry, building an autonomous driving simulation, or generating synthetic training data for warehouse safety scenarios, Cosmos 3 is the foundation model designed for exactly these use-cases.

Video generated by Cosmos 3 for robotics pick and place use-cases.

Video generated by Cosmos 3 for long tail driving scenarios.

Image-to-video generation using Cosmos 3 for warehouse safety data.

Cosmos 3 chain-of-thought reasoning in an autonomous driving application.

Architecture

Cosmos 3 is built on an MoT backbone that processes all modalities - text, image, video, audio, and action - within a single unified architecture. Each modality is first encoded by a dedicated encoder (a ViT for visual understanding, a VAE for visual/audio generation, and domain-aware vectors for actions), then projected into a shared representation space.

The input sequence is split into two subsequences: an autoregressive (AR) subsequence that handles reasoning and understanding via next-token prediction, and a diffusion (DM) subsequence that handles generation via iterative denoising. AR and DM tokens use separate parameter sets within each transformer layer but interact through joint attention - this is what lets a single model seamlessly switch between acting as a VLM, a video generator, a forward/inverse dynamics model, or a robot policy without any architectural changes.

Model Versions

This release of Cosmos 3 includes two model sizes, optimized for different deployment scenarios:

Cosmos 3 Nano - This is the 16B parameter model (8B reasoner and 8B generator), optimized for efficient inference. Cosmos 3 Nano is designed to run on workstation-grade compute like the RTX PRO 6000 GPU, and is available on Hugging Face at nvidia/Cosmos3-Nano.
Cosmos 3 Super - This is the 64B parameter model (32B reasoner and 32B generator) designed for large-scale synthetic data generation (SDG) and research, and runs on NVIDIA Hopper and Blackwell GPUs. Cosmos 3 Super is available on Hugging Face at nvidia/Cosmos3-Super.

SECTION 2: Cosmos 3 Capabilities

Cosmos 3 supports multiple input and generation modalities through a single unified model:

Input Modality	Output Modality	Application
Text \| Image \| Video	Video	Video Model
Text \| Video	Text	Vision Language Model (VLM)
Action \| Image \| Text	Video	Forward Dynamics Model
Text \| Video	Action	Inverse Dynamics Model
Image \| Text	Video & Action	Policy Model

Prompt Guide

For video generation, we recommend using detailed prompts in the form of narrative paragraphs. For example:

The video begins with a view from inside a vehicle traveling on a multi-lane highway under a clear blue sky. The road is bordered by dense green trees on both sides, creating a tranquil environment. Several vehicles, including a prominent white semi-truck and various cars, are visible ahead, maintaining a steady pace. The highway features multiple lanes separated by concrete barriers, and the scene is bathed in bright sunlight, indicating a clear day. As the video progresses, a large amount of debris suddenly appears on the lane ahead. With little time to avoid it, the ego vehicle has to drive over the debris and continue moving forward. A noticeable jolt occurs as the ego vehicle passes over the scattered objects. A point-of-view shot from inside the vehicle, capturing the road ahead and the surrounding environment.

For action generation, prompts should be concise and provide spatial references. For example:

Put the pot to the left of the purple item. This video is captured from a first-person perspective looking at the scene.

Find the prompt upsampling template, and best practices for writing high-quality prompts in the prompting guide on GitHub.

SECTION 3: Using Cosmos 3 with Diffusers

Cosmos 3 is integrated with the Hugging Face Diffusers library, making it easy to use world generation pipelines with just a few lines of code. You can run Cosmos 3 through the familiar DiffusionPipeline via Cosmos3OmniPipeline. With this, the goal is enabling frictionless adoption of Cosmos 3 and integration with your existing pipelines.

Let's see a Text-to-Image example for single frame generation using the Cosmos 3 Nano model:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)

prompt = (
    "A medium shot of a modern robotics research laboratory with white walls and a gray floor. "
    "A robotic arm with a metallic finish is mounted on a clean white workbench, its gripper positioned "
    "above a row of small colored objects. A laptop and neatly arranged tools sit beside the robot. "
    "A large monitor on the wall behind displays a software interface. The scene is brightly lit by "
    "overhead fluorescent lights."
)

result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)
result.video[0].save("cosmos3_t2i.jpg", format="JPEG", quality=85)

Here's the image generated by the Cosmos 3 Nano model and given prompt:

The documentation also has examples on Text-to-Video, Image-to-Video and more. Find information and API usage in the Cosmos 3 Diffusers documentation.

SECTION 4: Datasets for physical AI

As part of the Cosmos 3 launch, NVIDIA is releasing a set of Synthetic Data Generation (SDG) datasets to help the physical AI community train and evaluate world foundation models. These datasets were generated by various NVIDIA teams and are available on Hugging Face.

Section 5: Cosmos Framework

Cosmos Framework is an end-to-end framework for training and serving WFMs like Cosmos 3. This is where you'll find inference and post-training scripts, and agent skills for development.

Post-training Cosmos 3

Cosmos 3 understands and generates world videos and actions for robotics, autonomous vehicles, and smart spaces out of the box, but some applications may require further post-training on specific datasets to get the best results. We encourage post-training Cosmos 3 for different robots, environments, and tasks - check out the post-training guide in the repo.

Agent Skills

The repo also comes with agent skills to make development fast and easy. These skills help validate requirements, and set up the environment with dependencies. You can also use them for learning about the repo structure and examples, drafting good prompts, or running the inference and post-training scripts.

SECTION 6: Resources

Read the Cosmos 3 technical blog to learn about Cosmos 3 capabilities, performance, post-training, and deployment with NIM microservices.

Acknowledgments

Cosmos 3 is the result of amazing collaboration between many teams and people across NVIDIA, including -

#physical-ai #ai-reasoning #omni-model #nvidia #machine-learning #robotics