Hugging Face Blog · 2026-05-11 · 원문 보기

AWS에서 기초 모델 학습 및 추론을 위한 구성 요소

Building Blocks for Foundation Model Training and Inference on AWS

AWS에서의 기초 모델 훈련 및 추론을 위한 구성 요소

오랫동안 기초 모델의 "스케일링"은 주로 한 가지를 의미했습니다: 사전 훈련에 더 많은 컴퓨팅을 투자하면 성능이 향상된다는 것입니다. 이러한 직관은

Kaplan et al. (2020)

과 같은 실증 연구로 뒷받침되었으며, 이는 손실에서 예측 가능한 거듭제곱 법칙 추세를 보고했습니다

모델 매개변수

데이터 세트 크기

, 그리고

훈련 컴퓨팅

. 실제로 이러한 추세는 대규모 가속기 용량에 대한 지속적인 투자와 이를 효율적으로 활용하는 데 필요한 분산 인프라를 정당화했습니다. 하지만 경계는 진화했고 스케일링은 더 이상 단일 곡선이 아닙니다. NVIDIA의 "1에서 3으로의 스케일링 법칙" 프레임워크는 사전 훈련을 넘어서 성능이

사후 훈련

(예: 지도 학습 미세 조정(SFT) 및 강화 학습(RL) 기반 방법)과

테스트 시간 컴퓨팅

("긴 생각", 검색/검증, 다중 샘플 전략)을 통해 증가함을 유용하게 강조합니다.

그림: "AI의 세 가지 스케일링 법칙, 설명됨" (NVIDIA 블로그)에서 각색.

종합하면, 이러한 스케일링 체계는 기초 모델 라이프사이클(사전 훈련, 사후 훈련 및 추론)을 수렴하는 인프라 요구 사항으로 밀어붙입니다: 긴밀하게 결합된 가속기 컴퓨팅, 높은 대역폭 낮은 지연시간 네트워크 및 분산 스토리지 백엔드. 또한 리소스 관리를 위한 오케스트레이션의 중요성과 클러스터 상태를 유지하고 규모에서 성능 병리를 진단하기 위한 애플리케이션 및 하드웨어 수준의 관찰 가능성을 높입니다.

또 다른 주요 추세는 기초 모델 라이프사이클이 모델 개발 프레임워크, 클러스터 리소스 관리 및 운영 도구를 포괄하는 오픈 소스 소프트웨어(OSS) 생태계에 점점 더 의존한다는 것입니다. 클러스터 계층에서 리소스 관리는 일반적으로 Slurm과 Kubernetes와 같은 시스템에서 제공합니다. 모델 개발 및 분산 훈련은 일반적으로 PyTorch와 JAX와 같은 프레임워크에서 구현됩니다. 모니터링 및 시각화 (즉, 관찰 가능성)는 종종 Prometheus를 사용한 메트릭 수집과 Grafana를 사용한 시각화 및 경고를 통해 달성되며, 인프라 및 리소스 관리 위의 운영 계층으로 위치합니다. 그림 1은 이 계층화된 아키텍처를 보여주며, 하드웨어 인프라가 리소스 오케스트레이션을 지원하고, 이는 ML 프레임워크를 가능하게 하며, 관찰 가능성이 모든 계층에 걸쳐 있음을 보여줍니다.

그림 1: 기초 모델 훈련 및 추론을 위한 오픈 소스 소프트웨어 스택의 계층화된 아키텍처

이 게시물은 기초 모델 훈련 및 추론에 관련된 머신러닝 엔지니어 및 연구자를 대상으로 하며, OSS 프레임워크 위에 구축된 워크플로에 특별히 주의를 기울입니다. AWS 인프라(다중 노드 가속기 컴퓨팅, 높은 대역폭 낮은 지연시간 네트워킹, 분산 공유 스토리지 및 관련 관리 서비스 포함)가 기초 모델 라이프사이클 전반에 걸쳐 일반적인 OSS 스택과 상호 작용하는 방식을 분석합니다. 주요 목표는 사전 훈련, 사후 훈련 및 추론에 걸쳐 시스템 병목 현상 및 스케일링 특성을 이해하기 위한 기술적 기초를 제공하는 것입니다. 이 입문 게시물은 전체 시스템 아키텍처를 표면화하며, 대규모 분산 훈련 및 추론을 뒷받침하는 AWS 인프라 구성 요소와 OSS 도구 간의 통합 지점을 강조합니다.

AWS 구성 요소

이 시리즈의 나머지 부분은 인프라, 리소스 오케스트레이션, ML 소프트웨어 스택 및 관찰 가능성을 진행하면서 이 계층화된 아키텍처가 AWS에서 어떻게 실현되는지 검토합니다. 다음 섹션은 각 계층을 미리 봅니다.

인프라: 컴퓨팅, 네트워킹 및 스토리지

그림 1에서 보여주듯이, 인프라는 세 가지 결합된 구성 요소로 고정됩니다: 큰 장치 메모리를 갖춘 가속화된 컴퓨팅, 집단 통신을 위한 광대역 상호 연결 및 데이터와 체크포인트를 위한 확장 가능한 분산 스토리지.

가속화된 컴퓨팅은 대규모 기초 모델 사전 훈련, 사후 훈련 및 추론의 기초를 형성합니다. AWS는 Amazon EC2 가속화 컴퓨팅 인스턴스의 일부로 여러 세대의 NVIDIA GPU를 제공하며, Amazon EC2 P 인스턴스 제품군을 포함합니다. P5 인스턴스 제품군은 8개의 NVIDIA H100 GPU를 갖춘 p5.48xlarge, 더 작은 규모의 워크로드를 위한 단일 H100 GPU를 갖춘 p5.4xlarge, 그리고 NVIDIA H200 GPU를 갖춘 p5e.48xlarge/p5en.48xlarge 변형을 포함합니다. P6 인스턴스 제품군은 p6-b200.48xlarge와 Blackwell Ultra B300와 함께 p6-b300.48xlarge를 갖춘 NVIDIA Blackwell B200 아키텍처를 소개합니다. 이러한 세대에 걸쳐 지배적인 스케일링 축은 피크 Tensor 처리량, HBM 용량 및 대역폭, 그리고 상호 연결 대역폭(노드 내 및 노드 간)입니다.

1차 근사로서, 부동 소수점 연산 초당 수(FLOPS)로 측정된 피크 Tensor 코어 처리량은 이러한 가속기를 공통 축에 배치하는 데 도움이 됩니다. 아래 표는 NVSwitch/NVLink 기반 다중 GPU 노드와 정렬되는 SXM/HGX 클래스 사양을 사용하여 밀집 BF16/FP16 및 FP8 Tensor 연산에 대한 GPU당 피크 처리량, HBM 용량 및 HBM 대역폭을 요약합니다.

GPU (대표적인 변형)	BF16/FP16 Tensor 피크 (밀집)	FP8 Tensor 피크 (밀집)	FP4 Tensor 피크 (밀집)	HBM 용량	HBM 대역폭
H100 (SXM)	0.9895 PFLOPS	1.979 PFLOPS	—	80 GB HBM3	3.35 TB/s
H200 (SXM)	0.9895 PFLOPS	1.979 PFLOPS	—	141 GB HBM3e	4.8 TB/s
B200 (HGX, GPU당)	2.25 PFLOPS	4.5 PFLOPS	9 PFLOPS	180 GB HBM3e	8 TB/s
B300 (HGX, GPU당)	2.25 PFLOPS	4.5 PFLOPS	13.5 PFLOPS	288 GB HBM3e	8 TB/s

주: NVIDIA 제품 표는 종종 "희소성 포함" Tensor 처리량을 보고합니다. 이 표는 밀집 처리량을 보고합니다. 해당하는 경우, 밀집 처리량은 NVIDIA의 HGX 클래스 플랫폼 지침에 따라 희소 처리량의 절반으로 계산됩니다. DGX 그림은 시스템 수준입니다. B200 HBM 용량 및 대역폭 값은 NVIDIA에서 DGX 합계를 8로 나누어 GPU당 표현됩니다.

모델이 스케일링될 때, 단계 시간은 종종 원시 컴퓨팅 처리량보다는 집단 통신 및 메모리 이동이 지배합니다. 이는 명시적인 수평 확장 및 수직 확장 대역폭 계산을 의욕화합니다. 다중 GPU 인스턴스의 경우, GPU 통신은 두 가지 체계를 포괄합니다. 내부 수직 확장(NVLink/NVSwitch)은 노드 내 GPU 간의 높은 대역폭, 낮은 지연시간 GPU-GPU 연결을 제공하여 all-reduce 및 all-gather와 같은 집단이 호스트 네트워킹 스택을 통과하지 않고 실행될 수 있도록 합니다. 외부 수평 확장(EFA)은 노드 간 OS-바이패스 네트워킹을 제공하며, AWS는 이를 Amazon EC2 UltraClusters의 구성 요소로 사용합니다. 여기서 통신량이 많은 집단은 수천 개의 인스턴스에 걸쳐 있습니다. 다음 표는 이러한 인스턴스 유형의 주요 사양을 요약합니다:

인스턴스 유형	GPU	GPU	GPU 메모리	NVLink	NVLink 대역폭 (집계)	EFA	EFA 대역폭 (집계)
p5.4xlarge	H100	1	80 GB HBM3	—	—	v2	12.5 GB/s
p5.48xlarge	H100	8	640 GB HBM3	4세대	7.2 TB/s	v2	400 GB/s
p5e.48xlarge	H200	8	1,128 GB HBM3e	4세대	7.2 TB/s	v2	400 GB/s
p5en.48xlarge	H200	8	1,128 GB HBM3e	4세대	7.2 TB/s	v3	400 GB/s
p6-b200.48xlarge	B200	8	1,440 GB HBM3e	5세대	14.4 TB/s	v4	400 GB/s
p6-b300.48xlarge	B300	8	2,100 GB HBM3e	5세대	14.4 TB/s	v4	800 GB/s

주: EFA 대역폭은 다른 대역폭 메트릭과의 일관성을 위해 Gbps에서 GB/s로 변환됩니다(÷8). EC2 가속화 컴퓨팅 네트워킹 사양을 참조하세요. NVLink 및 EFA 대역폭 그림은 링크당 값이 아닌 인스턴스당 집계 값으로 표시됩니다. 해당 노드 내 상호 연결 및 네트워킹 특성에 대해 P5 인스턴스 제품군 페이지 및 P6 인스턴스 제품군 페이지를 참조하세요.

탄력적 패브릭 어댑터(EFA)는 확장 가능한 신뢰할 수 있는 데이터그램(SRD) 프로토콜을 사용하여 OS-바이패스 원격 직접 메모리 액세스(RDMA) 기능을 제공하는 Amazon EC2의 네트워크 인터페이스입니다. 애플리케이션이 Libfabric API를 통해 네트워크 장치와 직접 통신하도록 함으로써 (운영 체제 커널을 우회하여) EFA는 분산 훈련에서 집단 연산의 지연시간을 줄이고 처리량을 개선합니다.

여러 세대의 EFA는 서로 다른 인스턴스 제품군에서 사용 가능합니다. Amazon EC2 P5 및 P5e 인스턴스는 EFA 버전 2(EFAv2)를 갖추고 있습니다. P5en 인스턴스에서 제공되는 EFA 버전 3(EFAv3)은 EFAv2에 비해 패킷 지연시간을 약 35% 감소시킵니다. P6 인스턴스에서 사용 가능한 EFA 버전 4(EFAv4)는 EFAv3에 비해 집단 통신 성능에 18% 추가 개선을 제공합니다.

규모에서, 분산 훈련(코퍼스 스트리밍 및 다중 테라바이트 체크포인트 작성)과 대규모 추론(가중치 스테이징 및 KV 캐시 성장 관리)은 계층화된 스토리지 계층(핫 데이터를 위한 로컬 NVMe SSD, 공유 높은 처리량 액세스를 위한 Lustre, 내구적인 지속성을 위한 Amazon S3)을 의욕화합니다.

이 시리즈의 주요 다중 GPU 인스턴스에서, 로컬 NVMe은 인스턴스 저장소(임시)로 30.72 TB 원시 용량(8 × 3.84 TB NVMe SSD)으로 제공됩니다. EC2 가속화 컴퓨팅 인스턴스 저장소 사양을 참조하세요.

Lustre는 높은 성능 컴퓨팅(HPC)에서 광범위하게 사용되는 오픈 소스, POSIX 호환 분산 파일 시스템으로 많은 클라이언트에서 공유 네임스페이스와 높은 집계 처리량을 제공합니다. Amazon FSx for Lustre는 Lustre를 완전히 관리되는 서비스로 제공하며 테라바이트 초당 처리량, 수백만 IOPS 및 1밀리초 미만의 지연시간으로 가능한 병렬 파일 시스템으로 노출합니다. 데이터 저장소 연관은 Amazon S3와의 통합을 가능하게 하여 훈련 데이터 세트의 지연 로딩 및 내구성을 위한 자동 체크포인트 내보내기를 지원합니다.

클러스터 규모에서, 이러한 인스턴스는 Amazon EC2 UltraClusters에 배포되며, 이는 수천 개의 가속화된 인스턴스를 가용성 영역 내의 단일 긴밀하게 배치된 클러스터로 프로비저닝하고 페타비트 규모의 논블로킹 네트워크를 사용하여 상호 연결합니다.

그림: 2세대 Amazon EC2 UltraClusters (예: P5 UltraCluster).

단계별 통신 집약도가 높은 워크로드(예: 모든 토큰 디스패치가 많은 GPU에 걸쳐 있는 MoE 모델의 전문가 병렬성)의 경우, NVLink 도메인의 크기는 1차 제약 조건이 될 수 있습니다. 내부 수직 확장 축의 확장으로서, NVLink 도메인을 증가시키면 성능이 중요한 통신이 NVLink 패브릭을 떠나는 빈도가 줄어듭니다.

Amazon EC2 UltraServers는 전용 가속기 상호 연결을 통해 여러 구성 요소 인스턴스를 연결하여 NVLink 도메인을 단일 EC2 인스턴스 너머로 확장합니다. AWS는 P6e-GB200 UltraServers가 NVIDIA GB200 NVL72 플랫폼을 기반으로 하며 하나의 NVLink 도메인 내에서 최대 72개의 Blackwell GPU 및 13.4 TB의 집계 HBM3e를 노출한다고 보고합니다. 더 큰 규모에서, EFA는 다중 UltraServer 작업을 위한 교차 노드 패브릭으로 유지되지만, 노드 내 GPU 수를 증가시키면 성능이 중요한 통신이 NVLink 패브릭을 떠나는 빈도를 줄일 수 있습니다.

이러한 시스템은 Grace CPU 메모리와 Blackwell GPU HBM을 캐시 일관성 NVLink-C2C를 통해 결합하는 NVIDIA Grace-Blackwell 수퍼칩으로 구축되어, 명시적 호스트-장치 복사 없이 CPU 및 GPU 연결 메모리 간 직접 액세스를 가능하게 합니다. 실제로, 이는 GPU 워크로드에 사용 가능한 효과적인 메모리를 확장할 수 있습니다 (예: 더 콜드 모델 상태 또는 KV 캐시를 CPU 연결 메모리에 배치함으로써). PCIe 규모 복사 오버헤드를 피하는 동시에, 비록 더 높은 지연시간과 더 낮은 대역폭을 로컬 HBM보다는 가지고 있습니다.

P6e-GB200 UltraServers의 구성 요소 인스턴스 유형은 p6e-gb200.36xlarge이며, 4개의 GPU 및 탄력적 패브릭 어댑터(EFA) v4 네트워킹을 제공합니다. 아래 표는 인스턴스당 및 구성된 UltraServer 구성을 요약합니다.

주: p6e-gb200.36xlarge EFA 대역폭은 게시된 집계 EFA 네트워킹(4 × 400 Gbps)에서 GB/s로(÷8) 변환됩니다. EC2 가속화 컴퓨팅 네트워킹 사양을 참조하세요.

UltraServer	구성 요소 인스턴스 유형	GPU (NVLink 도메인)	HBM3e (집계)	EFA	EFA 대역폭
u-p6e-gb200x36	p6e-gb200.36xlarge	36	6.7 TB	v4	1,800 GB/s
u-p6e-gb200x72	p6e-gb200.36xlarge	72	13.4 TB	v4	3,600 GB/s

주: UltraServer EFA 대역폭은 AWS에서 보고한 대로 테라비트 초당(Tbps)에서 GB/s로(÷8) 변환됩니다. P6e-GB200 UltraServers 발표 및 P6 인스턴스 제품군 페이지를 참조하세요.

리소스 오케스트레이션: Slurm 및 Kubernetes

훈련이 수백 또는 수천 개의 가속기에 걸쳐 있을 때, 수동 리소스 관리는 다루기 힘들어집니다. 예를 들어, 512개의 GPU를 필요로 하는 훈련 작업은 64개의 8-GPU 노드(P-인스턴스)를 동시에 공동 예약해야 하며, 완료 또는 실패 시 리소스를 원자적으로 해제해야 합니다. Slurm과 Kubernetes 모두 제어 평면 아키텍처를 통해 이 도전에 대응합니다: 중앙화된 스케줄러는 클러스터 상태를 유지하고 할당 결정을 내리며, 워커 노드는 할당된 워크로드를 실행합니다.

그림 2: AWS의 Slurm 기반 및 Kubernetes 기반 리소스 오케스트레이션의 고급 아키텍처

Slurm (리소스 관리를 위한 간단한 Linux 유틸리티)은 높은 성능 컴퓨팅의 지배적인 워크로드 관리자이며, 스케줄링 알고리즘, 토폴로지 모델, 리소스 유형 및 회계 백엔드를 독립적으로 구성할 수 있는 모듈식 플러그인 아키텍처로 구축되었습니다. 그 스케줄링 모델은 리소스를 파티션(노드의 논리적 그룹)으로 구성하고, sbatch를 통해 작업 제출을 수락하며, srun을 통해 할당된 노드 전체에서 동기화된 시작으로 병렬 작업을 실행합니다. 분산 훈련을 위해 중요한 것은 Slurm이 작업 수준에서 스케줄링한다는 것입니다: 모든 작업이 실행되기 전에 전체 다중 노드 작업을 원자적으로 할당합니다. 백필 스케줄러는 더 높은 우선 순위의 작업을 지연시키지 않으면서 유휴 슬롯에서 더 낮은 우선 순위의 작업을 시작하는 반면, 다중 인수 우선 순위 시스템은 공정 공유 사용, 작업 나이 및 QOS 계층을 가중치를 매겨 테넌트 간 큐를 정렬합니다. Slurm은 또한 네트워크 스위치 계층을 모델링하는 플러그인을 통한 토폴로지 인식 배치를 지원하며 (AWS에서 EFA 패브릭 토폴로지를 인코딩하여 최소한의 스위치 홉을 사용하여 노드 간에 작업을 공동 위치시킴), 일반 리소스(GRES) 인터페이스를 통한 네이티브 GPU 스케줄링을 지원하며, 이는 GPU 유형을 추적하고 장치 친화성을 시행합니다.

AWS는 Slurm 기반 오케스트레이션을 위한 여러 배포 옵션을 제공합니다. AWS ParallelCluster는 EC2의 Slurm 클러스터 배포를 자동화하는 오픈 소스 클러스터 관리 도구이며, 헤드 노드 프로비저닝, 컴퓨팅 플릿 스케일링 및 공유 스토리지와의 통합을 처리합니다. AWS 병렬 컴퓨팅 서비스(PCS)는 관리되는 제어 평면을 제공하는 대체 옵션을 제공합니다. 분산 훈련 워크로드의 경우 특히, Amazon SageMaker HyperPod는 연속 노드 상태 모니터링 및 작업 자동 재개 기능과 같은 대규모 훈련에 맞춘 추가 기능을 갖춘 Slurm 모드를 지원합니다.

Kubernetes는 선언적, API 기반 접근 방식을 취합니다: 사용자는 리소스 매니페스트를 통해 원하는 상태를 지정하고, 컨트롤러는 실제 상태가 일치하도록 조정합니다. Kubernetes는 모델 배포에서 탁월하지만, 그 네이티브 스케줄링 모델은 긴밀하게 결합된 분산 훈련을 위해 여러 간격을 노출합니다. Kubernetes는 포드 수준에서 스케줄링합니다. 작업 수준의 원자성이 없으면, 다중 노드 훈련 작업은 부분적으로 시작할 수 있습니다 (일부 순위는 실행 중이고 다른 것들은 대기 중). GPU를 낭비하거나 교착 상태를 일으킬 수 있습니다. Vanilla Kubernetes는 또한 우선 순위 기반 백필을 포함한 배치 큐 의미론을 지원하지 않으며, 네트워크 패브릭 토폴로지(NVLink 도메인, EFA 상호 연결)의 구축된 인식을 가지지 않습니다. 통신량이 많은 집단을 배치하기 위해.

여러 Kubernetes 네이티브 프로젝트가 서로 다른 계층에서 이러한 간격을 해결합니다. Kueue는 기본 스케줄러 위의 준입 컨트롤러로 작동하여, 작업 수준의 갱 준입, 계층적 공정 공유를 포함한 다중 테넌트 할당량 및 우선 순위 기반 선점을 관리합니다 (기본 스케줄러에 포드 배치를 위임함). Volcano 및 NVIDIA KAI 스케줄러는 다른 접근 방식을 취하며, 기본 스케줄러를 대체하거나 강화하여 직접 갱 스케줄링을 토폴로지 인식 포드 배치와 통합합니다: Volcano는 범용 배치 스케줄러로서, KAI 스케줄러는 GPU 최적화 배치를 위한 깊은 NVLink/NVSwitch 인식을 가집니다. 이 계층은 상호 보완적입니다: Kueue는 준입 및 할당량 정책을 관리할 수 있는 반면, 인정된 작업을 토폴로지 인식 스케줄러에 전달할 수 있습니다.

AWS의 Kubernetes 기반 오케스트레이션을 위해, Amazon Elastic Kubernetes Service (EKS)는 NVIDIA 장치 플러그인을 통한 GPU 스케줄링을 포함한 관리되는 Kubernetes를 제공합니다. Amazon SageMaker HyperPod은 또한 EKS 모드를 지원하며, Kubernetes 오케스트레이션을 HyperPod의 훈련 특화 기능과 결합합니다. HyperPod EKS는 규모의 기초 모델 훈련을 위해 설계된 기능으로 EKS를 확장합니다. 작업 거버넌스는 팀 전체에 걸쳐 컴퓨팅 할당 및 정책 시행을 제공하며, Kueue를 준입 제어를 위해, Karpenter를 적시 노드 프로비저닝을 위해 통합합니다. 체크포인트 없는 훈련은 전통적인 체크포인트 기반 내결함성에 내재된 복구 지연시간을 해결합니다. 주기적으로 모델 상태를 공유 스토리지로 직렬화하는 대신, 체크포인트 없는 훈련은 GPU 간 연속 피어 투 피어 상태 복제를 유지합니다. 실패가 발생하면, 생존 노드는 FSx for Lustre 또는 S3에서 다중 테라바이트 체크포인트를 읽기보다는 EFA 기반 통신을 통해 손실 상태를 재구성합니다. 탄력적 훈련은 작업이 리소스 가용성에 따라 자동으로 스케일링되도록 가능하게 합니다. 추가 가속기가 이용 가능해질 때 (예: 완료된 작업 또는 새로 프로비저닝된 용량), 탄력적 작업은 이를 활용하기 위해 확장할 수 있습니다. 더 높은 우선 순위의 워크로드가 리소스를 필요로 할 때, 작업은 훈련 진행을 유지하면서 축소할 수 있습니다.

ML 소프트웨어 스택

분산 훈련 및 추론은 올바르게 구성되고 조정되어야 하는 여러 소프트웨어 계층을 포함합니다. 유용한 모델은 런타임 스택을 5개 계층으로 취급하며, 하드웨어 인접 구성 요소(올바르게 작동해야 모든 것이 실행됨)에서 프레임워크 수준 추상화(프로그래머 생산성 및 모델 처리량을 결정함)로 정렬됩니다: 하드웨어 활성화, 가속기 런타임 및 수학 라이브러리, 통신 기판, ML 프레임워크 및 분산 훈련/추론 프레임워크.

그림 3: EC2 인스턴스의 분산 훈련 및 추론을 위한 ML 소프트웨어 스택

하드웨어 활성화: 커널 드라이버

기초에서 Linux 커널 드라이버는 직접 하드웨어 액세스를 제공합니다. NVIDIA GPU 드라이버는 컴퓨팅 기능을 노출하고 GPUDirect RDMA를 지원하여 GPU와 네트워크 어댑터 간 직접 데이터 전송을 지원합니다. GDRCopy 드라이버(gdrdrv)는 GPU 메모리로 그리고 GPU 메모리에서 저지연 CPU 시작 복사를 가능하게 하며, NCCL에서 작은 메시지 전송에 사용됩니다. EFA 드라이버는 libfabric API를 통한 OS-바이패스 네트워킹을 제공하며, Lustre 클라이언트 드라이버는 FSx for Lustre 병렬 파일 시스템에 대한 POSIX 액세스를 가능하게 합니다.

가속기 런타임, 컴파일러 및 커널 라이브러리

CUDA 플랫폼은 GPU 컴퓨팅을 위한 프로그래밍 모델 및 런타임을 제공합니다. CUDA에 대해 컴파일된 애플리케이션은 NVIDIA GPU에서 커널을 실행하고, 장치 메모리를 관리하며, 여러 장치 간 실행을 조정할 수 있습니다. 현재 릴리스는 CUDA Toolkit 13.x이며, Blackwell 아키텍처(컴퓨팅 기능 10.x) 지원합니다.

현대적 훈련 및 추론 성능은 일반적인 공급자 원시물보다는 특화된 최적화 라이브러리 및 사용자 정의 커널에 의해 점점 더 많이 주도됩니다. FlashAttention과 같은 커널은 주의를 단일 메모리 효율적인 통과로 융합하여 HBM 트래픽을 줄이고 처리량을 개선합니다. 많은 팀은 또한 형태 및 정밀도 전문화된 융합 커널(예: layernorm/residual/activation, 양자화 GEMM, MoE 디스패치, KV 캐시 연산)을 작성하며, 그들의 정확한 모델에 맞춘 것입니다. 이는 Triton (Python GPU 커널 컴파일러) 및 NVIDIA의 CuTe (텐서 레이아웃 및 warp 수준 DSL)과 같은 프로그래밍 가능한 도구 체인으로 가능하게 되며, CUTLASS와 같은 라이브러리는 고도로 최적화된 GEMM 및 융합 구성 요소를 제공합니다. 실제로, 이 커널 및 컴파일러 계층은 종종 ML 프레임워크만큼 엔드-투-엔드 성능을 결정합니다.

통신 기판: NCCL 및 전송 플러그인

다중 GPU 훈련은 효율적인 집단 통신에 의존합니다. NVIDIA 집단 통신 라이브러리(NCCL)는 집단 연산을 구현합니다: all-reduce, all-gather, reduce-scatter, all-to-all, broadcast 및 포인트-투-포인트 송수신. 토폴로지 인식 알고리즘으로 intra-node 통신을 위한 NVLink와 inter-node 트래픽을 위한 네트워크 전송을 활용합니다. NCCL은 통신 토폴로지를 동적으로 감지하고 메시지 크기 및 이용 가능한 대역폭에 따라 링 또는 트리 알고리즘을 선택합니다. 데이터 병렬 및 텐서 병렬 전략은 주로 all-reduce 및 all-gather에 의존하는 반면, 전문가 병렬성을 가진 혼합 전문가(MoE) 모델은 GPU 간 토큰을 라우팅하기 위해 all-to-all 집단에 의존합니다: 디스패치 all-to-all은 각 토큰을 할당된 전문가를 호스팅하는 GPU로 보내고, 결합 all-to-all은 전문가 출력을 원래 GPU로 반환합니다 (NVIDIA 개발자 블로그). 모든 GPU가 전문가 병렬 그룹의 모든 다른 GPU와 데이터를 교환하기 때문에, all-to-all 통신 볼륨은 전문가 수에 따라 스케일링하고 높은 전문가 병렬성 차수에서 지배적 병목이 될 수 있습니다.

AWS에서, NCCL의 inter-node 통신은 aws-ofi-nccl 플러그인을 통해 가능하게 되며, 이는 NCCL의 전송 API를 libfabric 인터페이스에 매핑합니다. 이는 NCCL이 애플리케이션 변경 없이 EFA의 OS-바이패스 및 확장 가능한 신뢰할 수 있는 데이터그램(SRD) 프로토콜을 활용하도록 합니다.

추론 워크로드의 경우, 집단 연산은 모든 통신 패턴을 포함하지 않습니다. 분산 추론 아키텍처 (사전채우기 및 디코드 단계를 별개의 GPU 풀로 분리)는 특히 인스턴스 간 KV 캐시 상태 전송을 위한 효율적인 포인트 투 포인트 데이터 이동을 필요로 합니다. NVIDIA 추론 Xfer 라이브러리(NIXL)는 메모리 계층(HBM, DRAM, NVMe, 분산 스토리지)과 상호 연결(NVLink, InfiniBand, 이더넷) 간 포인트 투 포인트 전송을 위한 통합 API를 제공하여 이 요구 사항을 해결합니다. NIXL은 NVIDIA Dynamo와 같은 추론 프레임워크와 통합하며 UCX 및 GPUDirect Storage를 포함한 백엔드를 지원합니다.

ML 프레임워크: PyTorch

기초 모델 개발을 위한 두 가지 지배적인 프레임워크는 PyTorch와 JAX입니다. JAX는 XLA를 통한 SPMD(Single Program Multiple Data) 접근을 취하며, 동일한 프로그램이 자동 데이터 분산 및 집단 낮춤을 사용하여 장치 간에 실행됩니다. 이 블로그는 오픈 소스 생태계에서 더 광범위한 채택을 보고 아래에서 논의된 분산 훈련 및 추론 프레임워크의 기초를 형성하는 PyTorch에 초점을 맞춥니다.

PyTorch는 GPU 가속을 포함한 텐서 계산, 자동 미분 및 유연한 즉시 실행 모델을 제공합니다. 분산 워크로드의 경우, PyTorch의 torch.distributed 모듈은 핵심 원시 요소를 제공합니다: 집단 통신을 위한 프로세스 그룹 및 분산 데이터 병렬(DDP) 및 완전 분산 데이터 병렬(FSDP2)를 포함한 분산 데이터 병렬 추상화. DDP는 GPU 간에 모델을 복제하고 all-reduce를 통해 그래디언트를 동기화하는 반면, FSDP2는 ZeRO 알고리즘의 기법을 사용하여 매개변수, 그래디언트 및 옵티마이저 상태를 워커 간에 분산하여 단일 GPU 메모리 용량을 초과하는 모델의 훈련을 가능하게 합니다.

분산 훈련 및 추론 프레임워크

최상위 계층은 PyTorch 위에 구축되어 규모의 분산 훈련 및 추론을 위한 상위 수준 추상화를 제공하는 프레임워크로 구성됩니다. 훈련의 경우, 3가지 프레임워크 범주는 복잡성 성능 트레이드오프의 서로 다른 점을 처리합니다. 아래는 몇 가지 예입니다.

Hugging Face Transformers는 Accelerate를 통한 분산 훈련 지원이 내장된 Trainer 클래스를 제공하며, 이는 DDP, FSDP 및 DeepSpeed에 대해 추상화합니다. 이 경로는 사용 용이성 및 광범위한 모델 호환성을 우선시하며, 구성 단순성이 최대 처리량보다 중요한 미세 조정 및 중간 규모 훈련에 적합합니다.

NVIDIA Megatron Core는 규모의 최대 효율성을 목표로 하며, Transformer Engine을 통한 FP8 혼합 정밀도를 포함한 최적화와 함께 3D 병렬화(텐서, 파이프라인 및 전문가 병렬화)를 구현합니다. NeMo Framework는 사전 훈련 및 미세 조정을 위한 엔드-투-엔드 워크플로를 제공하기 위해 Megatron Core 위에 구축됩니다.

강화 학습 인간 피드백(RLHF) 및 관련 사후 훈련 방법의 경우, veRL (Volcano Engine 강화 학습)은 PPO, GRPO 및 REINFORCE++를 포함한 알고리즘을 구현하는 유연한 프레임워크를 제공합니다. veRL의 HybridFlow 아키텍처는 동일한 작업에서 훈련 백엔드(FSDP2, Megatron)와 추론 엔진(vLLM, SGLang)을 혼합하여 가중치 동기화 오버헤드를 피함으로써 배우 및 롤아웃 구성 요소 간 메모리 내 모델 가중치를 공유합니다.

추론 제공을 위해, vLLM은 KV 캐시를 페이징된 가상 메모리로 관리하여 조각화를 줄이고 더 높은 배치 크기를 가능하게 하는 PagedAttention을 구현합니다. SGLang은 요청 간 자동 접두사 재사용을 위한 RadixAttention, CPU 스케줄링을 GPU 계산과 겹치는 제로 오버헤드 배치 스케줄러 및 예측된 캐시 히트율에 따라 요청을 라우팅하는 캐시 인식 로드 밸런서로 이를 확장합니다. 두 프레임워크 모두 단일 GPU 메모리를 초과하는 모델을 제공하기 위한 텐서 병렬화를 지원하며, 둘 다 사전채우기 및 디코드 단계를 분리하는 분산 제공 아키텍처를 위해 NVIDIA Dynamo와 통합합니다.

관찰 가능성

관찰 가능성은 규모의 분산 훈련 시스템의 디버깅 및 운영을 위한 필수 조건입니다. 훈련 작업이 멈추거나 처리량이 저하될 때, 실무자는 원인이 하드웨어 실패, 네트워크 혼잡, 스토리지 병목 또는 애플리케이션 수준 비효율성인지 여부를 가시화해야 합니다. 이 시리즈에서 논의된 인프라 규모에서 (수천 개의 GPU, 페타비트의 상호 연결 대역폭 및 테라바이트의 체크포인트 데이터), 과제는 간단한 모니터링에서 체계적인 원격 측정 수집, 저장 및 분석으로 전환됩니다. 관찰 가능성은 3가지 원격 측정 범주를 포괄합니다: 인프라 메트릭(GPU, 네트워크, 스토리지), 워크로드 메트릭(훈련 처리량, 큐 지연시간) 및 주동적 오류 감지를 위한 경고.

핵심 스택: Prometheus 및 Grafana

Kubernetes 및 HPC 환경에서의 관찰 가능성의 실질적 표준은 메트릭 수집을 위한 Prometheus와 시각화 및 경고를 위한 Grafana를 결합합니다. Prometheus는 풀 기반 모델로 작동하며, 메트릭 내보내기에서 노출하는 HTTP 엔드포인트를 주기적으로 스크래핑합니다. 수집된 메트릭은 시계열 데이터베이스(TSDB)에 저장되고 PromQL을 통해 쿼리되며, 이는 집계, 필터링 및 경고 규칙 평가를 위한 유연한 쿼리 언어입니다. Grafana는 Prometheus를 데이터 소스로 사용하며, 대시보드를 렌더링하고 PromQL 표현식에 기반한 경고를 트리거합니다.

프로덕션 배포의 경우, Amazon Prometheus 관리 서비스(AMP)는 완전히 관리되는, Prometheus 호환 시계열 데이터베이스를 제공하며, 운영자가 스토리지, 복제 또는 높은 가용성을 관리할 필요 없이 초당 수백만 샘플을 수집하기 위해 스케일합니다. Amazon Managed Grafana(AMG)는 AMP와의 네이티브 통합 및 IAM 신원 센터를 통한 AWS 인증을 포함한 관리되는 Grafana 워크스페이스를 제공합니다. 함께, 이들 서비스는 기존 Prometheus 내보내기 및 Grafana 대시보드와의 호환성을 보존하면서 운영 오버헤드를 제거합니다.

GPU, 네트워크 및 애플리케이션 원격 측정

DCGM-Exporter는 Prometheus 형식의 NVIDIA GPU 메트릭을 노출하며, 사용률, 메모리 사용량, 전력, 온도 및 ECC 오류 및 XID 이벤트와 같은 하드웨어 상태 지표를 포함합니다. 훈련 워크로드의 경우, SM 활동(DCGM_FI_PROF_SM_ACTIVE)은 종종 기본 사용률 메트릭보다 컴퓨팅 효율성의 더 정확한 측정을 제공합니다.

EFA는 분산 훈련에서 집단 연산 병목 현상을 진단하는 데 도움이 되는 드라이버 수준 통계(바이트, 패킷, 재전송, 시간 초과)를 노출합니다. aws-ofi-nccl 플러그인은 NCCL을 libfabric 인터페이스에 연결하며, 운영자는 EFA 카운터를 NCCL 진단(NCCL_DEBUG=INFO)과 결합하여 네트워크 계층 문제를 격리할 수 있습니다.

Amazon FSx for Lustre는 처리량 및 메타데이터 지연시간을 포함한 클라이언트 측 메트릭을 노출하는 반면, 애플리케이션 수준 메트릭(훈련을 위한 단계 시간, 초당 토큰, 손실 값; 추론을 위한 TTFT, 토큰 간 지연시간)은 Prometheus 클라이언트 라이브러리를 통해 내보낼 수 있습니다.

GPU 상태 모니터링 및 경고

주동적 오류 감지는 하드웨어 문제가 훈련 중단으로 전파되는 것을 방지합니다. 일반적인 워크플로우는 DCGM 상태 메트릭을 모니터링하고 오류 수가 임계값을 초과할 때 경고를 트리거합니다. ECC 단일 비트 오류(SBE)는 작은 수의 경우 허용될 수 있지만, 가속도적 SBE 레이트는 종종 이중 비트 오류(DBE) 또는 기타 오류를 선행합니다. XID 63 (행 재매핑 실패), XID 64 (GPU 버스 에서 떨어짐), 그리고 XID 94/95 (포함/포함되지 않은 오류)는 일반적으로 즉시 노드 교체를 정당화합니다.

GPU 상태 - 클러스터 대시보드 (Grafana 대시보드 ID 21645)는 일반적인 GPU 오류 패턴의 참조 시각화를 제공합니다. 대시보드는 모든 클러스터 노드에서 ECC 오류, XID 이벤트, 열 위반 및 행 재매핑 상태를 집계하여 운영자가 훈련 작업에 영향을 미치기 전에 실패 중인 하드웨어를 식별할 수 있도록 합니다.

그림 4: GPU 상태 - GPU 오류 패턴 및 인스턴스 보고를 표시하는 클러스터 대시보드

결론

단일 사전 훈련 스케일링 법칙에서 3가지 상호 보완 체계(사전 훈련, 사후 훈련 및 테스트 시간 컴퓨팅)로의 전환은 인프라 요구 사항을 단편화하지 않았습니다. 이는 이를 강화했습니다. 3가지 체계 모두 긴밀하게 결합된 가속기 컴퓨팅, 높은 대역폭 낮은 지연시간 네트워킹 및 확장 가능한 분산 스토리지를 요구하며, 주로 워크로드 프로필 및 리소스 스케줄링 패턴에서 다릅니다.

이 게시물은 AWS에서 이러한 요구 사항을 해결하는 4개 계층 아키텍처를 표면화했습니다: 인프라 구성 요소(EC2 P-인스턴스, EFA 네트워킹 및 계층화된 스토리지), 리소스 오케스트레이션(Slurm 및 SageMaker HyperPod을 가진 Kubernetes), ML 소프트웨어 스택(커널 드라이버 및 CUDA에서 NCCL에서 PyTorch로) 및 관찰 가능성(Prometheus, Grafana 및 GPU 상태 모니터링). 각 계층은 위의 계층을 제약하고 가능하게 합니다: 구성 오류 드라이버 또는 포화 네트워크 링크는 부분적 병렬화 전략만큼 효과적으로 훈련 실행을 병목화할 수 있습니다.

이러한 통합 지점을 이해하는 것은 성능 병목 현상을 진단하고 기초 모델 라이프사이클 전반에 걸쳐 정보에 기반한 스케일링 결정을 내리기 위한 기초입니다.

저자

Aman Shanbhag는 NVIDIA의 MARS MLOps 팀의 AI 성능 및 인프라 엔지니어이며, 연구 팀이 확장 가능하고 높은 성능의 ML 훈련 및 추론 시스템을 구축하는 것을 돕습니다. 이전에는 AWS의 Specialist Solutions Architect로 근무하여 전 세계 고객을 ML 훈련 및 AWS의 추론 최적화로 지원했습니다. Aman은 Rice University에서 컴퓨터 과학, 수학 및 기업가정신 학위를 보유하고 있으며 AI 인프라, 성능 최적화 및 분산 훈련 및 추론에 초점을 맞춥니다.

Pavel Belevich는 Amazon Web Services의 GenAI ML Frameworks 팀의 선임 응용 과학자입니다. 분산 훈련 및 대규모 모델 추론에 대한 그의 연구를 프로덕션 규모의 실제 고객 워크로드에 적용합니다. AWS에 입사하기 전, Pavel은 PyTorch 분산 팀에서 근무했으며 FSDP 및 파이프라인 병렬화와 같은 핵심 분산 훈련 기법에 기여했습니다. AWS에서 그는 MoE 통신 패턴 및 대규모 제공/훈련 워크플로우에 대해 작업하고 있습니다. 그는 또한 전문가 병렬화 및 대규모 모델 시스템에 대한 기술 심층 분석을 통해 모범 사례를 정기적으로 공유합니다.

Keita Watanabe는 Amazon Web Services의 GenAI ML Frameworks 팀의 주요 솔루션 아키텍트이며, ML 시스템 성능 엔지니어링을 전문으로 합니다 및 전 세계 고객을 ML 훈련 및 AWS의 추론 최적화로 지원합니다. 그의 배경은 머신러닝 연구 및 개발입니다. AWS에 입사하기 전, Keita는 Rakuten에서 연구 과학자로 근무했으며 이미지 기반 제품 검색 시스템을 개발했습니다. Keita는 Tokyo 대학에서 과학 박사를 보유하고 있습니다.

Building Blocks for Foundation Model Training and Inference on AWS

For a long time, "scaling" in foundation models mostly meant one thing: spend more compute on pre-training and capabilities rise. That intuition was supported by empirical work such as

Kaplan et al. (2020)

, which reported predictable power-law trends in loss as you scale

model parameters

dataset size

, and

training compute

. In practice, these trends justified sustained investment in large-scale accelerator capacity and the surrounding distributed infrastructure needed to keep it efficiently utilized. But the frontier has evolved—and scaling is no longer a single curve. NVIDIA's "from one to three scaling laws" framing usefully emphasizes that, beyond pre-training, performance increasingly scales through

post-training

(e.g., supervised fine-tuning (SFT) and reinforcement learning (RL)-based methods) and through

test-time compute

("long thinking," search/verification, multi-sample strategies).

Figure: Adapted from "AI's Three Scaling Laws, Explained" (NVIDIA Blog).

Taken together, these scaling regimes push the foundation-model lifecycle—pre-training, post-training, and inference—toward convergent infrastructure requirements: tightly coupled accelerator compute, a high-bandwidth low-latency network, and a distributed storage backend. They also raise the importance of orchestration for resource management, and of application- and hardware-level observability to maintain cluster health and diagnose performance pathologies at scale.

Another key trend is the increasing reliance of the foundation-model lifecycle on an open-source software (OSS) ecosystem that spans model development frameworks, cluster resource management, and operational tooling. At the cluster layer, resource management is typically provided by systems such as Slurm and Kubernetes. Model development and distributed training are commonly implemented in frameworks such as PyTorch and JAX. Monitoring and visualization—that is, observability—are often achieved using Prometheus for metrics collection and Grafana for visualization and alerting, positioned as an operational layer atop infrastructure and resource management. Figure 1 illustrates this layered architecture, showing how hardware infrastructure supports resource orchestration, which in turn enables ML frameworks, with observability spanning across all layers.

Figure 1: The layered architecture of open-source software stacks for foundation model training and inference

This post is intended for machine learning engineers and researchers involved in foundation model training and inference, with particular attention to workflows built atop OSS frameworks. It analyzes how AWS infrastructure—including multi-node accelerator compute, high-bandwidth low-latency networking, distributed shared storage, and associated managed services—interacts with common OSS stacks across the foundation model lifecycle. The primary goal is to provide a technical foundation for understanding systems bottlenecks and scaling characteristics spanning pre-training, post-training, and inference. This introductory post surfaces the overall system architecture, emphasizing the integration points between AWS infrastructure components and OSS tools that underpin large-scale distributed training and inference.

The AWS Building Blocks

The remainder of this series examines how this layered architecture is realized on AWS, progressing through infrastructure, resource orchestration, the ML software stack, and observability. The following sections preview each layer.

Infrastructure: Compute, Network, and Storage

As illustrated in Figure 1, infrastructure is anchored by three coupled building blocks—accelerated compute with large device memory, wide-bandwidth interconnect for collective communication, and scalable distributed storage for data and checkpoints.

Accelerated compute forms the foundation of large-scale foundation model pre-training, post-training, and inference. AWS offers several generations of NVIDIA GPUs as part of its Amazon EC2 accelerated computing instances, including the Amazon EC2 P instance family. The P5 instance family includes p5.48xlarge with eight NVIDIA H100 GPUs, p5.4xlarge with a single H100 GPU for smaller-scale workloads, and p5e.48xlarge/p5en.48xlarge variants with NVIDIA H200 GPUs. The P6 instance family introduces NVIDIA Blackwell B200 architecture with p6-b200.48xlarge and Blackwell Ultra B300 with p6-b300.48xlarge. Across these generations, the dominant scaling axes are peak Tensor throughput, HBM capacity and bandwidth, and interconnect bandwidth (within and across nodes).

As a first-order approximation, peak Tensor Core throughput—measured in floating point operations per second (FLOPS)—helps situate these accelerators on a common axis. The table below summarizes per-GPU peak throughput for dense BF16/FP16 and FP8 Tensor operations, along with HBM capacity and HBM bandwidth, using SXM/HGX-class specifications that align with NVSwitch/NVLink-based multi-GPU nodes.

GPU (representative variant)	BF16/FP16 Tensor peak (dense)	FP8 Tensor peak (dense)	FP4 Tensor peak (dense)	HBM capacity	HBM bandwidth
H100 (SXM)	0.9895 PFLOPS	1.979 PFLOPS	—	80 GB HBM3	3.35 TB/s
H200 (SXM)	0.9895 PFLOPS	1.979 PFLOPS	—	141 GB HBM3e	4.8 TB/s
B200 (HGX, per GPU)	2.25 PFLOPS	4.5 PFLOPS	9 PFLOPS	180 GB HBM3e	8 TB/s
B300 (HGX, per GPU)	2.25 PFLOPS	4.5 PFLOPS	13.5 PFLOPS	288 GB HBM3e	8 TB/s

Note: NVIDIA product tables often report Tensor throughput “with sparsity”; this table reports dense throughput. Where applicable, dense throughput is taken as half of sparse throughput, following NVIDIA’s guidance for HGX-class platforms (NVIDIA). DGX figures are system-level; the B200 HBM capacity and bandwidth values are expressed per GPU by dividing DGX totals by eight (NVIDIA).

As models scale, step time is often dominated by collective communication and memory movement rather than raw compute throughput, motivating explicit scale-up and scale-out bandwidth accounting. For the multi-GPU instances, GPU communication spans two regimes. Internal scale-up (NVLink/NVSwitch) provides high-bandwidth, low-latency GPU-to-GPU connectivity within a node, enabling collectives such as all-reduce and all-gather to execute without traversing the host networking stack. External scale-out (EFA) provides OS-bypass networking across nodes, which AWS uses as a building block for Amazon EC2 UltraClusters where communication-heavy collectives span thousands of instances. The following table summarizes key specifications across these instance types:

Instance Type	GPU	GPUs	GPU Memory	NVLink	NVLink BW (aggregate)	EFA	EFA BW (aggregate)
p5.4xlarge	H100	1	80 GB HBM3	—	—	v2	12.5 GB/s
p5.48xlarge	H100	8	640 GB HBM3	4th	7.2 TB/s	v2	400 GB/s
p5e.48xlarge	H200	8	1,128 GB HBM3e	4th	7.2 TB/s	v2	400 GB/s
p5en.48xlarge	H200	8	1,128 GB HBM3e	4th	7.2 TB/s	v3	400 GB/s
p6-b200.48xlarge	B200	8	1,440 GB HBM3e	5th	14.4 TB/s	v4	400 GB/s
p6-b300.48xlarge	B300	8	2,100 GB HBM3e	5th	14.4 TB/s	v4	800 GB/s

Note: EFA bandwidth is converted from Gbps to GB/s (÷8) for consistency with other bandwidth metrics; see the EC2 accelerated computing networking specifications. NVLink and EFA bandwidth figures are shown as aggregate per-instance values rather than per-link values; see the P5 instance family page and the P6 instance family page for the corresponding intra-node interconnect and networking characteristics.

Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 that provides OS-bypass remote direct memory access (RDMA) capability using the Scalable Reliable Datagram (SRD) protocol. By enabling applications to communicate directly with the network device through the Libfabric API—bypassing the operating system kernel—EFA reduces latency and improves throughput for collective operations in distributed training.

Multiple generations of EFA are available on different instance families. Amazon EC2 P5 and P5e instances are equipped with EFA version 2 (EFAv2). EFA version 3 (EFAv3), provided on P5en instances, reduces packet latency by approximately 35% compared to EFAv2. EFA version 4 (EFAv4), available on P6 instances, delivers an additional 18% improvement in collective communication performance relative to EFAv3.

At scale, both distributed training (streaming corpora and writing multi-terabyte checkpoints) and large-scale inference (staging weights and managing KV cache growth) motivate a tiered storage hierarchy—local NVMe SSD for hot data, Lustre for shared high-throughput access, and Amazon S3 for durable persistence.

In this series’ primary multi-GPU instances, local NVMe is provided as instance store (ephemeral) with 30.72 TB raw capacity (8 × 3.84 TB NVMe SSD); see the EC2 accelerated-computing instance store specifications.

Lustre is an open-source, POSIX compliant distributed file system widely used in high-performance computing (HPC) to provide a shared namespace with high aggregate throughput across many clients. Amazon FSx for Lustre provides Lustre as a fully managed service and exposes it as a parallel file system capable of terabytes per second of throughput, millions of IOPS, and sub-millisecond latencies. Data Repository Associations enable integration with Amazon S3, supporting lazy loading of training datasets and automatic checkpoint export for durability.

At cluster scale, these instances are deployed in Amazon EC2 UltraClusters, which provision thousands of accelerated instances as a single, tightly placed cluster within an Availability Zone and interconnect them using a petabit-scale nonblocking network.

Figure: 2nd-generation Amazon EC2 UltraClusters (example P5 UltraCluster).

For workloads with high per-step communication intensity (e.g., expert parallelism in MoE models where all-to-all token dispatch spans many GPUs), the size of the NVLink domain can become a first-order constraint. As an extension of the internal scale-up axis, increasing the NVLink domain reduces how often performance-critical communication must leave the NVLink fabric.

Amazon EC2 UltraServers extend the NVLink domain beyond a single EC2 instance by connecting multiple component instances through a dedicated accelerator interconnect. AWS reports that P6e-GB200 UltraServers are built on the NVIDIA GB200 NVL72 platform and expose up to 72 Blackwell GPUs and 13.4 TB of aggregate HBM3e within one NVLink domain. At larger scales, EFA remains the cross-node fabric for multi-UltraServer jobs, but increasing the intra-domain GPU count can reduce how often performance-critical communication must leave the NVLink fabric.

These systems are built from NVIDIA Grace–Blackwell superchips, which couple Grace CPU memory and Blackwell GPU HBM via cache-coherent NVLink-C2C, enabling direct access across CPU- and GPU-attached memory without explicit host–device copies. In practice, this can extend the effective memory available to GPU workloads (e.g., by placing colder model state or KV cache in CPU-attached memory) while avoiding PCIe-scale copy overheads, albeit with higher latency and lower bandwidth than local HBM.

The component instance type for P6e-GB200 UltraServers is p6e-gb200.36xlarge, which provides four GPUs and Elastic Fabric Adapter (EFA) v4 networking. The tables below summarize the per-instance and composed UltraServer configurations.

Note: The p6e-gb200.36xlarge EFA bandwidth is converted from the published aggregate EFA networking (4 × 400 Gbps) to GB/s (÷8); see the EC2 accelerated computing networking specifications.

UltraServer	Component instance type	GPUs (NVLink domain)	HBM3e (aggregate)	EFA	EFA BW
u-p6e-gb200x36	p6e-gb200.36xlarge	36	6.7 TB	v4	1,800 GB/s
u-p6e-gb200x72	p6e-gb200.36xlarge	72	13.4 TB	v4	3,600 GB/s

Note: UltraServer EFA bandwidth is converted from terabits per second (Tbps), as reported by AWS, to GB/s (÷8); see the P6e-GB200 UltraServers announcement and the P6 instance family page.

Resource Orchestration: Slurm and Kubernetes

When training spans hundreds or thousands of accelerators, manual resource management becomes intractable. For example, a training job requiring 512 GPUs must co-schedule 64 eight-GPU nodes (P-instances) simultaneously, and release resources atomically upon completion or failure. Both Slurm and Kubernetes address this challenge through a control-plane architecture: a centralized scheduler maintains cluster state and makes allocation decisions, while worker nodes execute assigned workloads.

Figure 2: High-level architecture of Slurm-based and Kubernetes-based resource orchestration on AWS

Slurm (Simple Linux Utility for Resource Management) is the dominant workload manager in high-performance computing, built on a modular plugin architecture that allows the scheduling algorithm, topology model, resource types, and accounting backend to be configured independently. Its scheduling model organizes resources into partitions (logical groupings of nodes), accepts job submissions via sbatch, and launches parallel tasks via srun with synchronized startup across allocated nodes. Critically for distributed training, Slurm schedules at the job level—allocating entire multi-node jobs atomically before any task launches. A backfill scheduler starts lower-priority jobs in idle slots without delaying higher-priority ones, while a multi-factor priority system weighs fair-share usage, job age, and QOS tiers to order the queue across tenants. Slurm also supports topology-aware placement through plugins that model network switch hierarchies—on AWS, encoding the EFA fabric topology to co-locate jobs on nodes with minimal switch hops—and native GPU scheduling through its Generic Resource (GRES) interface, which tracks GPU types and enforces device affinity.

AWS provides multiple deployment options for Slurm-based orchestration. AWS ParallelCluster is an open-source cluster management tool that automates the deployment of Slurm clusters on EC2, handling head node provisioning, compute fleet scaling, and integration with shared storage. AWS Parallel Computing Service (PCS) offers an alternative that provides the managed control plane. For distributed training workloads specifically, Amazon SageMaker HyperPod supports Slurm mode with additional capabilities tailored to large-scale training, such as continuous node health monitoring and job auto-resume functionality.

Kubernetes takes a declarative, API-driven approach: users specify desired state through resource manifests, and controllers reconcile actual state to match. While Kubernetes excels at model deployment, its native scheduling model exposes several gaps for tightly coupled distributed training. Kubernetes schedules at the pod level; without job-level atomicity, a multi-node training job can partially start—some ranks running while others remain Pending—wasting GPUs or causing deadlocks. Vanilla Kubernetes also lacks batch queue semantics with priority-based backfill, built-in awareness of network fabric topology (NVLink domains, EFA interconnects) for placement of communication-heavy collectives.

Several Kubernetes-native projects address these gaps at different layers. Kueue operates as an admission controller atop the default scheduler, managing job-level gang admission, multi-tenant quotas with hierarchical fair sharing, and priority-based preemption—while delegating pod placement to the underlying scheduler. Volcano and NVIDIA KAI Scheduler take a different approach, replacing or augmenting the default scheduler to integrate gang scheduling directly with topology-aware pod placement—Volcano as a general-purpose batch scheduler, KAI Scheduler with deep NVLink/NVSwitch awareness for GPU-optimized placement. These layers are complementary: Kueue can manage admission and quota policy while passing admitted jobs to a topology-aware scheduler for placement.

For Kubernetes-based orchestration on AWS, Amazon Elastic Kubernetes Service (EKS) provides managed Kubernetes with GPU scheduling via the NVIDIA device plugin. Amazon SageMaker HyperPod also supports EKS mode, combining Kubernetes orchestration with HyperPod's training-specific capabilities. HyperPod EKS extends EKS with features designed for foundation model training at scale. Task governance provides compute allocation and policy enforcement across teams, integrating managed Kueue for admission control and Karpenter for just-in-time node provisioning. Checkpointless training addresses the recovery latency inherent in traditional checkpoint-based fault tolerance. Rather than periodically serializing model state to shared storage, checkpointless training maintains continuous peer-to-peer state replication across GPUs. When a failure occurs, surviving nodes reconstruct the lost state through EFA-based communication rather than reading multi-terabyte checkpoints from FSx for Lustre or S3. Elastic training enables jobs to automatically scale based on resource availability. When additional accelerators become available (e.g., from completed jobs or newly provisioned capacity), elastic jobs can expand to utilize them; when higher-priority workloads require resources, jobs can contract while maintaining training progress.

ML Software Stack

Distributed training and inference involve multiple software layers that must be correctly configured and coordinated. A useful model treats the runtime stack as five layers, ordered from hardware-adjacent components (which must function correctly for anything to run) to framework-level abstractions (which determine programmer productivity and model throughput): hardware enablement, accelerator runtime and math libraries, communication substrate, ML frameworks, and distributed training/inference frameworks.

Figure 3: The ML software stack for distributed training and inference on EC2 instances

Hardware enablement: kernel drivers

At the foundation, Linux kernel drivers provide direct hardware access. The NVIDIA GPU driver exposes compute capabilities and supports GPUDirect RDMA for direct data transfers between GPUs and network adapters. The GDRCopy driver (gdrdrv) enables low-latency CPU-initiated copies to and from GPU memory, used by NCCL for small-message transfers. The EFA driver provides OS-bypass networking through the libfabric API, and the Lustre client driver enables POSIX access to FSx for Lustre parallel file systems.

Accelerator runtime, compilers, and kernel libraries

The CUDA platform provides the programming model and runtime for GPU compute. Applications compiled against CUDA can launch kernels on NVIDIA GPUs, manage device memory, and coordinate execution across multiple devices. The current release is CUDA Toolkit 13.x, with support for Blackwell architecture (compute capability 10.x).

Modern training and inference performance is increasingly driven by specialized optimization libraries and custom kernels, not just general-purpose vendor primitives. Kernels like FlashAttention fuse attention into a single memory-efficient pass, cutting HBM traffic and improving throughput. Many teams also write shape- and precision-specialized fused kernels (e.g., layernorm/residual/activation, quantized GEMMs, MoE dispatch, KV-cache ops) tuned to their exact models. This is enabled by programmable toolchains such as Triton (Python GPU kernel compiler) and NVIDIA's CuTe (tensor layout and warp-level DSL), with libraries like CUTLASS providing highly optimized GEMM and fusion building blocks. In practice, this kernel and compiler layer often determines end-to-end performance as much as the ML framework.

Communication substrate: NCCL and transport plugins

Multi-GPU training depends on efficient collective communication. NVIDIA Collective Communications Library (NCCL) implements collective operations—all-reduce, all-gather, reduce-scatter, all-to-all, broadcast, and point-to-point send/receive—with topology-aware algorithms that exploit NVLink for intra-node communication and network transports for inter-node traffic. NCCL dynamically detects the communication topology and selects ring or tree algorithms depending on message size and available bandwidth. While data-parallel and tensor-parallel strategies rely primarily on all-reduce and all-gather, Mixture-of-Experts (MoE) models with expert parallelism depend on all-to-all collectives to route tokens between GPUs: a dispatch all-to-all sends each token to the GPU hosting its assigned expert, and a combine all-to-all returns expert outputs to the originating GPUs (NVIDIA Developer Blog). Because every GPU exchanges data with every other GPU in the expert-parallel group, all-to-all communication volume scales with the number of experts and can become a dominant bottleneck at high expert-parallelism degrees.

On AWS, NCCL's inter-node communication is enabled through the aws-ofi-nccl plugin, which maps NCCL's transport APIs to libfabric interfaces. This allows NCCL to leverage EFA's OS-bypass and Scalable Reliable Datagram (SRD) protocol without application changes.

For inference workloads, collective operations do not capture all communication patterns. Disaggregated inference architectures—which separate prefill and decode phases onto distinct GPU pools—require efficient point-to-point data movement, particularly for transferring KV cache state between instances. NVIDIA Inference Xfer Library (NIXL) addresses this requirement by providing a unified API for point-to-point transfers across memory tiers (HBM, DRAM, NVMe, distributed storage) and interconnects (NVLink, InfiniBand, Ethernet). NIXL integrates with inference frameworks such as NVIDIA Dynamo and supports backends including UCX and GPUDirect Storage.

ML frameworks: PyTorch

The two dominant frameworks for foundation model development are PyTorch and JAX. JAX takes an SPMD (Single Program Multiple Data) approach through XLA, where the same program executes across devices with automatic data distribution and collective lowering. This blog focuses on PyTorch, which sees broader adoption in the open-source ecosystem and forms the basis for the distributed training and inference frameworks discussed below.

PyTorch provides tensor computation with GPU acceleration, automatic differentiation, and a flexible eager-execution model. For distributed workloads, PyTorch's torch.distributed module provides the core primitives: process groups for collective communication, and distributed data-parallel abstractions including Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP2). DDP replicates models across GPUs and synchronizes gradients via all-reduce, while FSDP2 shards parameters, gradients, and optimizer states across workers using techniques from the ZeRO algorithm, enabling training of models that exceed single-GPU memory capacity.

Distributed training and inference frameworks

The top layer comprises frameworks that build on PyTorch to provide higher-level abstractions for distributed training and inference at scale. For training, three categories of frameworks address different points in the complexity-performance tradeoff. Below are few examples

Hugging Face Transformers provides the Trainer class with built-in support for distributed training via Accelerate, which abstracts over DDP, FSDP, and DeepSpeed. This path prioritizes ease of use and broad model compatibility, making it suitable for fine-tuning and moderate-scale training where configuration simplicity matters more than maximum throughput.

NVIDIA Megatron Core targets maximum efficiency at scale, implementing 3D parallelism (tensor, pipeline, and expert parallelism) with optimizations including FP8 mixed precision via Transformer Engine. The NeMo Framework builds on Megatron Core to provide end-to-end workflows for pre-training and fine-tuning.

For reinforcement learning from human feedback (RLHF) and related post-training methods, veRL (Volcano Engine Reinforcement Learning) provides a flexible framework that implements algorithms including PPO, GRPO, and REINFORCE++. veRL's HybridFlow architecture allows mixing training backends (FSDP2, Megatron) with inference engines (vLLM, SGLang) in the same job, avoiding weight synchronization overhead by sharing model weights in memory between actor and rollout components.

For inference serving, vLLM implements PagedAttention, managing the KV cache as paged virtual memory to reduce fragmentation and enable higher batch sizes. SGLang extends this with RadixAttention for automatic prefix reuse across requests, a zero-overhead batch scheduler that overlaps CPU scheduling with GPU computation, and a cache-aware load balancer that routes requests based on predicted cache hit rates. Both frameworks support tensor parallelism for serving models that exceed single-GPU memory, and both integrate with NVIDIA Dynamo for disaggregated serving architectures that separate prefill and decode phases.

Observability

Observability is a prerequisite for debugging and operating distributed training systems at scale. When a training job stalls or throughput degrades, practitioners need visibility into whether the cause is hardware failure, network congestion, storage bottlenecks, or application-level inefficiency. At the infrastructure scale discussed in this series—thousands of GPUs, petabits of interconnect bandwidth, and terabytes of checkpoint data—the challenge shifts from simple monitoring to systematic telemetry collection, storage, and analysis. Observability spans three telemetry categories: infrastructure metrics (GPU, network, storage), workload metrics (training throughput, queue latency), and alerting for proactive fault detection.

Core Stack: Prometheus and Grafana

The de facto standard for observability in Kubernetes and HPC environments combines Prometheus for metrics collection with Grafana for visualization and alerting. Prometheus operates on a pull-based model, periodically scraping HTTP endpoints exposed by metric exporters. Collected metrics are stored in a time-series database (TSDB) and queried via PromQL, a flexible query language for aggregation, filtering, and alerting rule evaluation. Grafana consumes Prometheus as a data source, rendering dashboards and triggering alerts based on PromQL expressions.

For production deployments, Amazon Managed Service for Prometheus (AMP) provides a fully managed, Prometheus-compatible time-series database that scales to ingest millions of samples per second without requiring operators to manage storage, replication, or high availability. Amazon Managed Grafana (AMG) offers a managed Grafana workspace with native integration to AMP and AWS authentication via IAM Identity Center. Together, these services eliminate operational overhead while preserving compatibility with existing Prometheus exporters and Grafana dashboards.

GPU, Network, and Application Telemetry

DCGM-Exporter exposes NVIDIA GPU metrics in Prometheus format, including utilization, memory usage, power, temperature, and hardware health indicators such as ECC errors and XID events. For training workloads, SM activity (DCGM_FI_PROF_SM_ACTIVE) often provides a more accurate measure of compute efficiency than basic utilization metrics.

EFA exposes driver-level statistics (bytes, packets, retransmits, timeouts) that help diagnose collective operation bottlenecks in distributed training. The aws-ofi-nccl plugin bridges NCCL to the libfabric interface, and operators can combine EFA counters with NCCL diagnostics (NCCL_DEBUG=INFO) to isolate network-layer issues.

Amazon FSx for Lustre exposes client-side metrics including throughput and metadata latency, while application-level metrics (step time, tokens per second, loss values for training; TTFT, inter-token latency for inference) can be exported via Prometheus client libraries.

GPU Health Monitoring and Alerting

Proactive fault detection prevents hardware issues from propagating into extended training interruptions. A typical workflow monitors DCGM health metrics and triggers alerts when error counts exceed thresholds. ECC single-bit errors (SBE) may be tolerable in small numbers, but accelerating SBE rates often precede double-bit errors (DBE) or other failures. XID 63 (row remap failure), XID 64 (GPU fallen off bus), and XID 94/95 (contained/uncontained errors) typically warrant immediate node replacement.

The GPU Health - Cluster dashboard (Grafana dashboard ID 21645) provides a reference visualization for common GPU error patterns. The dashboard aggregates ECC errors, XID events, thermal violations, and row remapping status across all cluster nodes, enabling operators to identify failing hardware before it impacts training jobs.

Figure 4: GPU Health - Cluster dashboard showing GPU error patterns and instance reporting

Conclusion

The shift from a single pre-training scaling law to three complementary regimes—pre-training, post-training, and test-time compute—has not fragmented infrastructure requirements; it has reinforced them. All three regimes demand tightly coupled accelerator compute, high-bandwidth low-latency networking, and scalable distributed storage, differing mainly in workload profile and resource scheduling patterns.

This post surfaced the four-layer architecture that addresses those requirements on AWS: infrastructure building blocks (EC2 P-instances, EFA networking, and tiered storage), resource orchestration (Slurm and Kubernetes with SageMaker HyperPod), the ML software stack (from kernel drivers and CUDA through NCCL to PyTorch), and observability (Prometheus, Grafana, and GPU health monitoring). Each layer constrains and enables the layers above it—a misconfigured driver or saturated network link can bottleneck an otherwise well-tuned training run just as effectively as a suboptimal parallelism strategy.

Understanding these integration points is the foundation for diagnosing performance bottlenecks and making informed scaling decisions across the foundation model lifecycle.

Authors

Aman Shanbhag is an AI Performance and Infrastructure Engineer on the MARS MLOps team at NVIDIA, where he helps research teams build scalable, high-performance ML training and inference systems. He previously worked as a Specialist Solutions Architect at AWS, supporting customers worldwide with ML training and inference optimization on AWS. Aman holds degrees in computer science, mathematics, and entrepreneurship from Rice University and focuses on AI infrastructure, performance optimization, and distributed training and inference.

Pavel Belevich is a Senior Applied Scientist in the GenAI ML Frameworks team at Amazon Web Services. He applies his research in distributed training and large-model inference to real customer workloads at production scale. Before joining AWS, Pavel worked on the PyTorch Distributed team, contributing to core distributed training techniques such as FSDP and Pipeline Parallelism. At AWS, he works on MoE communication patterns and large-scale serving/training workflows. He also regularly shares best practices through technical deep dives on expert parallelism and large-model systems.

Keita Watanabe is a Principal Solutions Architect in the GenAI ML Frameworks team at Amazon Web Services, where he specializes in ML systems performance engineering and supporting customers worldwide with ML training and inference optimization on AWS. His background is in machine learning research and development. Prior to joining AWS, Keita worked at Rakuten as a research scientist, developing an image-based product search system. Keita holds a PhD in Science from the University of Tokyo.

#foundation-models #model-training #inference-optimization #aws #deep-learning #llm-infrastructure