PyTorch 프로파일링 (1부): torch.profiler 초보자 가이드

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

PyTorch 프로파일링 (Part 1): torch.profiler 초보자 가이드

프로파일링할 수 없는 것은 최적화할 수 없다.

Large Language Model(LLM)에서 더 많은 토큰을 초당 처리하려고 하든, 추론 시간을 밀리초 단위로 줄이려고 하든, 아니면 학습 루프가 스펙시트에서 약속한 것보다 느린 이유를 이해하려고 하든, 결국 프로파일링이라는 길로 들어서게 됩니다.

문제는 프로파일링이 **가파른** 진입 장벽을 가지고 있다는 것입니다. 트레이스는 색상 있는 사각형들이 촘촘히 모인 벽입니다. 이벤트들은 겁을 주는 이름들을 가지고 있습니다. 대부분의 튜토리얼은 이미 트레이스를 읽을 수 있다고 가정합니다. 그래서 프로파일링을 **해야 한다**는 것을 알면서도, 트레이스를 열기는 나중으로(또는 다른 사람을 위해) 미루고 싶은 느낌을 받게 됩니다. 이 게시물과 이를 시작으로 하는 시리즈는 그 진입 장벽을 낮추려는 우리의 노력입니다.

이것은 **PyTorch 프로파일링** 시리즈의 첫 번째 게시물입니다. 우리는 프로파일러 트레이스를 읽는 기술을 천천히 구축하고 이를 최적화를 추진하는 데 사용할 것입니다. 계획은 다음과 같습니다:

Part 1 (이 게시물): 가장 단순한 연산인 행렬 곱셈 다음 바이어스 더하기로 시작하여 프로파일러가 반환한 결과를 읽는 방법을 배웁니다.
Part 2: `nn.Linear`와 작은 MLP로 확장하고, 트레이스를 사용하여 최적화를 동기화하고, 그 아래의 `kernels`를 살짝 봅니다.
Part 3: 모든 것을 `transformers`를 사용한 대형 언어 모델에 적용합니다.

우리는 초보자의 관점에서 이 여정을 기록합니다. 기본 PyTorch 외에 전제 조건은 없습니다. 이것을 몇몇 "아하!" 순간들이 있는 여유로운 읽기로 생각하세요. 이 게시물의 구조는 의도적으로 질문에 주도적입니다: 우리는 트레이스를 열고 "잠깐, 왜 **그런** 일이 일어나는 걸까?"라고 묻고, 무언가가 클릭될 때까지 답을 추적합니다. 끝에서 당신은 다음을 알아야 합니다:

`torch.profiler`를 설정하는 방법과 그것이 실제로 무엇을 반환하는지,
프로파일러 테이블과 트레이스를 읽는 방법 (CPU 레인, GPU 레인, 그리고 그 사이의 의심스러운 간격),
Python 호출에서 CUDA 커널까지의 이벤트 체인,
`torch.compile`을 얹을 때 무엇이 변하는지, 그리고 더 흥미롭게는 무엇이 **변하지 않는지**.

시작하기 전에, 아래의 모든 내용을 더 잘 읽을 수 있도록 두 가지 정의가 있습니다:

GPU **kernel**은 GPU의 많은 스레드에서 병렬로 실행되는 프로그램입니다.
CPU는 이러한 커널을 **스케줄하고 실행합니다**.

일반적으로 GPU 커널을 직접 작성할 필요가 없습니다. PyTorch 연산을 사용할 때, 자동으로 GPU에서 작업을 수행하는 하나 이상의 커널로 변환됩니다.

이 두 가지 아이디어를 준비해두고, 질문을 시작해봅시다.

이것이 이 게시물에 사용하는 전체 스크립트입니다: 01_matmul_add.py. 이 스크립트를 별도의 탭에서 열고 코드를 단계별로 살펴보기를 권합니다. 우리는 `NVIDIA A100-SXM4-80GB` GPU를 사용하여 스크립트를 실행합니다.

행렬 곱셈과 덧셈 연산

Dr. Sara Hooker가 정확하게 언급한 바와 같이, 우리가 주로 물로 이루어진 것처럼 심층 신경망은 주로 행렬 곱셈으로 이루어져 있습니다. 그들이 얼마나 근본적인지를 감안할 때, 우리의 프로파일링 여정을 다른 것으로 시작하는 것은 유감스러울 것입니다.

def fn(x, w, b):
  return torch.add(torch.matmul(x, w), b)

행렬 덧셈과 행렬 곱셈은 뉴런에서 가중치와 바이어스가 상호작용하는 방식을 모방합니다. 이 덧셈(말장난입니다)은 나중에 컴파일로 어떻게 길을 열어주는지 이해하는 데 도움이 될 것입니다 이 게시물의 후반부에서.

프로파일링하기 위해 우리는 `torch.profiler` 모듈을 사용할 것입니다. 관련된 단계는:

프로파일링할 코드를 준비 (여기서 `def fn`, 행렬 곱셈과 행렬 덧셈을 감싸는)
알고리즘에 주석을 달기. 이것은 완전히 선택적이지만, 이렇게 하기를 권합니다. `record_function`은 우리 함수를 `matmul_add`로 주석 처리하여, 트레이스에서 쉽게 탐색할 수 있습니다 (나중에 언급하는 바와 같이)

def step():
  with torch.profiler.record_function("matmul_add"):
    return fn(x, w, b)

`torch.profiler.profile` 컨텍스트 매니저로 코드를 감싸기

  with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,  
        torch.profiler.ProfilerActivity.CUDA, 
    ],
  ) as prof:
    
    for _ in range(5):
      step()
      prof.step()

프로파일 내보내기


prof.key_averages().table(sort_by="cuda_time_total", row_limit=15)


prof.export_chrome_trace(trace_path)

프로파일러는 두 가지 서로 다른 아티팩트를 내보냅니다:

프로파일러 테이블: 알고리즘의 통계적 요약을 제공합니다. "무엇이 가장 많은 시간을 차지하는가"에 답합니다. 이는 핫스팟을 파악하는 데 정말 도움이 됩니다. 핫스팟은 가장 많은 시간을 소비하는 이벤트, 파이프라인의 병목이 될 수 있거나 많은 횟수로 트리거되는 이벤트입니다.
프로파일러 트레이스: 시간적 실행 보기를 제공합니다. "언제, 그리고 왜 연산이 발생했는가"에 답합니다, CPU와 GPU에서 발생하는 활동을 나타냅니다. 이는 실행된 커널, 이들을 실행하는 데 걸린 지연, CPU와 GPU 활동 사이의 겹침 등을 조사하려고 할 때 도움이 됩니다.

두 가지가 실제로 작동하는 것을 살펴봅시다. (전체 `01_matmul_add.py` 스크립트는 여기)

GPU가 있는 머신에서 이 스크립트를 실행하는 것을 권합니다.

uv run 01_matmul_add.py --size 64

위 스크립트를 (GPU 머신에서) 실행하면 `traces/01_matmul_add` 폴더에서 두 가지 아티팩트를 찾을 수 있습니다:

64_bf16_cold_eager.json
64_bf16_cold_eager.txt


그림 1: 64 크기 행렬에 대한 행렬곱 더하기의 프로파일러 테이블

`.txt` 파일은 프로파일러 테이블을 보유합니다. 파일을 열면, 그림 1에 표시된 대로, 첫 번째 열이 프로파일 범위 내에서 트리거된 이벤트로 구성된 큰 테이블이 나타납니다.

다른 열은 이벤트가 CPU 또는 GPU 또는 `torch.profiler.profile` 내 `activities`에서 지정된 다른 장치에서 걸리는 시간과 관련이 있습니다. 가장 많은 시간을 소비하는 이벤트를 찾고, 그 이벤트가 실제로 그 정도의 시간을 소비해야 하는지 직관적으로 이해하려고 시도하세요. "# of Calls" 열도 살펴보는 것이 중요합니다. 이는 이벤트가 트리거된 횟수를 나타냅니다.

우리가 이미 이야기하고 있는 동안, "Self CPU/CUDA" vs "CPU/CUDA total"에 대해 이야기해 봅시다. "Self" 열은 자식을 제외하고 이벤트 자체 내에서만 소비한 시간을 측정합니다. "total" 열은 이벤트와 그것이 트리거한 모든 자식을 포함합니다. 그래서 만약 `matmul_add`의 "CPU total"을 보면, 그것은 자신이 소비한 시간 더하기 그것이 트리거한 자식 이벤트의 시간으로 구성됩니다. 이것은 주의해야 할 중요한 뉘앙스입니다.

테이블의 마지막 두 줄을 보면 프로파일러가 다음을 알려줍니다

Self CPU time total: 2.314ms
Self CUDA time total: 23.104us

CPU 시간은 `ms`에 있고 GPU 시간은 `us`에 있습니다. 관점을 위해, GPU에서 소비한 시간 (커널 `ampere_bf16_s16816gemm...`)은 CPU에서 소비한 시간 (`matmul_add` 연산)의 1% 미만입니다. GPU는 대부분의 시간을 유휴 상태로 있습니다. 이는 즉시 빨간 깃발입니다. 이런 일이 발생하는 이유는 GPU가 작은 matmul을 매우 빠르게 계산할 수 있어서, 우리 코드는 커널을 준비하고, GPU에서 실행하고, 곱할 데이터를 보내고, 결과를 수집하는 데 대부분의 시간을 소비합니다. 이 개념을 **오버헤드 제약** 알고리즘이라고 합니다.

이 영역에서 벗어나는 가장 쉬운 방법은 더 큰 행렬 곱셈을 사용하는 것입니다.

uv run 01_matmul_add.py --size 4096


그림 2: 4096 크기 행렬에 대한 행렬곱 더하기의 프로파일러 테이블

그림 2의 마지막 두 줄은:

Self CPU time total: 4.908ms
Self CUDA time total: 4.495ms

두 시간 모두 ms에 있습니다. 이는 우리가 행렬 곱셈의 크기를 증가시킴으로써 더 많은 GPU 시간을 구현했다는 의미입니다. 그림 2를 보면 대부분의 CUDA 시간이 이제 GPU 커널 (`ampere_bf16_s16816gemm_..`)에 의해 차지되고 그것을 실행한 CPU 연산 (`matmul_add`)에 의해서가 아님을 알 수 있습니다. 이는 우리가 정말로 오버헤드 제약에서 계산 제약으로 이동할 수 있었다는 것을 의미합니다.

이제 우리는 `.json` 아티팩트에 있는 디스패치 체인을 시각화하는 것으로 이동합니다. 당신은 이들을 Perfetto UI에 업로드하여 트레이스를 볼 수 있거나, `uvx trace-util traces -b traces`를 사용하여 Perfetto 링크를 직접 생성할 수 있습니다.

64x64 트레이스


그림 3: 64 크기 행렬의 행렬곱 덧셈에 대한 프로파일러 트레이스

그림 3에서, 우리는 행렬 곱셈과 덧셈에 대한 프로파일러 트레이스를 봅니다. 여기서 막대의 너비는 이벤트의 지속 시간을 나타내고, 수직 중첩은 호출 계층을 나타내고, CPU 레인은 CPU에서 발생하는 이벤트를 나타내고, GPU 레인은 실제 커널 실행을 보여줍니다. 또한 기다리거나 유휴 시간인 빈 공간도 있을 수 있습니다.

스크립트는 기본 구성으로 실행되었습니다:

크기 64: 입력, 가중치 및 바이어스는 (64, 64) 크기입니다
dtype bf16: 데이터 유형은 bfloat16입니다
컴파일 없음: 우리는 torch 연산을 컴파일하지 않았습니다
워밍업 없음: 프로파일링하기 전에 GPU를 워밍업하지 않았습니다

Perfetto를 사용할 때, 우리는 트레이스에 더 빠르게 접근하기 위해 키보드 사용을 제안합니다. "W A S D"를 트레이스 탐색에 사용할 수 있습니다.


그림 4: PyTorch 프로파일러 트레이스의 CPU 레인과 GPU 레인

그림 4에는 두 개의 레인이 있습니다. 하나는 CPU 활동용이고 하나는 GPU 활동용입니다. CPU 레인에서 세 가지 프로파일 단계를 볼 수 있습니다 (`ProfilerStep#2`부터 시작). 이것은 `schedule`에서 나옵니다.

schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)

`wait`는 노이즈가 많은 초기화를 건너뜁니다 (`ProfilerStep#0`), `warmup`은 녹음 없이 프로파일러를 통해 실행합니다 (`ProfilerStep#1`), 그리고 `active`는 트레이스에 표시되는 것입니다. 여기의 스크립트에서 사용 중인 스케줄을 찾을 수 있습니다.

탐정 모자를 쓰고 트레이스를 조사하고 몇 가지 질문을 해봅시다.

ProfilerStep#2는 왜 이렇게 오래 걸리나요?


그림 5: `ProfileStep#2`는 뒤에 오는 단계보다 눈에 띄게 더 넓습니다

그림 5에서, 우리는 `ProfileStep#2`가 다른 단계에 비해 더 많은 시간을 소비하고, 자세히 보면 `matmul_add` 주석도 유사한 패턴을 보일 것입니다. 결정적인 증거는 주석 자체가 아니라 주석 내부에 있습니다:

단계	`matmul_add` 시작	`aten::matmul` 시작	간격
#2	138.736	366.493	227.757 µs
#3	517.926	523.447	5.521 µs
#4	610.039	614.527	4.488 µs


그림 6: `record_function("matmul_add")`와 `aten::matmul` 사이의 ~228 µs 데드 윈도우

그림 6에 표시된 약 228 µs는 `record_function("matmul_add")`를 입력하는 것과 PyTorch가 실제로 `aten::matmul`을 디스패치하는 사이의 "데드 윈도우"입니다. 이는 작업 공간 할당, cuBLAS (NVIDIA의 기본 선형 대수 연산을 수행하기 위한 GPU 가속 라이브러리) 휴리스틱, 또는 지연 모듈 로딩을 포함한 여러 이유로 인해 발생할 수 있습니다. 우리는 눈을 감거나 프로파일링 전에 더 많은 워밍업 단계를 실행할 수 있습니다 (표준입니다)

프로파일링 측면에서, 워밍업은 실제로 프로파일링하기 전에 이벤트를 몇 번 실행하는 것입니다. GPU가 수행하는 사전 작업 (위의 포인터 포함)은 우리가 프로파일링하지 않으려는 일회성 노력입니다. 우리의 예에서, 우리는 두 가지 워밍업 단계가 있습니다. 하나는 프로파일러에 들어가기 전에 실제로 함수를 반복하고, 두 개는 프로파일러 내부에 있으며 `warmup` 인수로 달성됩니다. 이 섹션에서, 우리는 실제 반복을 스케줄과 함께 활성화했습니다.

uv run 01_matmul_add.py --warmup

워밍업이 있는 64x64 Perfetto 트레이스


그림 7: 워밍업 후, 각 프로파일 단계는 유사한 양의 시간을 소비합니다

그림 7에서 우리는 각 프로파일 단계가 유사한 시간을 소비합니다. 하지만 이것은 일회성 오버헤드를 최적화할 수 있었다는 의미가 아닙니다. 우리는 실행을 워밍업하여 오버헤드가 프로파일링되지 않도록 했습니다. 우리는 이 섹션을 갑자기 닫으면서 이 문제를 해결하는 힌트를 주지 않는 것이 독자에게 부당할 것이라고 생각하므로, 실행 오버헤드를 더 최적화하는 것에 대해 읽을 수 있는 링크가 있습니다.

CPU 레인과 GPU 레인 사이에 ~2.5ms의 오프셋이 있는 이유는?


그림 8: CPU 레인과 GPU 레인 사이의 ~2.5ms 오프셋

그림 8에서, 우리는 CPU 레인과 GPU 레인 사이에 약 2.5ms의 오프셋이 있음을 봅니다. 이것은 CPU가 CUDA 커널을 제출한 후 실제로 실행을 시작할 때까지의 지연입니다. 워밍업 단계와 스케줄의 `wait`와 `warmup`이 GPU를 바쁘게 유지해야 하고 오프셋을 줄여야 한다고 생각할 수도 있습니다.

정말로 무슨 일이 일어나고 있는지 밝혀내기 위해, 우리의 스케줄을 조금 변경해봅시다:

- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)
+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=3, repeat=1)


그림 9: `wait=0`과 `warmup=0`으로, 트레이스는 `Activity Buffer Request`를 드러냅니다

그림 9는 GPU 레인에서 어떤 연산 전에 `Activity Buffer Request`가 있음을 보여줍니다. 조금 더 자세히 살펴봅시다.


그림 10: 프로파일 단계 1에서 matmul과 add 커널 사이에 간격이 나타남

GPU 트레이스로 확대하면, `ProfileStep#0` (CPU 트레이스는 그림에 표시되지 않음)의 matmul과 add 커널이 서로 다음에 실행되는 동안, `ProfileStep#1`의 커널 사이에 윈도우가 있음을 알 수 있습니다. 이에 대한 최선의 설명은 버퍼 오버플로우가 있었고, 다른 버퍼 요청 (GPU VRAM에 메모리를 할당하는 요청)이 커널 실행 중에 발급되었다는 것입니다.

다른 가능성을 배제하는 가장 좋은 방법은 더 많은 반복에 대해 프로파일링하고 트레이스의 다른 부분에서 유사한 윈도우가 나타나는지 확인하는 것입니다. 이를 위해 우리는 `active=20`으로 실행합니다.


그림 11: 20 활성 단계를 사용하면 간격이 한 번만 나타나므로 버퍼 요청임을 확인합니다

그림 11에 표시된 대로, 우리는 `ProfileStep#1`에서 유사한 경향을 봅니다. 이는 우리 이전 발견과 일치하며, 우리는 실제로 다른 버퍼 요청이었다고 안전하게 결론지을 수 있습니다.

이벤트의 체인


그림 12: 디스패치의 체인

그림 12에서, 우리는 중첩된 CPU 호출을 봅니다. 이것은 중요한 시각화입니다. 하나는 디스패치의 체인이 정말로 무엇인지 이해할 수 있습니다.

우리는 프로파일링 단계를 캡슐화하는 `ProfileStep#`로 시작합니다. 단계를 주석 처리하기 때문에, 우리는 `matmul_add` 행을 봅니다. `matmul_add`는 두 개의 `aten` 호출로 구성되어 있습니다. 하나는 행렬 곱셈용이고 하나는 행렬 덧셈용입니다.

`aten::matmul`은 사용자 대면 PyTorch matmul 호출이 도달하는 ATen 수준 디스패치입니다. `aten::mm`은 2D 행렬-행렬 곱셈 백엔드입니다.

PyTorch가 행렬에 배치 축을 추가하면 `aten::bmm` (배치 행렬 곱셈)을 호출하는 방법이 매우 흥미롭습니다. 우리는 우회를 하고 `aten::bmm`이 실제로 작동하는 것을 봅시다.

- x = torch.randn(args.size, args.size, device=device, dtype=dtype)
- w = torch.randn( args.size, args.size, device=device, dtype=dtype)
- b = torch.randn(args.size, args.size, device=device, dtype=dtype)

+ # 배치 크기 8 추가
+ x = torch.randn(8, args.size, args.size, device=device, dtype=dtype)
+ w = torch.randn(8, args.size, args.size, device=device, dtype=dtype)
+ b = torch.randn(8, args.size, args.size, device=device, dtype=dtype)


그림 13: 배치 행렬 곱셈

그림 13에서, 입력에 배치 축을 추가한 후, `aten::matmul`은 이제 여러 다른 사전 필수 CUDA 런타임 호출과 함께 `aten::bmm` (`aten::mm` 대신)을 캡슐화합니다. 이는 또한 cuBLAS가 프로그램에 맞는 올바른 (가장 적합한) 커널을 디스패치하기 위해 수행해야 하는 휴리스틱을 나타냅니다.

이 게시물의 나머지 부분에서, 우리는 다르게 언급하지 않는 한 단순한 2D 행렬로 작업할 것입니다.

왜 matmul에는 추가 CUDA 런타임 호출이 있나요?


그림 14: matmul 커널 시작 전에 CUDA 점유 쿼리가 발생합니다

우리는 `aten::mm`에 대해 두 개의 CUDA 런타임 호출이 있음을 알 수 있습니다. 즉, `cudaOccupancyMaxActiveBlocksPerMultiprocessor` (그림 14에 상자로 표시) 및 `cudaLaunchKernel`이고, `aten::add`에 대해서는 `cudaLaunchKernel`만 있습니다.

`cudaOccupancyMaxActiveBlocksPerMultiprocessor`는 계획 호출이며 순전히 CPU 측입니다. 이는 다음을 묻습니다: "커널 함수, 선택한 블록 크기 및 선택한 동적 공유 메모리 크기가 주어졌을 때, 이 커널의 몇 개 블록이 동시에 하나의 SM (스트리밍 멀티프로세서)에 있을 수 있나요?"

이는 질문을 제기합니다. 왜 우리는 matmul을 위한 계획이 필요한데 덧셈을 위해서는 필요하지 않을까요?

이를 이해하려면 커널의 리소스 발자국을 봐야 합니다. GPU 커널을 클릭하면 각 커널의 리소스 발자국을 검사할 수 있습니다.


그림 15: Matmul 발자국	그림 16: Add 발자국

그림 15에서, 행렬 곱셈의 경우 `스레드당 레지스터`와 `공유 메모리`가 동적 (행렬 크기에 따라)임을 알 수 있습니다. cuBLAS는 수백 개의 커널 변형을 제공하며, 각각은 하드웨어 용량에 대한 런타임 정보가 필요한 휴리스틱 기반 실행 경로를 가집니다. 점유 쿼리는 그 휴리스틱의 일부입니다. 개념적으로, 우리는 GPU 가속 matmuls을 독립적인 타일에서 작동하는 것으로 생각할 수 있습니다: 우리가 사용하는 타일 수와 각 타일의 크기는 행렬과 하드웨어에 따라 다릅니다. 현대 알고리즘은 그것보다 훨씬 더 복잡하지만, 이것은 여전히 좋은 참조 프레임워크입니다.

그림 16에서 우리는 덧셈의 발자국이 32 레지스터와 0 공유 메모리라고 말합니다. 그것은 자명하게 맞습니다. 쿼리할 것이 없습니다. 왜냐하면 하드웨어 리소스가 점유율을 제한할 수 없을 것이기 때문입니다. 커널은 설계상 리소스가 가볍습니다.

이것을 임의의 트레이스를 읽을 때 빠른 진단으로 사용할 수 있습니다. CPU 레인에서 `cudaOccupancyMaxActiveBlocksPerMultiprocessor`를 스캔하세요. 각 발생은 "무거운, 적응적으로 실행되는" 커널을 나타냅니다. 보통 GEMM (일반 행렬 곱셈), conv 또는 유사한 것입니다. 앞에 점유 쿼리가 없는 커널은 엘리먼트별/축소 집단으로, PyTorch가 기계적으로 실행합니다.

cudaDeviceSynchronize가 왜 이렇게 큼 (~1.78 ms)?

`cudaDeviceSynchronize`는 모든 GPU 작업이 이 장치에서 완료될 때까지 CPU를 차단합니다. 프로파일러는 이 동기화를 활성 윈도우의 끝에서 이벤트를 플러시하기 위해 내보냅니다. 그 없이는 커널 타이밍이 누락될 것입니다.

1.78 ms 동기화가 26 µs의 실제 GPU 작업을 커버한다는 것은 이 실행이 98% 유휴였다는 것을 말합니다. 이는 오버헤드 제약 증상의 교과서입니다.

4096x4096 트레이스

우리는 이미 프로파일러 테이블 분석 (위)에서 우리의 알고리즘에 더 큰 행렬을 제공하면 오버헤드 제약 영역에서 벗어나 계산 제약이 되는 것을 알고 있습니다.

명령을 실행하고 트레이스를 더 깊이 있게 살펴봅시다.

uv run 01_matmul_add.py --size 4096 --warmup

같은 커널이 다른 커널에 비해 왜 더 많은 시간을 소비하나요?


그림 17: 한 matmul 커널이 다른 커널과 동일한 입력에도 불구하고 GPU에서 더 오래 실행됨

그림 17에서, 우리는 `ProfileStep#3`의 matmul 커널이 다른 단계의 커널보다 GPU에서 더 오래 걸린다는 것을 알 수 있습니다. 이는 특히 흥미롭습니다. 왜냐하면 실행된 다른 커널들은 정확히 같았기 때문입니다. 이는 cuBLAS 휴리스틱이 개입되지 않았다는 의미입니다. 스케줄 간격이 없고, CPU 시작은 정상이고, 프로파일러 인공물이 아닙니다.

그림 17의 이 트레이스는 이상적인 예제에서 놓치기 쉬운 유용한 점을 만듭니다: 커널 런타임은 상수가 아니며, 동일한 하드웨어 환경에서 동일한 코드를 실행하고 동일한 데이터도 그렇습니다.

이를 더 구체적으로 만들기 위해 스크립트를 조금 수정합시다. 우리는 각 단계를 캡처하면서 반복을 20번 실행합니다.

- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)
+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=20, repeat=1)

- for _ in range(5):
+ for _ in range(20):


그림 18: 20번의 반복에 걸쳐 같은 matmul 커널은 다른 속도로 실행됨

그림 18은 유사한 발견을 드러냅니다. 각 커널이 정확히 같았지만, 시간이 다르게 걸렸습니다. 다른 계산 시간은 다음을 탓할 수 있습니다:

GPU 클록 (유휴 및 부스트)
GPU 가열
GPU 전력 관리
드라이버 측 정리 작업

평균만 본 독자는 matmul이 ~1 ms (5의 평균 = 1084 µs)를 소비한다고 결론 내릴 것입니다; 트레이스를 본 독자는 matmul이 ~580 µs를 소비하지만 GPU가 맞지 않을 때는 다르다고 볼 것입니다. 이는 매우 다른 정신 모델이며, 하나만 올바릅니다.

torch.compile이 실제로 작동하는 것을 봅시다

`torch.compile`으로 작업하는 것은 항상 우리를 놀라게 했습니다. 하나는 일반적인 eager PyTorch 코드를 작성하지만, PyTorch는 텐서가 많은 영역을 캡처하고 그것을 그래프로 변환하고 최적화하고 생성된 코드를 실행하려고 시도합니다. 기본 백엔드는 보통 `TorchInductor`이며, 광범위한 파이프라인은:

`TorchDynamo`는 Python 실행을 FX 그래프로 캡처합니다
`AOTAutograd`는 그래디언트가 관련될 때 forward/backward 그래프를 준비합니다
`Inductor`는 그래프를 최적화된 CPU 또는 GPU 코드로 낮춥니다.

이 섹션에서, 우리는 컴파일에 대해 이야기하고 프로파일러 트레이스를 살펴봅시다.

uv run 01_matmul_add.py --size 4096 --warmup --compile

`args.compile` 플래그는 다음 코드를 트리거합니다:

def fn(x, w, b):
  return torch.add(torch.matmul(x, w), b)

fn = torch.compile(fn) if args.compile else fn


그림 19: 컴파일된 영역은 트레이스에 TorchDynamo 및 Inductor 프레임으로 나타남

그림 19에서, 우리는 `Torch-Compiled Region: 0/0`이라는 새로운 CPU 행을 봅니다. 이는 우리를 사용 중인 컴파일된 함수로 안내합니다.

matmul과 add 커널을 하나로 융합했나요?


그림 20: 컴파일된 실행은 단일 `aten::addmm`을 디스패치합니다

그림 20을 보면서 우리는 질문합니다. 우리는 정말로 곱셈과 덧셈 연산을 하나로 융합했나요?

이것은 그래프 수준에서의 연산 융합입니다. Inductor는 우리의 `torch.add(torch.matmul(x, w), b)`를 가져가 단일 `aten::addmm(b, x, w)` 호출로 다시 작성했습니다. 주의할 중요한 점은 그것이 **새로운** 융합된 CUDA 커널을 생성하지 **않았다**는 것입니다. 실제 GPU 작업은 여전히 `ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_64x3_nn`입니다. eager 모드가 사용한 것과 같은 cuBLAS 커널입니다. 그래서 여기서 "융합"은 커널 수준이 아닌 디스패처 수준에서입니다.

PyTorch는 `torch.addmm` 함수를 제공합니다. 이는 우리가 두 단계로 한 것을 합니다. 즉, 곱하고 더합니다. 우리는 독자가 이 함수의 트레이스를 보고 아래 댓글에서 당신의 관찰을 의견을 말하기를 권합니다!

torch.compile의 런타임 아키텍처

우리는 이론상 컴파일할 때 일어나는 일을 알고 있지만, 실제로 작동하는 것을 보는 것도 똑같이 중요합니다. CPU 측 계층을 살펴봅시다. 이것은 `torch.compile`의 런타임 아키텍처를 반영합니다.

TorchDynamo 캐시 조회는 Dynamo가 현재 호출이 여전히 동일한 입력 모양, dtype, 장치 및 텐서 메타데이터로 컴파일된 것과 일치하는지 확인하는 위치입니다. 무언가라도 불일치하면 Dynamo는 재컴파일할 것입니다. 이 비용은 모든 호출에서 지불되며, 컴파일 후에도 마찬가지입니다.

Torch-Compiled Region은 컴파일된 버전을 "들어가는" 래퍼입니다. AOTDispatcher Runtime Wrapper Prologue는 AOT Autograd의 런타임 래퍼입니다. 여기서 우리는 그래디언트가 필요하지 않지만, AOTDispatcher는 항상 스택에 있습니다. 텐서 메타데이터, 뷰 추적을 처리하고, `requires_grad`가 참이면 역전파를 설정할 것입니다.

## Call CompiledFxGraph 는 실제 생성된 코드가 실행되는 곳입니다. "CompiledFxGraph" 다음의 문자열은 FX 그래프의 컨텐츠 해시입니다. 모든 세 활성 단계 전체에서 동일하므로, 캐시 히트를 확인합니다.

생성된 코드를 `/tmp/torchinductor_/fxgraph`에서 디스크에서 찾을 수 있습니다. 이 해시로 키가 된 것은 Inductor가 실제로 생성한 Triton/C++을 읽으려고 할 때 유용합니다.

CUDA 시작이 절반으로 줄어들까요?


그림 21: 각 컴파일된 단계는 여전히 두 개의 GPU 커널을 시작합니다. Device-to-Device memcpy와 GEMM

그림 21의 트레이스를 보면서, 우리는 단계당 오직 하나의 `cudaLaunchKernel`만 있다고 정말 기뻤습니다. 이 관찰은 GPU 트레이스에서 보는 것과 직접 모순되었습니다. 여전히 단계당 두 개의 커널이 실행되고 있습니다. 즉, `Memcpy DtoD (Device -> Device)`와 GEMM입니다. CPU 트레이스로 돌아가서, 우리는 `cudaMemcpyAsync` 디스패치를 완전히 놓쳤다는 것을 알 수 있었습니다.

`addmm`은 `out = α·A·B + β·C`를 계산합니다. 그리고 cuBLAS의 GEMM-with-bias-add epilogue는 목적지 버퍼에 바이어스를 이미 포함해야 하는 버퍼에 작성합니다. Epilogues는 GEMM **다음에** 발생하는 모든 연산으로 생각할 수 있습니다. 깊은 학습의 세계에서 우리는 활성화, 바이어스 덧셈, 정규화 등과 같은 GEMM-Epilogue를 지속적으로 생각해냅니다. 이것이 cuBLAS GEMM-with-* 커널 변형이 있는 이유입니다.

당신이 `torch.compile`에 대해 다른 `mode`를 사용한다면 다른 커널 변형이 실행되는 것을 알 수 있습니다. 당신은 스스로 시도해보고 아래 댓글에서 당신의 관찰을 추가할 수 있습니다!

그래서 Inductor의 생성된 코드는:

`out = copy(C)` ← 그것이 DtoD memcpy입니다 (32 MB, ~33 µs 소비)
`out = α·(A·B) + β·out` ← `α=β=1`로 GEMM, 바이어스 더하기를 쓰기 회선으로 융합

결과는 수학적으로 여전히 같습니다. 바이어스 덧셈은 무료가 아니며, 우리는 앞에 memcpy 비용을 지불하고 약간 더 비싼 GEMM 에필로그를 지불합니다.

우리가 바라던 융합. 여기서 `x·w + b` (`out = α·A·B + β·C`)는 추가 메모리 트래픽이 없는 단일 커널로 축소됩니다. 이는 일어나지 않았습니다. Inductor는 두 메모리 접촉 연산을 보존하고, 그것은 바이어스 복사를 memcpy로 다시 레이블했고 더하기를 GEMM 에필로그로 다시 레이블했습니다.

정말로 융합된 구현은 memcpy를 건너뜁니다. 이것이 FlashAttention 스타일의 손으로 쓴 커널이 하는 것이고, Inductor가 Triton 코드생성을 통해 할 수 있는 것입니다. 하지만 `4096×4096 bf16 matmul`의 경우, Inductor는 분명히 "cuBLAS를 사용하고, 에필로그를 통해 바이어스를 수행"이 최고의 경로였다고 결정했습니다.

CPU 오버헤드가 올라갔습니다. 내려가지 않았습니다

이것은 eager와 컴파일된 실행을 비교할 때 가장 놓치기 쉬운 것입니다:

단계	eager dur (ms)	compile dur (ms)
#2	0.1	0.2
#3	0.07	0.1
#4	0.07	0.1

컴파일은 단계당 CPU에서 대략 2배 더 비쌉니다. 이것은 모든 호출이 어쨌든 동일한 `aten::addmm` 디스패치 위에 Dynamo > AOTAutograd > Inductor 스택을 걷기 때문입니다. 컴파일 파이프라인은 수십 개의 연산이 있는 ML 모델용으로 제작되었으며. 호출당 오버헤드가 상쇄됩니다 (단일 연산의 경우 세금입니다).

`torch.compile`은 `mode` 인수를 가지고 있습니다. CPU 오버헤드를 줄일 수 있는 `mode`를 찾아내 독자가 가져갈 숙제로 문서를 읽는 것입니다. 🤗

트레이스 읽기 핸디북

우리가 걸어간 패턴에 대한 빠른 참조입니다. 아이디어는: 트레이스에서 이것을 본다면, 일반적으로 이것이 의미한다는 것입니다.

프로파일러 테이블

당신이 보는 것	일반적으로 의미하는 것
`Self CPU time total` ≫ `Self CUDA time total` (CPU는 ms, GPU는 µs)	오버헤드 제약. CPU는 디스패치에 더 많은 시간을 소비하고 GPU는 계산에 소비합니다. 작업을 더 크게 만들거나 (더 큰 행렬, 배치 연산) 호출을 융합합니다.
`Self CPU time total` ≈ `Self CUDA time total`, 둘 다 ms	계산 제약. GPU가 병목입니다. 일반적으로 이것이 원하는 것입니다.
한 이벤트가 `CUDA total`을 지배합니다	그것이 당신의 핫스팟입니다. 거기서 최적화를 시작하세요.
한 이벤트가 엄청난 `# of Calls`를 가집니다	각 호출이 저렴해도 잠재적 병목입니다. 융합하거나 배치할 수 있는지 확인하세요.
`CPU total` ≫ `Self CPU` (행)	대부분의 비용은 자식에 있습니다. 부모가 아닌 중첩된 이벤트를 살펴보세요.

CPU 레인

당신이 보는 것	일반적으로 의미하는 것
첫 번째 `ProfileStep`이 나머지보다 훨씬 넓습니다	콜드 스타트 오버헤드: 작업 공간 할당, cuBLAS 휴리스틱, 지연 모듈 로딩. 워밍업 반복 및/또는 스케줄의 `warmup` 인수를 추가합니다.
`record_function("...")` 시작과 그 안의 첫 `aten::*` 사이의 큰 간격	같은 콜드 스타트 세금, 확대된 것입니다. 주석이 들어갔지만 디스패치가 아직 일어나지 않았습니다.
`cudaLaunchKernel` 앞의 `cudaOccupancyMaxActiveBlocksPerMultiprocessor`	무거운, 적응적으로 실행되는 커널 (GEMM, conv 등). cuBLAS가 드라이버에 얼마나 많은 블록이 SM에 맞는지 묻고 있어서 커널 변형을 선택할 수 있습니다.
선행 점유 쿼리 없는 `cudaLaunchKernel`	리소스가 가볍고 고정된 발자국을 가진 엘리먼트별 또는 축소 커널입니다. 계획할 것이 없습니다.
활성 윈도우의 끝에서 긴 `cudaDeviceSynchronize`	프로파일러가 이벤트를 플러시 중입니다. 지속 시간은 주로 GPU가 대기 중인 작업을 마치고 있으며, 실제 CPU 비용이 아닙니다. 작은 GPU 작업을 커버하는 동기화는 클래식 오버헤드 제약 증상입니다.
당신이 작성하지 않은 `cudaMemcpyAsync`	종종 숨겨진 Device-to-Device 복사입니다. `addmm`이 바이어스를 사용해 목적지 버퍼를 초기화하고 GEMM 에필로그를 수행할 때 일반적입니다.

GPU 레인

당신이 보는 것	일반적으로 의미하는 것
GPU 레인의 `Activity Buffer Request`	프로파일러가 자신의 이벤트 버퍼를 할당/재작성 중입니다. 첫 번째는 일반적으로 초기 CPU↔GPU 레인 오프셋을 차지합니다.
단일 단계의 두 커널 사이의 간격	실행 중 다른 버퍼 요청이 가능합니다. 더 많은 반복을 실행하여 확인: 한 번만 나타나면 프로파일러이지, 당신의 코드가 아닙니다.
같은 커널이 단계 전체에서 다르게 타이밍됩니다	GPU 클록, 열, 전력 관리, 드라이버 정리. 평균이 아닌 트레이스를 읽으세요.
`ampere_bf16_s16816gemm_...` 같은 커널 이름	matmul의 실제 cuBLAS GPU 작업입니다. 커널 이름은 일반적으로 동일 모양/dtype의 eager와 컴파일된 모드에서 동일합니다.
GEMM 바로 앞의 `Memcpy DtoD`	`addmm` 에필로그를 위한 바이어스 복사입니다. "융합"은 커널이 아닌 디스패처 수준에서입니다.

디스패치 체인

당신이 보는 것	일반적으로 의미하는 것
`ProfileStep#N` → `` → `aten::*` → `aten::mm` / `aten::bmm` / `aten::add`	정식 중첩 호출 계층입니다. 자신 시간은 자식을 제외합니다; 전체 시간은 그것을 포함합니다.
`aten::matmul`이 `aten::mm`으로 해결됩니다	2D × 2D 행렬 곱셈입니다.
`aten::matmul`이 `aten::bmm` (추가 CUDA 런타임 호출 포함)으로 해결됩니다	3D+ 텐서에 대한 배치 matmul입니다. cuBLAS는 변형을 선택할 더 많은 휴리스틱 작업을 수행합니다.
별도의 `aten::add` + `aten::mm` 쌍 대신 `aten::addmm(b, x, w)`	디스패처 수준에서의 연산 융합입니다. GPU 커널은 여전히 동일한 GEMM이며, 바이어스 더하기는 에필로그로 접혀 있습니다.

torch.compile

당신이 보는 것	일반적으로 의미하는 것
CPU 레인의 `Torch-Compiled Region: K/M` 행	당신은 컴파일된 함수 내부에 있습니다.
모든 단계에서 `TorchDynamo Cache Lookup`	Dynamo가 모양/dtype/장치가 캐시된 컴파일과 일치하는지 확인 중입니다. 컴파일 후에도 모든 호출에서 지불됩니다.
그래디언트 없이도 `AOTDispatcher Runtime Wrapper Prologue`	AOTAutograd의 런타임 래퍼는 항상 스택에 있으며, 텐서 메타데이터와 뷰 추적을 처리합니다.
`## Call CompiledFxGraph `이 단계 전체에서 동일한 해시	생성된 코드에 대한 캐시 히트입니다. 생성된 소스는 `/tmp/torchinductor_/fxgraph/` 아래에 있습니다.
작은 연산의 경우 eager보다 `torch.compile` 아래 높은 단계당 CPU 시간	예상됩니다. Dynamo → AOTAutograd → Inductor 스택은 많은 연산에서만 상쇄되는 세금입니다.

결론

우리는 작은 `matmul + add`로 시작했고 PyTorch 프로파일러를 읽는 방법을 배우는 핑계로 사용했습니다. 그 길에서 우리는 더 큰 워크로드로 잘 이동하는 몇 가지 정신 모델을 집었습니다. 이것은 **PyTorch 프로파일링** 시리즈의 첫 번째 정거장입니다. 다음 게시물에서 우리는 이 두 연산 장난감을 천천히 떠나 복잡성의 계단을 걸어 올라갈 것이며, 더 큰 빌딩 블록을 살펴보고 결국 실제 모델을 살펴볼 것입니다.

Noe Flandre, Suvaditya Mukherjee, 그리고 Vidit Ostwal의 게시물 초안에 대한 리뷰에 감사합니다!

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

What you cannot profile, you cannot optimize.

Whether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling.

The catch is that profiling has a steep on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we know we should be profiling, opening a trace can feel like a chore best left for later (or for someone else). This post, and the series it kicks off, is our attempt to lower that on-ramp.

This is the opening post of Profiling in PyTorch, a series where we slowly build the skill of reading profiler traces and use it to drive optimization. The plan:

Part 1 (this post): start with the simplest possible operation, a matrix multiplication followed by a bias add, and learn how to read what the profiler hands back.
Part 2: scale up to nn.Linear and a small MLP, use the traces to motivate optimizations, and peek at the kernels underneath.
Part 3: put it all together on Large Language Models with transformers.

We document the journey from a beginner's point of view. No prerequisites apart from basic PyTorch. Treat this as a leisurely read with some "Aha!" moments. The structure of the post is intentionally question-led: we open a trace, ask "wait, why is that happening?", and chase the answer until something clicks. By the end you should know:

how to set up torch.profiler and what it actually hands back,
how to read the profiler table and the trace (CPU lane, GPU lane, and the suspicious gaps in between),
the chain of events from a Python call all the way down to a CUDA kernel,
what changes (and, more interestingly, what does not change) when you slap torch.compile on top.

Before we begin, two definitions that will make everything below read better:

A GPU kernel is a program that runs in parallel on many threads of the GPU.
The CPU schedules and launches these kernels.

You don't usually have to write GPU kernels yourself; when you use a PyTorch operation, it is automatically translated to one or more kernels that do the job on GPU.

With those two ideas in your back pocket, let's start asking questions.

Here is the entire script that we use for the post: 01_matmul_add.py. We recommend opening this script in a separate tab and walk through the code step by step. We use the NVIDIA A100-SXM4-80GB GPU to run the scripts.

The matrix multiplication and addition operation

As correctly quipped by Dr. Sara Hooker, just as we are primarily made up of water, Deep Neural Networks are primarily made up of matrix multiplies. As fundamental as they are, it would be a shame to start our profiling journey with anything else.

def fn(x, w, b):
  return torch.add(torch.matmul(x, w), b)

The matrix addition along with the matrix multiplication mimics how weights and biases interact in a neuron. This addition (pun intended) will help us understand how it paves the way for compilation later in the post.

To profile, we will be using the torch.profiler module. The steps involved are:

Have the code to profile ready (here def fn, which wraps the matrix multiplication and matrix addition)
Annotate the algorithm. While this is completely optional, we recommend doing this. The record_function annotates our function as matmul_add, which will be easy to navigate in the traces (as we note later)

def step():
  with torch.profiler.record_function("matmul_add"):
    return fn(x, w, b)

Wrap the code with the torch.profiler.profile context manager

  with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,  
        torch.profiler.ProfilerActivity.CUDA, 
    ],
  ) as prof:
    
    for _ in range(5):
      step()
      prof.step()

Export the profile


prof.key_averages().table(sort_by="cuda_time_total", row_limit=15)


prof.export_chrome_trace(trace_path)

The profiler exports two distinct artifacts:

The profiler table: Provides the statistical summary of the algorithm. It answers "What is taking the most time". This becomes really helpful to figure out hotspots. A hotspot would be events that take the most amount of time, can be a bottleneck of the pipeline, or an event that is triggered a lot of times.
The profiler trace: Provides the temporal execution view. Answers "When and Why an operation happened", depicting the activities taking place on the CPU and the GPU. This is helpful when we want to investigate the kernel(s) that were launched, any delays in launching them, any overlap between CPU and GPU activities, etc.

Let's see the two in action with our first execution. (Here is the entire 01_matmul_add.py script)

It is recommended to run this script on a machine with a GPU.

uv run 01_matmul_add.py --size 64

If you run the above script (on a GPU machine) you will find a folder traces/01_matmul_add with the two artifacts:

64_bf16_cold_eager.json
64_bf16_cold_eager.txt


Figure 1: Profiler table for matmul add on 64 sized matrices

The .txt file holds the profiler table. Upon opening the file, as shown in Figure 1, one would be greeted with a big table with the first column consisting of the events that were triggered inside the scope of profile.

The other columns are related to the time the event takes on the CPU or GPU or any other device(s) specified in activities within torch.profiler.profile. Look at which events take the most amount of time, and try to intuitively understand if that event should in fact take that time. It is also important to look at the column "# of Calls" which dictates how many times the event was triggered.

While we are at it, let's also talk about "Self CPU/CUDA" vs "CPU/CUDA total". The "Self" columns measure time spent only inside the event itself, excluding its children. The "total" columns include the event and all of its children together. So if you look at the "CPU total" of matmul_add, it consists of the time it took on self plus the children events it triggered. This is an important nuance to note.

If you look at the last two lines out of the table you would notice that the profiler tells us that

Self CPU time total: 2.314ms
Self CUDA time total: 23.104us

The CPU time is in ms while the GPU time is in us. To put things in perspective, the time spent on GPUs (the kernel ampere_bf16_s16816gemm...) is less than 1% of the time spent on the CPU (the matmul_add operation). The GPU stays idle most of the time, which is an immediate red flag. The reason this happens is that the GPU can compute a small matmul very quickly, so our code spends most of the time preparing the kernels, launching them on the GPU, sending the data to multiply and gathering the results. This concept is known as an overhead-bound algorithm.

The easiest way to move out of this regime is to use bigger matrix multiplications.

uv run 01_matmul_add.py --size 4096


Figure 2: Profiler table for matmul add on 4096 sized matrices

The last two lines in Figure 2 are:

Self CPU time total: 4.908ms
Self CUDA time total: 4.495ms

Both times are in ms, which means we have materialized more GPU time just by increasing the size of the matrix multiplications. If you look at Figure 2 you would also notice that the most CUDA time is now taken by the GPU kernel (ampere_bf16_s16816gemm_..) and not by the CPU operation that launched it (matmul_add). This means that we were indeed able to move from overhead bound to compute bound.

We now move into visualising the dispatch chain, which lives inside the .json artifacts. You can upload them to Perfetto UI and see the traces, or you can use uvx trace-util traces -b traces to generate the Perfetto links directly.

64x64 traces


Figure 3: Profiler trace for matmul and add on 64 sized matrices

In Figure 3, we see the profiler trace for the matrix multiplication and addition. Here the bar width indicates the duration of an event, the vertical nesting is the call hierarchy, the CPU lane denotes the events that happen on the CPU, while the GPU lane shows the actual kernel executions. One might also notice the empty spaces which are the waiting or idle time.

The script was run with default configurations which are:

size 64: The inputs, weights and biases are sized (64, 64)
dtype bf16: The data type is bfloat16
no compile: We have not compiled the torch operations
no warmup: We have not warmed up the GPU before profiling

With Perfetto we suggest using the keyboard for quicker access to the trace. One could use "W A S D" for navigating the trace.


Figure 4: The CPU and GPU lanes of a PyTorch profiler trace

There are two lanes in Figure 4, one for the CPU activity and one for the GPU activity. In the CPU lane one would notice three profile steps (starting from ProfilerStep#2). This comes from the schedule.

schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)

The wait skips noisy initializations (ProfilerStep#0), warmup runs through the profiler without recording (ProfilerStep#1), and active is what shows up in trace. One can find the schedule being used in the script here.

Let's put on our detective hats and investigate the trace and ask some questions.

Why does the ProfilerStep#2 take so long?


Figure 5: `ProfileStep#2` is visibly wider than the steps that follow it

In Figure 5, we notice that ProfileStep#2 takes more time compared to the other steps, and upon looking closely you would see a similar pattern with the matmul_add annotation as well. The smoking gun is inside the annotation, not the annotation itself:

Step	`matmul_add` start	`aten::matmul` start	gap
#2	138.736	366.493	227.757 µs
#3	517.926	523.447	5.521 µs
#4	610.039	614.527	4.488 µs


Figure 6: The ~228 µs dead window between `record_function("matmul_add")` and `aten::matmul`

That ~228 µs shown in Figure 6 is the "dead window" between entering record_function("matmul_add") and PyTorch actually dispatching aten::matmul. This can happen for multiple reasons, including workspace allocations, cuBLAS (NVIDIA’s proprietary, GPU-accelerated library for performing fundamental linear algebra operations) heuristics, or lazy module loading. We can either look away or run some more warmup steps before we profile (which is the standard)

In terms of profiling, warmup is when you run the events a couple of times before actually profiling it. The pre-work done by the GPU (including the above pointers) are one time efforts which we do not want to profile. In our example, we have two warmup stages, one where we actually loop over the function before entering the profiler, and two inside the profiler which is achieved by the warmup argument. In this section, we have enabled the actual iterations along with the schedule.

uv run 01_matmul_add.py --warmup

Perfetto Trace for 64x64 with Warmup


Figure 7: After warming up, every profile step takes a similar amount of time

In Figure 7 we see that each profile step takes a similar time, but this does not mean we were able to optimize the one time overheads. We warmed up the runs so that the overheads were not profiled. We think that closing this section abruptly without a hint to solving this would do injustice to the reader, so here is a link to read about further optimizing launch overheads.

Why is there an offset of ~2.5 ms between the CPU and GPU lanes?


Figure 8: The ~2.5 ms offset between the CPU and GPU lanes

In Figure 8, we see that the CPU and GPU lanes have an offset of around 2.5 ms: this is the delay after the CPU submits the CUDA kernels and the time they actually start executing. One might think the warmup stage combined with the schedule's wait and warmup should keep a GPU busy and would diminish the offset.

To uncover what is really happening, let's change our schedule a little:

- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)
+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=3, repeat=1)


Figure 9: With `wait=0` and `warmup=0`, the trace reveals an `Activity Buffer Request`

Figure 9 shows us that there is an Activity Buffer Request in the GPU lane before any operation. Let's zoom in a little more.


Figure 10: A gap appears between the matmul and add kernels on profile step 1

Upon zooming into the GPU trace, we notice that the matmul and add kernels for ProfileStep#0 (the CPU trace of which is not visible in the Figure) happen one after the other, while the kernels for ProfileStep#1 have a window in between. The best explanation for this is that there was an overflow of buffers, and another buffer request (a request to allocate some memory on the GPU VRAM) was issued during the kernel execution.

The best way to rule out other possibilities is to profile for more iterations and see whether a similar window appears in other parts of the trace. To do that we run with active=20.


Figure 11: With 20 active steps the gap only shows up once, confirming it is a buffer request

As shown in Figure 11, we see a similar trend in ProfileStep#1. This is aligned with our previous findings, and we can safely conclude that it was indeed another buffer request.

The chain of events


Figure 12: The chain of dispatch

In Figure 12, we see the nested CPU calls. This is an important visualization, where one gets to understand what a chain of dispatch really looks like.

We begin with ProfileStep#<id> which encapsulates the profiling step. Due to us annotating the step, we see the matmul_add row. The matmul_add consists of two aten calls, one for matrix multiplication and one for matrix addition.

The aten::matmul is the ATen-level dispatch that those user-facing PyTorch matmul calls land on. aten::mm is the 2D matrix-matrix multiply backend.

It is very interesting to note how PyTorch calls aten::bmm (batched matrix multiplication) if we add the batch axis to our matrices. Let's take a detour and see the aten::bmm in action.

- x = torch.randn(args.size, args.size, device=device, dtype=dtype)
- w = torch.randn( args.size, args.size, device=device, dtype=dtype)
- b = torch.randn(args.size, args.size, device=device, dtype=dtype)

+ # adding a batch size of 8
+ x = torch.randn(8, args.size, args.size, device=device, dtype=dtype)
+ w = torch.randn(8, args.size, args.size, device=device, dtype=dtype)
+ b = torch.randn(8, args.size, args.size, device=device, dtype=dtype)


Figure 13: Batched Matrix Multiplication

In Figure 13, upon adding the batch axis to the inputs, aten::matmul now encapsulates a bunch of other prerequisite CUDA runtime calls along with aten::bmm (instead of aten::mm). This also hints at the heuristics that cuBLAS needs to do in order to dispatch the right (most suitable) kernel for the program.

In the rest of the post, we will be working with simple 2D matrices, unless otherwise mentioned.

Why does matmul have an extra CUDA runtime call?


Figure 14: A CUDA occupancy query fires before the matmul kernel launch

We notice that for aten::mm there are two CUDA Runtime calls, namely cudaOccupancyMaxActiveBlocksPerMultiprocessor (boxed in Figure 14) and cudaLaunchKernel, while for aten::add there is only the cudaLaunchKernel.

cudaOccupancyMaxActiveBlocksPerMultiprocessor is a planning call and is purely CPU side. It asks: "given a kernel function, a chosen block size, and a chosen dynamic shared memory size, how many blocks of this kernel can simultaneously reside on one SM (Streaming Multiprocessor)?"

This begs the question, why do we need planning for matmul and not for add?

To understand this, we have to look at the kernel's resource footprint. If you click on the GPU kernels, you will be able to inspect the resource footprint for the respective kernel.


Figure 15: Matmul footprint	Figure 16: Add footprint

In Figure 15, we note that for matrix multiplication the registers per thread and shared memory are dynamic (based on the size of the matrix). cuBLAS ships hundreds of kernel variants, and each has a heuristic-driven launch path that needs runtime information about hardware capacity. The occupancy query is part of that heuristic. Conceptually, we can think of GPU-accelerated matmuls as working on independent tiles: how many tiles we use and how big each tile needs to be depends on the matrices and the hardware. Modern algorithms are way more complicated than that, but this is still a good reference framework.

From Figure 16 we see that the footprint of addition says 32 registers and zero shared memory. That fits trivially. There's nothing to query, because no hardware resource is going to limit occupancy. The kernel is, by design, resource-light.

You can use this as a quick diagnostic when reading any trace. Scan the CPU lane for cudaOccupancyMaxActiveBlocksPerMultiprocessor. Each occurrence flags a "heavyweight, adaptively launched" kernel, usually a GEMM (GEneral Matrix Multiplication), conv, or similar. The kernels without a preceding occupancy query are the elementwise/reduction crowd that PyTorch launches mechanically.

Why is cudaDeviceSynchronize so big (~1.78 ms)?

cudaDeviceSynchronize blocks the CPU until all GPU work on this device finishes. The profiler emits this sync at the end of the active window to flush events. Without it, kernel timings would be missing.

A 1.78 ms sync covering 26 µs of real GPU work tells you this run was 98% idle. That's the textbook overhead-bound symptom.

4096x4096 traces

We already know from the profiler table analysis (above) that providing bigger matrices to our algorithm moves it out from the overhead-bound region to being compute-bound.

Let's run the command and dive deeper into the traces.

uv run 01_matmul_add.py --size 4096 --warmup

Why does the same kernel take more time compared to others?


Figure 17: One matmul kernel runs longer than the others despite identical inputs

In Figure 17, we notice that the matmul kernel for ProfileStep#3 takes longer on the GPU than the other steps. This is particularly interesting to note, because the other kernels launched were the exact same, which means there were no cuBLAS heuristics involved. There are no scheduling gaps, the CPU launches are normal, and it is not a profiler artifact.

This trace in Figure 17 makes a useful point that's easy to miss in idealized examples: kernel runtimes are not constants, even on the same hardware environment running identical code on identical data.

Let's make this more concrete by modifying the script a little. We run the iteration 20 times, capturing each of the steps.

- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)
+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=20, repeat=1)

- for _ in range(5):
+ for _ in range(20):


Figure 18: Across 20 iterations the same matmul kernel runs at different speeds

Figure 18 reveals a similar finding. While each kernel was the exact same, they time differently. The different compute times can be blamed on a bunch of reasons:

GPU clocks on idle and boost
GPU heating
GPU power management
Driver side housekeeping

A reader who only saw the average would conclude that a matmul took ~1 ms (mean of 5 = 1084 µs); a reader who looked at the trace would see that the matmul takes ~580 µs except when the GPU throws a fit. Those are very different mental models, and only one of them is correct.

Let's see some torch compile at work

Working with torch.compile has always amazed us. One writes normal eager PyTorch code, but PyTorch tries to capture tensor-heavy regions, turn them into graphs, optimize them, and run generated code. The default backend is usually TorchInductor, and the broad pipeline is:

TorchDynamo captures Python execution into an FX graph
AOTAutograd prepares forward/backward graphs when gradients are involved
Inductor lowers the graph into optimized CPU or GPU code.

In this section, we talk about compilation and look at the profiler traces.

uv run 01_matmul_add.py --size 4096 --warmup --compile

The args.compile flag triggers the following code:

def fn(x, w, b):
  return torch.add(torch.matmul(x, w), b)

fn = torch.compile(fn) if args.compile else fn


Figure 19: The compiled regions show up as TorchDynamo and Inductor frames in the trace

In Figure 19, we see the new CPU rows named Torch-Compiled Region: 0/0 which points us to the compiled functions being used.

Did we fuse the matmul and add kernels into one?


Figure 20: Compiled run dispatches a single `aten::addmm`

Looking at Figure 20 we ask the question, did we actually fuse the multiplication and addition operations together into one?

This is operator fusion at the graph level. Inductor took our torch.add(torch.matmul(x, w), b) and rewrote it into a single aten::addmm(b, x, w) call. The important thing to note here is that it did not produce a new fused CUDA kernel. The actual GPU work is still ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_64x3_nn, the same cuBLAS kernel eager mode used. So the "fusion" here is at the dispatcher level, not at the kernel level.

PyTorch provides the torch.addmm function that does what we did into two steps, that is multiply and add. We encourage the reader to look at the traces of this function and comment your observations in the comments below!

torch.compile's runtime architecture

While we know in theory what happens when we compile our functions it is equally important to see it in action. Let's look at the CPU-side hierarchy which reflects torch.compile's runtime architecture.

TorchDynamo Cache Lookup is where Dynamo checks that the current call still matches what was compiled with the same input shapes, dtypes, devices, and tensor metadata. If anything mismatched, Dynamo would recompile. This cost is paid every call, even after compilation.

Torch-Compiled Region is the wrapper that "enters" the compiled version. AOTDispatcher Runtime Wrapper Prologue is AOT Autograd's runtime wrapper. Even though we don't need gradients here, AOTDispatcher is always in the stack handling tensor metadata, view tracking, and would set up the backward pass if requires_grad were true.

## Call CompiledFxGraph is where the actual generated code runs. The string after "CompiledFxGraph" is the content hash of the FX graph. It's the same across all three active steps, confirming cache hits.

You can find the generated code on disk under /tmp/torchinductor_<user>/fxgraph keyed by this hash, useful when you want to read the Triton/C++ that Inductor actually produced.

Do the CUDA launches go down by half?


Figure 21: Each compiled step still launches two GPU kernels, a Device-to-Device memcpy and the GEMM

Looking at the traces in Figure 21, we were really happy to notice only one cudaLaunchKernel per step. This observation was directly contradicting what we were seeing in the GPU trace. There were still two kernels being launched per step, namely the Memcpy DtoD (Device -> Device) and the GEMM. Going back to the CPU trace, we noticed that we had completely missed the cudaMemcpyAsync dispatch.

addmm computes out = α·A·B + β·C, and cuBLAS's GEMM-with-bias-add epilogue writes into a destination buffer that needs to already contain the bias. An epilogues can be thought of all the operations that happen after a GEMM. In the world of deep-learning we constantly come up with GEMM-Epilogues like activations, bias addition, normalization and many more. This is why there are cuBLAS GEMM-with- kernel variants.

If you use different modes for torch.compile you would notice different kernel variants being launched. You can try it for yourself and add a comment below about your observations!

So Inductor's generated code does:

out = copy(C) ← that's the DtoD memcpy (32 MB, takes ~33 µs)
out = α·(A·B) + β·out ← GEMM with α=β=1, fusing the bias add into the writeback

The result is mathematically still the same. The bias add isn't free, as we pay a memcpy upfront plus a slightly more expensive GEMM epilogue.

The fusion one might have hoped for, where x·w + b (here out = α·A·B + β·C) collapses into a single kernel with no extra memory traffic, isn't what happened. Inductor preserved the two memory-touching operations, it just relabeled the bias copy as a memcpy and the addition as a GEMM epilogue.

A truly fused implementation would skip the memcpy. That's what FlashAttention-style hand-written kernels do, and what Inductor can do via Triton codegen, but for a 4096×4096 bf16 matmul, Inductor evidently decided "use cuBLAS, do the bias via epilogue setup" was the best path.

CPU overhead went up, not down

This is the easiest thing to miss when comparing an eager and a compiled run:

step	eager dur (ms)	compile dur (ms)
#2	0.1	0.2
#3	0.07	0.1
#4	0.07	0.1

Compile is roughly 2× more expensive on the CPU per step. That's because every call walks the full Dynamo > AOTAutograd > Inductor stack, on top of the same aten::addmm dispatch we have anyway. The compile pipeline is built for ML models with dozens of ops where the per-call overhead amortizes (for a single op it's a tax).

torch.compile has a mode argument. It is for the reader to take home as an assignment to read the documentation and come up with a mode that could take the CPU overhead down. 🤗

Trace reading cheatsheet

A quick reference for the patterns we walked through. The idea is: if you see this in a trace, this is what it usually means.

Profiler table

What you see	What it usually means
`Self CPU time total` ≫ `Self CUDA time total` (CPU in ms, GPU in µs)	Overhead-bound. The CPU spends more time dispatching than the GPU spends computing. Make the work bigger (larger matrices, batched ops) or fuse calls.
`Self CPU time total` ≈ `Self CUDA time total`, both in ms	Compute-bound. The GPU is the bottleneck, which is usually what you want.
One event dominates `CUDA total`	That's your hotspot. Start the optimization there.
One event has a huge `# of Calls`	A potential bottleneck even if each call is cheap. Check whether it can be fused or batched.
`CPU total` ≫ `Self CPU` for a row	Most of the cost lives in children. Drill into the nested events, not the parent.

CPU lane

What you see	What it usually means
First `ProfileStep` much wider than the rest	Cold-start overhead: workspace allocation, cuBLAS heuristics, lazy module loading. Add warmup iterations and/or the schedule's `warmup` argument.
Big gap between `record_function("...")` start and the first `aten::*` inside it	Same cold-start tax, just zoomed in. The annotation entered, but the dispatch hadn't happened yet.
`cudaOccupancyMaxActiveBlocksPerMultiprocessor` before a `cudaLaunchKernel`	A heavyweight, adaptively-launched kernel (GEMM, conv, etc.). cuBLAS is asking the driver how many blocks fit on an SM so it can pick a kernel variant.
`cudaLaunchKernel` with no preceding occupancy query	An elementwise or reduction kernel with a fixed, resource-light footprint. Nothing to plan.
A long `cudaDeviceSynchronize` at the end of the active window	The profiler flushing events. Its duration is mostly the GPU finishing pending work, not a real CPU cost. A sync covering tiny GPU work is a classic overhead-bound symptom.
A `cudaMemcpyAsync` you didn't write	Often a hidden Device-to-Device copy. Common when `addmm` seeds its destination buffer with the bias before the GEMM epilogue.

GPU lane

What you see	What it usually means
`Activity Buffer Request` on the GPU lane	The profiler is allocating/refilling its own event buffer. The first one usually accounts for the initial CPU↔GPU lane offset.
A gap between two kernels in a single step	Likely another buffer request mid-execution. Confirm by running more iterations: if it appears only once, it's the profiler, not your code.
The same kernel timing differently across steps	GPU clocks, thermals, power management, driver housekeeping. Read the trace, not just the mean.
A kernel named like `ampere_bf16_s16816gemm_...`	The actual cuBLAS GPU work for a matmul. The kernel name is typically the same in eager and compiled mode for the same shapes/dtypes.
`Memcpy DtoD` immediately before a GEMM	The bias copy for an `addmm` epilogue. The "fusion" is at the dispatcher level, not in the kernel.

Dispatch chain

What you see	What it usually means
`ProfileStep#N` → `<record_function name>` → `aten::*` → `aten::mm` / `aten::bmm` / `aten::add`	The canonical nested call hierarchy. Self time excludes children; Total time includes them.
`aten::matmul` resolving to `aten::mm`	2D × 2D matrix multiply.
`aten::matmul` resolving to `aten::bmm` (with extra CUDA runtime calls)	Batched matmul on 3D+ tensors. cuBLAS does more heuristic work to pick the variant.
`aten::addmm(b, x, w)` instead of a separate `aten::add` + `aten::mm` pair	Operator fusion at the dispatcher level. The GPU kernel is still the same GEMM, with the bias add folded into the epilogue.

torch.compile

What you see	What it usually means
A `Torch-Compiled Region: K/M` row in the CPU lane	You're inside a compiled function.
`TorchDynamo Cache Lookup` on every step	Dynamo is verifying shapes/dtypes/devices match the cached compile. Paid on every call, even after compilation.
`AOTDispatcher Runtime Wrapper Prologue` even with no grads	AOTAutograd's runtime wrapper is always in the stack, handling tensor metadata and view tracking.
`## Call CompiledFxGraph <hash>` with the same hash across steps	Cache hits on the generated code. The generated source lives under `/tmp/torchinductor_<user>/fxgraph/<hash>`.
Per-step CPU time higher under `torch.compile` than eager for a tiny op	Expected. The Dynamo → AOTAutograd → Inductor stack is a tax that only amortizes over many ops.

Conclusion

We started with a tiny matmul + add and used it as an excuse to learn how to read a PyTorch profiler. Along the way we picked up a few mental models that travel well to bigger workloads. This was the first stop in the Profiling PyTorch series. In the posts that follow, we will gradually leave this two-op toy behind and walk up the ladder of complexity, looking at larger building blocks and, eventually, real models.

Thanks to Noe Flandre, Suvaditya Mukherjee, and Vidit Ostwal for their reviews on the early draft of the post!

#pytorch #profiling #performance-optimization #deep-learning #torch-profiler #beginners-guide