GPU Training Stack Explorer

Explore the complete GPU distributed training stack - from individual latency components to full system optimization strategies. This interactive tool breaks down where every microsecond goes in modern large-scale model training.

What You'll Learn

The Latency Waterfall

Understand every component of inter-node communication latency:

Key Insight: PCIe traversal dominates total latency (40-50%), making the IB vs EFA debate less significant than commonly believed.

Parallelism Topology

See how different parallelism strategies communicate:

Key Insight: Only TP runs over NVLink and cannot be overlapped. Everything else crosses the network but modern systems hide most of it behind compute.

Compute-Communication Overlap

Watch the evolution from naive blocking to fully overlapped execution:

Key Insight: The progression from "every nanosecond matters" to "latency disappears entirely" through clever scheduling.

Profiling Toolkit

Top-down methodology for finding bottlenecks:

Complete with actual profiling commands for every layer (nsys, ncu, dcgmi, perfquery, etc.)

The Real Optimization Frontier

Ranked list of what actually delivers speedups:

  1. SM Contention Elimination (1.5-2.6×): NCCL steals SMs from GEMMs
  2. Overlap Scheduling (hide 90%+): Communication during compute is free
  3. Reducing Total Bytes (2× AllReduce BW): FP8, SHARP, compression
  4. Workload Balancing (eliminates 65% of CP latency): It's not the network!
  5. Fabric Congestion (P99.9 tails): The legitimate IB vs EFA debate - but #5 of 5

Key Insight: The fabric debate is the LEAST important optimization lever. SM contention, overlap, and workload balance dominate.

Interactive Features

Tab Navigation

Layer-by-Layer Exploration

Click on individual layers to see:

Sources & References

Data synthesized from:

Use Cases

For ML Engineers: Understand where training time actually goes and what to optimize first

For Infrastructure Teams: Make informed decisions about fabric selection (IB vs EFA) based on measured impact

For Researchers: Deep-dive into the gap between theoretical wire latency and actual exposed cost

For Capacity Planners: See which hardware components matter most for different training patterns