GPU Collective Communication Visualizer
An interactive web visualization for understanding collective communication — the fundamental communication patterns used in distributed GPU computing. GPU communication libraries (NVIDIA NCCL, AMD RCCL) and distributed computing frameworks (MPI, Gloo) implement these patterns, and this tool helps you learn how they work.
See how data moves between GPUs in Ring, Tree, and Naive topologies — step by step, with real-time animation.
What is Collective Communication?
Collective communication refers to communication patterns where multiple GPUs participate together. These patterns are essential for distributed deep learning training — GPU communication libraries (NVIDIA NCCL, AMD RCCL) and distributed computing frameworks (MPI, Gloo) all implement them. This tool visualizes how each algorithm works internally, helping you understand the core concepts.
AllReduce Visualization
AllReduce is the most widely used collective operation in distributed training. It sums values across all GPUs and distributes the result back. This tool visualizes Ring AllReduce, Tree AllReduce, and Naive AllReduce step by step, showing exactly how data flows between GPUs at each stage.
Supported Operations
- AllReduce (Ring, Tree, Naive) — Reduce all GPU values and broadcast the result to every GPU. Essential for gradient synchronization in distributed training (PyTorch DDP, Horovod).
- Broadcast (Tree, Naive) — Copy one GPU's data to all other GPUs. Used for model parameter initialization and synchronization.
- Reduce (Tree, Naive) — Aggregate all GPU data to a single root GPU. Used in parameter server architectures.
- AllGather (Ring) — Each GPU collects all unique data chunks from every other GPU. Used in FSDP/ZeRO for parameter restoration before forward pass.
- ReduceScatter (Ring) — Reduce and distribute result chunks across GPUs. Core operation in FSDP/ZeRO gradient sharding for memory-efficient training.
- AllToAll (Naive) — Personalized data exchange between all GPU pairs. Used in Mixture of Experts (MoE) models for token routing.
- Gather (Naive) — Collect unique chunks from all GPUs to root.
- Scatter (Naive) — Distribute root's chunks to each GPU.
Algorithm Topologies
- Ring AllReduce — Data flows in a circular pattern. Bandwidth-optimal for large messages. Widely used algorithm for multi-GPU AllReduce (e.g. default in NCCL).
- Tree AllReduce — Binary tree structure for logarithmic-depth communication. Lower latency for small messages.
- Naive (Direct) — Star topology with direct GPU-to-GPU transfers.
Use Cases
Understanding these collective communication primitives helps you reason about how distributed deep learning frameworks (PyTorch DDP, PyTorch FSDP, DeepSpeed ZeRO, Megatron-LM, Horovod) and communication libraries (NCCL, RCCL, MPI, Gloo) work under the hood.
JavaScript is required to run this application.