A Preliminary Report on Distro

Bowen Peng, Jeffrey Quesnelle, Dillon Rolnick, Ari Lotter, Umer H. Adil, Esteban La Rocca – 2024

Status quo

Distributed data parallelism (DDP) and fully shared data parallelism (FSDP) are used to split model training among GPUs
Typically only a single model is trained, and all of that model’s weights get shared across GPUs, synchronized after each training step
Synchronizing weights means sending across the entire set of weights in a relatively I/O intensive operation between GPUs, requiring high bandwidth interconnect and the GPU to be placed physically close to one another

Contribution

Demonstrate 4-5x OOM reduction in GPU communication, enabling training of LLMs in low bandwidth contexts
New optimizer that matches AdamW in convergence performance

Implications

Cheaper GPU clusters
Changes the relative trade-off between higher VRAM GPUs and interconnect, which maps to compute-heavy workloads vs I/O-heavy operations (interconnect much more expensive than VRAM)
Decouples interconnect bandwidth from model size, enabling scaling without increasing interconnect bandwidth
Enables the creation of distributed and decentralized training networks across heterogeneous networking

References