A Preliminary Report on Distro
Bowen Peng, Jeffrey Quesnelle, Dillon Rolnick, Ari Lotter, Umer H. Adil, Esteban La Rocca – 2024
Status quo
- Distributed data parallelism (DDP) and fully shared data parallelism (FSDP) are used to split model training among GPUs
- Typically only a single model is trained, and all of that model’s weights get shared across GPUs, synchronized after each training step
- Synchronizing weights means sending across the entire set of weights in a relatively I/O intensive operation between GPUs, requiring high bandwidth interconnect and the GPU to be placed physically close to one another
Contribution
- Demonstrate 4-5x OOM reduction in GPU communication, enabling training of LLMs in low bandwidth contexts
- New optimizer that matches AdamW in convergence performance
Implications
- Cheaper GPU clusters
- Changes the relative trade-off between higher VRAM GPUs and interconnect, which maps to compute-heavy workloads vs I/O-heavy operations (interconnect much more expensive than VRAM)
- Decouples interconnect bandwidth from model size, enabling scaling without increasing interconnect bandwidth
- Enables the creation of distributed and decentralized training networks across heterogeneous networking