A Preliminary Report on Distro

Bowen Peng, Jeffrey Quesnelle, Dillon Rolnick, Ari Lotter, Umer H. Adil, Esteban La Rocca – 2024

Status quo

  • Distributed data parallelism (DDP) and fully shared data parallelism (FSDP) are used to split model training among GPUs
  • Typically only a single model is trained, and all of that model’s weights get shared across GPUs, synchronized after each training step
  • Synchronizing weights means sending across the entire set of weights in a relatively I/O intensive operation between GPUs, requiring high bandwidth interconnect and the GPU to be placed physically close to one another

Contribution

  • Demonstrate 4-5x OOM reduction in GPU communication, enabling training of LLMs in low bandwidth contexts
  • New optimizer that matches AdamW in convergence performance

Implications

  • Cheaper GPU clusters
  • Changes the relative trade-off between higher VRAM GPUs and interconnect, which maps to compute-heavy workloads vs I/O-heavy operations (interconnect much more expensive than VRAM)
  • Decouples interconnect bandwidth from model size, enabling scaling without increasing interconnect bandwidth
  • Enables the creation of distributed and decentralized training networks across heterogeneous networking

References

paperreadonline