ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training

https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/

A novel deep learning (DL) training technology for scaling model training, from a single GPU to massive supercomputers with thousands of GPUs
Highlights
- Train model over 30 trillion parameters on 512 V100 GPUs, 50x larger than SOTA
- Training efficiency: super-linear throughput scaling through novel data partitioning and mapping that can exploit the aggregate CPU/NVMe (Non-Volatile Memory Express) memory bandwidths and CPU compute
- Democratize large model training by allowing data scientists with a single GPU to fine-tune models larger than Open AI GPT-3 (175 billion parameters)
- Eliminating the barrier to entry for large model training by making it simpler and easier (w/o complexity of combining parallelism techniques or user code changes)
Steps
- Partitioning each model layer across all data parallel processes
- Placing the partitions on the corresponding data parallel NVMe devices
- Coordinating the data movement needed to compute forward / backward propagation and weight updates on the data parallel CPUs and GPUs, respectively

SOTA large model training technology: 3D parallelism
- Combines model parallelism, pipeline parallelism, with data parallelism
  - Used in DeepSpeed and NVIDIA Megatron-LM
  - But requires 320 GPU (~80GB memory) to fit a trillion-parameter model for training
  - Requires significant code refactoring (large barrier to entry)
Questions arise
- Support the next 1000x growth in model size?
- Make large models of today accessible to more data scientists?
- Make large model training easier by eliminating the need for model refactoring?
ZeRO-Infinity
- New innovations: data mapping and high-performance heterogeneous memory access
  - Allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously, unencumbered by their limited bandwidth
- Train models w/o the need to recombine forms of parallelism using a memory-centric computation-tiling approach
- Makes large model training easy by identifying and automating all the communication required for training any arbitrary model architecture, eliminate the need for model refactoring
- Compute-and-communication-overlap engine to push training efficiency to the limits by hiding as much communication latency as possible
Concludes
- Unprecedented model scale
- Accessible
- Easy to use
- Excellent training efficiency

Last updated 4 years ago

Was this helpful?