ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training
https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/
ZeRO-Infinity (Zero Redundancy Optimizer)
A novel deep learning (DL) training technology for scaling model training, from a single GPU to massive supercomputers with thousands of GPUs
Highlights
Train model over 30 trillion parameters on 512 V100 GPUs, 50x larger than SOTA
Training efficiency: super-linear throughput scaling through novel data partitioning and mapping that can exploit the aggregate CPU/NVMe (Non-Volatile Memory Express) memory bandwidths and CPU compute
Democratize large model training by allowing data scientists with a single GPU to fine-tune models larger than Open AI GPT-3 (175 billion parameters)
Eliminating the barrier to entry for large model training by making it simpler and easier (w/o complexity of combining parallelism techniques or user code changes)
Steps
Partitioning each model layer across all data parallel processes
Placing the partitions on the corresponding data parallel NVMe devices
Coordinating the data movement needed to compute forward / backward propagation and weight updates on the data parallel CPUs and GPUs, respectively
Addressing the needs of large model training now
SOTA large model training technology: 3D parallelism
Combines model parallelism, pipeline parallelism, with data parallelism
Used in DeepSpeed and NVIDIA Megatron-LM
But requires 320 GPU (~80GB memory) to fit a trillion-parameter model for training
Requires significant code refactoring (large barrier to entry)
Questions arise
Support the next 1000x growth in model size?
Make large models of today accessible to more data scientists?
Make large model training easier by eliminating the need for model refactoring?
ZeRO-Infinity
New innovations: data mapping and high-performance heterogeneous memory access
Allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously, unencumbered by their limited bandwidth
Train models w/o the need to recombine forms of parallelism using a memory-centric computation-tiling approach
Makes large model training easy by identifying and automating all the communication required for training any arbitrary model architecture, eliminate the need for model refactoring
Compute-and-communication-overlap engine to push training efficiency to the limits by hiding as much communication latency as possible
Concludes
Unprecedented model scale
Accessible
Easy to use
Excellent training efficiency
Last updated