ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training

ZeRO-Infinity (Zero Redundancy Optimizer)

  • A novel deep learning (DL) training technology for scaling model training, from a single GPU to massive supercomputers with thousands of GPUs

  • Highlights

    • Train model over 30 trillion parameters on 512 V100 GPUs, 50x larger than SOTA

    • Training efficiency: super-linear throughput scaling through novel data partitioning and mapping that can exploit the aggregate CPU/NVMe (Non-Volatile Memory Express) memory bandwidths and CPU compute

    • Democratize large model training by allowing data scientists with a single GPU to fine-tune models larger than Open AI GPT-3 (175 billion parameters)

    • Eliminating the barrier to entry for large model training by making it simpler and easier (w/o complexity of combining parallelism techniques or user code changes)

  • Steps

    • Partitioning each model layer across all data parallel processes

    • Placing the partitions on the corresponding data parallel NVMe devices

    • Coordinating the data movement needed to compute forward / backward propagation and weight updates on the data parallel CPUs and GPUs, respectively

Addressing the needs of large model training now

  • SOTA large model training technology: 3D parallelism

    • Combines model parallelism, pipeline parallelism, with data parallelism

      • Used in DeepSpeed and NVIDIA Megatron-LM

      • But requires 320 GPU (~80GB memory) to fit a trillion-parameter model for training

      • Requires significant code refactoring (large barrier to entry)

  • Questions arise

    • Support the next 1000x growth in model size?

    • Make large models of today accessible to more data scientists?

    • Make large model training easier by eliminating the need for model refactoring?

  • ZeRO-Infinity

    • New innovations: data mapping and high-performance heterogeneous memory access

      • Allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously, unencumbered by their limited bandwidth

    • Train models w/o the need to recombine forms of parallelism using a memory-centric computation-tiling approach

    • Makes large model training easy by identifying and automating all the communication required for training any arbitrary model architecture, eliminate the need for model refactoring

    • Compute-and-communication-overlap engine to push training efficiency to the limits by hiding as much communication latency as possible

  • Concludes

    • Unprecedented model scale

    • Accessible

    • Easy to use

    • Excellent training efficiency

Last updated