> For the complete documentation index, see [llms.txt](https://sliu583.gitbook.io/blog/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://sliu583.gitbook.io/blog/specific-work/shivarams-group/group-papers/lists/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training.md).

# ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training

### ZeRO-Infinity (Zero Redundancy Optimizer)&#x20;

* A novel deep learning (DL) training technology for scaling model training, from a single GPU to massive supercomputers with thousands of GPUs&#x20;
* Highlights&#x20;
  * Train model over 30 trillion parameters on 512 V100 GPUs, 50x larger than SOTA&#x20;
  * Training efficiency: super-linear throughput scaling through novel data partitioning and mapping that can exploit the aggregate CPU/NVMe (Non-Volatile Memory Express) memory bandwidths and CPU compute&#x20;
  * Democratize large model training by allowing data scientists with a single GPU to fine-tune models larger than Open AI GPT-3 (175 billion parameters)
  * Eliminating the barrier to entry for large model training by making it simpler and easier (w/o complexity of combining parallelism techniques or user code changes)&#x20;
* Steps&#x20;
  * Partitioning each model layer across all data parallel processes&#x20;
  * Placing the partitions on the corresponding data parallel NVMe devices&#x20;
  * Coordinating the data movement needed to compute forward / backward propagation and weight updates on the data parallel CPUs and GPUs, respectively&#x20;

![](/files/-MZZoZ0dTWA3NTbWV34W)

### Addressing the needs of large model training now&#x20;

* SOTA large model training technology: 3D parallelism&#x20;
  * Combines model parallelism, pipeline parallelism, with data parallelism&#x20;
    * Used in DeepSpeed and [NVIDIA Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
    * But requires 320 GPU (\~80GB memory) to fit a trillion-parameter model for training&#x20;
    * Requires significant code refactoring (large barrier to entry)&#x20;
* Questions arise&#x20;
  * Support the next 1000x growth in model size?
  * Make large models of today accessible to more data scientists?&#x20;
  * Make large model training easier by eliminating the need for model refactoring?&#x20;
* ZeRO-Infinity&#x20;
  * New innovations: data mapping and high-performance heterogeneous memory access&#x20;
    * Allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously, unencumbered by their limited bandwidth&#x20;
  * Train models w/o the need to recombine forms of parallelism using a memory-centric computation-tiling approach
  * Makes large model training easy by identifying and automating all the communication required for training any arbitrary model architecture, eliminate the need for model refactoring&#x20;
  * Compute-and-communication-overlap engine to push training efficiency to the limits by hiding as much communication latency as possible&#x20;
* Concludes&#x20;
  * Unprecedented model scale&#x20;
  * Accessible&#x20;
  * Easy to use&#x20;
  * Excellent training efficiency&#x20;
