Scaling Distributed Machine Learning within-Network Aggregation

https://www.usenix.org/conference/nsdi21/presentation/sapio

Abstract

  • SwitchML: reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network

Motivation

  • Network-bound workload

    • Advances in GPU

    • The ratio of communication to computation to the workload itself has shifted

Challenge

  • Switch:

    • Limited computation

    • Limited storage

    • No floating points

    • Packet loss

Design

  • Combined switch-host architecture

  • Pool-based streaming aggregation

  • Quantized integer operations

  • Failure-recovery protocol

  • In-switch RDMA implementation

Last updated