Scaling Distributed Machine Learning within-Network Aggregation
https://www.usenix.org/conference/nsdi21/presentation/sapio
Abstract
SwitchML: reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network
Motivation
Network-bound workload
Advances in GPU
The ratio of communication to computation to the workload itself has shifted
Challenge
Switch:
Limited computation
Limited storage
No floating points
Packet loss
Design
Combined switch-host architecture
Pool-based streaming aggregation
Quantized integer operations
Failure-recovery protocol
In-switch RDMA implementation
PreviouspFabric: Minimal Near-Optimal Datacenter TransportNextWCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
Last updated