CtrlK

Scaling Distributed Machine Learning within-Network Aggregation

https://www.usenix.org/conference/nsdi21/presentation/sapio

Abstract

SwitchML: reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network

Motivation

Network-bound workload
- Advances in GPU
- The ratio of communication to computation to the workload itself has shifted

Challenge

Switch:
- Limited computation
- Limited storage
- No floating points
- Packet loss

Design

Combined switch-host architecture
Pool-based streaming aggregation
Quantized integer operations
Failure-recovery protocol
In-switch RDMA implementation

PreviouspFabric: Minimal Near-Optimal Datacenter Transport NextWCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers

Last updated 4 years ago

Was this helpful?