# Scaling Distributed Machine Learning within-Network Aggregation

### Abstract&#x20;

* SwitchML: reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network&#x20;

### Motivation&#x20;

* Network-bound workload&#x20;
  * Advances in GPU
  * The ratio of communication to computation to the workload itself has shifted&#x20;

### Challenge

* Switch:&#x20;
  * Limited computation
  * Limited storage&#x20;
  * No floating points&#x20;
  * Packet loss&#x20;

### Design

* Combined switch-host architecture&#x20;
* Pool-based streaming aggregation&#x20;
* Quantized integer operations&#x20;
* Failure-recovery protocol&#x20;
* In-switch RDMA implementation&#x20;
