# SwitchML

### Problem&#x20;

* Distributed training is increasingly network-bound, the network performance has a substantial impact on the training time.&#x20;
* The paper proposes to use in-network aggregation primitive to accelerate distributed ML workloads, and implement it using programmable switch hardware. They propose SwitchML, a co-design of in-switch processing with an end-host transport layer and ML frameworks.&#x20;

### Insight

* The main insights are&#x20;
  * 1\) aggregation involves a simple arithmetic operation and is amenable parallelization and pipelined execution on programmable network devices: decompose the parameter updates into appropriately-sized chunks that can be individually processed by the switch pipeline&#x20;
  * 2\) aggregation for SGD can be applied separately on different portions of the input data without order, and this doesn't affect correctness: tolerate packet loss through using light-weight switch scoreboard and retransmission mechanism driven solely by end hosts&#x20;
  * 3\) ML training is robust to modest approximations in compute: having the workers scale and convert floating-point values to fixed-point using adaptive scaling factor with negligible approx loss&#x20;
* The main techniques SwitchML uses include&#x20;
  * Combined switch-host architecture: the switch handles integer aggregation, while end-host handles reliability and more complex computations&#x20;
  * Pool-based streaming aggregation: streams aggregation through the switch&#x20;
  * Lightweight fault tolerant protocols to recover from packet loss&#x20;
  * Quantized integer-based aggregation: convert floating-point values to 32-bit integers using a block floating-point-like approach&#x20;

### Main Strength&#x20;

* Identify a list of key challenges of implementing aggregation primitive in programmable dataplane switch, especially limited computation and storage capacities&#x20;
* Break down the challenges and tackle them with a bunch of technqiues; appropriately dividing end-host and switch functionalities for an efficient and reliable streaming aggregation protocol
* The pseudocode of algorithms are nice! &#x20;

### Main Weakness&#x20;

* Intel halts Tofino networking chip development now&#x20;
* In the packet loss recovery experiments, some slots are unevenly affected by random losses but SwitchML does not apply any form of work-stealing to rebalance load among aggregators&#x20;

### Comments

* Training large models is typically more compute-bound than network-bound; google use TPU pods for training to avoid communication overhead?
