SwitchML
Problem
Distributed training is increasingly network-bound, the network performance has a substantial impact on the training time.
The paper proposes to use in-network aggregation primitive to accelerate distributed ML workloads, and implement it using programmable switch hardware. They propose SwitchML, a co-design of in-switch processing with an end-host transport layer and ML frameworks.
Insight
The main insights are
1) aggregation involves a simple arithmetic operation and is amenable parallelization and pipelined execution on programmable network devices: decompose the parameter updates into appropriately-sized chunks that can be individually processed by the switch pipeline
2) aggregation for SGD can be applied separately on different portions of the input data without order, and this doesn't affect correctness: tolerate packet loss through using light-weight switch scoreboard and retransmission mechanism driven solely by end hosts
3) ML training is robust to modest approximations in compute: having the workers scale and convert floating-point values to fixed-point using adaptive scaling factor with negligible approx loss
The main techniques SwitchML uses include
Combined switch-host architecture: the switch handles integer aggregation, while end-host handles reliability and more complex computations
Pool-based streaming aggregation: streams aggregation through the switch
Lightweight fault tolerant protocols to recover from packet loss
Quantized integer-based aggregation: convert floating-point values to 32-bit integers using a block floating-point-like approach
Main Strength
Identify a list of key challenges of implementing aggregation primitive in programmable dataplane switch, especially limited computation and storage capacities
Break down the challenges and tackle them with a bunch of technqiues; appropriately dividing end-host and switch functionalities for an efficient and reliable streaming aggregation protocol
The pseudocode of algorithms are nice!
Main Weakness
Intel halts Tofino networking chip development now
In the packet loss recovery experiments, some slots are unevenly affected by random losses but SwitchML does not apply any form of work-stealing to rebalance load among aggregators
Comments
Training large models is typically more compute-bound than network-bound; google use TPU pods for training to avoid communication overhead?
Last updated