• Distributed training is increasingly network-bound, the network performance has a substantial impact on the training time.

  • The paper proposes to use in-network aggregation primitive to accelerate distributed ML workloads, and implement it using programmable switch hardware. They propose SwitchML, a co-design of in-switch processing with an end-host transport layer and ML frameworks.


  • The main insights are

    • 1) aggregation involves a simple arithmetic operation and is amenable parallelization and pipelined execution on programmable network devices: decompose the parameter updates into appropriately-sized chunks that can be individually processed by the switch pipeline

    • 2) aggregation for SGD can be applied separately on different portions of the input data without order, and this doesn't affect correctness: tolerate packet loss through using light-weight switch scoreboard and retransmission mechanism driven solely by end hosts

    • 3) ML training is robust to modest approximations in compute: having the workers scale and convert floating-point values to fixed-point using adaptive scaling factor with negligible approx loss

  • The main techniques SwitchML uses include

    • Combined switch-host architecture: the switch handles integer aggregation, while end-host handles reliability and more complex computations

    • Pool-based streaming aggregation: streams aggregation through the switch

    • Lightweight fault tolerant protocols to recover from packet loss

    • Quantized integer-based aggregation: convert floating-point values to 32-bit integers using a block floating-point-like approach

Main Strength

  • Identify a list of key challenges of implementing aggregation primitive in programmable dataplane switch, especially limited computation and storage capacities

  • Break down the challenges and tackle them with a bunch of technqiues; appropriately dividing end-host and switch functionalities for an efficient and reliable streaming aggregation protocol

  • The pseudocode of algorithms are nice!

Main Weakness

  • Intel halts Tofino networking chip development now

  • In the packet loss recovery experiments, some slots are unevenly affected by random losses but SwitchML does not apply any form of work-stealing to rebalance load among aggregators


  • Training large models is typically more compute-bound than network-bound; google use TPU pods for training to avoid communication overhead?

Last updated