Wavelet: Efficient DNN Training with Tick-Tock Scheduling

https://mlsys.org/virtual/2021/oral/1586

  1. All-reduce

  2. Parameter server

  • Why?

    • Cluster-level

    • Might be more fragmentation

    • Not something about single task utilization

  • Not using all resources all the time

Gandiva:

  • cluster-level

  • But does not improve a single job's performance

  • Single job also takes the same time

  • Increase inter-batch parallelism

Gandiva:

  • Multi-jobs and single jobs

Pipedream: minimizing the communication there

Version of the model that is read?

Slides

Last updated