Wavelet: Efficient DNN Training with Tick-Tock Scheduling

https://mlsys.org/virtual/2021/oral/1586
  1. 1.
    All-reduce
  2. 2.
    Parameter server
  • Why?
    • Cluster-level
    • Might be more fragmentation
    • Not something about single task utilization
  • Not using all resources all the time
Gandiva:
  • cluster-level
  • But does not improve a single job's performance
  • Single job also takes the same time
  • Increase inter-batch parallelism
Gandiva:
  • Multi-jobs and single jobs
Pipedream: minimizing the communication there
Version of the model that is read?

Slides