Wavelet: Efficient DNN Training with Tick-Tock Scheduling
https://mlsys.org/virtual/2021/oral/1586




All-reduce
Parameter server

Why?
Cluster-level
Might be more fragmentation
Not something about single task utilization

Not using all resources all the time

Gandiva:
cluster-level
But does not improve a single job's performance
Single job also takes the same time

Increase inter-batch parallelism
Gandiva:
Multi-jobs and single jobs

Pipedream: minimizing the communication there


Version of the model that is read?



Slides
Last updated
Was this helpful?