Wavelet: Efficient DNN Training with Tick-Tock Scheduling
https://mlsys.org/virtual/2021/oral/1586
Last updated
Was this helpful?
https://mlsys.org/virtual/2021/oral/1586
Last updated
Was this helpful?
All-reduce
Parameter server
Why?
Cluster-level
Might be more fragmentation
Not something about single task utilization
Not using all resources all the time
Gandiva:
cluster-level
But does not improve a single job's performance
Single job also takes the same time
Increase inter-batch parallelism
Gandiva:
Multi-jobs and single jobs
Pipedream: minimizing the communication there
Version of the model that is read?