Beyond Data and Model Parallelism for Deep Neural Networks
https://www.youtube.com/watch?v=81l6kkV-OkE
Parallelizing DNN training is hard
Complex DNN models --> complex machine architectures
Existing approaches: data and model parallelism
Data parallelism is the default strategy in existing DNN framework
Manually-designed strategies
Combine data and model parallelism to accelerate DNNs
Automatic generated strategies
ColocRL uses RL to find device placement for model paralellism
Exploring dimensions beyond data and model parallelism can further accelerate DNN training (by up to 3.3x)
A search-based approach
Define the SOAP search space of possible parallelization approach
A cost model and a search algorithm
Combining them: optimized strategies
The SOSP search space
Samples, operators, attributes, parameters
Samples: partitioning training samples (data parallelism)
Operators: partitioning DNN operators (model parallelism)
Attributes: partitioning attributes in a sample (e.g., different pixels)
Parameters: partitioning parameters in an operator
Hybrid parallelism: different strategies perform the same computation (same accuracy, and focus on runtime performance)
This work: by considering a large search space, able to find better solution
Example: data parallelism, model parallelism, hybrid
A cost model and a search algorithm
Optimized solution in this search space
FlexFlow
Input: operator graph (computation in DNN model), device topology (set of available devices, and their inter-connections)
Execution optimizer
MCMC: search algorithm
Iterative generate candidate strategies
Execution simulator: cost model
Simulate the execution of the strategies and send the simulated performance back to the search algorithm
Challenge: measuring distributed executions on real hardware is slow
Two observations
The performance of DNN operators is highly predictable
DNN models only use a small number of distinct operators (redundancy)
Execution simulator
Measure each distinct operator once
Use the measurements to estimate different parallelization strategies
Delta simulation algorithm
Idea: do nothave to build task graph from scratch
Observation
The MCMC search proposes a new strategy by updating the previous strategy
Most of the task graph does not change
Solution: simulate a new strategy using incremental updates to previous simulations
Best found strategy will be sent to distributed runtime to parallelize training
Evaluation
Simulation reduces the search by 2-7x
The search only takes a few minutes
Two clusters, six DNN benchmarks
Last updated