Beyond Data and Model Parallelism for Deep Neural Networks

https://www.youtube.com/watch?v=81l6kkV-OkE

Parallelizing DNN training is hard
- Complex DNN models --> complex machine architectures
Existing approaches: data and model parallelism
- Data parallelism is the default strategy in existing DNN framework
- Manually-designed strategies
  - Combine data and model parallelism to accelerate DNNs
- Automatic generated strategies
  - ColocRL uses RL to find device placement for model paralellism
- Exploring dimensions beyond data and model parallelism can further accelerate DNN training (by up to 3.3x)
A search-based approach
- Define the SOAP search space of possible parallelization approach
- A cost model and a search algorithm
- Combining them: optimized strategies
The SOSP search space
- Samples, operators, attributes, parameters
  - Samples: partitioning training samples (data parallelism)
  - Operators: partitioning DNN operators (model parallelism)
  - Attributes: partitioning attributes in a sample (e.g., different pixels)
  - Parameters: partitioning parameters in an operator
- Hybrid parallelism: different strategies perform the same computation (same accuracy, and focus on runtime performance)
This work: by considering a large search space, able to find better solution
- Example: data parallelism, model parallelism, hybrid
A cost model and a search algorithm
- Optimized solution in this search space
- FlexFlow
  - Input: operator graph (computation in DNN model), device topology (set of available devices, and their inter-connections)
  - Execution optimizer
    MCMC: search algorithm
    Iterative generate candidate strategies
    Execution simulator: cost model
    Simulate the execution of the strategies and send the simulated performance back to the search algorithm
    Challenge: measuring distributed executions on real hardware is slow
    Two observations
    The performance of DNN operators is highly predictable
    DNN models only use a small number of distinct operators (redundancy)
    Execution simulator
    Measure each distinct operator once
    Use the measurements to estimate different parallelization strategies
    Delta simulation algorithm
    Idea: do nothave to build task graph from scratch
    Observation
    The MCMC search proposes a new strategy by updating the previous strategy
    Most of the task graph does not change
    Solution: simulate a new strategy using incremental updates to previous simulations
    Best found strategy will be sent to distributed runtime to parallelize training
- Evaluation
  - Simulation reduces the search by 2-7x
  - The search only takes a few minutes
  - Two clusters, six DNN benchmarks

PreviousAutomatically Discovering Machine Learning Optimizations NextIOS: Inter-Operator Scheduler for CNN Acceleration

Last updated 3 years ago

Was this helpful?