Beyond Data and Model Parallelism for Deep Neural Networks

https://www.youtube.com/watch?v=81l6kkV-OkE

  • Parallelizing DNN training is hard

    • Complex DNN models --> complex machine architectures

  • Existing approaches: data and model parallelism

    • Data parallelism is the default strategy in existing DNN framework

    • Manually-designed strategies

      • Combine data and model parallelism to accelerate DNNs

    • Automatic generated strategies

      • ColocRL uses RL to find device placement for model paralellism

    • Exploring dimensions beyond data and model parallelism can further accelerate DNN training (by up to 3.3x)

  • A search-based approach

    • Define the SOAP search space of possible parallelization approach

    • A cost model and a search algorithm

    • Combining them: optimized strategies

  • The SOSP search space

    • Samples, operators, attributes, parameters

      • Samples: partitioning training samples (data parallelism)

      • Operators: partitioning DNN operators (model parallelism)

      • Attributes: partitioning attributes in a sample (e.g., different pixels)

      • Parameters: partitioning parameters in an operator

    • Hybrid parallelism: different strategies perform the same computation (same accuracy, and focus on runtime performance)

  • This work: by considering a large search space, able to find better solution

    • Example: data parallelism, model parallelism, hybrid

  • A cost model and a search algorithm

    • Optimized solution in this search space

    • FlexFlow

      • Input: operator graph (computation in DNN model), device topology (set of available devices, and their inter-connections)

      • Execution optimizer

        • MCMC: search algorithm

          • Iterative generate candidate strategies

        • Execution simulator: cost model

          • Simulate the execution of the strategies and send the simulated performance back to the search algorithm

          • Challenge: measuring distributed executions on real hardware is slow

          • Two observations

            • The performance of DNN operators is highly predictable

            • DNN models only use a small number of distinct operators (redundancy)

          • Execution simulator

            • Measure each distinct operator once

            • Use the measurements to estimate different parallelization strategies

            • Delta simulation algorithm

              • Idea: do nothave to build task graph from scratch

              • Observation

                • The MCMC search proposes a new strategy by updating the previous strategy

                • Most of the task graph does not change

              • Solution: simulate a new strategy using incremental updates to previous simulations

        • Best found strategy will be sent to distributed runtime to parallelize training

    • Evaluation

      • Simulation reduces the search by 2-7x

      • The search only takes a few minutes

      • Two clusters, six DNN benchmarks

Last updated