IOS: Inter-Operator Scheduler for CNN Acceleration

https://arxiv.org/abs/2011.01302

  • Executive summary

    • Motivation: sequential execution --> under-utilization problem

  • Inter-Operator Scheduler

    • Inter-operator parallelism

    • Dynamic programming --> optimal schedule

    • 1.1-1.5x speedup

  • Efficient deployment of CNNs is important

    • Is CNN inference in current DL libraries well utilizing underlying hardware?

  • Motivation for Inter-Operator Parallelization

    • More small convs in CNN design

    • GPU peak performance increased

    • Intra- and inter-operator parallelization

      • Sequential execution: Intra-operator Parallelization: Device under-utilization (small op & opwerful GPU)

      • Inter-Op Parallel Execution: better device utilization

  • Background: wavefront schedule policy

    • Execute all available operators stage by stage

    • A better schedule

      • Put op to saturated stage: marginal benefit

      • Under-utilization problem

      • Wavefront schedule policy is sub-optimal

  • Inter-operator scheduler (IOS)

    • General idea: explore the schedule space exhausitvely

    • Challenge: the number of schedules is exp in the number of operators

      • Prohibitive to enumerate

    • Observation 1: optimal schedule for a subgraph can be reused

      • Key idea: dynamic programming

    • Observation 2: the width of the computation graph is usually small (max number of parallelizable operators)

      • Key result: time complexity is only exponential in the width

    • Parallelization strategy selection

      • Concurrent execution --> multi-GPU kernel at the same time

      • Operator merge --> merged convolution, usually better performance

      • Profile & select

    • Last stage candidates

      • S' can be the last stage of S <--> there is no edge from S' to S - S'

    • Transition graph and time complexity

    • Methodology

      • Benchmarks

        • Inception V3, SqueezeNet, Randwire, NasNet

      • Baselines: state-of-the-art frameworks, different schedules on IOS Runtime

      • Environment: NVIDIA V100, Cuda, cuDNN

    • More active warps improve utilization

Last updated