IOS: Inter-Operator Scheduler for CNN Acceleration
https://arxiv.org/abs/2011.01302
Last updated
Was this helpful?
https://arxiv.org/abs/2011.01302
Last updated
Was this helpful?
Executive summary
Motivation: sequential execution --> under-utilization problem
Inter-Operator Scheduler
Inter-operator parallelism
Dynamic programming --> optimal schedule
1.1-1.5x speedup
Efficient deployment of CNNs is important
Is CNN inference in current DL libraries well utilizing underlying hardware?
Motivation for Inter-Operator Parallelization
More small convs in CNN design
GPU peak performance increased
Intra- and inter-operator parallelization
Sequential execution: Intra-operator Parallelization: Device under-utilization (small op & opwerful GPU)
Inter-Op Parallel Execution: better device utilization
Background: wavefront schedule policy
Execute all available operators stage by stage
A better schedule
Put op to saturated stage: marginal benefit
Under-utilization problem
Wavefront schedule policy is sub-optimal
Inter-operator scheduler (IOS)
General idea: explore the schedule space exhausitvely
Challenge: the number of schedules is exp in the number of operators
Prohibitive to enumerate
Observation 1: optimal schedule for a subgraph can be reused
Key idea: dynamic programming
Observation 2: the width of the computation graph is usually small (max number of parallelizable operators)
Key result: time complexity is only exponential in the width
Parallelization strategy selection
Concurrent execution --> multi-GPU kernel at the same time
Operator merge --> merged convolution, usually better performance
Profile & select
Last stage candidates
S' can be the last stage of S <--> there is no edge from S' to S - S'
Transition graph and time complexity
Methodology
Benchmarks
Inception V3, SqueezeNet, Randwire, NasNet
Baselines: state-of-the-art frameworks, different schedules on IOS Runtime
Environment: NVIDIA V100, Cuda, cuDNN
More active warps improve utilization