# IOS: Inter-Operator Scheduler for CNN Acceleration

* Executive summary&#x20;
  * Motivation: sequential execution --> under-utilization problem&#x20;
* Inter-Operator Scheduler&#x20;
  * Inter-operator parallelism&#x20;
  * Dynamic programming --> optimal schedule&#x20;
  * 1.1-1.5x speedup&#x20;
* Efficient deployment of CNNs is important&#x20;
  * Is CNN inference in current DL libraries well utilizing underlying hardware?&#x20;
* Motivation for Inter-Operator Parallelization&#x20;
  * More small convs in CNN design
  * GPU peak performance increased&#x20;
  * Intra- and inter-operator parallelization&#x20;
    * Sequential execution: Intra-operator Parallelization: Device under-utilization (small op & opwerful GPU)
    * Inter-Op Parallel Execution: better device utilization&#x20;
* Background: wavefront schedule policy&#x20;
  * Execute all available operators stage by stage&#x20;
  * A better schedule&#x20;
    * Put op to saturated stage: marginal benefit&#x20;
    * Under-utilization problem&#x20;
    * Wavefront schedule policy is sub-optimal&#x20;
* Inter-operator scheduler (IOS)
  * General idea: explore the schedule space exhausitvely&#x20;
  * Challenge: the number of schedules is exp in the number of operators&#x20;
    * Prohibitive to enumerate&#x20;
  * Observation 1: optimal schedule for a subgraph can be reused&#x20;
    * Key idea: dynamic programming&#x20;
  * Observation 2: the width of the computation graph is usually small (max number of parallelizable operators)&#x20;
    * Key result: time complexity is only exponential in the width&#x20;
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FkX4mTTCukb5YMppJGCNt%2Fimage.png?alt=media\&token=f0d5f7a8-1a27-4694-b64c-891e6968c523)
  * Parallelization strategy selection&#x20;
    * Concurrent execution --> multi-GPU kernel at the same time&#x20;
    * Operator merge --> merged convolution, usually better performance&#x20;
    * Profile & select&#x20;
  * Last stage candidates&#x20;
    * S' can be the last stage of S <--> there is no edge from S' to S - S'&#x20;
  * Transition graph and time complexity&#x20;
  * Methodology&#x20;
    * Benchmarks&#x20;
      * Inception V3, SqueezeNet, Randwire, NasNet
    * Baselines: state-of-the-art frameworks, different schedules on IOS Runtime&#x20;
    * Environment: NVIDIA V100, Cuda, cuDNN
  * More active warps improve utilization
