Server-Driven Video Streaming for Deep Learning Inference


  • Video streaming for analytics is pervasive

    • Wild-life camera, traffic camera, drone camera

  • Our goal: scale out video streaming for analytics

  • Design goals:

    • Preserver high accuracy

    • Save bandwidth

      • reduces bandwidth cost

      • reduces the response delay

  • Bandwidth saving opportunity

    • Traditionally: video server --> human server

      • Send pixels that are not relevant to the inference results

    • In this scenario: camera directly to the server-side DNN

      • Video analytics enables aggressive compression on non-object pixels

  • Prior works: real-time camera-side heuristics

    • Filter to drop pixels

    • Capture the frame --> heuristics --> filter --> send the remaining part to the server

    • Sub-optimal: miss many objects. Camera-side compute is too limited to support accurate heuristics

  • Challenge: server-side-DNN-driven + in real time

    • Difficult: encode current video by the feedback of current video (chicken-egg problem)

    • Solution: see the video first and iterate on how to encode the video

      • DNN-driven streaming (DDS)

        • Camera buffers several frames to form a video segment

        • Encode in low quality --> server --> DNN --> results and feedback regions --> re-encode the feedback regions in higher quality (send back to camera and re-encode) --> server feeds the result and update the inference result

        • Encodes and recalls the undetected objects through the iteration

          • How?

            • Regions that may have objects

            • Eliminate confidently-detected regions

            • Re-encode in higher quality

            • Finds the regions that are almost detected but not!

  • Bandwidth-accuracy trade-off

    • Save up to 59% bandwidth and achieve higher accuracy

  • Conclusion

    • Contribution:

      • Bandwidth-accuracy trade-off

      • Real-time server driven

      • DDS iterative workflows

  • Questions

    • Camera (NN-driven)?

      • Depends on the video contents (if difficult, then camera small-NN might not be sufficient to explore the complexity)

    • Dynamic: connectivity between camera and server?

      • Saving bandwidth --> video streaming survive

    • Server: low quality, not even detected?

      • Limitation: first iteration, the quality can't be too low to provide reasonable feedback and result in poor accuracy (middle quality)

    • Interest-based encoding?

      • DDS is trying to propose regions outside the video source and region of interest methods have already had the raw videos in their hands

      • DDS: video source provider and the region proposal algorithm (bandwidth cost)

    • Overhead of additional iterations?

      • Encoding capability of camera (little)

      • Server (2 times of inference on each frame): 2x overhead

        • But can save bandwidth --> save a huge cost

        • Other: server-side compute methods have more overheads

          • 3-4x overhead on the server side

    • Latency induced?

      • Camera data storage?

        • Store the video on camera until feedback arrives?

        • DDS buffer frames (buffer time) + extra iterations (to improve the accuracy)

          • Worst time delay: sub-optimal

          • 90% of the results before the iteration, the part of results are delivered from low-quality, streaming delay is shorter

          • Average delay pretty low

        • Storage: requires 2-3s video

      • Quickly changing video?

        • Rationale: iterate on the current frame, for each frame, look at it twice. Does not try to do something like learn from previous video. Without adaptation, DDS is able to merely look at the current frame.

      • Multiple-cameras

        • Extend DDS to work with multiple cameras and save more bandwidth in multi-camera setting?

        • Opportunity of research

          • Inter-camera redundancy

            • VID identification

            • Another design dimension

Some of my thoughts / questions:

  1. Worst case scenarios of iterations might induce a lot of overhead, right? Stopping criterion?

  2. Another kind of idea is, so since it depends on the video complexity, is it possible to use a mechanism

    1. Small DNN on the camera side

    2. While large DNN on the server side

    3. And depending on the video complexity, smarty re-distribute compute responsibility to the compute servers?

Last updated