Server-Driven Video Streaming for Deep Learning Inference

https://people.cs.uchicago.edu/~junchenj/docs/DDS-Sigcomm20.pdf

Talk

Video streaming for analytics is pervasive
- Wild-life camera, traffic camera, drone camera
Our goal: scale out video streaming for analytics
Design goals:
- Preserver high accuracy
- Save bandwidth
  - reduces bandwidth cost
  - reduces the response delay
Bandwidth saving opportunity
- Traditionally: video server --> human server
  - Send pixels that are not relevant to the inference results
- In this scenario: camera directly to the server-side DNN
  - Video analytics enables aggressive compression on non-object pixels
Prior works: real-time camera-side heuristics
- Filter to drop pixels
- Capture the frame --> heuristics --> filter --> send the remaining part to the server
- Sub-optimal: miss many objects. Camera-side compute is too limited to support accurate heuristics
Challenge: server-side-DNN-driven + in real time
- Difficult: encode current video by the feedback of current video (chicken-egg problem)
- Solution: see the video first and iterate on how to encode the video
  - DNN-driven streaming (DDS)
    Camera buffers several frames to form a video segment
    Encode in low quality --> server --> DNN --> results and feedback regions --> re-encode the feedback regions in higher quality (send back to camera and re-encode) --> server feeds the result and update the inference result
    Encodes and recalls the undetected objects through the iteration
    How?
    Regions that may have objects
    Eliminate confidently-detected regions
    Re-encode in higher quality
    Finds the regions that are almost detected but not!
Bandwidth-accuracy trade-off
- Save up to 59% bandwidth and achieve higher accuracy
Conclusion
- Contribution:
  - Bandwidth-accuracy trade-off
  - Real-time server driven
  - DDS iterative workflows
Questions
- Camera (NN-driven)?
  - Depends on the video contents (if difficult, then camera small-NN might not be sufficient to explore the complexity)
- Dynamic: connectivity between camera and server?
  - Saving bandwidth --> video streaming survive
- Server: low quality, not even detected?
  - Limitation: first iteration, the quality can't be too low to provide reasonable feedback and result in poor accuracy (middle quality)
- Interest-based encoding?
  - DDS is trying to propose regions outside the video source and region of interest methods have already had the raw videos in their hands
  - DDS: video source provider and the region proposal algorithm (bandwidth cost)
- Overhead of additional iterations?
  - Encoding capability of camera (little)
  - Server (2 times of inference on each frame): 2x overhead
    But can save bandwidth --> save a huge cost
    Other: server-side compute methods have more overheads
    3-4x overhead on the server side
- Latency induced?
  - Camera data storage?
    Store the video on camera until feedback arrives?
    DDS buffer frames (buffer time) + extra iterations (to improve the accuracy)
    Worst time delay: sub-optimal
    90% of the results before the iteration, the part of results are delivered from low-quality, streaming delay is shorter
    Average delay pretty low
    Storage: requires 2-3s video
  - Quickly changing video?
    Rationale: iterate on the current frame, for each frame, look at it twice. Does not try to do something like learn from previous video. Without adaptation, DDS is able to merely look at the current frame.
  - Multiple-cameras
    Extend DDS to work with multiple cameras and save more bandwidth in multi-camera setting?
    Opportunity of research
    Inter-camera redundancy
    VID identification
    Another design dimension

Some of my thoughts / questions:

Worst case scenarios of iterations might induce a lot of overhead, right? Stopping criterion?
Another kind of idea is, so since it depends on the video complexity, is it possible to use a mechanism
1. Small DNN on the camera side
2. While large DNN on the server side
3. And depending on the video complexity, smarty re-distribute compute responsibility to the compute servers?

PreviousNeural Adaptive Video Streaming with Pensieve NextCongestion Control

Last updated 3 years ago

Was this helpful?