Server-Driven Video Streaming for Deep Learning Inference
https://people.cs.uchicago.edu/~junchenj/docs/DDS-Sigcomm20.pdf
Last updated
Was this helpful?
https://people.cs.uchicago.edu/~junchenj/docs/DDS-Sigcomm20.pdf
Last updated
Was this helpful?
Video streaming for analytics is pervasive
Wild-life camera, traffic camera, drone camera
Our goal: scale out video streaming for analytics
Design goals:
Preserver high accuracy
Save bandwidth
reduces bandwidth cost
reduces the response delay
Bandwidth saving opportunity
Traditionally: video server --> human server
Send pixels that are not relevant to the inference results
In this scenario: camera directly to the server-side DNN
Video analytics enables aggressive compression on non-object pixels
Prior works: real-time camera-side heuristics
Filter to drop pixels
Capture the frame --> heuristics --> filter --> send the remaining part to the server
Sub-optimal: miss many objects. Camera-side compute is too limited to support accurate heuristics
Challenge: server-side-DNN-driven + in real time
Difficult: encode current video by the feedback of current video (chicken-egg problem)
Solution: see the video first and iterate on how to encode the video
DNN-driven streaming (DDS)
Camera buffers several frames to form a video segment
Encode in low quality --> server --> DNN --> results and feedback regions --> re-encode the feedback regions in higher quality (send back to camera and re-encode) --> server feeds the result and update the inference result
Encodes and recalls the undetected objects through the iteration
How?
Regions that may have objects
Eliminate confidently-detected regions
Re-encode in higher quality
Finds the regions that are almost detected but not!
Bandwidth-accuracy trade-off
Save up to 59% bandwidth and achieve higher accuracy
Conclusion
Contribution:
Bandwidth-accuracy trade-off
Real-time server driven
DDS iterative workflows
Questions
Camera (NN-driven)?
Depends on the video contents (if difficult, then camera small-NN might not be sufficient to explore the complexity)
Dynamic: connectivity between camera and server?
Saving bandwidth --> video streaming survive
Server: low quality, not even detected?
Limitation: first iteration, the quality can't be too low to provide reasonable feedback and result in poor accuracy (middle quality)
Interest-based encoding?
DDS is trying to propose regions outside the video source and region of interest methods have already had the raw videos in their hands
DDS: video source provider and the region proposal algorithm (bandwidth cost)
Overhead of additional iterations?
Encoding capability of camera (little)
Server (2 times of inference on each frame): 2x overhead
But can save bandwidth --> save a huge cost
Other: server-side compute methods have more overheads
3-4x overhead on the server side
Latency induced?
Camera data storage?
Store the video on camera until feedback arrives?
DDS buffer frames (buffer time) + extra iterations (to improve the accuracy)
Worst time delay: sub-optimal
90% of the results before the iteration, the part of results are delivered from low-quality, streaming delay is shorter
Average delay pretty low
Storage: requires 2-3s video
Quickly changing video?
Rationale: iterate on the current frame, for each frame, look at it twice. Does not try to do something like learn from previous video. Without adaptation, DDS is able to merely look at the current frame.
Multiple-cameras
Extend DDS to work with multiple cameras and save more bandwidth in multi-camera setting?
Opportunity of research
Inter-camera redundancy
VID identification
Another design dimension
Some of my thoughts / questions:
Worst case scenarios of iterations might induce a lot of overhead, right? Stopping criterion?
Another kind of idea is, so since it depends on the video complexity, is it possible to use a mechanism
Small DNN on the camera side
While large DNN on the server side
And depending on the video complexity, smarty re-distribute compute responsibility to the compute servers?