Pantheon: the training ground for Internet congestion-control research
https://www.usenix.org/system/files/conference/atc18/atc18-yan-francis.pdf
Last updated
Was this helpful?
https://www.usenix.org/system/files/conference/atc18/atc18-yan-francis.pdf
Last updated
Was this helpful?
Congestion control
Cornerstone problem in computer networking
Avoids congestion collapse
Allocates resources among users
Affects every application using TCP socket
BBR, Sprout, PCC
Every emerging algorithm claims to be the SOTA
Compared with other algorithms that they picked
Must acquire, compile, and execute prior algorithms
Evaluated on their own testbed
Large service operators: risky to deploy, long turnaround time
Researchers: on a much smaller scales, results may not generalize
On simulators / emulators with their settings
How to configure the settings?
Based on specific results they collected
The internet is diverse and evolving
Other fields:
Database: TPC
Computer systems: SPEC
CV: ImageNet
Lesson: shared, reproducible benchmarks can lead to huge leaps performance and transform technologies by making them scientific
Pantheon: a community resource
A common language in CC
Benchmark algorithms
Shared testbeds
Public data
A training ground for congestion control
Enables faster innovation and more reproducible research
e.g. Vivace (NSDI '18), Copa (NSDI '18), Indigo: a ML-based congestion control
15+ algorithms
Common testing interface
Measure performance faithfully without modifications
Performance varies across types of network path, path direction, and time
Limitation
Only tests schemes at full throttle
Nodes are not necessarily representative
Does not measure interactions between different schemes (fairness, TCP-friendliness)
Calibrated emulators and pathological emulators
Simulator / emulator: reproducible and allows rapid experimentation
Open problem: what is the choice of parameter values to faithfully emulate a particular target network
Replication errors
Five parameters: a bottleneck link rate, a constant propagation delay, a DropTail threshold for the sender's queue, a loss rate, a bit that selects constant rate or Poisson-governed rate
Steps
Collect a set of results over a particular network path on Pantheon
Avg throughput and 95th percentile delay of a dozen algorithms
Run Bayesian optimization
Run twice: constant rate and Poisson-governed rate
Objective function f(x): mean replication error
Prior: Guassian process
Acquisition function: expected improvements
Pathological emulators
Very small buffer sizes
Severe ACK aggregation
Token-bucket policers
Ongoing projects: Vivace, Copa, and more; Indigo
Vivace: validating a new scheme in the real world
Copa: iterative design with measurements
Indigo: a machine learning design enabled by Pantheon
Model the problem as a sequential decision making problem
Sender observes CC signal at every step, and then it takes an action to adjust the CC window
Goal: learn the mapping from state to action, and encode the mapping into a model
Design
State: queueing delay, sending rate, receiving rate, window size, previous action
Model: 1-layer LSTM network (for history)
CC-oracle
Outputs an action that brings congestion window closest to the ideal size
Ideal size
Only exists in emulators (global view of the network)
BDP: simple emulated links with a fixed bandwidth and min RRT
Bandwidth delay product and use it as the CC window
Search around BDP otherwise
Imitation learning algorithm with DAgger