Pantheon: the training ground for Internet congestion-control research

https://www.usenix.org/system/files/conference/atc18/atc18-yan-francis.pdf

Talk

Congestion control
- Cornerstone problem in computer networking
  - Avoids congestion collapse
  - Allocates resources among users
  - Affects every application using TCP socket
- BBR, Sprout, PCC

Every emerging algorithm claims to be the SOTA
- Compared with other algorithms that they picked
  - Must acquire, compile, and execute prior algorithms
- Evaluated on their own testbed
  - Large service operators: risky to deploy, long turnaround time
  - Researchers: on a much smaller scales, results may not generalize
- On simulators / emulators with their settings
  - How to configure the settings?
- Based on specific results they collected
  - The internet is diverse and evolving

Other fields:

Database: TPC
Computer systems: SPEC
CV: ImageNet
Lesson: shared, reproducible benchmarks can lead to huge leaps performance and transform technologies by making them scientific

Pantheon: a community resource

A common language in CC
- Benchmark algorithms
- Shared testbeds
- Public data
A training ground for congestion control
- Enables faster innovation and more reproducible research
- e.g. Vivace (NSDI '18), Copa (NSDI '18), Indigo: a ML-based congestion control
15+ algorithms
Common testing interface
Measure performance faithfully without modifications
- Performance varies across types of network path, path direction, and time
Limitation
- Only tests schemes at full throttle
- Nodes are not necessarily representative
- Does not measure interactions between different schemes (fairness, TCP-friendliness)
Calibrated emulators and pathological emulators
- Simulator / emulator: reproducible and allows rapid experimentation
- Open problem: what is the choice of parameter values to faithfully emulate a particular target network
- Replication errors
  - Five parameters: a bottleneck link rate, a constant propagation delay, a DropTail threshold for the sender's queue, a loss rate, a bit that selects constant rate or Poisson-governed rate
- Steps
  - Collect a set of results over a particular network path on Pantheon
    Avg throughput and 95th percentile delay of a dozen algorithms
  - Run Bayesian optimization
    Run twice: constant rate and Poisson-governed rate
    Objective function f(x): mean replication error
    Prior: Guassian process
    Acquisition function: expected improvements
  - Pathological emulators
    Very small buffer sizes
    Severe ACK aggregation
    Token-bucket policers
Ongoing projects: Vivace, Copa, and more; Indigo
- Vivace: validating a new scheme in the real world
- Copa: iterative design with measurements
- Indigo: a machine learning design enabled by Pantheon
  - Model the problem as a sequential decision making problem
  - Sender observes CC signal at every step, and then it takes an action to adjust the CC window
  - Goal: learn the mapping from state to action, and encode the mapping into a model
  - Design
    State: queueing delay, sending rate, receiving rate, window size, previous action
    Model: 1-layer LSTM network (for history)
  - CC-oracle
    Outputs an action that brings congestion window closest to the ideal size
    Ideal size
    Only exists in emulators (global view of the network)
    BDP: simple emulated links with a fixed bandwidth and min RRT
    Bandwidth delay product and use it as the CC window
    Search around BDP otherwise
    Imitation learning algorithm with DAgger

PreviousA Deep Reinforcement Learning Perspective on Internet Congestion Control NextOther

Last updated 3 years ago

Was this helpful?