Pantheon: the training ground for Internet congestion-control research

https://www.usenix.org/system/files/conference/atc18/atc18-yan-francis.pdf

Talk

  • Congestion control

    • Cornerstone problem in computer networking

      • Avoids congestion collapse

      • Allocates resources among users

      • Affects every application using TCP socket

    • BBR, Sprout, PCC

  • Every emerging algorithm claims to be the SOTA

    • Compared with other algorithms that they picked

      • Must acquire, compile, and execute prior algorithms

    • Evaluated on their own testbed

      • Large service operators: risky to deploy, long turnaround time

      • Researchers: on a much smaller scales, results may not generalize

    • On simulators / emulators with their settings

      • How to configure the settings?

    • Based on specific results they collected

      • The internet is diverse and evolving

Other fields:

  • Database: TPC

  • Computer systems: SPEC

  • CV: ImageNet

  • Lesson: shared, reproducible benchmarks can lead to huge leaps performance and transform technologies by making them scientific

Pantheon: a community resource

  • A common language in CC

    • Benchmark algorithms

    • Shared testbeds

    • Public data

  • A training ground for congestion control

    • Enables faster innovation and more reproducible research

    • e.g. Vivace (NSDI '18), Copa (NSDI '18), Indigo: a ML-based congestion control

  • 15+ algorithms

  • Common testing interface

  • Measure performance faithfully without modifications

    • Performance varies across types of network path, path direction, and time

  • Limitation

    • Only tests schemes at full throttle

    • Nodes are not necessarily representative

    • Does not measure interactions between different schemes (fairness, TCP-friendliness)

  • Calibrated emulators and pathological emulators

    • Simulator / emulator: reproducible and allows rapid experimentation

    • Open problem: what is the choice of parameter values to faithfully emulate a particular target network

    • Replication errors

      • Five parameters: a bottleneck link rate, a constant propagation delay, a DropTail threshold for the sender's queue, a loss rate, a bit that selects constant rate or Poisson-governed rate

    • Steps

      • Collect a set of results over a particular network path on Pantheon

        • Avg throughput and 95th percentile delay of a dozen algorithms

      • Run Bayesian optimization

        • Run twice: constant rate and Poisson-governed rate

        • Objective function f(x): mean replication error

        • Prior: Guassian process

        • Acquisition function: expected improvements

      • Pathological emulators

        • Very small buffer sizes

        • Severe ACK aggregation

        • Token-bucket policers

  • Ongoing projects: Vivace, Copa, and more; Indigo

    • Vivace: validating a new scheme in the real world

    • Copa: iterative design with measurements

    • Indigo: a machine learning design enabled by Pantheon

      • Model the problem as a sequential decision making problem

      • Sender observes CC signal at every step, and then it takes an action to adjust the CC window

      • Goal: learn the mapping from state to action, and encode the mapping into a model

      • Design

        • State: queueing delay, sending rate, receiving rate, window size, previous action

        • Model: 1-layer LSTM network (for history)

      • CC-oracle

        • Outputs an action that brings congestion window closest to the ideal size

        • Ideal size

          • Only exists in emulators (global view of the network)

          • BDP: simple emulated links with a fixed bandwidth and min RRT

            • Bandwidth delay product and use it as the CC window

          • Search around BDP otherwise

        • Imitation learning algorithm with DAgger

Last updated