pFabric: Minimal Near-Optimal Datacenter Transport

https://web.stanford.edu/~skatti/pubs/sigcomm13-pfabric.pdf

Abstract

  • pFabric: minimalistic datacenter transport design

    • Provides near theoretically optimal flow completion times even at the 99th percentile for short flows

    • Minimizing average flow completion time for long flows

  • Key: datacenter transport should decouple flow scheduling from rate control

Introduction

Problem

  • Interactive soft real-time workloads demand low latency for each of the short request/response flows

    • But currently: FCT high

    • Reason: flows get queued up behind bursts of packets from large flows of co-existing workloads

Solution: rate control

  • Implicit: Keeping queues nearly empty through mechanisms like adaptive congestion control, ECN-based feedback, pacing etc.

    • Con: cannot precisely determine the right flow rates to optimally schedule flows

  • Explicit: compute and assign rates from the network to each flow in order to schedule the flows based on sizes or deadlines

    • Con: rather complex to implement because it requires detailed flow state at switches and coordination among switches to identify the bottleneck for each flow

Key:

  • Priority

  • Switches: depend on the priority

    • Priority scheduling: when a port is idle, the packet with the highest priority buffered at the port is dequeued and sent out

      • Starvation prevention: dequeue the earliest packet from the flow that has the highest priority packet in the queue

    • Priority dropping: when a packet arrives to a port with a full buffer, if it has priority less than or equal to the lowest priority packet in the buffer, it is dropped. Otherwise, the packet with the lowest priority is dropped to make room.

  • Lazy rate control

    • Flows start at line rate

    • SACKS, for every packet act, do additive increase as in standard tcp

    • No fast retransmits. Packet drops only detected by timeouts.

    • If fixed threshold occur, flow enters into probe mode where it periodically retransmits min-sized packet with a 1-byte payload and re-enter slow-start once it receives an act.

Last updated