pFabric

Interactive soft real-time applications demand low latency for every short request and response flow. The FCT for short flows in the deployed TCP-based fabrics was too long (i.e. tens of ms as compared to 10-20ms in theory). To reduce the FCT, new datacenter transport designs either use implicit signal for rate control or use explicit computed rate to schedule flow. These approaches are either imprecise in determining the right flow rates to optimal schedule flows or rather complex to implement in practice.
The problem is, for datacenter workloads, how to minimize the average FCT (or maximize the number of deadlines met) with a minimal practical design (i.e. ensure that short, high-priority flows see very low latency and long flows fully utilize the network).

The paper presents a minimalistic datacenter transport design that provides theoretically optimal flow completion times even at the 99th percentile for short flows, while minimizing average flow completion time for long flows.
The key insight is to decouple flow scheduling from rate control.
On the one hand, packets carry a single priority number in the headers, which is set independently by each flow. Switches have very small buffers. In a greedy and local fashion, individual switch conducts flow scheduling and dropping mechanism based on assigned priority in the packets. Priority flow scheduling ensures short flows to be completed with low latency.
On the other hand, rate control is minimal as all flows start at line-rate and throttle their sending rate only when high and persistent loss is observed. Minimal rate control ensures that congestion collapse event is properly handled and the network is fully utilized.

Simplicity: theoretically, pFabric requires no flow state or complex exact rate calculations at the switches, no large switch buffering requirements, no explicit network feedback, and no complex congestion control mechanisms at the end host.
Performance: extensive simulation experiments illustrate that pFabric can achieve near-optimal flow completion time both on average, and at the 99th percentile for short flows at loads as high as 80% of network fabric capacity.

All results are from simulations, and no results from prototype implementations shown. Though there's an analysis session on the feasibility of the implementation.
Strictly prioritizing small flows may starve long flows; malicious behaviors can split large flows to gain advantages, and these are not handled in the current design.
It requires modification on both the end-host and switches. Switch needs to maintain priority queue. End-host applications need to indicate the flow size or the deadlines, which might be not known at initiation time.

For the benchmarking workloads, I wonder whether the web search and data mining workloads are representative enough.
What if the priorities are arbitrarily set by the applications (not corresponding to the deadline or size)?

Yes! Clearly structured, intuitive and elegant design principals, clean presentation, extensive experiment results. I also enjoy reading the conceptual view of flow scheduling over a datacenter fabric.

Last updated 2 years ago

Was this helpful?