Data center TCP (DCTCP)

https://dl.acm.org/doi/10.1145/1851182.1851192

Problems being solved

Motivation

  • Cloud data centers: especially soft real-time applications, generate mixing workloads

    • Small predictable latency

    • Large sustained throughput

  • In this environment, SOTA TCP protocol falls short

Problems solved / improved

  • Higher throughput using less buffer space

  • High burst tolerance and low latency for short flows

  • Handles 10x the current background traffic, without impacting foreground traffic

Metrics of success

Partition/Aggregate workflow pattern:

  • Low latency for short flows

  • High burst tolerance

Need to continuously update internal data structures of the applications:

  • High utilization / throughput for long flows

Key innovations

  • Measure and analyze production traffic from data centers whose network is comprised of commodity switches

    • Impairments that hurt performance is identified, and linked to the properties of the traffic and switches

  • DCTCP that addresses these impairments to meet the need of applications

    • Goal: switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flow

    • Use Explicit Congestion Notification (ECN)

    • Combine ECN with a novel control scheme at the sources

      • Extract multibit feedback on congestion in the network from the single bit stream of ECN marks

Communications in data centers

Partition / Aggregate (Query)

  1. Motivates why latency is a critical metric

    1. Delay sensitive

  2. all-up SLA

    1. lagging instances of partition / aggregate can thus add up to threaten the all-up SLAs for queries

    2. When a node misses its deadline, the computation continues without that response, lowering the quality of the result.

    3. Many applications find it difficult to meet these deadlines using state-of-the-art TCP, so developers often resort to complex, ad-hoc solutions

  3. Missed deadline: lower quality result

Short message (50KB-1MB) (Coordination, Control State)

  • Delay sensitive

Large flows (1MB-50MB) (Data update)

  • Throughput sensitive

Performance impairments

  • Shallow packet buffers cause three performance impairments

    • Incast

      • if many flows converge on the same interface of a switch over a short period of time, the packets may exhaust either the switch memory or the maximum permitted buffer for that interface, resulting in packet losses.

      • This can occur even if the flow sizes are small

      • Partition/Aggregate design pattern: as the request for data synchronizes the workers’ responses and creates incast at the queue of the switch port connected to the aggregator

  • Queue buildup

  • Buffer pressure

Requirements

Algorithm

  • The TCP/ECN Control Loop

    • One batch has one ECN

  • DC v.s. WAN

    • Round trip time (RTTs)

    • Applications simultaneously need extremely high bandwidths and very low latencies

    • Little statistical multiplexing

    • Network: largely homogeneous and under a single administrative control

      • Backward compatibility, incremental deployment and fairness to legacy protocols are not major concerns

  • Real rule of thumb: low variance in sending rate --> small buffers suffice

Last updated