Data center TCP (DCTCP)
https://dl.acm.org/doi/10.1145/1851182.1851192
Problems being solved
Motivation
Cloud data centers: especially soft real-time applications, generate mixing workloads
Small predictable latency
Large sustained throughput
In this environment, SOTA TCP protocol falls short
Problems solved / improved
Higher throughput using less buffer space
High burst tolerance and low latency for short flows
Handles 10x the current background traffic, without impacting foreground traffic
Metrics of success
Partition/Aggregate workflow pattern:
Low latency for short flows
High burst tolerance
Need to continuously update internal data structures of the applications:
High utilization / throughput for long flows
Key innovations
Measure and analyze production traffic from data centers whose network is comprised of commodity switches
Impairments that hurt performance is identified, and linked to the properties of the traffic and switches
DCTCP that addresses these impairments to meet the need of applications
Goal: switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flow
Use Explicit Congestion Notification (ECN)
Combine ECN with a novel control scheme at the sources
Extract multibit feedback on congestion in the network from the single bit stream of ECN marks
Communications in data centers
Partition / Aggregate (Query)
Motivates why latency is a critical metric
Delay sensitive
all-up SLA
lagging instances of partition / aggregate can thus add up to threaten the all-up SLAs for queries
When a node misses its deadline, the computation continues without that response, lowering the quality of the result.
Many applications find it difficult to meet these deadlines using state-of-the-art TCP, so developers often resort to complex, ad-hoc solutions
Missed deadline: lower quality result
Short message (50KB-1MB) (Coordination, Control State)
Delay sensitive
Large flows (1MB-50MB) (Data update)
Throughput sensitive
Performance impairments
Shallow packet buffers cause three performance impairments
Incast
if many flows converge on the same interface of a switch over a short period of time, the packets may exhaust either the switch memory or the maximum permitted buffer for that interface, resulting in packet losses.
This can occur even if the flow sizes are small
Partition/Aggregate design pattern: as the request for data synchronizes the workers’ responses and creates incast at the queue of the switch port connected to the aggregator
Queue buildup
Buffer pressure
Requirements
Algorithm
The TCP/ECN Control Loop
One batch has one ECN
Related assumptions (or limitations?)
DC v.s. WAN
Round trip time (RTTs)
Applications simultaneously need extremely high bandwidths and very low latencies
Little statistical multiplexing
Network: largely homogeneous and under a single administrative control
Backward compatibility, incremental deployment and fairness to legacy protocols are not major concerns
Real rule of thumb: low variance in sending rate --> small buffers suffice
Last updated
Was this helpful?