# Data center TCP (DCTCP)

### Problems being solved&#x20;

#### Motivation&#x20;

* Cloud data centers: especially soft real-time applications, generate mixing workloads
  * Small predictable latency&#x20;
  * Large sustained throughput&#x20;
* In this environment, SOTA TCP protocol falls short&#x20;

#### Problems solved / improved&#x20;

* Higher throughput using less buffer space&#x20;
* High burst tolerance and low latency for short flows&#x20;
* Handles 10x the current background traffic, without impacting foreground traffic&#x20;

### Metrics of success

Partition/Aggregate workflow pattern:

* Low latency for short flows
* High burst tolerance&#x20;

Need to continuously update internal data structures of the applications:&#x20;

* High utilization / throughput for long flows&#x20;

### Key innovations&#x20;

* Measure and analyze production traffic from data centers whose network is comprised of commodity switches&#x20;
  * Impairments that hurt performance is identified, and linked to the properties of the traffic and switches&#x20;
* DCTCP that addresses these impairments to meet the need of applications&#x20;
  * Goal: switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flow
  * Use Explicit Congestion Notification (ECN)&#x20;
  * Combine ECN with a novel control scheme at the sources&#x20;
    * Extract multibit feedback on congestion in the network from the single bit stream of ECN marks&#x20;

### Communications in data centers

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYbkpyf2OKZfRMo0Z4U%2F-MYbqmw571_TnBd7A8Zh%2Fimage.png?alt=media\&token=9dd36a28-e9b4-4ae2-94ee-3b8089a67c3d)

#### Partition / Aggregate (Query)

1. Motivates why latency is a critical metric
   1. Delay sensitive &#x20;
2. all-up SLA
   1. lagging instances of partition / aggregate can thus add up to threaten the all-up SLAs for queries&#x20;
   2. When a node misses its deadline, the computation continues without that response, lowering the quality of the result.
   3. Many applications find it difficult to meet these deadlines using state-of-the-art TCP, so developers often resort to complex, ad-hoc solutions
3. Missed deadline: lower quality result&#x20;

#### Short message (50KB-1MB) (Coordination, Control State)

* Delay sensitive&#x20;

#### Large flows (1MB-50MB) (Data update)

* Throughput sensitive&#x20;

### Performance impairments

* Shallow packet buffers cause three performance impairments&#x20;
  * Incast&#x20;
    * if many flows converge on the same interface of a switch over a short period of time, the packets may exhaust either the switch memory or the maximum permitted buffer for that interface, resulting in packet losses.
    * This can occur even if the flow sizes are small
    * Partition/Aggregate design pattern: as the request for data synchronizes the workers’ responses and creates incast at the queue of the switch port connected to the aggregator

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYbkpyf2OKZfRMo0Z4U%2F-MYc0Ubv7KPtgVmtzq8f%2Fimage.png?alt=media\&token=72a9f790-05ed-4151-9d22-1ef279b0be9a)

* Queue buildup&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYbkpyf2OKZfRMo0Z4U%2F-MYc1-0F9gImQV86WBPY%2Fimage.png?alt=media\&token=0cec4799-d520-41e9-91a4-b0e072f9814c)

* Buffer pressure&#x20;

### Requirements&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYbkpyf2OKZfRMo0Z4U%2F-MYc1vU3r20RB3hsNdDd%2Fimage.png?alt=media\&token=85341b0d-1e7c-48bb-80d3-e4e787f797db)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYbkpyf2OKZfRMo0Z4U%2F-MYc26hzmCLbJoPJKfBm%2Fimage.png?alt=media\&token=e329c944-37db-431e-b607-6b9db8be37ba)

### Algorithm

* The TCP/ECN Control Loop&#x20;
  * One batch has one ECN&#x20;

### Related assumptions (or limitations?)

* DC v.s. WAN
  * Round trip time (RTTs)&#x20;
  * Applications simultaneously need extremely high bandwidths and very low latencies&#x20;
  * Little statistical multiplexing&#x20;
  * Network: largely homogeneous and under a single administrative control&#x20;
    * Backward compatibility, incremental deployment and fairness to legacy protocols are not major concerns&#x20;
* Real rule of thumb: low variance in sending rate --> small buffers suffice&#x20;
