# Understanding host network stack overheads

* Network & Host Hardware Trends&#x20;
  * For internet and early generation datacenters
    * Bottlenecks were in the network&#x20;
    * Challenge: sharing network resources&#x20;
      * Switch buffers, switch bandwidth&#x20;
* Exponentially increasing bandwidth&#x20;
  * Stagnant CPU capacity (slowdown of Moore's Law)&#x20;
  * For high speed networks&#x20;
    * Bottleneck have moved to the host
    * Challenge: sharing host resources&#x20;
      * CPU cores
      * DRAM bandwidth
      * LLC capacity&#x20;
* Solutions: Linux stack optimization, RDMA, hardware offloads, userspace stacks
* Goal: guide the design-space by a detailed understanding of today's stack&#x20;
* Methodology & Experimental Scenarios&#x20;
  * Goal: understand CPU overheads of host network stack&#x20;
  * Want to push the bottlenecks to the network stack&#x20;
  * Measure: throughput, cpu utilization & breakdown, cache miss rate&#x20;
  * Impact of various factors:
    * Optimization techniques: TSO/GRO, Jumbo frames, aRFS&#x20;
    * HW configurations: DDIO: IOMMU&#x20;
    * Traffic pattern: single, incast, one-to-one, outcast, all-to-all
    * Flow types: long flows, short, mixture&#x20;
    * Network drops
    * Congestion control protocols&#x20;
* Main lessons from our study&#x20;
  * **Bottlenecks have shifted from packet processing to data copy**
    * For 40 Gbps NICs, a single CPU core could saturate the access link bandwidth&#x20;
    * Multiple cores needed to saturate 100 Gbps access link bandwidth&#x20;
    * Possible solution: **zero-copy techniques** like TCP mmap/AF\_XDP
      * Implementation overhead &#x20;
  * **The NIC DMA pipeline has become inefficient**&#x20;
    * NIC overwrites data before application read them&#x20;
    * High cache miss rate is the core reason for the inefficiency of NIC DMA pipeline&#x20;
    * Large TCP buffers increase the delay from packet RX to data copy&#x20;
    * More NIC Rx descriptors lead to a higher chance of cache eviction&#x20;
    * Enabling IOMMU further degrades the performance
      * Additional per page operations&#x20;
    * Possible solution: **TCP buffer size calculation** must take host resources (like L3 cache, packet processing latency) into account&#x20;
    * Possible solution: **decouple** data copy and packet processing (so as to scale them independently)&#x20;
    * Possible solution: an efficient cache replacement policy&#x20;
  * **Host resource sharing leads to further performance degradation**&#x20;
    * Multiple flows contending for host resources aggravates host bottleneck&#x20;
    * Cache contention degrades the throughput per core&#x20;
      * Possible solution: receiver-driven protocols for orchestrating receiver's caches
    * Bandwidth contention further degrades performance&#x20;
    * Higher scheduling overheads
    * GRO benefits reduce (increase the # of flows per core)&#x20;
    * Possible solution: **receiver-driven** transport protocols for orchestrating receiver's bandwidth&#x20;
  * **Colocation of short and long flows further degrades performance**&#x20;
    * When flows are collocated, both long/short flows suffer (degrades 48% / 42%)&#x20;
    * Long and short flows have different bottlenecks&#x20;
      * TCP/IP: overheads increase as we decrease the flow size&#x20;
      * Data copy: won't improve performance of short flows as long flows&#x20;
    * Possible solution: design different packet processing pipelines for short and long flows
    * Possible solution: design **application- and network-aware** CPU schedulers&#x20;

![A Linux Network Stack Data Path  ](/files/Ng3TwJOmkDYlBRmb8zi1)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/understanding-host-network-stack-overheads.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
