Understanding host network stack overheads

https://dl.acm.org/doi/abs/10.1145/3452296.3472888

Network & Host Hardware Trends
- For internet and early generation datacenters
  - Bottlenecks were in the network
  - Challenge: sharing network resources
    Switch buffers, switch bandwidth
Exponentially increasing bandwidth
- Stagnant CPU capacity (slowdown of Moore's Law)
- For high speed networks
  - Bottleneck have moved to the host
  - Challenge: sharing host resources
    CPU cores
    DRAM bandwidth
    LLC capacity
Solutions: Linux stack optimization, RDMA, hardware offloads, userspace stacks
Goal: guide the design-space by a detailed understanding of today's stack
Methodology & Experimental Scenarios
- Goal: understand CPU overheads of host network stack
- Want to push the bottlenecks to the network stack
- Measure: throughput, cpu utilization & breakdown, cache miss rate
- Impact of various factors:
  - Optimization techniques: TSO/GRO, Jumbo frames, aRFS
  - HW configurations: DDIO: IOMMU
  - Traffic pattern: single, incast, one-to-one, outcast, all-to-all
  - Flow types: long flows, short, mixture
  - Network drops
  - Congestion control protocols
Main lessons from our study
- Bottlenecks have shifted from packet processing to data copy
  - For 40 Gbps NICs, a single CPU core could saturate the access link bandwidth
  - Multiple cores needed to saturate 100 Gbps access link bandwidth
  - Possible solution: zero-copy techniques like TCP mmap/AF_XDP
    Implementation overhead
- The NIC DMA pipeline has become inefficient
  - NIC overwrites data before application read them
  - High cache miss rate is the core reason for the inefficiency of NIC DMA pipeline
  - Large TCP buffers increase the delay from packet RX to data copy
  - More NIC Rx descriptors lead to a higher chance of cache eviction
  - Enabling IOMMU further degrades the performance
    Additional per page operations
  - Possible solution: TCP buffer size calculation must take host resources (like L3 cache, packet processing latency) into account
  - Possible solution: decouple data copy and packet processing (so as to scale them independently)
  - Possible solution: an efficient cache replacement policy
- Host resource sharing leads to further performance degradation
  - Multiple flows contending for host resources aggravates host bottleneck
  - Cache contention degrades the throughput per core
    Possible solution: receiver-driven protocols for orchestrating receiver's caches
  - Bandwidth contention further degrades performance
  - Higher scheduling overheads
  - GRO benefits reduce (increase the # of flows per core)
  - Possible solution: receiver-driven transport protocols for orchestrating receiver's bandwidth
- Colocation of short and long flows further degrades performance
  - When flows are collocated, both long/short flows suffer (degrades 48% / 42%)
  - Long and short flows have different bottlenecks
    TCP/IP: overheads increase as we decrease the flow size
    Data copy: won't improve performance of short flows as long flows
  - Possible solution: design different packet processing pipelines for short and long flows
  - Possible solution: design application- and network-aware CPU schedulers

PreviousMIND: In-Network Memory Management for Disaggregated Data Centers NextFrom Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers

Last updated 3 years ago

Was this helpful?