Understanding host network stack overheads
https://dl.acm.org/doi/abs/10.1145/3452296.3472888
Network & Host Hardware Trends
For internet and early generation datacenters
Bottlenecks were in the network
Challenge: sharing network resources
Switch buffers, switch bandwidth
Exponentially increasing bandwidth
Stagnant CPU capacity (slowdown of Moore's Law)
For high speed networks
Bottleneck have moved to the host
Challenge: sharing host resources
CPU cores
DRAM bandwidth
LLC capacity
Solutions: Linux stack optimization, RDMA, hardware offloads, userspace stacks
Goal: guide the design-space by a detailed understanding of today's stack
Methodology & Experimental Scenarios
Goal: understand CPU overheads of host network stack
Want to push the bottlenecks to the network stack
Measure: throughput, cpu utilization & breakdown, cache miss rate
Impact of various factors:
Optimization techniques: TSO/GRO, Jumbo frames, aRFS
HW configurations: DDIO: IOMMU
Traffic pattern: single, incast, one-to-one, outcast, all-to-all
Flow types: long flows, short, mixture
Network drops
Congestion control protocols
Main lessons from our study
Bottlenecks have shifted from packet processing to data copy
For 40 Gbps NICs, a single CPU core could saturate the access link bandwidth
Multiple cores needed to saturate 100 Gbps access link bandwidth
Possible solution: zero-copy techniques like TCP mmap/AF_XDP
Implementation overhead
The NIC DMA pipeline has become inefficient
NIC overwrites data before application read them
High cache miss rate is the core reason for the inefficiency of NIC DMA pipeline
Large TCP buffers increase the delay from packet RX to data copy
More NIC Rx descriptors lead to a higher chance of cache eviction
Enabling IOMMU further degrades the performance
Additional per page operations
Possible solution: TCP buffer size calculation must take host resources (like L3 cache, packet processing latency) into account
Possible solution: decouple data copy and packet processing (so as to scale them independently)
Possible solution: an efficient cache replacement policy
Host resource sharing leads to further performance degradation
Multiple flows contending for host resources aggravates host bottleneck
Cache contention degrades the throughput per core
Possible solution: receiver-driven protocols for orchestrating receiver's caches
Bandwidth contention further degrades performance
Higher scheduling overheads
GRO benefits reduce (increase the # of flows per core)
Possible solution: receiver-driven transport protocols for orchestrating receiver's bandwidth
Colocation of short and long flows further degrades performance
When flows are collocated, both long/short flows suffer (degrades 48% / 42%)
Long and short flows have different bottlenecks
TCP/IP: overheads increase as we decrease the flow size
Data copy: won't improve performance of short flows as long flows
Possible solution: design different packet processing pipelines for short and long flows
Possible solution: design application- and network-aware CPU schedulers
Last updated