Understanding host network stack overheads

https://dl.acm.org/doi/abs/10.1145/3452296.3472888

  • Network & Host Hardware Trends

    • For internet and early generation datacenters

      • Bottlenecks were in the network

      • Challenge: sharing network resources

        • Switch buffers, switch bandwidth

  • Exponentially increasing bandwidth

    • Stagnant CPU capacity (slowdown of Moore's Law)

    • For high speed networks

      • Bottleneck have moved to the host

      • Challenge: sharing host resources

        • CPU cores

        • DRAM bandwidth

        • LLC capacity

  • Solutions: Linux stack optimization, RDMA, hardware offloads, userspace stacks

  • Goal: guide the design-space by a detailed understanding of today's stack

  • Methodology & Experimental Scenarios

    • Goal: understand CPU overheads of host network stack

    • Want to push the bottlenecks to the network stack

    • Measure: throughput, cpu utilization & breakdown, cache miss rate

    • Impact of various factors:

      • Optimization techniques: TSO/GRO, Jumbo frames, aRFS

      • HW configurations: DDIO: IOMMU

      • Traffic pattern: single, incast, one-to-one, outcast, all-to-all

      • Flow types: long flows, short, mixture

      • Network drops

      • Congestion control protocols

  • Main lessons from our study

    • Bottlenecks have shifted from packet processing to data copy

      • For 40 Gbps NICs, a single CPU core could saturate the access link bandwidth

      • Multiple cores needed to saturate 100 Gbps access link bandwidth

      • Possible solution: zero-copy techniques like TCP mmap/AF_XDP

        • Implementation overhead

    • The NIC DMA pipeline has become inefficient

      • NIC overwrites data before application read them

      • High cache miss rate is the core reason for the inefficiency of NIC DMA pipeline

      • Large TCP buffers increase the delay from packet RX to data copy

      • More NIC Rx descriptors lead to a higher chance of cache eviction

      • Enabling IOMMU further degrades the performance

        • Additional per page operations

      • Possible solution: TCP buffer size calculation must take host resources (like L3 cache, packet processing latency) into account

      • Possible solution: decouple data copy and packet processing (so as to scale them independently)

      • Possible solution: an efficient cache replacement policy

    • Host resource sharing leads to further performance degradation

      • Multiple flows contending for host resources aggravates host bottleneck

      • Cache contention degrades the throughput per core

        • Possible solution: receiver-driven protocols for orchestrating receiver's caches

      • Bandwidth contention further degrades performance

      • Higher scheduling overheads

      • GRO benefits reduce (increase the # of flows per core)

      • Possible solution: receiver-driven transport protocols for orchestrating receiver's bandwidth

    • Colocation of short and long flows further degrades performance

      • When flows are collocated, both long/short flows suffer (degrades 48% / 42%)

      • Long and short flows have different bottlenecks

        • TCP/IP: overheads increase as we decrease the flow size

        • Data copy: won't improve performance of short flows as long flows

      • Possible solution: design different packet processing pipelines for short and long flows

      • Possible solution: design application- and network-aware CPU schedulers

Last updated