Dagger: Efficient and Fast RPCs in Cloud Microservices in Near-Memory Reconfigurable NICs

https://www.csl.cornell.edu/~delimitrou/papers/2021.asplos.sinan.pdf

Presentation

  • Trends in cloud computing (monoliths)

    • Tightly-coupled application logic in a single statically / dynamically linked library

    • Shift towards microservices

      • Loosely-coupled application logic split into many independent small applications

    • Shift towards serverless

      • Fine application granularity

      • Fine lifetime granularity

  • Cloud applications today are interactive

    • Frequent interaction with large sets of users

    • Strict performance requirements as SLO

      • Low tail latency under high load

      • Performance predictability

  • Focus on: improve communication stack in microservices

    • over RPCs

    • RPC requests in microservices are small and vary by tiers

    • Take

      • Per-request communication overheads are large

      • Cannot tune communication stacks for small messages only

      • Need an adaptive stack

    • RPC stacks run on the same CPUs as highly concurrent applications

      • Already high pressure on CPUs from applications

      • Intensive traffic of small messages

  • Dagger: a HW/SW co-designed end-host RPC stack

    • Design principles

      • Hardware offload

        • Existing techniques to improve efficiency of cloud networking

          • Kernel bypass: IX, eRPC, mTCP, and many others

            • Removes per-packet kernel overheads, tightly couples networking stacks with applications, but still run everything in SW

          • RDMA system

            • Offloads networking stacks to hardware

            • But:

              • only provides low-level abstractions, the RPC part runs in SW

              • requires specialized adapters

        • hardware NIC for end-host communication stacks, from the L1 (PHY) layer, and all the way up to the application (RPC) layer

          • Completely free CPU from doing any work related to data exchange

      • Reconfigurability

        • Networking protocols, load balances, threading, data representation, data manipulation. HW should also be

        • Dagger is based on an FPGA!

          • Configurable transport: UDP, TCP, mTCP, HOMA, TONIC

          • Configurable load balancer / flow controller: static, round-robin, random, application-specific

          • Configurable host-NIC interface: PCIe doorbells, PCIe MMIOs, coherency-based

          • Configurable threading model: connection/thread/queue/flow mapping, number of NIC flows / queues

      • Tight coupling

        • Dagger is based on a cache-coherent FPGA tightly-coupled with the host CPU

          • Inspired by soNUMA, series of RDMA studies

          • an FPGA acting as NUMA node

            • No DMAs are required to exchange data between NUMA nodes

            • No explicit MMIO requests

            • Minimal software overhead

            • NUMA interconnects have lower latency

        • Existing SmartNICs are based on PICe! (introduce overheads)

          • Doorbell scheme

            • Multiple PCIe roundtrips

            • Expensive and CPU-inefficient rings based on MMIOs

          • Existing optimizations: combined descriptors and packets, packet write with MMIOs, doorbell batching... (but fail to eliminate)

Last updated