Dagger: Efficient and Fast RPCs in Cloud Microservices in Near-Memory Reconfigurable NICs

https://www.csl.cornell.edu/~delimitrou/papers/2021.asplos.sinan.pdf

Presentation

Trends in cloud computing (monoliths)
- Tightly-coupled application logic in a single statically / dynamically linked library
- Shift towards microservices
  - Loosely-coupled application logic split into many independent small applications
- Shift towards serverless
  - Fine application granularity
  - Fine lifetime granularity
Cloud applications today are interactive
- Frequent interaction with large sets of users
- Strict performance requirements as SLO
  - Low tail latency under high load
  - Performance predictability
Focus on: improve communication stack in microservices
- over RPCs
- RPC requests in microservices are small and vary by tiers
- Take
  - Per-request communication overheads are large
  - Cannot tune communication stacks for small messages only
  - Need an adaptive stack
- RPC stacks run on the same CPUs as highly concurrent applications
  - Already high pressure on CPUs from applications
  - Intensive traffic of small messages
Dagger: a HW/SW co-designed end-host RPC stack
- Design principles
  - Hardware offload
    Existing techniques to improve efficiency of cloud networking
    Kernel bypass: IX, eRPC, mTCP, and many others
    Removes per-packet kernel overheads, tightly couples networking stacks with applications, but still run everything in SW
    RDMA system
    Offloads networking stacks to hardware
    But:
    only provides low-level abstractions, the RPC part runs in SW
    requires specialized adapters
    hardware NIC for end-host communication stacks, from the L1 (PHY) layer, and all the way up to the application (RPC) layer
    Completely free CPU from doing any work related to data exchange
  - Reconfigurability
    Networking protocols, load balances, threading, data representation, data manipulation. HW should also be
    Dagger is based on an FPGA!
    Configurable transport: UDP, TCP, mTCP, HOMA, TONIC
    Configurable load balancer / flow controller: static, round-robin, random, application-specific
    Configurable host-NIC interface: PCIe doorbells, PCIe MMIOs, coherency-based
    Configurable threading model: connection/thread/queue/flow mapping, number of NIC flows / queues
  - Tight coupling
    Dagger is based on a cache-coherent FPGA tightly-coupled with the host CPU
    Inspired by soNUMA, series of RDMA studies
    an FPGA acting as NUMA node
    No DMAs are required to exchange data between NUMA nodes
    No explicit MMIO requests
    Minimal software overhead
    NUMA interconnects have lower latency
    Existing SmartNICs are based on PICe! (introduce overheads)
    Doorbell scheme
    Multiple PCIe roundtrips
    Expensive and CPU-inefficient rings based on MMIOs
    Existing optimizations: combined descriptors and packets, packet write with MMIOs, doorbell batching... (but fail to eliminate)

PreviousDemocratizing Cellular Access with AnyCell NextSage: Practical & Scalable ML-Driven Performance Debugging in Microservices

Last updated 4 years ago

Was this helpful?