Dagger: Efficient and Fast RPCs in Cloud Microservices in Near-Memory Reconfigurable NICs
https://www.csl.cornell.edu/~delimitrou/papers/2021.asplos.sinan.pdf
Presentation
Trends in cloud computing (monoliths)
Tightly-coupled application logic in a single statically / dynamically linked library
Shift towards microservices
Loosely-coupled application logic split into many independent small applications
Shift towards serverless
Fine application granularity
Fine lifetime granularity
Cloud applications today are interactive
Frequent interaction with large sets of users
Strict performance requirements as SLO
Low tail latency under high load
Performance predictability
Focus on: improve communication stack in microservices
over RPCs
RPC requests in microservices are small and vary by tiers
Take
Per-request communication overheads are large
Cannot tune communication stacks for small messages only
Need an adaptive stack
RPC stacks run on the same CPUs as highly concurrent applications
Already high pressure on CPUs from applications
Intensive traffic of small messages
Dagger: a HW/SW co-designed end-host RPC stack
Design principles
Hardware offload
Existing techniques to improve efficiency of cloud networking
Kernel bypass: IX, eRPC, mTCP, and many others
Removes per-packet kernel overheads, tightly couples networking stacks with applications, but still run everything in SW
RDMA system
Offloads networking stacks to hardware
But:
only provides low-level abstractions, the RPC part runs in SW
requires specialized adapters
hardware NIC for end-host communication stacks, from the L1 (PHY) layer, and all the way up to the application (RPC) layer
Completely free CPU from doing any work related to data exchange
Reconfigurability
Networking protocols, load balances, threading, data representation, data manipulation. HW should also be
Dagger is based on an FPGA!
Configurable transport: UDP, TCP, mTCP, HOMA, TONIC
Configurable load balancer / flow controller: static, round-robin, random, application-specific
Configurable host-NIC interface: PCIe doorbells, PCIe MMIOs, coherency-based
Configurable threading model: connection/thread/queue/flow mapping, number of NIC flows / queues
Tight coupling
Dagger is based on a cache-coherent FPGA tightly-coupled with the host CPU
Inspired by soNUMA, series of RDMA studies
an FPGA acting as NUMA node
No DMAs are required to exchange data between NUMA nodes
No explicit MMIO requests
Minimal software overhead
NUMA interconnects have lower latency
Existing SmartNICs are based on PICe! (introduce overheads)
Doorbell scheme
Multiple PCIe roundtrips
Expensive and CPU-inefficient rings based on MMIOs
Existing optimizations: combined descriptors and packets, packet write with MMIOs, doorbell batching... (but fail to eliminate)
Last updated
Was this helpful?