CliqueMap: Productionizing an RMA-Based Distributed Caching System
https://dl.acm.org/doi/pdf/10.1145/3452296.3472934
Basics
Caching: distributed caching
Hierarchy structure
Distributed caching
At the layer of application
RPC
Remote procedure call: machines in distributed system to talk with each other
GRPC: protocol buffers
Lesser restrictions , ease of programming
RMA / RDMA
Remote memory access
Offload the code execution path from CPU to hardware or software NIC
Restrictions with respect to size of memory that can be accessed
Primitives are not quite easy to use for programmers
Intro / Summary
In-memory KV caching / serving systems are crucial building blocks of user-facing services throughout the industry (i.e., Twemcache [OSDI20], CacheLib [OSDI20])
Remote memory access (RMA)
Benefits: performance / efficiency benefits
Downsides: limited programmability / narrow primitives
Production challenges
Deliver high availability and low cost
Balance CPU- and RAM-efficiency
Evolving the system over time
Multi-language serving ecosystems
Navigating heterogenous datacenters
How do we productionize an RMA-based distributed caching system?
Less compute + latency benefits (10s of ms)
Throughput? different customers have different challenges
Replication: same piece of data
Aware of the topology
Lookup: RMA (accelerate)
RPC: mutations, and other metadata management
Extensibility and ease of programming
Last updated
Was this helpful?