CliqueMap: Productionizing an RMA-Based Distributed Caching System

https://dl.acm.org/doi/pdf/10.1145/3452296.3472934

Caching: distributed caching
- Hierarchy structure
- Distributed caching
  - At the layer of application
RPC
- Remote procedure call: machines in distributed system to talk with each other
- GRPC: protocol buffers
- Lesser restrictions , ease of programming
RMA / RDMA
- Remote memory access
- Offload the code execution path from CPU to hardware or software NIC
  - Restrictions with respect to size of memory that can be accessed
- Primitives are not quite easy to use for programmers

In-memory KV caching / serving systems are crucial building blocks of user-facing services throughout the industry (i.e., Twemcache [OSDI20], CacheLib [OSDI20])
Remote memory access (RMA)
- Benefits: performance / efficiency benefits
- Downsides: limited programmability / narrow primitives
- Production challenges
  - Deliver high availability and low cost
  - Balance CPU- and RAM-efficiency
  - Evolving the system over time
  - Multi-language serving ecosystems
  - Navigating heterogenous datacenters
How do we productionize an RMA-based distributed caching system?
- Less compute + latency benefits (10s of ms)
- Throughput? different customers have different challenges

Replication: same piece of data
- Aware of the topology
Lookup: RMA (accelerate)
RPC: mutations, and other metadata management
- Extensibility and ease of programming

Last updated 4 years ago

Was this helpful?