CliqueMap: Productionizing an RMA-Based Distributed Caching System

https://dl.acm.org/doi/pdf/10.1145/3452296.3472934

Basics

  • Caching: distributed caching

    • Hierarchy structure

    • Distributed caching

      • At the layer of application

  • RPC

    • Remote procedure call: machines in distributed system to talk with each other

    • GRPC: protocol buffers

    • Lesser restrictions , ease of programming

  • RMA / RDMA

    • Remote memory access

    • Offload the code execution path from CPU to hardware or software NIC

      • Restrictions with respect to size of memory that can be accessed

    • Primitives are not quite easy to use for programmers

Intro / Summary

  • In-memory KV caching / serving systems are crucial building blocks of user-facing services throughout the industry (i.e., Twemcache [OSDI20], CacheLib [OSDI20])

  • Remote memory access (RMA)

    • Benefits: performance / efficiency benefits

    • Downsides: limited programmability / narrow primitives

    • Production challenges

      • Deliver high availability and low cost

      • Balance CPU- and RAM-efficiency

      • Evolving the system over time

      • Multi-language serving ecosystems

      • Navigating heterogenous datacenters

  • How do we productionize an RMA-based distributed caching system?

    • Less compute + latency benefits (10s of ms)

    • Throughput? different customers have different challenges

  • Replication: same piece of data

    • Aware of the topology

  • Lookup: RMA (accelerate)

  • RPC: mutations, and other metadata management

    • Extensibility and ease of programming

Last updated