Owl: Scale and Flexibility in Distribution of Hot Contents

https://www.usenix.org/conference/osdi22/presentation/flinn

What is the problem that is being solved?

  • Content distribution

    • Meta’s private cloud, large objects read by many consumers

  • Not a “one-size-fits-all” problem

    • Scale: an object may be read by a few or millions of processes

    • Size: objects may be a few MBs or a few TBs

    • Hot data: clients may read objects within seconds or hours apart

  • Exacting requirements of content distribution: three dimensions

    • Fast

      • deliver at available network or disk bandwidth

    • Efficient

      • Scalability: # of clients that can have their distribution needs met by a given number of servers

        • ~ millions of client processes

      • Network usage

        • Bytes transmitted ( --> tput?)

        • Communication locality (i.e. in-rack less costly than cross-region transfer)

      • Resource usage on client machines: e.g. CPU cycles, memory, disk I/O

        • Q: what does client mean? multi-tenant? client agress to share spare resources for transfer?

    • Reliable

      • Highly available

      • React quickly to workload changes and outages

      • I.e. percentage of download requests that the distribution system satisfies within a latency SLA (e.g. partial outages, performance faults)

What are the key innovations over prior work?

Prior works

  • At least 3 prior distribution systems used at Meta

    • Hierarchical caching

    • BitTorrent

    • Static distribution tree

  • Common problems

    • Poor balance between centralization and decentralization

    • Not enough flexibility to meet all requirements of many different types of services at Meta

      • Customized for particular use cases

      • Hard to manage & operate

Hierarchical caching

  • Quotas: too high, not enough statistical multiplexing; if too low, client throttle, can’t get data when they need

    • Second and third question? Are they solved by this paper's approach?

BitTorrent

  • Seeder: getting data from data source, cache it locally

  • Other peer: p2p distribution

  • Cons

    • Decentralize: no picture of the whole distribution picture

    • Take a long time to propagate the change

    • Hard to aggregate decisions

Key points: centralized control plane + decentralized data plane

  1. Centralization v.s Decentralization

    1. Data plane: highly decentralized

    2. Control plane: centralized

    3. Components

  2. Ephemeral distribution tree

Scaling the control plane

  • Scale the tracker

    • For each chunk, tracker metadata specifies which peers are caching the chunk and which are downloading it. Tracker metadata also specifies the source of each peer’s download (e.g., an external source or another peer)

    • Optimizations to scale the tracker

      • Large chunk size (via streaming and saving partial work)

      • Geographically-sorted indexes (for quick lookups)

    • Benchmark results for a single tracker: 1.5-2.4 TB/s

    • Scale further

      • Shard detailed state across multiple trackers

      • Sharding by peer lets 1 tracker manage cache state for a peer

      • But ephemeral distribution tress how sharded across trackers

  • Tracker sharding

    • Key idea: provide two levels of resolution for metadata

      • Peers managed by local tracker: chunks --> peers

      • Peers managed by remote tracker: chunks --> trackers

    • Each tracker broadcasts chunks its peers have

    • Trackers can delegate finding a source to another tracker

    • 97.5% of delegation succeed improve cache miss by 3x

Tracker: the need for flexibility (policy design)

  • Customizable policy modules

    • Complex behavior by composing simple peer commands

    • Selection: where to fetch, how to retry, when to give up

      • Need to consider: max outflows, bandwidth constraints etc.

      • Examples

        • Location-aware, choose nearest machine peer / super peer

        • Hot-code: hot data --> super peer, cold --> direct data source

    • Cache: memory or disk, replacement, sharing data with apps

      • Examples

        • Local: client has its own local policy

        • Replacement policy: LRU, least-rare (i.e. # of client that contains the chunk of data), TTL, random, etc.

  • Emulations for finding the right policy

    • Emulation: recording

      • Tracker: go directly to the data source; chunks fetched peer lifetimes

    • Emulation: replay

      • Peers, superpeers, network

      • Search space of policies - report key metrics; runs on all new workloads, periodically on every workload

  • Scale to 106 client types, 55 unique policies

What are the key results?

Traffic v.s Servers

BitTorrent v.s Owl

Other

  • Distribute 800PB of hot content per day to millions of peers at Meta

  • Decentralized p2p data plane with a highly-centralized control plane in which the trackers make detailed decisions for peers such as

    • Where to fetch each chunk of data

    • How to retry failed fetches

    • Which chunks to cache in peer memory and storage

  • Highly customizable through tracker policies

Conclusions

  • Lessons

    • Decentralized data plane, centralized control plane

    • Policy / mechanism split --> keep clients simple

    • Flexible policies --> different use cases

    • Emulation --> policy selection

  • Future

    • Streaming data / push notifications (v.s pull)

    • Incremental updates of AI models

    • Distribution to the edge

Last updated