Owl: Scale and Flexibility in Distribution of Hot Contents
https://www.usenix.org/conference/osdi22/presentation/flinn
Last updated
Was this helpful?
https://www.usenix.org/conference/osdi22/presentation/flinn
Last updated
Was this helpful?
Content distribution
Meta’s private cloud, large objects read by many consumers
Not a “one-size-fits-all” problem
Scale: an object may be read by a few or millions of processes
Size: objects may be a few MBs or a few TBs
Hot data: clients may read objects within seconds or hours apart
Exacting requirements of content distribution: three dimensions
Fast
deliver at available network or disk bandwidth
Efficient
Scalability: # of clients that can have their distribution needs met by a given number of servers
~ millions of client processes
Network usage
Bytes transmitted ( --> tput?)
Communication locality (i.e. in-rack less costly than cross-region transfer)
Resource usage on client machines: e.g. CPU cycles, memory, disk I/O
Q: what does client mean? multi-tenant? client agress to share spare resources for transfer?
Reliable
Highly available
React quickly to workload changes and outages
I.e. percentage of download requests that the distribution system satisfies within a latency SLA (e.g. partial outages, performance faults)
Prior works
At least 3 prior distribution systems used at Meta
Hierarchical caching
BitTorrent
Static distribution tree
Common problems
Poor balance between centralization and decentralization
Not enough flexibility to meet all requirements of many different types of services at Meta
Customized for particular use cases
Hard to manage & operate
Quotas: too high, not enough statistical multiplexing; if too low, client throttle, can’t get data when they need
Second and third question? Are they solved by this paper's approach?
Seeder: getting data from data source, cache it locally
Other peer: p2p distribution
Cons
Decentralize: no picture of the whole distribution picture
Take a long time to propagate the change
Hard to aggregate decisions
Centralization v.s Decentralization
Data plane: highly decentralized
Control plane: centralized
Components
Ephemeral distribution tree
Scale the tracker
For each chunk, tracker metadata specifies which peers are caching the chunk and which are downloading it. Tracker metadata also specifies the source of each peer’s download (e.g., an external source or another peer)
Optimizations to scale the tracker
Large chunk size (via streaming and saving partial work)
Geographically-sorted indexes (for quick lookups)
Benchmark results for a single tracker: 1.5-2.4 TB/s
Scale further
Shard detailed state across multiple trackers
Sharding by peer lets 1 tracker manage cache state for a peer
But ephemeral distribution tress how sharded across trackers
Tracker sharding
Key idea: provide two levels of resolution for metadata
Peers managed by local tracker: chunks --> peers
Peers managed by remote tracker: chunks --> trackers
Each tracker broadcasts chunks its peers have
Trackers can delegate finding a source to another tracker
97.5% of delegation succeed improve cache miss by 3x
Customizable policy modules
Complex behavior by composing simple peer commands
Selection: where to fetch, how to retry, when to give up
Need to consider: max outflows, bandwidth constraints etc.
Examples
Location-aware, choose nearest machine peer / super peer
Hot-code: hot data --> super peer, cold --> direct data source
Cache: memory or disk, replacement, sharing data with apps
Examples
Local: client has its own local policy
Replacement policy: LRU, least-rare (i.e. # of client that contains the chunk of data), TTL, random, etc.
Emulations for finding the right policy
Emulation: recording
Tracker: go directly to the data source; chunks fetched peer lifetimes
Emulation: replay
Peers, superpeers, network
Search space of policies - report key metrics; runs on all new workloads, periodically on every workload
Scale to 106 client types, 55 unique policies
Distribute 800PB of hot content per day to millions of peers at Meta
Decentralized p2p data plane with a highly-centralized control plane in which the trackers make detailed decisions for peers such as
Where to fetch each chunk of data
How to retry failed fetches
Which chunks to cache in peer memory and storage
Highly customizable through tracker policies
Lessons
Decentralized data plane, centralized control plane
Policy / mechanism split --> keep clients simple
Flexible policies --> different use cases
Emulation --> policy selection
Future
Streaming data / push notifications (v.s pull)
Incremental updates of AI models
Distribution to the edge