Ctrlk

Owl: Scale and Flexibility in Distribution of Hot Contents

https://www.usenix.org/conference/osdi22/presentation/flinn

What is the problem that is being solved?

Content distribution
- Meta’s private cloud, large objects read by many consumers
Not a “one-size-fits-all” problem
- Scale: an object may be read by a few or millions of processes
- Size: objects may be a few MBs or a few TBs
- Hot data: clients may read objects within seconds or hours apart

Exacting requirements of content distribution: three dimensions
- Fast
  - deliver at available network or disk bandwidth
- Efficient
  - Scalability: # of clients that can have their distribution needs met by a given number of servers
    ~ millions of client processes
  - Network usage
    Bytes transmitted ( --> tput?)
    Communication locality (i.e. in-rack less costly than cross-region transfer)
  - Resource usage on client machines: e.g. CPU cycles, memory, disk I/O
    Q: what does client mean? multi-tenant? client agress to share spare resources for transfer?
- Reliable
  - Highly available
  - React quickly to workload changes and outages
  - I.e. percentage of download requests that the distribution system satisfies within a latency SLA (e.g. partial outages, performance faults)

What are the key innovations over prior work?

Prior works

At least 3 prior distribution systems used at Meta
- Hierarchical caching
- BitTorrent
- Static distribution tree
Common problems
- Poor balance between centralization and decentralization
- Not enough flexibility to meet all requirements of many different types of services at Meta
  - Customized for particular use cases
  - Hard to manage & operate

Hierarchical caching

Quotas: too high, not enough statistical multiplexing; if too low, client throttle, can’t get data when they need
- Second and third question? Are they solved by this paper's approach?

BitTorrent

Seeder: getting data from data source, cache it locally
Other peer: p2p distribution
Cons
- Decentralize: no picture of the whole distribution picture
- Take a long time to propagate the change
- Hard to aggregate decisions

Key points: centralized control plane + decentralized data plane

Centralization v.s Decentralization
1. Data plane: highly decentralized
2. Control plane: centralized
3. Components
Ephemeral distribution tree

Scaling the control plane

Scale the tracker
- For each chunk, tracker metadata specifies which peers are caching the chunk and which are downloading it. Tracker metadata also specifies the source of each peer’s download (e.g., an external source or another peer)
- Optimizations to scale the tracker
  - Large chunk size (via streaming and saving partial work)
  - Geographically-sorted indexes (for quick lookups)
- Benchmark results for a single tracker: 1.5-2.4 TB/s
- Scale further
  - Shard detailed state across multiple trackers
  - Sharding by peer lets 1 tracker manage cache state for a peer
  - But ephemeral distribution tress how sharded across trackers
Tracker sharding
- Key idea: provide two levels of resolution for metadata
  - Peers managed by local tracker: chunks --> peers
  - Peers managed by remote tracker: chunks --> trackers
- Each tracker broadcasts chunks its peers have
- Trackers can delegate finding a source to another tracker
- 97.5% of delegation succeed improve cache miss by 3x

Tracker: the need for flexibility (policy design)

Customizable policy modules
- Complex behavior by composing simple peer commands
- Selection: where to fetch, how to retry, when to give up
  - Need to consider: max outflows, bandwidth constraints etc.
  - Examples
    Location-aware, choose nearest machine peer / super peer
    Hot-code: hot data --> super peer, cold --> direct data source
- Cache: memory or disk, replacement, sharing data with apps
  - Examples
    Local: client has its own local policy
    Replacement policy: LRU, least-rare (i.e. # of client that contains the chunk of data), TTL, random, etc.
Emulations for finding the right policy
- Emulation: recording
  - Tracker: go directly to the data source; chunks fetched peer lifetimes
- Emulation: replay
  - Peers, superpeers, network
  - Search space of policies - report key metrics; runs on all new workloads, periodically on every workload
Scale to 106 client types, 55 unique policies

What are the key results?

Traffic v.s Servers

BitTorrent v.s Owl

Other

Distribute 800PB of hot content per day to millions of peers at Meta
Decentralized p2p data plane with a highly-centralized control plane in which the trackers make detailed decisions for peers such as
- Where to fetch each chunk of data
- How to retry failed fetches
- Which chunks to cache in peer memory and storage
Highly customizable through tracker policies

Conclusions

Lessons
- Decentralized data plane, centralized control plane
- Policy / mechanism split --> keep clients simple
- Flexible policies --> different use cases
- Emulation --> policy selection
Future
- Streaming data / push notifications (v.s pull)
- Incremental updates of AI models
- Distribution to the edge

PreviousCloud Reading Group NextRubberBand: cloud-based hyperparameter tuning

Last updated 2 years ago

Was this helpful?