Building An Elastic Query Engine on Disaggregated Storage
Last updated
Last updated
Presents SnowFlake design and implementation, along with a discussion on how recent changes in cloud infrastructure (emerging hardware, fine-grained billing, etc.) have altered the many assumptions
Traditional shared-nothing architectures
Data partitioned across servers
Each server handles its own partition
Hardware-workload mismatch
Data re-shuffle during elasticity
Fundamental issue in share-nothing architecture: tight coupling of compute and storage
Solution: decouple compute and persistent storage
Independent scaling of resources
Analyze: query engine on disaggregated storage
Design aspects
Data-driven insights
Future directions
System: SnowFlake --> statistic from 70 million queries over 14 day period
Characteristics
Diversity of queries
Read-only: 28%
Write-only: 13%
Read-write: 59%
Three distinct query classes (vary in orders of magnitude)
Query distribution over time
Read-only query load varies significantly overtime
High-level architecture
Virtual warehouse
Abstractions for computational resources
Under the hood --> set of VMs
Distributed execution of queries
Different customers --> different warehouses
Ephemeral storage system key features
Co-located with compute in VWs
Use both DRAM and local SSDs
Opportunistic caching of persistent data
Hide S3 access latency
Elasticity without data re-shuffle
Intermediate data characteristics
Intermediate data sizes --> variation over 5 orders of magnitude
Difficult to predict intermediate data sizes upfront
A bit different than ours
Possibility: decouple compute & ephemeral storage
Persistent data caching
Intermediate data volume --> peak
Opportunistic caching
How to ensure consistency?
Each file assigned to unique node
Consistent hashing
Write-through caching
Elasticity
Persistent storage - easy, offloaded to S3
Compute - easy, pre-warmed pool of VMs
Request --> take VMs from the pool
Ephemeral storage - challenging, due to co-location with compute
Back to shared-nothing architecture problem (data re-shuffle)
Reshuffle --> re-hashing to ensure consistency
Lazy consistent hashing
Locality aware task scheduling
Elasticity without data re-shuffle
Waits until a task that actually reads it
Do customers exploit elasticity in the wild?
Resource scaling by up to 100x needed at times
At what time-scales are warehouses resized?
Granularity of warehouse elasticity >> changes in query load
Need finer-grained elasticity in order to better match demand
Resource utilization
System-wide resource utilization: CPU, memory, network-tx, network-rx
Virtual warehouse abstraction
Good performance isolation
Trade-off: low resource utilization
Significant room for improvement in resource utilization
Solution #1: finer-grained elasticity with current design
Solution #2: resource shared model
Resource sharing
Per-second pricing --> pre-warmed pool not cost effective!
Sol #2: move to resource shared model
Statistical multiplexing
better resource utilization
helps support elasticity
Challenges
Maintaining isolation guarantees
Shared ephemeral storage system
Sharing cache
No pre-determined lifetime
Co-existence with int. data
Elasticity without violating isolation
Cross-tenant interference
Need private address spaces for tenants
Findings are general applicable to other distributed system on top of disaggregated storage