Building An Elastic Query Engine on Disaggregated Storage

  • Presents SnowFlake design and implementation, along with a discussion on how recent changes in cloud infrastructure (emerging hardware, fine-grained billing, etc.) have altered the many assumptions

Presentation

  • Traditional shared-nothing architectures

    • Data partitioned across servers

    • Each server handles its own partition

  • Hardware-workload mismatch

  • Data re-shuffle during elasticity

  • Fundamental issue in share-nothing architecture: tight coupling of compute and storage

  • Solution: decouple compute and persistent storage

    • Independent scaling of resources

    • Analyze: query engine on disaggregated storage

      • Design aspects

      • Data-driven insights

      • Future directions

    • System: SnowFlake --> statistic from 70 million queries over 14 day period

  • Characteristics

    • Diversity of queries

        • Read-only: 28%

        • Write-only: 13%

        • Read-write: 59%

        • Three distinct query classes (vary in orders of magnitude)

    • Query distribution over time

      • Read-only query load varies significantly overtime

  • High-level architecture

    • Virtual warehouse

      • Abstractions for computational resources

      • Under the hood --> set of VMs

      • Distributed execution of queries

    • Different customers --> different warehouses

      • Ephemeral storage system key features

        • Co-located with compute in VWs

          • Use both DRAM and local SSDs

        • Opportunistic caching of persistent data

          • Hide S3 access latency

        • Elasticity without data re-shuffle

    • Intermediate data characteristics

      • Intermediate data sizes --> variation over 5 orders of magnitude

      • Difficult to predict intermediate data sizes upfront

        • A bit different than ours

      • Possibility: decouple compute & ephemeral storage

    • Persistent data caching

      • Intermediate data volume --> peak

      • Opportunistic caching

        • How to ensure consistency?

          • Each file assigned to unique node

            • Consistent hashing

          • Write-through caching

    • Elasticity

      • Persistent storage - easy, offloaded to S3

      • Compute - easy, pre-warmed pool of VMs

        • Request --> take VMs from the pool

      • Ephemeral storage - challenging, due to co-location with compute

        • Back to shared-nothing architecture problem (data re-shuffle)

        • Reshuffle --> re-hashing to ensure consistency

      • Lazy consistent hashing

        • Locality aware task scheduling

        • Elasticity without data re-shuffle

          • Waits until a task that actually reads it

    • Do customers exploit elasticity in the wild?

      • Resource scaling by up to 100x needed at times

      • At what time-scales are warehouses resized?

        • Granularity of warehouse elasticity >> changes in query load

        • Need finer-grained elasticity in order to better match demand

      • Resource utilization

        • System-wide resource utilization: CPU, memory, network-tx, network-rx

        • Virtual warehouse abstraction

          • Good performance isolation

          • Trade-off: low resource utilization

        • Significant room for improvement in resource utilization

        • Solution #1: finer-grained elasticity with current design

        • Solution #2: resource shared model

    • Resource sharing

      • Per-second pricing --> pre-warmed pool not cost effective!

      • Sol #2: move to resource shared model

        • Statistical multiplexing

          • better resource utilization

          • helps support elasticity

        • Challenges

          • Maintaining isolation guarantees

          • Shared ephemeral storage system

          • Sharing cache

            • No pre-determined lifetime

            • Co-existence with int. data

          • Elasticity without violating isolation

            • Cross-tenant interference

            • Need private address spaces for tenants

    • Findings are general applicable to other distributed system on top of disaggregated storage

Last updated