# Building An Elastic Query Engine on Disaggregated Storage

* Presents SnowFlake design and implementation, along with a discussion on how recent changes in cloud infrastructure (emerging hardware, fine-grained billing, etc.) have altered the many assumptions&#x20;

### Presentation&#x20;

* Traditional shared-nothing architectures&#x20;
  * Data partitioned across servers
  * Each server handles its own partition&#x20;
* Hardware-workload mismatch
* Data re-shuffle during elasticity&#x20;
* Fundamental issue in share-nothing architecture: tight coupling of compute and storage&#x20;
* Solution: decouple compute and persistent storage&#x20;
  * Independent scaling of resources&#x20;
  * Analyze: query engine on disaggregated storage&#x20;
    * Design aspects&#x20;
    * Data-driven insights&#x20;
    * Future directions&#x20;
  * System: SnowFlake --> statistic from 70 million queries over 14 day period&#x20;
* Characteristics&#x20;
  * Diversity of queries&#x20;
    * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FzqJO8tLvL1VtQG9UcL4s%2Fimage.png?alt=media\&token=d53c9719-54ff-4395-8de2-6566c2031e21)
      * Read-only: 28%&#x20;
      * Write-only: 13%
      * Read-write: 59%&#x20;
      * Three distinct query classes (vary in orders of magnitude)&#x20;
  * Query distribution over time&#x20;
    * Read-only query load varies significantly overtime&#x20;
    * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FP6zUbiu2jsr9oDLCbkcx%2Fimage.png?alt=media\&token=3f9aefb6-d404-4fb1-8b98-4c54fc1741ea)
* High-level architecture&#x20;
  * Virtual warehouse
    * Abstractions for computational resources&#x20;
    * Under the hood --> set of VMs&#x20;
    * Distributed execution of queries&#x20;
  * Different customers --> different warehouses&#x20;
    * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FDm3gQKDPyVHPPb6LIbvk%2Fimage.png?alt=media\&token=cdce6aa1-29f4-4917-8120-1a6bcbc9563d)
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FWVX21uZzJ4C7k5iSblns%2Fimage.png?alt=media\&token=146e45ce-1266-49bd-b9b4-3db0ea73a982)
    * Ephemeral storage system key features&#x20;
      * Co-located with compute in VWs&#x20;
        * Use both DRAM and local SSDs&#x20;
      * Opportunistic caching of persistent data&#x20;
        * Hide S3 access latency&#x20;
      * Elasticity without data re-shuffle&#x20;
  * Intermediate data characteristics&#x20;
    * Intermediate data sizes --> variation over 5 orders of magnitude&#x20;
      * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FgEMLBxKaExiLdv1B3gHq%2Fimage.png?alt=media\&token=b803f611-d5cb-40d1-baf8-6e3dd97cc2ae)
    * Difficult to predict intermediate data sizes upfront&#x20;
      * A bit different than ours&#x20;
    * Possibility: decouple compute & ephemeral storage&#x20;
  * Persistent data caching&#x20;
    * Intermediate data volume --> peak&#x20;
    * Opportunistic caching
      * How to ensure consistency?
        * Each file assigned to unique node&#x20;
          * Consistent hashing
        * Write-through caching&#x20;
  * Elasticity&#x20;
    * Persistent storage - easy, offloaded to S3&#x20;
    * Compute - easy, pre-warmed pool of VMs&#x20;
      * Request --> take VMs from the pool&#x20;
    * Ephemeral storage - challenging, due to co-location with compute&#x20;
      * Back to shared-nothing architecture problem (data re-shuffle)&#x20;
      * Reshuffle --> re-hashing to ensure consistency&#x20;
    * Lazy consistent hashing&#x20;
      * **Locality aware task** scheduling
      * Elasticity without data re-shuffle&#x20;
        * Waits until a task that actually reads it
  * &#x20;Do customers exploit elasticity in the wild?&#x20;
    * Resource scaling by up to 100x needed at times&#x20;
    * At what time-scales are warehouses resized?&#x20;
      * Granularity of warehouse elasticity >> changes in query load&#x20;
      * Need finer-grained elasticity in order to better match demand&#x20;
    * Resource utilization&#x20;
      * System-wide resource utilization: CPU, memory, network-tx, network-rx
      * Virtual warehouse abstraction
        * Good performance isolation
        * Trade-off: low resource utilization&#x20;
      * Significant room for improvement in resource utilization
      * Solution #1: finer-grained elasticity with current design
      * Solution #2: resource shared model&#x20;
  * Resource sharing&#x20;
    * Per-second pricing --> pre-warmed pool not cost effective!
    * Sol #2: move to resource shared model
      * Statistical multiplexing
        * better resource utilization
        * helps support elasticity&#x20;
      * Challenges&#x20;
        * Maintaining isolation guarantees&#x20;
        * Shared ephemeral storage system&#x20;
        * Sharing cache&#x20;
          * No pre-determined lifetime&#x20;
          * Co-existence with int. data&#x20;
        * Elasticity without violating isolation&#x20;
          * Cross-tenant interference&#x20;
          * Need private address spaces for tenants&#x20;
  * Findings are general applicable to other distributed system on top of disaggregated storage&#x20;
