# Building An Elastic Query Engine on Disaggregated Storage

* Traditional shared-nothing architectures&#x20;
  * Data partitioned across servers
  * Each server handles its own partition&#x20;
  * Problems
    * Hardware-workload mismatch
    * Data re-shuffle during elasticity&#x20;
    * Fundamental issue: tight coupling of compute & storage&#x20;
* Query Engines on Disaggregated Storage&#x20;
  * Decouple compute and persistent storage
  * Independent scaling of resources&#x20;
* Work explores&#x20;
  * Design aspects&#x20;
  * Data-driven insights&#x20;
  * Future directions&#x20;
* System focus on: snowflake
  * Warehousing as a service, in production for over 5 years, 1000s of customers, millions of queries per day&#x20;
  * Statistics from 70 million queries over 14 day&#x20;
* Diversity of Queries&#x20;
  * Read only --> 28%&#x20;
  * Write only --> 13%
  * R/W --> 59%&#x20;
  * Persistent data read/written varies over several orders of magnitude within each class&#x20;
* Query distribution over time&#x20;
  * Read-only query load varies significantly over time&#x20;
* High-level architecture&#x20;
  * Virtual warehouse: abstraction for computational resources, under the hood --> set of VMs, distributed execution of queries &#x20;
  * Customers --> different warehouses&#x20;
  * Decouples compute from persistent storage (S3)&#x20;
    * Intermediate data: distributed join, or group by --> ephemeral storage&#x20;
      * Ephemeral storage system key features&#x20;
        * Co-located with compute in VWs
        * Opportunistic caching of persistent data (hide access latency)
        * Elasticity without data re-shuffle&#x20;
* Intermediate data characteristics&#x20;
  * Data size: variation over 5 orders of magnitude&#x20;
  * Difficult to predict intermediate data sizes upfront&#x20;
  * Decouple compute & ephemeral storage?&#x20;
* Persistent data caching&#x20;
  * Intermediate data volume: peak, average&#x20;
  * Opportunistic caching of persistent data in ephemeral storage system, hides latency of S3 access
  * How to ensure consistency?&#x20;
    * Each file assigned to unique node, consistent hashing&#x20;
    * Write-through caching&#x20;
    * Analysis, future directions in paper&#x20;
* Elasticity&#x20;
  * Persistent storage: easy, offloaded to S3&#x20;
  * Compute: easy, pre-warmed pool of VMs&#x20;
  * Ephemeral storage - challenging, due to co-location with compute&#x20;
    * Back to shared-nothing architecture problem (data re-shuffle)&#x20;
* Lazy consistent hashing&#x20;
  * Locality aware task scheduling&#x20;
  * Elasticity without data re-shuffle&#x20;
  * Lazy consistent hashing&#x20;
* Elasticity&#x20;
  * Resource scaling by up to 100x needed at times&#x20;
  * What time-scale?&#x20;
    * Granularity of warehouse elasticity >> changes in query load&#x20;
    * Take-away: need finer-grained elasticity in order to better match demand&#x20;
* Resource utilization&#x20;
  * System-wide resource utilizations: significant room for improvement in resource utilizaiton&#x20;
  * Virtual Warehouse abstraction
    * Good performance isolation
    * Trade-off: low resource utilization
  * Solution: finer-grained elasticity with current design, or move to resource shared model&#x20;
* Alternate design: resource sharing&#x20;
  * Pricing model: per-hour usage for warehouses&#x20;
  * Move to per-second pricing --> less cost-effective&#x20;
  * Solution: move to resource shared model
    * Better resource utilization&#x20;
    * Helps support elasticity&#x20;
    * Challenges&#x20;
      * Maintaining isolation guarantees&#x20;
      * Shared ephemeral storage system
        * Sharing cache&#x20;
          * No pre-determined lifetime&#x20;
          * Co-existence with int. data
        * &#x20;Elasticity without violating isolation
          * Possible cross-tenant interference
          * Need private address-spaces for tenants &#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/building-an-elastic-query-engine-on-disaggregated-storage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
