# External Merge Sort for Top-K Queries: Eager input filtering guided by histograms

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWywqVYsqrSFKbbmgmO%2Fimage.png?alt=media\&token=a44449d5-f8f8-4673-92cf-91749e344477)

* K is large (focus)&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWywz4jhAOE6PcRUs_R%2Fimage.png?alt=media\&token=27fec089-599e-4b9c-8dbd-cbc68b6948f7)

* Running figures are the fastest&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyxGHa3wapkqjvS-ld%2Fimage.png?alt=media\&token=436f82ef-53c4-4059-a673-37e04c630895)

* Exact top-k tuples&#x20;
  * Other algorithms: approximate, but don't care in this work&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyxRfoE2giYDGD5JnJ%2Fimage.png?alt=media\&token=e4f6725d-f6bb-4e00-abdf-5239b1402f40)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyxUy-z4cM-9EHvxvY%2Fimage.png?alt=media\&token=da1cc2c4-de35-4550-a7e6-e076cadc32fc)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyxahv-QYYjfLtAka0%2Fimage.png?alt=media\&token=38575707-6609-46b0-9110-389b7762960e)

* Queries&#x20;
  * Single table
  * Or result from previous operations&#x20;
* Push join&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyxzZLhqMqARdbfOLC%2Fimage.png?alt=media\&token=608fc253-2227-48a5-8cc9-a47e2d2d97d9)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyy5-2HLGnE44cwkdz%2Fimage.png?alt=media\&token=eab0fb72-4c3a-4df0-b1ba-70275a71d613)

### Existing Top-K Algorithms&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyyBmt6-8lpXovicq1%2Fimage.png?alt=media\&token=298db428-5798-4967-8803-24a484d6d269)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyyKXeJllwUkEBAs_n%2Fimage.png?alt=media\&token=cad2aa29-2d6a-4b98-a62d-b4f8d557bb00)

* Don't need to sort, only keep track of K tuples&#x20;
* Optimize this further?&#x20;
  * Expensive for the priority queue?&#x20;
    * This technique only requires a memory allocation, and only sorts the K tuples fully
    * Optimize? Goals?&#x20;
      * Do more scan, and optimize the space?&#x20;
      * If K is large? Expensive of keeping things?&#x20;
        * Other techniques still needs to sort, this one does not &#x20;
* Relative to K&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyyR8bgx6ooCY4ZBvf%2Fimage.png?alt=media\&token=bb17f319-7469-4f6b-9d2c-2d567e143655)

* Problem&#x20;
  * Relative to the input size&#x20;
  * Postgrel: in-memory priority, then switch to external if it exceeds the memory budget&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyzO4nN3H9dh5wEl5E%2Fimage.png?alt=media\&token=a6fd5820-3bbc-4a73-abc0-30549cb4f910)

* In-memory priority queue --> external sort&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWyzZEgT9mNPF4Hwz1h%2Fimage.png?alt=media\&token=5734766c-ecac-4b78-8007-be5749036870)

* Threads working at the same time, memory split across threads, queries, operators&#x20;
* Different priority for different query, the resource might be taken away from us&#x20;

### Our Top-K algorithm&#x20;

* K < memory: just like in-memory priority queue&#x20;

#### Top-K when K > Memory

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz-1YuCa9Lf43BwSBy%2Fimage.png?alt=media\&token=1e40733e-1c23-4703-bdaf-3f23d585e2e1)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz-AJ_UrZwcV6QLjSC%2Fimage.png?alt=media\&token=7c32b412-3366-4763-81b9-0d5e4efb669b)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz-IQt48dmXn2eR5UI%2Fimage.png?alt=media\&token=4afa794e-69d7-43ad-9a52-01256b56c93d)

* Key: this tuple will not be part of the final output?&#x20;
* What kinds of information to calculate this type of key?&#x20;
  * Histogram: use its information to calculate&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz-S8dA4ocsUjiTJr5%2Fimage.png?alt=media\&token=ede3a664-5d96-436c-8306-57dea9ce895d)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz-o_OZ8MR5ksk4p-l%2Fimage.png?alt=media\&token=55bb6f02-961f-4001-8807-5ca59f6c1a23)

* Processing more: keep defining it&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz0CKrYykoN-uKo17H%2Fimage.png?alt=media\&token=72b54a61-d188-4b3e-9416-9b0c3a2d0099)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz0FAq4f5emmx0G9X7%2Fimage.png?alt=media\&token=5a719ccb-3e56-447c-9593-0f78f7636a12)

* 10 input tuples
  * Safely discards value more than 30&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz0XXc6348qUeqGnn_%2Fimage.png?alt=media\&token=5c5fbd10-ad56-4567-b3a9-d0c12ab60fd7)

* Keep enough information to represent K tuples&#x20;
* Could sort
  * But extra cost when inserting new histogram tuple&#x20;
  * Move things around might be more costly, allocate more memory&#x20;
    * But bottleneck is the I/O, so won't become bottleneck&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz1UuNRDd03Nn0rva_%2Fimage.png?alt=media\&token=543624df-a005-4810-83f0-951739561db1)

* Use the same cutoff key even before sorting
* So two
  * Before sorting
  * Before writing them to secondary storage&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz1puVOhuM8IS67-PC%2Fimage.png?alt=media\&token=89d7bd75-af01-4562-a3ac-a801da4fd19f)

* Take the value on top of the priority queue&#x20;
* Create the thing, and change the bucket size as the sum of all values&#x20;
  * Quickly empty the priority queue&#x20;

### Experimental Evaluation&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz2hJToXitXD7ki8WP%2Fimage.png?alt=media\&token=056881cc-4db2-4bed-afda-a2389a9d6c85)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz32NLJE0JW621zD88%2Fimage.png?alt=media\&token=6418f027-feae-4ce4-93a7-765388b46759)

* Increase the number of histogram, achieve speedup&#x20;
  * But diminish when from 50 to 100
  * 50 bucket (2 integers)&#x20;
* Two plots directly related&#x20;
  * I/O bound operations&#x20;
* Equal size histogram?
  * 50 histograms for each run&#x20;
  * Placement selection
  * Bucket different size? Only the last one&#x20;
* Increase bucket?
  * Memory?&#x20;
    * Exceed the budget, then stop (a few megabytes)&#x20;
  * Problematic&#x20;
    * Memory budget decreases, ...&#x20;
  * Play with the budget?&#x20;
* More benefit?
  * Change of distribution, or ...
  * More buckets at some cases, but not some other cases&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz3kHTLPn9NNHULwh8%2Fimage.png?alt=media\&token=a8fed528-74c3-46a6-a094-574d5d2ef6a4)

* Collect enough information&#x20;
* Eliminate most of the input before writing to secondary storage&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz55Qm7UJqI8mzLj_U%2Fimage.png?alt=media\&token=92f8623b-1c93-4c8b-bf36-c93930b6111c)

* Adversarial case&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWyw9mS72ai4v140Ntq%2F-MWz5Ni9l7YhJqOXtC8i%2Fimage.png?alt=media\&token=36ebb41d-5108-4b5a-8aa2-f503d7887c03)

* Have enough memory, but not to use all of them&#x20;
* Cost: memory size \* execution time&#x20;
* Bigger input sizes:
  * 3x improving cost (cheaper)&#x20;
  * No significant slow down in performance&#x20;
  * Compared to the baseline where all memory is being used&#x20;
* Charge for what is being used&#x20;
  * Why reduce cost?
    * Filter out tuples, don't have to sort and split the entire input
    * Going to disk (latency increases), but not increase that much, because we sort a small fraction&#x20;
  * Baseline&#x20;
    * Baseline: Maintain 30 millions in memory
    * Your case: keep flushing up&#x20;
      * Saving memory consumption
      * And according to the cost calculation we save cost

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWz5QS_Ngi7ZECv70qF%2F-MWz6V6vREhkSstzSiYf%2Fimage.png?alt=media\&token=26b99479-1db8-494d-90ff-76921fb0c722)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MWz6ZKsYaXNLZ-UHMvz%2F-MWz6t8Vfru7dfGuWIAx%2Fimage.png?alt=media\&token=201b2ed9-2e99-456e-883b-a4c61ad29320)

* Change the K&#x20;
  * Baseline: external sort without the cutoff&#x20;
  * Take some times to consume input (later, decreasing)&#x20;

* Lines
  * Different data distribution&#x20;

* Compare with other baselines?
  * Priority queue? Only possible when K fit in memory&#x20;

* Sampling mechanism?
  * Operations to find that K but result is correct&#x20;
  * Probability guarantee: number of operations you are required&#x20;
  * A lot of probe to the data, and what point can you guarantee to be accurate?&#x20;
  * Extra scan&#x20;
  * In this setting: data is remote, two-scan is expensive&#x20;
