External Merge Sort for Top-K Queries: Eager input filtering guided by histograms

https://dl.acm.org/doi/pdf/10.1145/3318464.3389729

  • K is large (focus)

  • Running figures are the fastest

  • Exact top-k tuples

    • Other algorithms: approximate, but don't care in this work

  • Queries

    • Single table

    • Or result from previous operations

  • Push join

Existing Top-K Algorithms

  • Don't need to sort, only keep track of K tuples

  • Optimize this further?

    • Expensive for the priority queue?

      • This technique only requires a memory allocation, and only sorts the K tuples fully

      • Optimize? Goals?

        • Do more scan, and optimize the space?

        • If K is large? Expensive of keeping things?

          • Other techniques still needs to sort, this one does not

  • Relative to K

  • Problem

    • Relative to the input size

    • Postgrel: in-memory priority, then switch to external if it exceeds the memory budget

  • In-memory priority queue --> external sort

  • Threads working at the same time, memory split across threads, queries, operators

  • Different priority for different query, the resource might be taken away from us

Our Top-K algorithm

  • K < memory: just like in-memory priority queue

Top-K when K > Memory

  • Key: this tuple will not be part of the final output?

  • What kinds of information to calculate this type of key?

    • Histogram: use its information to calculate

  • Processing more: keep defining it

  • 10 input tuples

    • Safely discards value more than 30

  • Keep enough information to represent K tuples

  • Could sort

    • But extra cost when inserting new histogram tuple

    • Move things around might be more costly, allocate more memory

      • But bottleneck is the I/O, so won't become bottleneck

  • Use the same cutoff key even before sorting

  • So two

    • Before sorting

    • Before writing them to secondary storage

  • Take the value on top of the priority queue

  • Create the thing, and change the bucket size as the sum of all values

    • Quickly empty the priority queue

Experimental Evaluation

  • Increase the number of histogram, achieve speedup

    • But diminish when from 50 to 100

    • 50 bucket (2 integers)

  • Two plots directly related

    • I/O bound operations

  • Equal size histogram?

    • 50 histograms for each run

    • Placement selection

    • Bucket different size? Only the last one

  • Increase bucket?

    • Memory?

      • Exceed the budget, then stop (a few megabytes)

    • Problematic

      • Memory budget decreases, ...

    • Play with the budget?

  • More benefit?

    • Change of distribution, or ...

    • More buckets at some cases, but not some other cases

  • Collect enough information

  • Eliminate most of the input before writing to secondary storage

  • Adversarial case

  • Have enough memory, but not to use all of them

  • Cost: memory size * execution time

  • Bigger input sizes:

    • 3x improving cost (cheaper)

    • No significant slow down in performance

    • Compared to the baseline where all memory is being used

  • Charge for what is being used

    • Why reduce cost?

      • Filter out tuples, don't have to sort and split the entire input

      • Going to disk (latency increases), but not increase that much, because we sort a small fraction

    • Baseline

      • Baseline: Maintain 30 millions in memory

      • Your case: keep flushing up

        • Saving memory consumption

        • And according to the cost calculation we save cost

  • Change the K

    • Baseline: external sort without the cutoff

    • Take some times to consume input (later, decreasing)

  • Lines

    • Different data distribution

  • Compare with other baselines?

    • Priority queue? Only possible when K fit in memory

  • Sampling mechanism?

    • Operations to find that K but result is correct

    • Probability guarantee: number of operations you are required

    • A lot of probe to the data, and what point can you guarantee to be accurate?

    • Extra scan

    • In this setting: data is remote, two-scan is expensive

Last updated