External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
https://dl.acm.org/doi/pdf/10.1145/3318464.3389729
Last updated
Was this helpful?
https://dl.acm.org/doi/pdf/10.1145/3318464.3389729
Last updated
Was this helpful?
K is large (focus)
Running figures are the fastest
Exact top-k tuples
Other algorithms: approximate, but don't care in this work
Queries
Single table
Or result from previous operations
Push join
Don't need to sort, only keep track of K tuples
Optimize this further?
Expensive for the priority queue?
This technique only requires a memory allocation, and only sorts the K tuples fully
Optimize? Goals?
Do more scan, and optimize the space?
If K is large? Expensive of keeping things?
Other techniques still needs to sort, this one does not
Relative to K
Problem
Relative to the input size
Postgrel: in-memory priority, then switch to external if it exceeds the memory budget
In-memory priority queue --> external sort
Threads working at the same time, memory split across threads, queries, operators
Different priority for different query, the resource might be taken away from us
K < memory: just like in-memory priority queue
Key: this tuple will not be part of the final output?
What kinds of information to calculate this type of key?
Histogram: use its information to calculate
Processing more: keep defining it
10 input tuples
Safely discards value more than 30
Keep enough information to represent K tuples
Could sort
But extra cost when inserting new histogram tuple
Move things around might be more costly, allocate more memory
But bottleneck is the I/O, so won't become bottleneck
Use the same cutoff key even before sorting
So two
Before sorting
Before writing them to secondary storage
Take the value on top of the priority queue
Create the thing, and change the bucket size as the sum of all values
Quickly empty the priority queue
Increase the number of histogram, achieve speedup
But diminish when from 50 to 100
50 bucket (2 integers)
Two plots directly related
I/O bound operations
Equal size histogram?
50 histograms for each run
Placement selection
Bucket different size? Only the last one
Increase bucket?
Memory?
Exceed the budget, then stop (a few megabytes)
Problematic
Memory budget decreases, ...
Play with the budget?
More benefit?
Change of distribution, or ...
More buckets at some cases, but not some other cases
Collect enough information
Eliminate most of the input before writing to secondary storage
Adversarial case
Have enough memory, but not to use all of them
Cost: memory size * execution time
Bigger input sizes:
3x improving cost (cheaper)
No significant slow down in performance
Compared to the baseline where all memory is being used
Charge for what is being used
Why reduce cost?
Filter out tuples, don't have to sort and split the entire input
Going to disk (latency increases), but not increase that much, because we sort a small fraction
Baseline
Baseline: Maintain 30 millions in memory
Your case: keep flushing up
Saving memory consumption
And according to the cost calculation we save cost
Change the K
Baseline: external sort without the cutoff
Take some times to consume input (later, decreasing)
Lines
Different data distribution
Compare with other baselines?
Priority queue? Only possible when K fit in memory
Sampling mechanism?
Operations to find that K but result is correct
Probability guarantee: number of operations you are required
A lot of probe to the data, and what point can you guarantee to be accurate?
Extra scan
In this setting: data is remote, two-scan is expensive