External Merge Sort for Top-K Queries: Eager input filtering guided by histograms

https://dl.acm.org/doi/pdf/10.1145/3318464.3389729

Problem
- Relative to the input size
- Postgrel: in-memory priority, then switch to external if it exceeds the memory budget

Threads working at the same time, memory split across threads, queries, operators
Different priority for different query, the resource might be taken away from us

Key: this tuple will not be part of the final output?
What kinds of information to calculate this type of key?
- Histogram: use its information to calculate

Keep enough information to represent K tuples
Could sort
- But extra cost when inserting new histogram tuple
- Move things around might be more costly, allocate more memory
  - But bottleneck is the I/O, so won't become bottleneck

Take the value on top of the priority queue
Create the thing, and change the bucket size as the sum of all values
- Quickly empty the priority queue

Increase the number of histogram, achieve speedup
- But diminish when from 50 to 100
- 50 bucket (2 integers)
Two plots directly related
- I/O bound operations
Equal size histogram?
- 50 histograms for each run
- Placement selection
- Bucket different size? Only the last one
Increase bucket?
- Memory?
  - Exceed the budget, then stop (a few megabytes)
- Problematic
  - Memory budget decreases, ...
- Play with the budget?
More benefit?
- Change of distribution, or ...
- More buckets at some cases, but not some other cases

Have enough memory, but not to use all of them
Cost: memory size * execution time
Bigger input sizes:
- 3x improving cost (cheaper)
- No significant slow down in performance
- Compared to the baseline where all memory is being used
Charge for what is being used
- Why reduce cost?
  - Filter out tuples, don't have to sort and split the entire input
  - Going to disk (latency increases), but not increase that much, because we sort a small fraction
- Baseline
  - Baseline: Maintain 30 millions in memory
  - Your case: keep flushing up
    Saving memory consumption
    And according to the cost calculation we save cost

Change the K
- Baseline: external sort without the cutoff
- Take some times to consume input (later, decreasing)
Lines
- Different data distribution
Compare with other baselines?
- Priority queue? Only possible when K fit in memory

Last updated 4 years ago

Was this helpful?