# Data Management

[DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling](https://www.usenix.org/conference/osdi21/presentation/khan)&#x20;

[CLP: Efficient and Scalable Search on Compressed Text Logs](https://www.usenix.org/conference/osdi21/presentation/rodrigues)&#x20;

[Polyjuice: High-Performance Transactions via Learned Concurrency Control](https://www.usenix.org/conference/osdi21/presentation/wang-jiachen)&#x20;

[Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing](https://www.usenix.org/conference/osdi21/presentation/shen)&#x20;

### DMon&#x20;

Motivation&#x20;

* Single search query (\~700ms): 8 processor cores&#x20;
  * Millions of cores&#x20;
  * Cost in management and energy cost&#x20;
* CPU performance&#x20;
  * 32% doing useful work&#x20;
  * Other: fetching, decoding, bad speculation, execution units unavailable, memory stalls (poor data locality)&#x20;

Existing techniques

* Compiler Optimizations&#x20;
  * Improve data locality via program transformation&#x20;
  * Con: rely on static heuristic, can sometimes hurt performance&#x20;
* Dynamic profilers&#x20;
  * Accurate execution information, identify and resolve poor locality&#x20;
  * Con: mostly manual repair, high profiling overhead&#x20;

Contribution&#x20;

* Selective profiling&#x20;
* Apply specific compiler optimizations based on profiling results

&#x20;

![](/files/-MeaTnEfYbpaoWon_K01)

Evaluation

* Efficiency&#x20;
* Effectiveness&#x20;
* Real-world case studies&#x20;

Data locality problem that is not solved by optimization?&#x20;

* Tail latency problem&#x20;
  * Can't identify and repair all&#x20;

### CLP

Workload / problems:&#x20;

* Logs --> search tools --> compress (i.e. gzip)&#x20;
  * Problem&#x20;
    * Perabytes of logs&#x20;
    * Consume a lot of resources for uncompressed logs&#x20;
  * Current solution&#x20;
    * Build indexes: adds storage overhead&#x20;
      * Total usage is the same magnitude&#x20;
    * Can only retain indexed logs for a few weeks&#x20;
    * Compress (2.27% of the original size)&#x20;
      * Unsearchable once compressed&#x20;
      * Requires decompression&#x20;

Method key:&#x20;

* Lossless log compression: better than general-purpose compressors&#x20;
* Can search compressed logs&#x20;
  * Without decompression&#x20;
    * Only decompress matching results&#x20;
  * With good performance&#x20;

Method:

* Domain-specific algorithm to extract the overlapping parts&#x20;
  * De-duplicate&#x20;
  * Log Type Dictionary&#x20;
  * Variable Dictionary&#x20;
  * Encoded Message&#x20;

Comparison:

* Search: ripgrep, ElasticSearch&#x20;
* Archieve tools: lzma, ...&#x20;

How to change the structure? And prevent them from blowing down?

* Compress logs to multiple archives&#x20;
  * Different set of archives&#x20;
* Do the search? Affecting the performance?&#x20;
  * Time is neglectable compared to the time to read&#x20;
* Accurate schema?&#x20;
  * Yes&#x20;
  * Based on the variable, tend to be quite general (generic schema)&#x20;
    * Aren't the best? Search performance&#x20;

### Polyjuice&#x20;

Motivation:&#x20;

* CC: control how concurrent executions interleave&#x20;
  * Maximizing interleaving --> better performance&#x20;
* Currently, no single CC algorithm is the best depending on the workload (high-, low- contention)

Current solutions:

* Coarse-grained approach to combine few known CC algorithms&#x20;
* Con&#x20;
  * Cumbersome: requires manual partition of the workload&#x20;
  * Suboptimal: uses a single CC within each partition&#x20;

Approach:

* RL
  * State&#x20;
    * Differentiate state that require different CC actions&#x20;
    * Two part&#x20;
      * Type of transaction&#x20;
      * Data accessed&#x20;
  * Action
    * Should be able to encode most existing CC algorithms&#x20;
    * Exert control on the interleaving&#x20;
    * Solution&#x20;
      * Whether to wait, and how long?
        * OCC: no wait&#x20;
        * 2PL: wait until dependent transactions commit&#x20;
        * IC3, RP, DRP: wait until ... finish execution up to some point&#x20;
      * Which versions of data to read?
        * Latest committed version&#x20;
        * Latest published dirty version&#x20;
      * Whether to make a dirty write visible?
        * Keep itself in the private buffer&#x20;
        * Publish the dirty write&#x20;
      * Whether to validate now, prior to commit?&#x20;
        * Whether or not to validate after this access&#x20;
        * Based on Silo's protocol&#x20;

![](/files/-MeacE2AHKFx0XuYYmGE)

### VEGITO &#x20;

Workload:&#x20;

* Typical workloads&#x20;
  * OLTP: short-term&#x20;
  * OLAP: long-term
* New type of workload: HTAP&#x20;

Requirement:

* Performance&#x20;
  * Minimizes performance degradation&#x20;
  * Millions of transactions per second&#x20;
* Freshness&#x20;
  * Delay..&#x20;

Data analysis: TP + ETL + AP

* HTAP alternatives&#x20;
* Alternative 1: Dual system&#x20;
  * Good performance&#x20;
  * Large time delay&#x20;
* Alter 2: single-layout&#x20;
  * Short time delay&#x20;
  * Huge performance degradation&#x20;
* Alter 3: dual-layout&#x20;
  * Lightweight sync.&#x20;

Idea: VEGITO&#x20;

* High availability for fast in-memory OLTP&#x20;
  * Replication-based&#x20;
  * Synchronous log shipping&#x20;
* Challenge&#x20;
  * Log cleaning&#x20;
    * Need to be consistent and log-parallel&#x20;
  * Epoch-based design&#x20;
    * Gossip-style&#x20;
      * Guarantee epoch assignment is consistent&#x20;
  * MVCS&#x20;
    * Multi-version column store (MVCS)&#x20;
    * Different locality for read & write&#x20;
      * column-wise v.s row-wise&#x20;
    * Use block-based MVCS&#x20;
      * Row-split&#x20;
      * Col-merge&#x20;
  * Tree-based index&#x20;
    * Write-optimized tree index
    * &#x20;Read-optimized index&#x20;
    * Tree insert = in-place insert + balance (costly)
      * Balance in batch&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/conference/index/osdi-21/data-management.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
