Ctrlk

Episode 18

Structure is all you need

Software 2.0 for data quality management
Theodoros Rekatsinas | UW-Madison

The Notorious data quality problem

Data quality management task

Error detection tasks

Tuple (sample) validation
Cell-value validation

Data repairs

Missing data imputation
Data repairs (value replacement)

Push to ML model?

ML models are sensitive to low-quality data

Goal: Streamline data quality management

Marius: graphs, heteregeneous-structure data

Example: data validation for mean estimation

Discrepancies between two estimate

Filling some of the values, and leaving the others

If know this dependency in advanced, then we could make better estimate

Structure-aware data cleaning is necessary

Heterogeneous types of structure

Contextual ML for automated data quality

Unsupervised manner?
- Can I run this inference queries fast

HoloClean: Probabilistic Data Repairs

("Take two")

Schema-level Attention

Why Attention?

Naturally-occurring missing data

Use case: Data Categorization

Other use cases

Error detection in demographic data used for policy decision
KIP tracking
Imputation of numerical data for industrial machinery monitoring

Picket: self-supervised transformers for data validation in ML pipelines

Loss-based Outlier Detection and Filtering

Go back to the idea of learning a model to capture the clean data, and use this model in decisions
1. PicketNet: transformer
2. Outlier detection problem

PicketNet: two-stream transformer for tabula data

Benefits:
- Value stream: flexibility
- Schema stream: regularization

Experimental Highlights: Poisoning Attacks

Aim to destroy

Contexual ML for automated data quality ops

Scalable no-code graph learning

Systematic variation of the data?
- Types of noise: real-world data, no assumption about the noise. Types of noise can be random, or systematic error (integrating things across different sources)
  - Random
  - Systematic: repeated instances of the noise, if we condition, then it's not random
  - Adversarial noise: are aware of the downstream task, and go and attack that system
Holoclean
- Attention: handle this gracefully, pick up the type of strong bias
Picket:
- Worst-possible case (Adversarial)
- Not overfitting to examples
Distinguish between out-of-distribution or systematic change
- Solution: two streams (scheme, value)
  - Value: robust towards the case. Kernel structure that operates on this level.
- Profiling mechanism
Hetereogeneous data types
- Tabula data: higher level constraints
  - Encode them as functional dependencies in DB
  - or pick them up through attention mechanism
  - Back in the day: user specification
- Structure learning over the data
  - Exactly the attention matrix, in a faster and cheaper way
- Hetereogenous
  - Same mechanism can potentially hold for a graph
  - Running structural learning type of profiling
    Identify homogenous area as a pre-processing step
    And preprocessing ...
    Filtering away and keep
  - Doing this heavily in Holoclean
Monitor data and see if something is going on in the data pipeline
- Reliable data
- How that setting is different? Or some of the goals change?
  - Reconstruction
    Signal and context
    With high confidence, then it should be outlier
  - Information about the likelihood

Goal:

applying the rules at scale (ETL, ...)
Start-up and companies: specific problem
- Identify duplicates in records
- Infer rules to prepare and standardize
- AI --> platform for error detection and fixing
Nobody is targeting
- Automating this
- Position: reasoning about noisy structured data

Challenges and what that issue looks like:

What is the model doing? Aspects
- Attention: interpretable (know the semantics of the attributes, put more weights or less)
- Allow people not immediately accept them. Have confidence over the prediction of the model.
  - Accept the one that makes sense
Also, allow users to introduce business /external features that would allow you to introduce

Holoclean

Accept logic rules, convert them to features
Support matching functions

Which one should I trust?

Ensemble (weighted vote)
In real cases, people believe their rules...

PreviousEpisode 17 NextIndex

Last updated 4 years ago

Was this helpful?