Episode 18
Structure is all you need
Software 2.0 for data quality management
Theodoros Rekatsinas | UW-Madison
The Notorious data quality problem
Data quality management task
Error detection tasks
Tuple (sample) validation
Cell-value validation
Data repairs
Missing data imputation
Data repairs (value replacement)
Push to ML model?
ML models are sensitive to low-quality data
Goal: Streamline data quality management
Marius: graphs, heteregeneous-structure data
Example: data validation for mean estimation
Discrepancies between two estimate
Filling some of the values, and leaving the others
If know this dependency in advanced, then we could make better estimate
Structure-aware data cleaning is necessary
Heterogeneous types of structure
Contextual ML for automated data quality
Unsupervised manner?
Can I run this inference queries fast
HoloClean: Probabilistic Data Repairs
("Take two")
Schema-level Attention
Why Attention?
Naturally-occurring missing data
Use case: Data Categorization
Other use cases
Error detection in demographic data used for policy decision
KIP tracking
Imputation of numerical data for industrial machinery monitoring
Picket: self-supervised transformers for data validation in ML pipelines
Loss-based Outlier Detection and Filtering
Go back to the idea of learning a model to capture the clean data, and use this model in decisions
PicketNet: transformer
Outlier detection problem
PicketNet: two-stream transformer for tabula data
Benefits:
Value stream: flexibility
Schema stream: regularization
Experimental Highlights: Poisoning Attacks
Aim to destroy
Contexual ML for automated data quality ops
Scalable no-code graph learning
Systematic variation of the data?
Types of noise: real-world data, no assumption about the noise. Types of noise can be random, or systematic error (integrating things across different sources)
Random
Systematic: repeated instances of the noise, if we condition, then it's not random
Adversarial noise: are aware of the downstream task, and go and attack that system
Holoclean
Attention: handle this gracefully, pick up the type of strong bias
Picket:
Worst-possible case (Adversarial)
Not overfitting to examples
Distinguish between out-of-distribution or systematic change
Solution: two streams (scheme, value)
Value: robust towards the case. Kernel structure that operates on this level.
Profiling mechanism
Hetereogeneous data types
Tabula data: higher level constraints
Encode them as functional dependencies in DB
or pick them up through attention mechanism
Back in the day: user specification
Structure learning over the data
Exactly the attention matrix, in a faster and cheaper way
Hetereogenous
Same mechanism can potentially hold for a graph
Running structural learning type of profiling
Identify homogenous area as a pre-processing step
And preprocessing ...
Filtering away and keep
Doing this heavily in Holoclean
Monitor data and see if something is going on in the data pipeline
Reliable data
How that setting is different? Or some of the goals change?
Reconstruction
Signal and context
With high confidence, then it should be outlier
Information about the likelihood
Goal:
applying the rules at scale (ETL, ...)
Start-up and companies: specific problem
Identify duplicates in records
Infer rules to prepare and standardize
AI --> platform for error detection and fixing
Nobody is targeting
Automating this
Position: reasoning about noisy structured data
Challenges and what that issue looks like:
What is the model doing? Aspects
Attention: interpretable (know the semantics of the attributes, put more weights or less)
Allow people not immediately accept them. Have confidence over the prediction of the model.
Accept the one that makes sense
Also, allow users to introduce business /external features that would allow you to introduce
Holoclean
Accept logic rules, convert them to features
Support matching functions
Which one should I trust?
Ensemble (weighted vote)
In real cases, people believe their rules...
Last updated