Episode 18
Structure is all you need
The Notorious data quality problem

Data quality management task
Error detection tasks
Data repairs

ML models are sensitive to low-quality data

Goal: Streamline data quality management

Example: data validation for mean estimation




Structure-aware data cleaning is necessary

Heterogeneous types of structure

Contextual ML for automated data quality


HoloClean: Probabilistic Data Repairs


("Take two")

Schema-level Attention

Why Attention?

Naturally-occurring missing data

Use case: Data Categorization

Other use cases
Picket: self-supervised transformers for data validation in ML pipelines



Loss-based Outlier Detection and Filtering

PicketNet: two-stream transformer for tabula data

Experimental Highlights: Poisoning Attacks

Contexual ML for automated data quality ops

Scalable no-code graph learning


Last updated