Episode 18

Structure is all you need

  • Software 2.0 for data quality management

  • Theodoros Rekatsinas | UW-Madison

The Notorious data quality problem

Data quality management task

Error detection tasks

  • Tuple (sample) validation

  • Cell-value validation

Data repairs

  • Missing data imputation

  • Data repairs (value replacement)

  • Push to ML model?

ML models are sensitive to low-quality data

Goal: Streamline data quality management

  • Marius: graphs, heteregeneous-structure data

Example: data validation for mean estimation

  • Discrepancies between two estimate

  • Filling some of the values, and leaving the others

  • If know this dependency in advanced, then we could make better estimate

Structure-aware data cleaning is necessary

Heterogeneous types of structure

Contextual ML for automated data quality

  • Unsupervised manner?

    • Can I run this inference queries fast

HoloClean: Probabilistic Data Repairs

("Take two")

Schema-level Attention

Why Attention?

Naturally-occurring missing data

Use case: Data Categorization

Other use cases

  • Error detection in demographic data used for policy decision

  • KIP tracking

  • Imputation of numerical data for industrial machinery monitoring

Picket: self-supervised transformers for data validation in ML pipelines

Loss-based Outlier Detection and Filtering

  1. Go back to the idea of learning a model to capture the clean data, and use this model in decisions

    1. PicketNet: transformer

    2. Outlier detection problem

PicketNet: two-stream transformer for tabula data

  • Benefits:

    • Value stream: flexibility

    • Schema stream: regularization

Experimental Highlights: Poisoning Attacks

  • Aim to destroy

Contexual ML for automated data quality ops

Scalable no-code graph learning

  • Systematic variation of the data?

    • Types of noise: real-world data, no assumption about the noise. Types of noise can be random, or systematic error (integrating things across different sources)

      • Random

      • Systematic: repeated instances of the noise, if we condition, then it's not random

      • Adversarial noise: are aware of the downstream task, and go and attack that system

  • Holoclean

    • Attention: handle this gracefully, pick up the type of strong bias

  • Picket:

    • Worst-possible case (Adversarial)

    • Not overfitting to examples

  • Distinguish between out-of-distribution or systematic change

    • Solution: two streams (scheme, value)

      • Value: robust towards the case. Kernel structure that operates on this level.

    • Profiling mechanism

  • Hetereogeneous data types

    • Tabula data: higher level constraints

      • Encode them as functional dependencies in DB

      • or pick them up through attention mechanism

      • Back in the day: user specification

    • Structure learning over the data

      • Exactly the attention matrix, in a faster and cheaper way

    • Hetereogenous

      • Same mechanism can potentially hold for a graph

      • Running structural learning type of profiling

        • Identify homogenous area as a pre-processing step

        • And preprocessing ...

        • Filtering away and keep

      • Doing this heavily in Holoclean

  • Monitor data and see if something is going on in the data pipeline

    • Reliable data

    • How that setting is different? Or some of the goals change?

      • Reconstruction

        • Signal and context

        • With high confidence, then it should be outlier

      • Information about the likelihood

Goal:

  • applying the rules at scale (ETL, ...)

  • Start-up and companies: specific problem

    • Identify duplicates in records

    • Infer rules to prepare and standardize

    • AI --> platform for error detection and fixing

  • Nobody is targeting

    • Automating this

    • Position: reasoning about noisy structured data

Challenges and what that issue looks like:

  • What is the model doing? Aspects

    • Attention: interpretable (know the semantics of the attributes, put more weights or less)

    • Allow people not immediately accept them. Have confidence over the prediction of the model.

      • Accept the one that makes sense

  • Also, allow users to introduce business /external features that would allow you to introduce

Holoclean

  • Accept logic rules, convert them to features

  • Support matching functions

Which one should I trust?

  • Ensemble (weighted vote)

  • In real cases, people believe their rules...

Last updated