Episode 18
Last updated
Was this helpful?
Last updated
Was this helpful?
Software 2.0 for data quality management
Theodoros Rekatsinas | UW-Madison
Tuple (sample) validation
Cell-value validation
Missing data imputation
Data repairs (value replacement)
Push to ML model?
Marius: graphs, heteregeneous-structure data
Discrepancies between two estimate
Filling some of the values, and leaving the others
If know this dependency in advanced, then we could make better estimate
Unsupervised manner?
Can I run this inference queries fast
Error detection in demographic data used for policy decision
KIP tracking
Imputation of numerical data for industrial machinery monitoring
Go back to the idea of learning a model to capture the clean data, and use this model in decisions
PicketNet: transformer
Outlier detection problem
Benefits:
Value stream: flexibility
Schema stream: regularization
Aim to destroy
Systematic variation of the data?
Types of noise: real-world data, no assumption about the noise. Types of noise can be random, or systematic error (integrating things across different sources)
Random
Systematic: repeated instances of the noise, if we condition, then it's not random
Adversarial noise: are aware of the downstream task, and go and attack that system
Holoclean
Attention: handle this gracefully, pick up the type of strong bias
Picket:
Worst-possible case (Adversarial)
Not overfitting to examples
Distinguish between out-of-distribution or systematic change
Solution: two streams (scheme, value)
Value: robust towards the case. Kernel structure that operates on this level.
Profiling mechanism
Hetereogeneous data types
Tabula data: higher level constraints
Encode them as functional dependencies in DB
or pick them up through attention mechanism
Back in the day: user specification
Structure learning over the data
Exactly the attention matrix, in a faster and cheaper way
Hetereogenous
Same mechanism can potentially hold for a graph
Running structural learning type of profiling
Identify homogenous area as a pre-processing step
And preprocessing ...
Filtering away and keep
Doing this heavily in Holoclean
Monitor data and see if something is going on in the data pipeline
Reliable data
How that setting is different? Or some of the goals change?
Reconstruction
Signal and context
With high confidence, then it should be outlier
Information about the likelihood
Goal:
applying the rules at scale (ETL, ...)
Start-up and companies: specific problem
Identify duplicates in records
Infer rules to prepare and standardize
AI --> platform for error detection and fixing
Nobody is targeting
Automating this
Position: reasoning about noisy structured data
Challenges and what that issue looks like:
What is the model doing? Aspects
Attention: interpretable (know the semantics of the attributes, put more weights or less)
Allow people not immediately accept them. Have confidence over the prediction of the model.
Accept the one that makes sense
Also, allow users to introduce business /external features that would allow you to introduce
Holoclean
Accept logic rules, convert them to features
Support matching functions
Which one should I trust?
Ensemble (weighted vote)
In real cases, people believe their rules...