# Automatically Discovering Machine Learning Optimizations

https://www.youtube.com/watch?v=XyXzzjbuXCs&t=2s

ML computation is increasingly exponentially

Amount of computation in largest ML training doubles every 3.4 months

Computation Graph of ML Model

Nodes: tensor algebra operator (e.g., convolution, matrix mul)

Edge: tensor (high-dim array)

ML systems are a key ingredient in ML

Distributed heterogenous hardware architectures

Challenges of building ML systems

Increasingly diverse models

Increasingly heterogeneous hardware

Summary

Massively parallel: tensor algebra is parallelizable in many dimensions

New ML operators: continuously introduced into ML systems

Heterogenous hardware: different processor kinds and complex memory hierarchy

Limitations

Hard to manually design

Suboptimal performance

Correctness relies on manual proof

Catalyst: automated learning systems lab

Mission: automate the design and optimization of ML systems

Statistical and mathematical properties of ML algorithms

Domain knowledge of modern hardware platforms

Graph optimizations: TASO, PET --> automated generation of graph optimizations

Parallelization optimizations: FlexFlow, Dorylus (automated discovery of fast parallelization strategy)

Data layout & placement: Lux, Roc

Advantages of automated ML systems

Better runtime performance: discovering novel optimizations hard to manually designed, 3-10x speedup over manual optimizations

Less engineering effort: code for discovering optimizations is generally much less than manual implementation of these optimizations

Stronger correctness guarantees: using formal verification techniques

TASO, PET

Current rule-based graph optimizations

Challenges of graph optimizations for ML: infeasible to manually design graph optimizations for all cases

Operators, graph architectures, hardware backends

TASO: tensor algebra super-optimizer

Automated generation and verification of graph substitutions

Workflow

Operator specifications

Graph subset generator

Huge amount of graphs

Explicitly considering all pairs does not scale --> compute output fingerprints, pairs of graphs with identical fingerprint are candidate substitutions

Candidate substitution

Graph substitution verifier

Graph optimizer

Based on individual operators' cost

Measure the cost of each operator on hardware

Cost-based backtracking search

PET: first tensor program optimizer with partially equivalent transformations

Larger optimization space

Better performance

Correctness

Last updated