Episode 17

Guest: Savin Goyal

Title: Taming the Long Tail of Industrial ML Applications

Abstract: Data Science usage at Netflix goes much beyond our eponymous recommendation systems. It touches almost all aspects of our business - from optimizing content delivery and informing buying decisions to fighting fraud. Our unique culture affords our data scientists extraordinary freedom of choice in ML tools and libraries, all of which results in an ever-expanding set of interesting problem statements and a diverse set of ML approaches to tackle them. Our data scientists, at the same time, are expected to build, deploy, and operate complex ML workloads autonomously without the need to be significantly experienced with systems or data engineering. In this talk, I will discuss some of the challenges involved in improving the development and deployment experience for ML workloads. I will focus on Metaflow, our ML framework, which offers useful abstractions for managing the model’s lifecycle end-to-end, and how a focus on human-centric design positively affects our data scientists' velocity.

Data Scientist

Jupyter notebook, R Studio, or some other IDE
OpenCV, Tensorflow, Pytorch, OpenCV
Deal with data: flow, google cloud storage, S3, local disk
- Keep track of states, data
Create a workflow out of the work
- Offload compute into the cloud, hyperparameter optimization, sigopt, slurm
- Familiar with all these concepts
- Input of the workflow changing overtime
  - Get the latest data and retrain
Orchestrate the workflow (pipeline)
- Fit into this new paradigm
- Containers (docker etc.)
  - Take unit of compute and package that into container
Flask, Shiny, Custom application, Commercial and open source solutions to monitor the service
Result
- If business stakeholders don't satisfy
- O/w
  - More progress: more ideas, add more features, how to make sure deploying staging of services?

Structure your code as a DAG
- A natural way to express ML pipelines
- Many technical benefits follow when you do this
- Start two models A, B; join and decide which is better; end
Start with an ML script

With a small bit of refactoring with Metaflow

Develop locally like any other script
In each step, anything you store

Store these data and can be retrieved and inspected

And is versioned and namespaced

Can restart from any step
- ? Checkpoint overhead?
- Previous step snapshot by metaflow
Straightforward grid search

Offload compute to the cloud

Annotate the step and resources
1. How would user know what compute resources are appropriate?

Specify compute dependencies easily
- Packages and environments, what are the things they want?
Ready for production?
- Workflow executes asynchronously
- Publish your workflow to workflow orchestrator
  - Gap: own programming paradigm, rewrite...
  - Good thing: workflow orchestrator executes a DAG
    Allow users to seeminglessly to deploy the prototyping work

Ready for integration?
- Web services etc.

PreviousStanford MLSys Seminar NextEpisode 18

Last updated 4 years ago

Was this helpful?