Episode 17
Guest: Savin Goyal
Title: Taming the Long Tail of Industrial ML Applications
Abstract: Data Science usage at Netflix goes much beyond our eponymous recommendation systems. It touches almost all aspects of our business - from optimizing content delivery and informing buying decisions to fighting fraud. Our unique culture affords our data scientists extraordinary freedom of choice in ML tools and libraries, all of which results in an ever-expanding set of interesting problem statements and a diverse set of ML approaches to tackle them. Our data scientists, at the same time, are expected to build, deploy, and operate complex ML workloads autonomously without the need to be significantly experienced with systems or data engineering. In this talk, I will discuss some of the challenges involved in improving the development and deployment experience for ML workloads. I will focus on Metaflow, our ML framework, which offers useful abstractions for managing the model’s lifecycle end-to-end, and how a focus on human-centric design positively affects our data scientists' velocity.
Data Scientist
Jupyter notebook, R Studio, or some other IDE
OpenCV, Tensorflow, Pytorch, OpenCV
Deal with data: flow, google cloud storage, S3, local disk
Keep track of states, data
Create a workflow out of the work
Offload compute into the cloud, hyperparameter optimization, sigopt, slurm
Familiar with all these concepts
Input of the workflow changing overtime
Get the latest data and retrain
Orchestrate the workflow (pipeline)
Fit into this new paradigm
Containers (docker etc.)
Take unit of compute and package that into container
Flask, Shiny, Custom application, Commercial and open source solutions to monitor the service
Result
If business stakeholders don't satisfy
O/w
More progress: more ideas, add more features, how to make sure deploying staging of services?
Structure your code as a DAG
A natural way to express ML pipelines
Many technical benefits follow when you do this
Start two models A, B; join and decide which is better; end
Start with an ML script
With a small bit of refactoring with Metaflow
Develop locally like any other script
In each step, anything you store
Store these data and can be retrieved and inspected
And is versioned and namespaced
Can restart from any step
? Checkpoint overhead?
Previous step snapshot by metaflow
Straightforward grid search
Offload compute to the cloud
Annotate the step and resources
How would user know what compute resources are appropriate?
Specify compute dependencies easily
Packages and environments, what are the things they want?
Ready for production?
Workflow executes asynchronously
Publish your workflow to workflow orchestrator
Gap: own programming paradigm, rewrite...
Good thing: workflow orchestrator executes a DAG
Allow users to seeminglessly to deploy the prototyping work
Ready for integration?
Web services etc.
Last updated
Was this helpful?