Episode 17

Guest: Savin Goyal

Title: Taming the Long Tail of Industrial ML Applications

Abstract: Data Science usage at Netflix goes much beyond our eponymous recommendation systems. It touches almost all aspects of our business - from optimizing content delivery and informing buying decisions to fighting fraud. Our unique culture affords our data scientists extraordinary freedom of choice in ML tools and libraries, all of which results in an ever-expanding set of interesting problem statements and a diverse set of ML approaches to tackle them. Our data scientists, at the same time, are expected to build, deploy, and operate complex ML workloads autonomously without the need to be significantly experienced with systems or data engineering. In this talk, I will discuss some of the challenges involved in improving the development and deployment experience for ML workloads. I will focus on Metaflow, our ML framework, which offers useful abstractions for managing the model’s lifecycle end-to-end, and how a focus on human-centric design positively affects our data scientists' velocity.

Data Scientist

  • Jupyter notebook, R Studio, or some other IDE

  • OpenCV, Tensorflow, Pytorch, OpenCV

  • Deal with data: flow, google cloud storage, S3, local disk

    • Keep track of states, data

  • Create a workflow out of the work

    • Offload compute into the cloud, hyperparameter optimization, sigopt, slurm

    • Familiar with all these concepts

    • Input of the workflow changing overtime

      • Get the latest data and retrain

  • Orchestrate the workflow (pipeline)

    • Fit into this new paradigm

    • Containers (docker etc.)

      • Take unit of compute and package that into container

  • Flask, Shiny, Custom application, Commercial and open source solutions to monitor the service

  • Result

    • If business stakeholders don't satisfy

    • O/w

      • More progress: more ideas, add more features, how to make sure deploying staging of services?

  • Structure your code as a DAG

    • A natural way to express ML pipelines

    • Many technical benefits follow when you do this

    • Start two models A, B; join and decide which is better; end

  • Start with an ML script

  • Develop locally like any other script

  • In each step, anything you store

  • Store these data and can be retrieved and inspected

  • And is versioned and namespaced

  • Can restart from any step

    • ? Checkpoint overhead?

    • Previous step snapshot by metaflow

  • Straightforward grid search

  • Offload compute to the cloud

  1. Annotate the step and resources

    1. How would user know what compute resources are appropriate?

  • Specify compute dependencies easily

    • Packages and environments, what are the things they want?

  • Ready for production?

    • Workflow executes asynchronously

    • Publish your workflow to workflow orchestrator

      • Gap: own programming paradigm, rewrite...

      • Good thing: workflow orchestrator executes a DAG

        • Allow users to seeminglessly to deploy the prototyping work

  • Ready for integration?

    • Web services etc.

Last updated