Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices

https://www.csl.cornell.edu/~delimitrou/papers/2021.asplos.sage.pdf

Presentation

  • Motivation

    • Microservices become increasingly popular in cloud system

      • Fine-grained, loosely-coupled, and single-concerned

      • Communicate with RPCs or RESTful APIs

      • Pros

        • Agile development

        • Better modularity and elasticity

        • Testing and debugging in isolation

      • Cons

        • Different HW & SW constraints

        • Dependencies --> complicate cluster management

    • SLOs govern interactive microservices

  • Challenges in microservice performance debugging

    • Microservices are more sensitive to performance unpredictability

    • Complex network dependences

    • Complex tracing and monitoring

  • Critical: automatic technique (data-driven method)

    • Existing works: poor scalability; not practical

  • Design principle

    • No need to label data

      • Requires a causal model

    • Robust to sampling frequency

      • Suitable for instrumentation in production, not using temporal patterns for inference

    • No need for kernel-level tracing

    • Practical adjustment to service updates

    • Focuses on resource provisioning-related performance issues

  • Sage: root cause analysis system using unsupervised learning

    • Casual Bayesian Networks (CBN) for casual relationships among microservices

      • Edges indicate casual relationships, a tool for structural casual inference, interpretable and explainable

  • Use counterfactuals to detect root causes (services and resources) of SLO violations

    • Counterfactual queries

      • Queries of hypothetical end-to-end latency if some metrics had been "normal"

      • Root causes: metrics that hypothetically solve the end-to-end performance issue

    • Generating counterfactuals with generative models

      • CVAE

        • Prior network, encoder, decoder (MLP)

      • GVAE: factorize CVAE according to the CBN model

        • Connection pruning to enforce the network to follow the casual model

        • Better interpretability

        • Faster retraining upon microservice updates

        • Root cause detection with GVAE

          • Learn the latent variable (Z) from the encoder

          • Calculate the "normal" values of metrics and latent variables

          • Two-level intervention for root cause detection

        • Incremental & partial retraining

Last updated