# Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices

### Presentation&#x20;

* Motivation&#x20;
  * Microservices become increasingly popular in cloud system
    * Fine-grained, loosely-coupled, and single-concerned&#x20;
    * Communicate with RPCs or RESTful APIs&#x20;
    * Pros&#x20;
      * Agile development&#x20;
      * Better modularity and elasticity&#x20;
      * Testing and debugging in isolation&#x20;
    * Cons&#x20;
      * Different HW & SW constraints&#x20;
      * Dependencies --> complicate cluster management&#x20;
  * SLOs govern interactive microservices&#x20;
* Challenges in microservice performance debugging&#x20;
  * Microservices are more sensitive to performance unpredictability&#x20;
  * Complex network dependences&#x20;
  * Complex tracing and monitoring&#x20;
* Critical: automatic technique (data-driven method)&#x20;
  * Existing works: poor scalability; not practical&#x20;
* Design principle&#x20;
  * No need to label data&#x20;
    * Requires a causal model&#x20;
  * Robust to sampling frequency&#x20;
    * Suitable for instrumentation in production, not using temporal patterns for inference&#x20;
  * No need for kernel-level tracing&#x20;
  * Practical adjustment to service updates&#x20;
  * Focuses on resource provisioning-related performance issues
* Sage: root cause analysis system using unsupervised learning&#x20;

  * Casual Bayesian Networks (CBN) for casual relationships among microservices
    * Edges indicate casual relationships, a tool for structural casual inference, interpretable and explainable &#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FM23Sh1SjEqGLyKodid7f%2Fimage.png?alt=media\&token=67052533-0b2a-42e2-b312-d7e39461d2c7)

* Use counterfactuals to detect root causes (services and resources) of SLO violations
  * Counterfactual queries&#x20;
    * Queries of hypothetical end-to-end latency if some metrics had been "normal"
    * Root causes: metrics that hypothetically solve the end-to-end performance issue&#x20;
  * Generating counterfactuals with generative models&#x20;
    * CVAE&#x20;
      * Prior network, encoder, decoder (MLP)&#x20;
    * GVAE: factorize CVAE according to the CBN model&#x20;
      * Connection pruning to enforce the network to follow the casual model&#x20;
      * Better interpretability&#x20;
      * Faster retraining upon microservice updates&#x20;
      * Root cause detection with GVAE&#x20;
        * Learn the latent variable (Z) from the encoder&#x20;
        * Calculate the "normal" values of metrics and latent variables
        * Two-level intervention for root cause detection&#x20;
      * Incremental & partial retraining&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FIUhM8COuwhNSaxUGDFuj%2Fimage.png?alt=media\&token=18e94c8e-825d-4653-baf5-7e7e8fd7c996)

&#x20;
