# Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices

### Presentation&#x20;

* Motivation&#x20;
  * Microservices become increasingly popular in cloud system
    * Fine-grained, loosely-coupled, and single-concerned&#x20;
    * Communicate with RPCs or RESTful APIs&#x20;
    * Pros&#x20;
      * Agile development&#x20;
      * Better modularity and elasticity&#x20;
      * Testing and debugging in isolation&#x20;
    * Cons&#x20;
      * Different HW & SW constraints&#x20;
      * Dependencies --> complicate cluster management&#x20;
  * SLOs govern interactive microservices&#x20;
* Challenges in microservice performance debugging&#x20;
  * Microservices are more sensitive to performance unpredictability&#x20;
  * Complex network dependences&#x20;
  * Complex tracing and monitoring&#x20;
* Critical: automatic technique (data-driven method)&#x20;
  * Existing works: poor scalability; not practical&#x20;
* Design principle&#x20;
  * No need to label data&#x20;
    * Requires a causal model&#x20;
  * Robust to sampling frequency&#x20;
    * Suitable for instrumentation in production, not using temporal patterns for inference&#x20;
  * No need for kernel-level tracing&#x20;
  * Practical adjustment to service updates&#x20;
  * Focuses on resource provisioning-related performance issues
* Sage: root cause analysis system using unsupervised learning&#x20;

  * Casual Bayesian Networks (CBN) for casual relationships among microservices
    * Edges indicate casual relationships, a tool for structural casual inference, interpretable and explainable &#x20;

![](/files/4IhT3RdFl4fAyLwNzY7n)

* Use counterfactuals to detect root causes (services and resources) of SLO violations
  * Counterfactual queries&#x20;
    * Queries of hypothetical end-to-end latency if some metrics had been "normal"
    * Root causes: metrics that hypothetically solve the end-to-end performance issue&#x20;
  * Generating counterfactuals with generative models&#x20;
    * CVAE&#x20;
      * Prior network, encoder, decoder (MLP)&#x20;
    * GVAE: factorize CVAE according to the CBN model&#x20;
      * Connection pruning to enforce the network to follow the casual model&#x20;
      * Better interpretability&#x20;
      * Faster retraining upon microservice updates&#x20;
      * Root cause detection with GVAE&#x20;
        * Learn the latent variable (Z) from the encoder&#x20;
        * Calculate the "normal" values of metrics and latent variables
        * Two-level intervention for root cause detection&#x20;
      * Incremental & partial retraining&#x20;

![](/files/wmFnSIIzc3LwMu3ew8Dw)

&#x20;


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/networking/index/reading-list/sage-practical-and-scalable-ml-driven-performance-debugging-in-microservices.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
