Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
https://www.csl.cornell.edu/~delimitrou/papers/2021.asplos.sage.pdf
Presentation
Motivation
Microservices become increasingly popular in cloud system
Fine-grained, loosely-coupled, and single-concerned
Communicate with RPCs or RESTful APIs
Pros
Agile development
Better modularity and elasticity
Testing and debugging in isolation
Cons
Different HW & SW constraints
Dependencies --> complicate cluster management
SLOs govern interactive microservices
Challenges in microservice performance debugging
Microservices are more sensitive to performance unpredictability
Complex network dependences
Complex tracing and monitoring
Critical: automatic technique (data-driven method)
Existing works: poor scalability; not practical
Design principle
No need to label data
Requires a causal model
Robust to sampling frequency
Suitable for instrumentation in production, not using temporal patterns for inference
No need for kernel-level tracing
Practical adjustment to service updates
Focuses on resource provisioning-related performance issues
Sage: root cause analysis system using unsupervised learning
Casual Bayesian Networks (CBN) for casual relationships among microservices
Edges indicate casual relationships, a tool for structural casual inference, interpretable and explainable
Use counterfactuals to detect root causes (services and resources) of SLO violations
Counterfactual queries
Queries of hypothetical end-to-end latency if some metrics had been "normal"
Root causes: metrics that hypothetically solve the end-to-end performance issue
Generating counterfactuals with generative models
CVAE
Prior network, encoder, decoder (MLP)
GVAE: factorize CVAE according to the CBN model
Connection pruning to enforce the network to follow the casual model
Better interpretability
Faster retraining upon microservice updates
Root cause detection with GVAE
Learn the latent variable (Z) from the encoder
Calculate the "normal" values of metrics and latent variables
Two-level intervention for root cause detection
Incremental & partial retraining
Last updated