Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices

https://www.csl.cornell.edu/~delimitrou/papers/2021.asplos.sage.pdf

Motivation
- Microservices become increasingly popular in cloud system
  - Fine-grained, loosely-coupled, and single-concerned
  - Communicate with RPCs or RESTful APIs
  - Pros
    Agile development
    Better modularity and elasticity
    Testing and debugging in isolation
  - Cons
    Different HW & SW constraints
    Dependencies --> complicate cluster management
- SLOs govern interactive microservices
Challenges in microservice performance debugging
- Microservices are more sensitive to performance unpredictability
- Complex network dependences
- Complex tracing and monitoring
Critical: automatic technique (data-driven method)
- Existing works: poor scalability; not practical
Design principle
- No need to label data
  - Requires a causal model
- Robust to sampling frequency
  - Suitable for instrumentation in production, not using temporal patterns for inference
- No need for kernel-level tracing
- Practical adjustment to service updates
- Focuses on resource provisioning-related performance issues
Sage: root cause analysis system using unsupervised learning
- Casual Bayesian Networks (CBN) for casual relationships among microservices
  - Edges indicate casual relationships, a tool for structural casual inference, interpretable and explainable

Last updated 3 years ago

Was this helpful?