Understanding and Detecting Software Upgrade Failures in Distributed Systems

https://www.cs.purdue.edu/homes/yonglezh/pub/upgrade-sosp21.pdf

Presentation

Most internet services live on the cloud
software upgrade is disruptive yet inevitable
- Full-stop upgrade results in a not-available period
- Rolling upgrade results in a partially available period
  - Even if performed across clusters (reduced workers make the distributed system vulnerable for the workload spikes)
Upgrade failures aggravate the disruption
- Problematic
  - They have a large-scale impact
  - Could have persistent impact
    Cannot be recovered by simple roll-back
Dilemma: safe vs. fast upgrade
- Safe upgrade requires slow rollout
  - Hours
- Fast upgrade is desired to deploy new features, patches
Idea: first study focusing on upgrade failures
- Study on symptom, severity, and time of detection
  - E.g., Majority of upgrade failures were only caught after software release
- Root-cause study
  - E.g., data format and data semantic incompatibilities are the major cause
- Triggering-condition study
  - E.g., ~90% of upgrade failures can be triggered by testing consecutive versions
Deep dive
- Root-cause study
  - DUPChecker: A static checking tool for data format incompatibilities
    e.g., incompatible enum type checking
    changed Enum definitions
    whether an Enum data could be communicated btw. versions by tracking dataflow
  - Semantic incompatibilities about incomplete version checking & handling
    CASS-6678: checked version number in one message while forgot to do it for another
    Good practices to avoid this issue
- Triggering-condition study
  - 90% can be triggered between consecutive major/minor versions
  - Half can be triggered using operations in stress testing with default configuration
  - Most non-default configurations and operations can be found in unit test
DUPTester: automated testing tool for upgrade failures
- Testing scenarios
  - Full-stop: execute a workload before upgrade
  - Rolling: execute during update
- Workloads
  - Default stress testing workload
  - Workload synthesized by picking client-side operations in unit tests

PreviousEfficient and Scalable Thread-Safety Violation Detection NextNVM

Last updated 3 years ago

Was this helpful?