Understanding and Detecting Software Upgrade Failures in Distributed Systems
https://www.cs.purdue.edu/homes/yonglezh/pub/upgrade-sosp21.pdf
Presentation
Most internet services live on the cloud
software upgrade is disruptive yet inevitable
Full-stop upgrade results in a not-available period
Rolling upgrade results in a partially available period
Even if performed across clusters (reduced workers make the distributed system vulnerable for the workload spikes)
Upgrade failures aggravate the disruption
Problematic
They have a large-scale impact
Could have persistent impact
Cannot be recovered by simple roll-back
Dilemma: safe vs. fast upgrade
Safe upgrade requires slow rollout
Hours
Fast upgrade is desired to deploy new features, patches
Idea: first study focusing on upgrade failures
Study on symptom, severity, and time of detection
E.g., Majority of upgrade failures were only caught after software release
Root-cause study
E.g., data format and data semantic incompatibilities are the major cause
Triggering-condition study
E.g., ~90% of upgrade failures can be triggered by testing consecutive versions
Deep dive
Root-cause study
DUPChecker: A static checking tool for data format incompatibilities
e.g., incompatible enum type checking
changed Enum definitions
whether an Enum data could be communicated btw. versions by tracking dataflow
Semantic incompatibilities about incomplete version checking & handling
CASS-6678: checked version number in one message while forgot to do it for another
Good practices to avoid this issue
Triggering-condition study
90% can be triggered between consecutive major/minor versions
Half can be triggered using operations in stress testing with default configuration
Most non-default configurations and operations can be found in unit test
DUPTester: automated testing tool for upgrade failures
Testing scenarios
Full-stop: execute a workload before upgrade
Rolling: execute during update
Workloads
Default stress testing workload
Workload synthesized by picking client-side operations in unit tests
Last updated
Was this helpful?