Understanding and Detecting Software Upgrade Failures in Distributed Systems

https://www.cs.purdue.edu/homes/yonglezh/pub/upgrade-sosp21.pdf

Presentation

  • Most internet services live on the cloud

  • software upgrade is disruptive yet inevitable

    • Full-stop upgrade results in a not-available period

    • Rolling upgrade results in a partially available period

      • Even if performed across clusters (reduced workers make the distributed system vulnerable for the workload spikes)

  • Upgrade failures aggravate the disruption

    • Problematic

      • They have a large-scale impact

      • Could have persistent impact

        • Cannot be recovered by simple roll-back

  • Dilemma: safe vs. fast upgrade

    • Safe upgrade requires slow rollout

      • Hours

    • Fast upgrade is desired to deploy new features, patches

  • Idea: first study focusing on upgrade failures

    • Study on symptom, severity, and time of detection

      • E.g., Majority of upgrade failures were only caught after software release

    • Root-cause study

      • E.g., data format and data semantic incompatibilities are the major cause

    • Triggering-condition study

      • E.g., ~90% of upgrade failures can be triggered by testing consecutive versions

  • Deep dive

    • Root-cause study

      • DUPChecker: A static checking tool for data format incompatibilities

        • e.g., incompatible enum type checking

          • changed Enum definitions

          • whether an Enum data could be communicated btw. versions by tracking dataflow

      • Semantic incompatibilities about incomplete version checking & handling

        • CASS-6678: checked version number in one message while forgot to do it for another

        • Good practices to avoid this issue

    • Triggering-condition study

      • 90% can be triggered between consecutive major/minor versions

      • Half can be triggered using operations in stress testing with default configuration

      • Most non-default configurations and operations can be found in unit test

  • DUPTester: automated testing tool for upgrade failures

    • Testing scenarios

      • Full-stop: execute a workload before upgrade

      • Rolling: execute during update

    • Workloads

      • Default stress testing workload

      • Workload synthesized by picking client-side operations in unit tests

Last updated