# Understanding and Detecting Software Upgrade Failures in Distributed Systems

### Presentation

* Most internet services live on the cloud&#x20;
* software upgrade is disruptive yet inevitable&#x20;
  * Full-stop upgrade results in a not-available period&#x20;
  * Rolling upgrade results in a partially available period&#x20;
    * Even if performed across clusters (reduced workers make the distributed system vulnerable for the workload spikes)
* &#x20;Upgrade failures aggravate the disruption&#x20;
  * Problematic&#x20;
    * They have a large-scale impact&#x20;
    * Could have persistent impact&#x20;
      * Cannot be recovered by simple roll-back&#x20;
* Dilemma: safe vs. fast upgrade&#x20;
  * Safe upgrade requires slow rollout&#x20;
    * Hours&#x20;
  * Fast upgrade is desired to deploy new features, patches&#x20;
* Idea: first study focusing on upgrade failures&#x20;
  * Study on **symptom**, severity, and time of detection&#x20;
    * E.g., Majority of upgrade failures were only caught after software release&#x20;
  * **Root-cause** study&#x20;
    * E.g., data format and data semantic incompatibilities are the major cause&#x20;
  * &#x20;**Triggering**-condition study
    * E.g., \~90% of upgrade failures can be triggered by testing consecutive versions&#x20;
* Deep dive&#x20;
  * Root-cause study
    * **DUPChecker: A static checking tool for data format incompatibilities**&#x20;
      * e.g., incompatible enum type checking&#x20;
        * changed Enum definitions&#x20;
        * whether an Enum data could be communicated btw. versions by tracking dataflow&#x20;
    * Semantic incompatibilities about incomplete version checking & handling&#x20;
      * CASS-6678: checked version number in one message while forgot to do it for another
      * Good practices to avoid this issue&#x20;
  * Triggering-condition study&#x20;
    * 90% can be triggered between consecutive major/minor versions&#x20;
    * Half can be triggered using operations in stress testing with default configuration
    * Most non-default configurations and operations can be found in unit test&#x20;
* **DUPTester: automated testing tool for upgrade failures** &#x20;
  * Testing scenarios&#x20;
    * Full-stop: execute a workload before upgrade
    * Rolling: execute during update&#x20;
  * Workloads&#x20;
    * Default stress testing workload
    * Workload synthesized by picking client-side operations in unit tests&#x20;

![](/files/bfX5WsFkLpg12yi7DyhK)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/conference/index/sosp-21/consistency/understanding-and-detecting-software-upgrade-failures-in-distributed-systems.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
