Efficient and Scalable Thread-Safety Violation Detection

SOSP 19: https://dl.acm.org/doi/pdf/10.1145/3341301.3359638?casa_token=tJoAJXALN4MAAAAA:JUd0XcAhj-AXAA8ORPeHmOS2xaLGSI-vwBGVMjVsixOq2KIXpuypu0V9ENOzpswb39bFZoXJs2n-

Presentation

  • Concurrent programming are everywhere --> bugs

  • Thread-safety violation (TSV)

    • Thread-safety contract

    • Thread-safety violation (TSV) occurs if two threads concurrently work on two conflicting methods on the same object at the same time

      • Outcome: crashes, etc.

  • Reasoning about TSVs is difficult

    • Too many possibilities of concurrency

      • Multithreading, event-driven, message passing

    • Too many forms of synchronizations

  • Data-race analysis on small scale, it's fine

    • Program source code --> integration --> static data-race analysis --> pruning

    • Or dynamic data-race analysis

      • Run the program with test inputs and program binary (integration)

      • Dynamic data-race analysis will slow down by 10x or more

  • Large-scale: existing techniques fail

    • CloudBuild: 100M tests from 4K teams, up to 10K machines / day

    • Tens of thousands of extra machines are needed

  • Challenges: integration, overhead, and false positives

  • TSVD: a scalable dynamic analysis tool for TSVs

  • Design

    • How to achieve zero false positive?

      • Report after violation + inject delays to trigger violations

        • Question: how to identify potential unsafe calls to insert delays?

          • Happen-before analysis (expensive, and require integration effort to annotate all sync primitives)

        • Racing calls

          • Two conflict methods, called from different threads, accessing the same object, having close-by physical timestamps

          • Remove unlikely race calls

            • Insights for synchronization inference

              • Insights: many ways to implement synchronization, one common effect of all synchronizations: if m1 synchronized before m2 and m1, m2 are nearby, delay to m1 will cause delay to m2

              • Transitive delay

                • If another thread cannot make progress, ..., infer synchronization

      • TSVD inference

        • Is there exist a sync?

        • Is the program running in a sequential mode?

        • Is a likely racing call less likely than another?

      • TSVD multiple runs

        • Earlier runs' knowledge will be inherited by later runs

  • TSVD review

    • instrumentation: push-button design

    • runtime:

      • likely racing list: add to list based on the physical timestamp, read the list to inject delay, remove based on the delay feedback, and report bugs

      • Remaining entries can be passed to the next run

  • Evaluation: test in Microsoft

    • Compared with Random, data collider, and HB-Tracking

  • Overall: push-button TSV detection tool.

    • Infers sync and uses them in the same run

      • Easy integration: oblivious to synchronization patterns

      • Lightweight

      • Accuracy

      • Coverage

Last updated