Efficient and Scalable Thread-Safety Violation Detection

SOSP 19: https://dl.acm.org/doi/pdf/10.1145/3341301.3359638?casa_token=tJoAJXALN4MAAAAA:JUd0XcAhj-AXAA8ORPeHmOS2xaLGSI-vwBGVMjVsixOq2KIXpuypu0V9ENOzpswb39bFZoXJs2n-

Presentation

Concurrent programming are everywhere --> bugs
Thread-safety violation (TSV)
- Thread-safety contract
- Thread-safety violation (TSV) occurs if two threads concurrently work on two conflicting methods on the same object at the same time
  - Outcome: crashes, etc.
Reasoning about TSVs is difficult
- Too many possibilities of concurrency
  - Multithreading, event-driven, message passing
- Too many forms of synchronizations
Data-race analysis on small scale, it's fine
- Program source code --> integration --> static data-race analysis --> pruning
- Or dynamic data-race analysis
  - Run the program with test inputs and program binary (integration)
  - Dynamic data-race analysis will slow down by 10x or more
Large-scale: existing techniques fail
- CloudBuild: 100M tests from 4K teams, up to 10K machines / day
- Tens of thousands of extra machines are needed
Challenges: integration, overhead, and false positives
TSVD: a scalable dynamic analysis tool for TSVs

Design
- How to achieve zero false positive?
  - Report after violation + inject delays to trigger violations
    Question: how to identify potential unsafe calls to insert delays?
    Happen-before analysis (expensive, and require integration effort to annotate all sync primitives)
    Racing calls
    Two conflict methods, called from different threads, accessing the same object, having close-by physical timestamps
    Remove unlikely race calls
    Insights for synchronization inference
    Insights: many ways to implement synchronization, one common effect of all synchronizations: if m1 synchronized before m2 and m1, m2 are nearby, delay to m1 will cause delay to m2
    Transitive delay
    If another thread cannot make progress, ..., infer synchronization
  - TSVD inference
    Is there exist a sync?
    Is the program running in a sequential mode?
    Is a likely racing call less likely than another?
  - TSVD multiple runs
    Earlier runs' knowledge will be inherited by later runs
TSVD review
- instrumentation: push-button design
- runtime:
  - likely racing list: add to list based on the physical timestamp, read the list to inject delay, remove based on the delay feedback, and report bugs
  - Remaining entries can be passed to the next run
Evaluation: test in Microsoft
- Compared with Random, data collider, and HB-Tracking
Overall: push-button TSV detection tool.
- Infers sync and uses them in the same run
  - Easy integration: oblivious to synchronization patterns
  - Lightweight
  - Accuracy
  - Coverage

PreviousConsistency NextUnderstanding and Detecting Software Upgrade Failures in Distributed Systems

Last updated 3 years ago

Was this helpful?