TSVD ==== Paper: "Efficient Scalable Thread-Safety-Violation Detection" by Li et al (SOSP 2019) https://github.com/microsoft/tsvd Why read this paper? Concurrency is good area for verification Subtle bugs Testing is difficult (e.g., incomplete) Core challenge: Reasoning about many orderings Insight Delaying ops on shared objects helps find racing accesses Cool tool Works on real systems Finds bugs No false positives, by their definition Gives stack traces of threads performing conflicting operations Problem: non-deterministic bugs Result of a race condition in a parallel program Bugs that show up only in some runs of a program E.g., Test case fails once in a few months Difficult to find Testing for it may change behavior and not result in bug Difficult to reproduce Something went wrong; how to debug it? Hard to find concurrency bugs with static analysis. Often not a local property: depends on callers holding locks, ordering, etc. Requires sophisticated inter-procedural analysis. Difficult to infer specs for intermediate procedures like what INFER did. How to find concurrency bugs with runtime testing? Define what constitutes a bug. Returning wrong result, crashing, violating some intermediate invariant, .. Crash: null pointer, exception, etc. Deadlock/livelock: test never finishes, times out. Intermediate specialized checkers: memory leak checker, use-after-free. Similar flavor to any other testing strategy we've seen. Explore different schedules. This is the source of non-determinism for concurrency. === Strawman exploration plan 1: run many times. Each run might have a different schedule. E.g., run a stress test with many threads, run it many times, etc. E.g., this is done in critical concurrent code in Linux. [ https://www.kernel.org/doc/Documentation/locking/locktorture.txt ] [ https://www.kernel.org/doc/Documentation/RCU/torture.txt ] Shows importance of looking for bugs in concurrent code. Huge effort for application code. Downsides as described above. Might be hard to get a test to fail in interesting ways. Hard to debug even if a test does fail. === Straman exploration plan 2: exhaustive. Run test, but with a special scheduler. Scheduler forces all interleaving of instructions of threads run first thread for one instruction then 2nd thread for one instruction then 3rd thread for one instruction backout 2nd thread, run 3rd thread for one instruction etc. Different ways you might implement backout Snapshot state of process Or restart from beginning: should be deterministic now that we control scheduling Good property: repeatable. If we hit a bug, we have a schedule that triggers it! Good property: checks all bugs. Bad property: too slow for serious programs. Lots of interleavings to check. Many similar interleavings. There's been lots of work on the schedule-exploration plan. CHESS from Microsoft PCT (ASPLOS 2010) DPOR (Dynamic Partial Order Reduction) BPOR (Bounded Partial Order Reduction) CB-DPOR ... Some insights from that line of work: Find "preemption points" where interleaving is interesting. E.g., thread synchronization, accessing shared variables, etc. Few thread switches usually enough to find bugs. Optimizations for eliminating equivalent schedules. === Another approach: check for intermediate invariants that imply concurrency bugs. Above, we were looking for "end-to-end" bug that demonstrates problem. Instead, we could formulate invariants that should be true of good code. Classical example: must hold a lock while accessing shared memory. Many tools along these lines. Some focused on locks. Invariant: must be at least one lock consistently held when accessing any given shared memory location. [ http://cseweb.ucsd.edu/~savage/papers/Tocs97.pdf ] Others are more general "happens-before" trackers. Invariant: must be a strict "happens-before" order when two threads access the same shared-memory location. E.g., "go test -race" [ https://golang.org/doc/articles/race_detector.html ] [ https://clang.llvm.org/docs/ThreadSanitizer.html ] [ https://github.com/google/ktsan ] Widely used in practice. Good property: need far fewer runs (maybe just one). Why? Intuition: easier to observe invariant violation than precise race. No need to actually hit the race, suffices to violate invariant. Bad property: false positives. Invariant might be too strong. Tool might not detect some reason why the access is actually safe. TSVD paper says precise HB tracking is too hard/costly to get right. === Breakout Q: what's the TSVD plan in contrast to the above approaches? What defines a bug? How does it explore schedules to find the bug? Do they miss bugs? Do they report "bugs" that are not real issues? === TSVD approach: Check intermediate invariant. All about shared data structures in C#: maps, lists, queues, ... Tweak the schedule by inserting delays, hoping to trigger invariant violation. Insert delays just before an invocation to a method on shared data structure. Same insight as schedule-exploration: this is where schedule matters! Delay injection idea from DataCollider (OSDI 2010), overlapping authors. Wait to see if another thread comes along and violates invariant. What's the intermediate invariant on data structures? Read/write sets. OK to have many read operations concurrently. Write operation cannot run concurrent with any other write or read. [ https://github.com/microsoft/TSVD/blob/master/TestApps/DataRace/Configurations/runtime-config.cfg ] Effective because violations of this invariant almost always can cause a crash/panic/exception. Interesting contrast with data-race checkers. Data-race checkers look for unsynchronized read/write to shared memory location. Similar to TSVD's read/write sets on a data structure. But turns out to not work nearly as well: many benign races on memory locations. In contrast, racing on data structure r/w causes data structure corruption. Optimizations for deciding when to inject delays. Near-misses. Watch for threads that had conflicting accesses within 100 msec of each other. Global buffer tracks all recent accesses. If we find such a pair of accesses, delay next such access by 100 msec. Concurrent phase detection. Seems to be targeted at tests that do sequential setup, then parallel execution. Don't want to slow down the sequential phase. Inferring happens-before. Problem: concurrent accesses might be perfectly well serialized using locks, etc. No use trying to delay one thread, because other thread won't race.. Clever trick: empirical observation of what happens due to delay. Inject delay in one thread. If another thread gets delayed by ~same amount, assume they are serialized. Don't bother trying to get these two threads to race. Decay. If TSVD doesn't see a race quickly, stop trying to get the race to happen. Probability of injecting delay starts at 1, and decays over time. Results. Effective: finds bugs in C# code, developers think the bugs are real. Easy to apply to C# code. Instrumenting compiled code, no source changes needed. Low overhead: table 2. 33% vs 2-4x for other approaches. Misses some bugs that other tools find: figure 8. Mostly heuristics going wrong. Near-miss window too short, HB inference wrong, injected delay too short. Misses some bugs by design. Wrong results, not a race on a data structure read/write.