L16: Chess Paper: "Finding and reproducing Heisenbugs in Concurrent Programs" by Musavathi et al (OSDI 2008) Why read this paper? Concurrency is good area for verification Subtle bugs Testing is difficult (e.g., incomplete) Core challenge: Reasoning about many orderings Insight Few threads and preemption points necessary to find bugs Cool tool Works on real systems Finds bugs Reproduces interleaving Problem: Heisenberg bugs Result of a race condition in a parallel program Bugs that show up only in some runs of a program E.g., Test case fails once in a few months Difficult to find Testing for it may change behavior and not result in bug Difficult to reproduce Example: bug in Concurrency and Coordination Runtime (CCR) Test didn't fail for many months, then failed Chress found it in 20 secs 6737 interleavings Bug: CCR ports ports: https://msdn.microsoft.com/en-us/library/bb648755.aspx spec: atomically process tasks associated with a port race in two-phase commit protocol Thread A: registers it, but preempted before grabbing tasks Thread B: cancels tasks, but is preempted after first task Thread A: wakes up and also processes the tasks Chess can reproduce it Naive approach: Run test Scheduler forces all interleaving of instructions of threads run first thread for one instruction then 2nd thread for one instruction then 3rd thread for one instruction backout 2nd thread, run 3rd thread for one instruction etc. To back out Record program state of a thread, so that you can roll it back If program re-executes system call Replay recorded results from first invocation If test fails Race happened Report bug for the recorded interleaving This approach is too expensive Ok for small programs Not good for whole systems Observation: interleave at *preemption points* preemption point: point where thread may be rescheduled Thread synchronizes with another thread Thread release the processor Thread acquire locks ... Run a thread atomically between preemption points Approach: interleave only at preemption points Take a test for a program P Link with a wrapper library Intercepts all preemption points When a thread creates another thread When thread acquires lock When thread performs a blocking operation ... Code that is likely involved in a data race Calls scheduler at preemption points Basic operation of scheduler Replay phase Replay scheduling decisions to some preemption point Select another thread to run, and go to record phase Record phase Run a thread until it yields, then scheduler runs another thread Scheduler prefers *not* to preempt a thread But record at each preemption point, the enabled threads Search phase If test completes, determine a new schedule Go to replace phase Design scheduler Don't want scheduler to know about details of each synchronization primitive Don't want to reason about threads in detail (e.g., execution speeds) Idea: happens-before graph nodes = instructions edge = enforces partial order on instructions each node is annotated with (a task, sync variable, operation) task = thread, callback, etc. sync variable = resource (lock, semaphore, etc) used by task operation = isWrite and isRelease if isWrite for node n: add edge from preceding nodes with same resource to n add edge from n to all subsequent nodes with the same resource if isRelease is set by a node n: add waiting tasks to set of enabled tasks If program is data-race free, edges order accesses For example, if threads share a variable, protected by a lock Implementation Wrapper Dynamically-linked library Binary rewriting For hardware-level instructions Win32, .NET, Singularity How does a wrapper determine that a call with block? Use the "try" version of the call Conservative set isWrite to True set isRelease to True Challenges Benign races How to capture them? Enforce single-threaded execution! No concurrent access, no data race Slower, but run Chess instances in parallel May miss heisenbugs Solution run data-race detector on the first few runs add preemption points at data race Handling all preemption points Many of them (Win32 has many sync primitives) At different levels of abstraction (e.g., Win32 and HW) Careful to pick right level of abstraction Locks vs atomic instructions Difficult but not too bad Tests help them Imperfect replay Relies on replay to construct state at preemption point Replay is light-weight Misses sources of non-determinism Lazy initialization Interference from environment Non-deterministic calls Reduce interleavings Avoid unfair schedules Ignore preemptions in libraries Special case accesses to volatile variables Scheduler recognizes early-visited states Same happens-before graph Run with few threads Results Found many known bugs Two unknown bugs Can find Chess find all Heisenberg bugs? No, driven by specific set of tests No, imperfect replay Difficulty handling benign races