IronFleet: Proving Practical Distributed Systems Correct ======================================================== [ source code: https://github.com/Microsoft/Ironclad ] lock example: ironfleet/src/Dafny/Distributed/Impl/Lock ironfleet/src/Dafny/Distributed/Protocol/Lock ironfleet/src/Dafny/Distributed/Protocol/Lock/RefinementProof ironfleet/src/Dafny/Distributed/Services/Lock why this paper? distributed systems have subtle bugs similar to the bugs discussed in Amazon paper earlier this semester but they want to extend verification to the implementation too one of a few examples of verified distributed systems must handle concurrency, partial failures, .. practical focus goal is a working, reasonably performant system not necessarily focused on 100% machine-checked proofs combining several approaches reduction (movers) reminiscent of CSPEC paper different tools: Dafny, Z3 more automated vs less direct control? Dafny hard-coded to sequential verification but can mostly coax it into verifying distributed systems complexity in verifying a distributed system concurrency partial failure top-level plan: separate reasoning about distributed protocol vs implementation "concurrency containment strategy" don't run things concurrently on a single hosts i.e., no CPU concurrency, no threads, etc. sequential single-threaded packet processing we still have concurrency because of multiple hosts, but try to separate reasoning about concurrency+failure from reasoning about sequential code allows sequential reasoning (Hoare logic) much like 6.826 lab assignments reason about distributed systems aspects (concurrency, failure) separately concrete steps for this plan paper figure 3 1. write down a specification for how each host behaves in the protocol: abstract state for each host + what the host does with messages 2. prove that the host implementation refines this protocol specification: use sequential reasoning like [step_proc] in the labs better automation from the Dafny tool and the Z3 solver 3. prove that the combination of hosts (represented by protocol-level specs) together refines the system-wide specification what do their specifications and theorems look like? closely matches Butler's view: initial state and step relation both impl and spec minor difference: no notion of a "trace" or "externally visible actions" instead, abstraction function must connect whatever the developer considers to be "externally visible actions" (pretty much packets in this context) see "SpecRelation" in section 3.1 low-level network model (UDP) exposes ghost variable: all packets sent/rcvd convenient choice for externally visible action (in the protocol refinement) convenient for writing abstraction function to whole-system spec state likely a side-effect of using the Dafny tool as the baseline designed for sequential code verification no built-in notion of actions or a trace instead, procedures have pre- and post-conditions ("requires", "ensures") the authors have found a way to "retrofit" traces into Dafny ghost variable contains the trace pre- and post-conditions say state corresponds to current trace ghost var top-level "step loop" (figure 8) that is "assumed to be correct" refinement proof structure is the same as Butler's plan each implementation step corresponds to taking 0 or more spec steps trace subsumed by state: no separate requirement about externally visible actions but the abstraction function is part of TCB because it defines external visibility reduction: achieving sequential reasoning [section 3.6] real world runs code concurrently on multiple hosts however, if there aren't any interactions between the hosts, can ignore concurrency more formally, re-order the concurrent bits of execution to make it look sequential figure 7 similar to Butler's commutativity reasoning, and to right/left movers from CSPEC paper authors don't do a machine-checked proof of this commutativity/reduction argument instead, they prove a "reduction-enabling obligation" they proved on paper that, if this obligation is true, they can do the reduction but likely reduction is hard to code up in Dafny Dafny has a hard-coded sequential, single-host execution model contrast with Coq: exec model defined inside Coq, easy to change downside of Coq approach: have to rebuild entire tooling ecosystem each time.. IronFleet authors later proved this obligation is sufficient, in Dafny reduction-enabling obligation is basically the same as the mover plan in the CSPEC paper each implementation step must first receive all messages (right-movers), before doing zero or one time-dependent operations (non-movers), followed by sending out all messages (left-movers) other local-computation steps can move arbitrarily they commute with all actions on other hosts looks like the mover pattern from CSPEC mechanizing this obligation: implementation of step returns a list of events presumably things like "recv pkt", "send pkt", "read time" event loop requires that this list be properly ordered (as above) figure 8 what's abstracted away in the protocol spec? section 3.2 finite data structures functional / relational description, rather than code with mutable memory atomicity of steps proving overall system spec abstraction function connect multiple host protocol states into single system state lock example: history of lock holders current lock holder is the host that believes it holds the lock but also might have an in-flight packet transferring lock need to define who is said to own the lock in that case IronRSL: spec executes when quorum of replicas agrees actual replica execution is a nop for spec explicitly state a theorem like figure 1 using Dafny and Z3 triggers: hint what values to choose for quantifier when using a [forall] statement manual unfolding / opaque annotations explicit lemma invocation lemmas are just functions without any executable code in the body (nop) forall quantifiers: arguments of lemma premise of lemma: "requires" clause or pre-condition conclusion of lemma: "provides" clause or post-condition exists conclusion: return value of lemma function example: figure 6 pass in arguments to avoid Z3 having to guess quantifier assignment for "forall" explicitly tells Z3 to "destruct" the [exists] promised by a lemma can do inductive proofs by recursively calling proof lemmas (fig 6) similar to manual proof term construction in Coq reasoning about liveness: Temporal Logic of Actions (TLA) goal is to reason about a sequence of states predicates talk about current + future states i.e., predicate applies to a particular point in the state sequence, and describes what's going on in that state + going forward in time simple predicates: describe just the current state (much like in lab assignments) P_h : host h currently holds the lock square (now and forever) and diamond (eventually) combinators [square] P_h [diamond] P_h [square] [diamond] P_h forall h \in Hosts, [square] [diamond] P_h helpful to state liveness theorems [square] if request comes in, then [diamond] response sent out liveness proof roughly speaking, WF1 says if action A takes the system from P to Q, and the system is fair (will eventually run A), then P eventually leads to Q. minor technicality: WF1 requires that P keeps holding, at least until Q holds. more significant technicality: requires that A be enabled, and fairness is allowed to not run A if it's not enabled. being "enabled" is an artifact of how TLA models actions think of Butler's description where step might not be enabled if PC <> alpha "always-enabled actions": no such restrictions can always take a step easy transformation: just say if PC <> alpha, then nop turns out this might lead to some theoretical problems (like no possible impl) but that doesn't matter for IronFleet: there is a working impl, by construction. with always-enabled actions, host impl does round-robin loop over possible actions ensures fairness: every action is tried allows for proof of liveness proof structure apply WF1 to chain together actions that eventually lead to progress/liveness similar to reasoning about sequential composition, with some non-determinism example: section 4.4 liveness assumes each HostStep finishes (no proof) not clear why this paper claims this is a proof of liveness for implementation explicit assumption that host impl does not run out of memory running out of memory is not explicitly modeled in Dafny process terminates if it runs out of memory, rather than malloc() returns NULL seems more a proof of protocol liveness interesting tidbits no recovery: crashed hosts do not rejoin the system would require modifying Dafny to reason about crashes, recovery, .. infact, even crashes are not modeled explicitly instead, seems to be a side-effect of weak UDP spec might never receive messages sent by some host so we can consider that as a host crashing parts of the proof not mechanized reduction for re-ordering step termination for liveness