IronFleet: Proving Practical Distributed Systems Correct
========================================================

[ source code: https://github.com/Microsoft/Ironclad ]
  lock example:
    ironfleet/src/Dafny/Distributed/Impl/Lock
    ironfleet/src/Dafny/Distributed/Protocol/Lock
    ironfleet/src/Dafny/Distributed/Protocol/Lock/RefinementProof
    ironfleet/src/Dafny/Distributed/Services/Lock

why this paper?
  distributed systems have subtle bugs
    similar to the bugs discussed in Amazon paper earlier this semester
    but they want to extend verification to the implementation too
  one of a few examples of verified distributed systems
    must handle concurrency, partial failures, ..
  practical focus
    goal is a working, reasonably performant system
    not necessarily focused on 100% machine-checked proofs
    combining several approaches
      reduction (movers) reminiscent of CSPEC paper
  different tools: Dafny, Z3
    more automated vs less direct control?
    Dafny hard-coded to sequential verification
    but can mostly coax it into verifying distributed systems

complexity in verifying a distributed system
  concurrency
  partial failure

top-level plan: separate reasoning about distributed protocol vs implementation
  "concurrency containment strategy"
  don't run things concurrently on a single hosts
    i.e., no CPU concurrency, no threads, etc.
    sequential single-threaded packet processing
  we still have concurrency because of multiple hosts, but try to separate
    reasoning about concurrency+failure from reasoning about sequential code
  allows sequential reasoning (Hoare logic) much like 6.826 lab assignments
  reason about distributed systems aspects (concurrency, failure) separately

concrete steps for this plan
  paper figure 3
  1. write down a specification for how each host behaves in the protocol:
    abstract state for each host + what the host does with messages
  2. prove that the host implementation refines this protocol specification:
    use sequential reasoning like [step_proc] in the labs
    better automation from the Dafny tool and the Z3 solver
  3. prove that the combination of hosts (represented by protocol-level specs)
    together refines the system-wide specification

what do their specifications and theorems look like?
  closely matches Butler's view: initial state and step relation
    both impl and spec
  minor difference: no notion of a "trace" or "externally visible actions"
  instead, abstraction function must connect whatever the developer considers
    to be "externally visible actions" (pretty much packets in this context)
  see "SpecRelation" in section 3.1
  low-level network model (UDP) exposes ghost variable: all packets sent/rcvd
    convenient choice for externally visible action (in the protocol refinement)
    convenient for writing abstraction function to whole-system spec state

  likely a side-effect of using the Dafny tool as the baseline
    designed for sequential code verification
    no built-in notion of actions or a trace
    instead, procedures have pre- and post-conditions ("requires", "ensures")
    the authors have found a way to "retrofit" traces into Dafny
      ghost variable contains the trace
      pre- and post-conditions say state corresponds to current trace ghost var
      top-level "step loop" (figure 8) that is "assumed to be correct"

  refinement proof structure is the same as Butler's plan
    each implementation step corresponds to taking 0 or more spec steps
    trace subsumed by state: no separate requirement about externally visible actions
    but the abstraction function is part of TCB because it defines external visibility

reduction: achieving sequential reasoning [section 3.6]
  real world runs code concurrently on multiple hosts
    however, if there aren't any interactions between the hosts, can ignore concurrency
  more formally, re-order the concurrent bits of execution to make it look sequential
    figure 7
    similar to Butler's commutativity reasoning, and to right/left movers from CSPEC paper
  authors don't do a machine-checked proof of this commutativity/reduction argument
  instead, they prove a "reduction-enabling obligation"
    they proved on paper that, if this obligation is true, they can do the reduction
    but likely reduction is hard to code up in Dafny
    Dafny has a hard-coded sequential, single-host execution model
      contrast with Coq: exec model defined inside Coq, easy to change
      downside of Coq approach: have to rebuild entire tooling ecosystem each time..
    IronFleet authors later proved this obligation is sufficient, in Dafny

  reduction-enabling obligation is basically the same as the mover plan in the CSPEC paper
    each implementation step must first receive all messages (right-movers),
      before doing zero or one time-dependent operations (non-movers),
      followed by sending out all messages (left-movers)
    other local-computation steps can move arbitrarily
      they commute with all actions on other hosts
    looks like the mover pattern from CSPEC

  mechanizing this obligation: implementation of step returns a list of events
    presumably things like "recv pkt", "send pkt", "read time"
    event loop requires that this list be properly ordered (as above)
    figure 8

what's abstracted away in the protocol spec?
  section 3.2
  finite data structures
  functional / relational description, rather than code with mutable memory
  atomicity of steps

proving overall system spec
  abstraction function
    connect multiple host protocol states into single system state
    lock example: history of lock holders
      current lock holder is the host that believes it holds the lock
      but also might have an in-flight packet transferring lock
      need to define who is said to own the lock in that case
    IronRSL: spec executes when quorum of replicas agrees
      actual replica execution is a nop for spec
  explicitly state a theorem like figure 1

using Dafny and Z3
  triggers: hint what values to choose for quantifier when using a [forall] statement
  manual unfolding / opaque annotations
  explicit lemma invocation
    lemmas are just functions without any executable code in the body (nop)
      forall quantifiers: arguments of lemma
      premise of lemma: "requires" clause or pre-condition
      conclusion of lemma: "provides" clause or post-condition
      exists conclusion: return value of lemma function
      example: figure 6
    pass in arguments to avoid Z3 having to guess quantifier assignment for "forall"
    explicitly tells Z3 to "destruct" the [exists] promised by a lemma
  can do inductive proofs by recursively calling proof lemmas (fig 6)
    similar to manual proof term construction in Coq

reasoning about liveness: Temporal Logic of Actions (TLA)
  goal is to reason about a sequence of states
  predicates talk about current + future states
    i.e., predicate applies to a particular point in the state sequence,
      and describes what's going on in that state + going forward in time
  simple predicates: describe just the current state (much like in lab assignments)
    P_h : host h currently holds the lock
  square (now and forever) and diamond (eventually) combinators
    [square] P_h
    [diamond] P_h
    [square] [diamond] P_h
    forall h \in Hosts, [square] [diamond] P_h
  helpful to state liveness theorems
    [square] if request comes in, then [diamond] response sent out

  liveness proof
    roughly speaking, WF1 says if action A takes the system from P to Q, and
      the system is fair (will eventually run A), then P eventually leads to Q.
    minor technicality: WF1 requires that P keeps holding, at least until Q holds.
    more significant technicality: requires that A be enabled, and fairness is
      allowed to not run A if it's not enabled.
    being "enabled" is an artifact of how TLA models actions
      think of Butler's description where step might not be enabled if PC <> alpha
    "always-enabled actions": no such restrictions
      can always take a step
      easy transformation: just say if PC <> alpha, then nop
    turns out this might lead to some theoretical problems (like no possible impl)
      but that doesn't matter for IronFleet: there is a working impl, by construction.

  with always-enabled actions, host impl does round-robin loop over possible actions
    ensures fairness: every action is tried
    allows for proof of liveness

  proof structure
    apply WF1 to chain together actions that eventually lead to progress/liveness
    similar to reasoning about sequential composition, with some non-determinism
    example: section 4.4

  liveness assumes each HostStep finishes (no proof)
    not clear why this paper claims this is a proof of liveness for implementation
    explicit assumption that host impl does not run out of memory
      running out of memory is not explicitly modeled in Dafny
      process terminates if it runs out of memory, rather than malloc() returns NULL
    seems more a proof of protocol liveness

interesting tidbits
  no recovery: crashed hosts do not rejoin the system
    would require modifying Dafny to reason about crashes, recovery, ..
    infact, even crashes are not modeled explicitly
      instead, seems to be a side-effect of weak UDP spec
      might never receive messages sent by some host
      so we can consider that as a host crashing

  parts of the proof not mechanized
    reduction for re-ordering
    step termination for liveness