FSCQ
====

Paper: "Using crash hoare logic for certifying the FSCQ file system", Chen et
  al. (SOSP 15)

Source code: https://github.com/mit-pdos/fscq

Why are we reading this paper?
  Example of systems verification
    File systems are critical infrastructure
    File systems have bugs
  Reasoning about crashes
    Crashes show up in lab 3 and 4
  Abstraction to capture crash non-determinism
    Asynchronous disk
  File system proven in Coq
    Lot of class infrastructure was inspired by this paper / line of work
  Revisit some big database ideas
    Logging (will show up in lab 3)

Somewhat different style of paper from most of the other papers we've seen
  Not about a deployed system
  Not used in industry
  Prototype demonstrates ideas but not in itself the point of the paper
  Similar to SibylFS; very different from Amazon, Google, Facebook papers
  As we push into state-of-the-art verification, more papers like this

System overview
  Similar to labs, but fuse instead of nbd
  Spec, program, and proof written in Coq
  Extract to Haskell
  Link with Fuse library and run as a server on Linux
    Good: Run standard Linux programs
    Bad: Large TCB (Linux, Haskell, etc.)

Goal: crash safety
  File system runs, power fails computer reboots
  User can use file system again, as if crash didn't happen
    Storage devices power up correctly
    File system's internal data structures are correct
    User data is present (except recent files that haven't been saved yet)

Risk: crash exposes subtle bugs
  Append to a file requires three disk writes:
    1. write data
    2. write inode
    3. write free block map
  Crash might expose intermediate states
    E.g., crash after 1 and 2 happened, but before 3
    What to do in this case?
      Is block free or not?
    Changing the order doesn't help
      3, 1, 2
      What if crash after 3?

Standard solution: logging
  Clean, principled solution used by many file systems.
    Never update file system in place.
    First write to log, commit, apply log to file system.
    Recovery after crash: read log, apply to the file system.
    If crash before commit, then don't apply (empty log).
  Not easy to implement (in particular if you want performance)
    e.g., bugs in ext4
  Contribution of paper
    Verifying crash safety of a file system using logging

Verifying logging
  What does correct mean?
  How to model the disk?
  How to model crashes?
  How to model reboot?
  How to prove correctness?
    Without failures
      Must prove core subset of POSIX correct
        open, close, read, write, create, mkdir, etc.
      Requires reasoning about directories, inodes, etc.
    With failures

Correctness of logging
  Atomicity: all writes happen or none happen
    Prevents incomplete updates (like the file append example above)
    Unit of granularity is a system call (write, rename, ..)
  Durability: changes will persist after a crash
    Simple logging provides immediate durability
    Later work (DFSCQ) implements deferred durability for performance
  Concurrency: ignored in this work, assume sequential execution
    Still an open problem, though some work in this space (e.g., Perennial)

Modeling disk
  Simplified compared to real disks
  Operations: read, write, sync
    Labs has only synchronous read and write
  Specifications for operations
    Pre- and post-conditions
    Example: synchronous write
      {a |-> v0} disk_write(a, v) {a |-> v}
  Seperation logic: |->, as seen in previous paper
    Only address a changed and nothing else
    We don't use separation logic in the labs
      Labs talk specs about whole state
    Separation logic allows for more succinct specs
      E.g., when updating several locations
      
Asynchronous disks
  Volatile buffer of pending writes
  write() adds to buffer
  read() gets latest value from buffer
  sync() flushes buffer to durable storage
  crash() loses this buffer
  Disk can also flush in the background, not just on sync()

[[ Breakout Q: How does FSCQ model this? ]]

Modeling asynchronous disk
  { a |-> 0 } write(a, 10); write(a, 11) { a |-> ? }
  What can be on disk if crash in this sequence?
    0, 10, and 11
  What will read(a) return?
    { a |-> 0 } write(a, 10); r = read(a) { r = ?? }
    r = 10 (i.e., last value written)
  Idea: model with value sets
    In the model, disk sector is a list of values; latest write is last
    { a |-> [0] } write(a, 10) { a |-> [0, 10] }
                  write(a, 11) { a |-> [0, 10, 11] }
    Subsequent read returns 11
    Sync flushes set
      { a |-> [0] } write (a, 10); write(a, 11); sync() { a |-> [11] }
  After crash
    Any value: either latest write or any prior write

Describing disk crashes
  Idea: add a crash condition to pre- and postcondition
  Example crash condition:
    { a |-> [0] } write(a, 10) { a |-> [0, 10] }
      CRASH { a |-> [0]  \/  a |-> [0, 10] }
  Models atomic sector write
    Common assumption in file systems
    Disks, SSDs try to ensure sector atomicity
  Two steps to reasoning about crashes:
    Crash condition describes the point at which we might crash,
      not the state in which we would then reboot the computer
    On recovery, model chooses a specific value for each sector
    E.g., either a |-> [0] or a |-> [10]

Atomicity
  atomic_two_write spec
    Topic of lab 3
    Spec in figure 4, using logging
  One implementation:
    commit writes (a1, v1) and (a2, v2) to the log
    write commit block
    apply log: write to a1, a2
  Correctness of implementation
    no-crash: a1 and a2 are updated
    crash: log is intact
    must recover the log after crash: if header says committed, write a1, a2

Log abstraction.
  Abstract state on top of the log: two logical disks.
    first logical disk represents the state that's committed.
    second logical disk represents the state that's not committed yet.
    These disks no longer have asynchronous sectors; just one value for each address.
  log_read and log_write operate on the second logical disk.
    Figure 10
    Separation logic at multiple levels: raw disk, logical disk.
  log_commit copies the second logical disk to the first.
  recovery copies the first logical disk to the second.

  log_rep(ActiveTxn, start_state, cur_state) is a separation-logic predicate describing raw disk.
    Figure 5
    Log header.
    Log itself: not described, because empty: not committed.
    Applied disk writes.
  Other states (instead of ActiveTxn) reflect when log has been written to.
    LogFlushed, CommittedUnsync, CommittedSync, AppliedUnsync, AppliedSync
  log_intact(start_state, new_state)
    Combination of these other states.

Proving procedures
  Sequences of disk, read, and syncs
    Computation in Galina as in labs
  Example:
    recovery():  // simplified!
      r = read(COMMIT)
      if r then:
        write(a1, v1)
        write(a2, v2)
        sync()
        write(COMMIT, 0)
  Post-condition (no cashes)
    Chain pre and post conditions as in lab
    [step_proc]
  Crash condition
    OR all crash conditions of operations of procedure

Reboot and recovery
  Figure 6
  What is pre condition of recovery()?
    log_intact(last_state, committed_state)
  What is post condition of recovery()?
    log_rep(NoTxn, last_state) \/
    log_rep(NoTxn, committed_state)
  What is crash condition of recovery()?
    log_intact(last_state, committed_state)
  Recovery is idempotent
    crash condition => pre condition
    -> ok to crash during recovery and run recovery again

File system implementation.
  Inodes, directories, block allocators, etc.
  Built on top of the log abstraction.
    Everything stored in the log's logical disk.
  Log provides atomicity + crash safety.
  File system code provides abstraction between disk and FS tree.

Example: specification of rename (fig 13)
  Putting normal and recovery execution together

Performance
  OK compared to synchronous file systems
  Bad compared to asynchronous file systems
  How to specify asynchronous file systems?
    See follow-on paper in SOSP'17
    https://people.csail.mit.edu/nickolai/papers/chen-dfscq.pdf