FSCQ ==== Paper: "Using crash hoare logic for certifying the FSCQ file system", Chen et al. (SOSP 15) Source code: https://github.com/mit-pdos/fscq Why are we reading this paper? Example of systems verification File systems are critical infrastructure File systems have bugs Reasoning about crashes Crashes show up in lab 3 and 4 Abstraction to capture crash non-determinism Asynchronous disk File system proven in Coq Lot of class infrastructure was inspired by this paper / line of work Revisit some big database ideas Logging (will show up in lab 3) Somewhat different style of paper from most of the other papers we've seen Not about a deployed system Not used in industry Prototype demonstrates ideas but not in itself the point of the paper Similar to SibylFS; very different from Amazon, Google, Facebook papers As we push into state-of-the-art verification, more papers like this System overview Similar to labs, but fuse instead of nbd Spec, program, and proof written in Coq Extract to Haskell Link with Fuse library and run as a server on Linux Good: Run standard Linux programs Bad: Large TCB (Linux, Haskell, etc.) Goal: crash safety File system runs, power fails computer reboots User can use file system again, as if crash didn't happen Storage devices power up correctly File system's internal data structures are correct User data is present (except recent files that haven't been saved yet) Risk: crash exposes subtle bugs Append to a file requires three disk writes: 1. write data 2. write inode 3. write free block map Crash might expose intermediate states E.g., crash after 1 and 2 happened, but before 3 What to do in this case? Is block free or not? Changing the order doesn't help 3, 1, 2 What if crash after 3? Standard solution: logging Clean, principled solution used by many file systems. Never update file system in place. First write to log, commit, apply log to file system. Recovery after crash: read log, apply to the file system. If crash before commit, then don't apply (empty log). Not easy to implement (in particular if you want performance) e.g., bugs in ext4 Contribution of paper Verifying crash safety of a file system using logging Verifying logging What does correct mean? How to model the disk? How to model crashes? How to model reboot? How to prove correctness? Without failures Must prove core subset of POSIX correct open, close, read, write, create, mkdir, etc. Requires reasoning about directories, inodes, etc. With failures Correctness of logging Atomicity: all writes happen or none happen Prevents incomplete updates (like the file append example above) Unit of granularity is a system call (write, rename, ..) Durability: changes will persist after a crash Simple logging provides immediate durability Later work (DFSCQ) implements deferred durability for performance Concurrency: ignored in this work, assume sequential execution Still an open problem, though some work in this space (e.g., Perennial) Modeling disk Simplified compared to real disks Operations: read, write, sync Labs has only synchronous read and write Specifications for operations Pre- and post-conditions Example: synchronous write {a |-> v0} disk_write(a, v) {a |-> v} Seperation logic: |->, as seen in previous paper Only address a changed and nothing else We don't use separation logic in the labs Labs talk specs about whole state Separation logic allows for more succinct specs E.g., when updating several locations Asynchronous disks Volatile buffer of pending writes write() adds to buffer read() gets latest value from buffer sync() flushes buffer to durable storage crash() loses this buffer Disk can also flush in the background, not just on sync() [[ Breakout Q: How does FSCQ model this? ]] Modeling asynchronous disk { a |-> 0 } write(a, 10); write(a, 11) { a |-> ? } What can be on disk if crash in this sequence? 0, 10, and 11 What will read(a) return? { a |-> 0 } write(a, 10); r = read(a) { r = ?? } r = 10 (i.e., last value written) Idea: model with value sets In the model, disk sector is a list of values; latest write is last { a |-> [0] } write(a, 10) { a |-> [0, 10] } write(a, 11) { a |-> [0, 10, 11] } Subsequent read returns 11 Sync flushes set { a |-> [0] } write (a, 10); write(a, 11); sync() { a |-> [11] } After crash Any value: either latest write or any prior write Describing disk crashes Idea: add a crash condition to pre- and postcondition Example crash condition: { a |-> [0] } write(a, 10) { a |-> [0, 10] } CRASH { a |-> [0] \/ a |-> [0, 10] } Models atomic sector write Common assumption in file systems Disks, SSDs try to ensure sector atomicity Two steps to reasoning about crashes: Crash condition describes the point at which we might crash, not the state in which we would then reboot the computer On recovery, model chooses a specific value for each sector E.g., either a |-> [0] or a |-> [10] Atomicity atomic_two_write spec Topic of lab 3 Spec in figure 4, using logging One implementation: commit writes (a1, v1) and (a2, v2) to the log write commit block apply log: write to a1, a2 Correctness of implementation no-crash: a1 and a2 are updated crash: log is intact must recover the log after crash: if header says committed, write a1, a2 Log abstraction. Abstract state on top of the log: two logical disks. first logical disk represents the state that's committed. second logical disk represents the state that's not committed yet. These disks no longer have asynchronous sectors; just one value for each address. log_read and log_write operate on the second logical disk. Figure 10 Separation logic at multiple levels: raw disk, logical disk. log_commit copies the second logical disk to the first. recovery copies the first logical disk to the second. log_rep(ActiveTxn, start_state, cur_state) is a separation-logic predicate describing raw disk. Figure 5 Log header. Log itself: not described, because empty: not committed. Applied disk writes. Other states (instead of ActiveTxn) reflect when log has been written to. LogFlushed, CommittedUnsync, CommittedSync, AppliedUnsync, AppliedSync log_intact(start_state, new_state) Combination of these other states. Proving procedures Sequences of disk, read, and syncs Computation in Galina as in labs Example: recovery(): // simplified! r = read(COMMIT) if r then: write(a1, v1) write(a2, v2) sync() write(COMMIT, 0) Post-condition (no cashes) Chain pre and post conditions as in lab [step_proc] Crash condition OR all crash conditions of operations of procedure Reboot and recovery Figure 6 What is pre condition of recovery()? log_intact(last_state, committed_state) What is post condition of recovery()? log_rep(NoTxn, last_state) \/ log_rep(NoTxn, committed_state) What is crash condition of recovery()? log_intact(last_state, committed_state) Recovery is idempotent crash condition => pre condition -> ok to crash during recovery and run recovery again File system implementation. Inodes, directories, block allocators, etc. Built on top of the log abstraction. Everything stored in the log's logical disk. Log provides atomicity + crash safety. File system code provides abstraction between disk and FS tree. Example: specification of rename (fig 13) Putting normal and recovery execution together Performance OK compared to synchronous file systems Bad compared to asynchronous file systems How to specify asynchronous file systems? See follow-on paper in SOSP'17 https://people.csail.mit.edu/nickolai/papers/chen-dfscq.pdf