L1: POCS intro

* POCS: Principles of Computer Systems
  Discuss key ideas in computer systems
    Isolation, locking, logging, consistency, Paxos
    Ideas you have seen in 6.033, 6.828, 6.824, ...
  But in a principled manner
    With specifications and proofs

* Motivation: avoid bugs
  Systems are subtle and complex
    Concurrency
    Distributed
    Security
    Fault tolerance
    Performance
  Complexity leads to bugs
    Benign bugs
    But also serious ones:
      users may loose their data
      systems leaks confidential data

* Focus in POCS labs on fault tolerance
    but papers will touch on other challenges
  two kinds of failures 
    crash (e.g., power failures)
    defects/faults (e.g., dead disk, dead sector)
  assumption: fail-stop
    after failure, computer systems stops immediately
    on crash, computer reboots and one can run repair

* Example systems that must handle failures
  Single-node systems
    File systems
    Databases
    Key/value stores
  Distributed systems
    Distributed KVs
    ...
  Embedded systems
    Flash-translation layer (FTL) in SSDs

* A key challenge: crash-safety of persistent state
  Both in single-node and distributed systems
  System runs updating persistent state
  Crash happens
  Must recover to a correct state
    perhaps ok to loose some user data
    but internal system structure must survive

* Why is achieving crash safety hard?
  Persistent state becomes inconsistent
    E.g., failure during a two-step operation (e.g., updating free list and file)
    -> Block is marked allocated, but isn't in use
    -> Block is part of a file, but not marked allocated
  Storage runs concurrent with program
    Programs sends several disk request to disk, disk works on them
    After disk failure, which requests have been processed?
  Storage devices are slower than main memory
    Many complex optimizations to avoid disks writes
    Complexity leads to more opportunities for problems

* Real-world examples
  Linux file system study [http://pages.cs.wisc.edu/~ll/papers/fsstudy.pdf]
    FSCQ slides [https://pdos.csail.mit.edu/papers/fscq:sosp15-slides.pdf]
  SSD study
    [https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf]
  Bugs can have bad consequences

* Explanation of comments in slide [bugs.pdf]
  * Standard logging (slow but correct)
    data + metadata through log
    flush log, flush commit hdr
  * Log-bypass (first code comment)
    metadata through log, data direct
    flush data, flush log, flush commit hdr
    [can combine flush of data+log]
  * checksums (second code comment)
    put checksum(log) in hdr
    flush log+hdr together
    incompatible with log-bypass!

* How to find/avoid these bugs?
  Testing
    Works well!
    But doesn't demonstrate absence of bugs
  Model checking
    Check all symbolic states
    Hard to capture key properties without state explosion
  Verification
    Prove implementation has no bugs
  All approaches are active areas of research
    6.826 focuses on verification
    But we will read papers about the other approaches too

* Verification
  Three parts to verification
    Specification: a precise description of the desired behavior of a system
    Implementation: an implementation of a system
    Proof: mathematical proof showing implementation meets specification
  Old area, but has become hot again
    Verification technology has become good enough to tackle complex systems

* Real-world successes of verification
  CompCert
  Amazon's use of specs
  Maybe more in the future

* Specifications
  Properties of a good spec:
    Describe the behavior of the system
    Tell client and implementer what each need to know
    Precise enough to be able to prove theorems based on spec
    Leave out unnecessary requirements that
       may confuse clients
       make proofs intractable
  This is hard!
    Many ways to describe behavior of a system
    Easy to miss subtle cases
    Bugs in the specs are problematic
  Specifications on their own have much value
    See Amazon paper next week
    The old POCS notes [http://bwlampson.site/48-POCScourse/48-POCS2006.pdf]

* Implementations/Code
  Implementations are not much code in POCS
  POCS focuses on the core of a system
    with subtle issues
    sometimes complex ones too

* Proofs
  By hand or machine-checked
    Old POCS was pen&paper proofs
  New POCS focuses on machine-checked
    Write spec, code, and proof in proof assistant (Coq)
    Coq checks proof
  Doing a proof in Coq requires developing a proof argument
    Often done on paper
    Old POCS notes influenced much recent research in verified systems

* Class structure
  Labs to teach you to write specs and prove implementations correct
  Readings to teach you about the state of the art verification
    Many recent papers, fast developing area of research
  Grading: labs and class participation
    No quizzes
  Staff

===

Coq overview
  proof assistant
  expressive:
    functional programs
    abstract mathematical definitions
    theorems
    proofs
    scripting language for proofs
  assistant: checks proofs rather than generates all proofs
    can write scripts to automate proof generation
  can generate running code
    coq -> haskell -> executable binary

Demo
  Bank.v
    bank definition
      nat balances
      transfer
      extraction

    "Separate Extraction"
    look at Top.hs
    look at Main.hs

    ghc Main
    ./Main

    5 works
    20 is nonsense

    fix and prove no more bugs
      transfer'
      total
      theorem: total remains the same
      cases
      omega

toy example; real programs are more sophisticated in many dimensions
  mutable state (variables)
  I/O (disk, network, user)
  crashes
  concurrency
  assembly code
  .. but each dimension adds complexity

difficult to verify existing programs (say, Linux kernel)
  specs are complicated and messy, or not well-defined at all
  language is not a great fit for verification
    complex semantics of what C statements mean
    might run differently depending on which C compiler or flags are used
    ...
  code not structured for verification: too much complexity all at once
    mutable shared memory, IO, crashes, concurrency, assembly, ...
  verification effort is typically larger than effort to build new impl
    so, makes sense to rewrite impl to simplify verification

labs will explore how to deal with some of these complexities
  mutable state
  I/O
  crashes
  concurrency (limited form)
  no assembly code
  .. but carefully introducing each form of complexity

final lab: disk mirroring (RAID 0)

demo
  [ run everything as root ]
  ./bin/replicate-nbd init
  ./bin/replicate-nbd start &

  modprobe nbd
  nbd-client localhost /dev/nbd0
  mkfs -t ext4 /dev/nbd0
  mount /dev/nbd0 /mnt/x
  cd /mnt/x

  echo hello world > test.txt
  sync
  strings disk0.img
  strings disk1.img

  rm disk0.img
  echo another test > foo.txt
  sync
  strings disk1.img

  cat test.txt

  umount /mnt/x
  nbd-client -d /dev/nbd0