6.826: Principles of Computer Systems (POCS) ============================================ Administrivia: Recording, will post online. Q to start discussion: where at, background? This class: computer systems from a verification perspective. Isolation, locking, logging, consistency, Paxos, non-interference, .. Key ideas you have seen in 6.033, 6.828, 6.824, ... 6.826 will try to discuss them in a principled manner. Particular focus on specification and proofs. What does it mean for a system to be correct? What exact properties does it provide? Is the implementation correct, or does it have bugs? Motivation: avoid bugs. Systems have to deal with complex issues and requirements Concurrency Distributed Security Fault tolerance Performance Evolution over time Complexity leads to bugs Wide range of bugs Some bugs are "simple" mistakes like memory errors or deadlock Some bugs are subtle issues with consistency, crash recovery, corner cases, ... Wide range of issues caused by bugs Some bugs are benign, but others cause problems for users User loses their data User data corrupted System leaks confidential data Adversary gets control of the system Example: complexity due to fault tolerance in storage systems. Widely studied problem. This class will have lab assignments that focus on fault tolerance. Papers in lecture will touch on other challenges. Two kinds of failures system crashes and reboots (e.g., power failures) system has a defect and keeps running (e.g., dead disk, dead sector) Assumption: clear failure model, typically "fail-stop" on failure, stop immediately when system crashes, computer stops immediately, doesn't run any more code computer reboots and runs repair procedure before running again when disk crashes, disk stops immediately, doesn't service requests (Alternative: Byzantine failures, keep running but producing garbage.) Example systems that must handle failures: Single-node systems File systems Databases Key/value stores Distributed systems Distributed key-value stores ... Embedded systems Flash-translation layer (FTL) in SSDs Key challenge: crash-safety of persistent state. Both in single-node and distributed systems System runs updating persistent state Crash happens Must recover to a correct state perhaps ok to loose some user data (but should be precise about guarantees) internal system structure must survive (enough to keep operating) Why is achieving crash safety hard? Persistent state becomes inconsistent E.g., failure during a two-step operation (e.g., updating free list and file) -> Block is marked allocated, but isn't in use -> Block is part of a file, but not marked allocated Storage runs concurrent with program Programs sends several disk request to disk, disk works on them After disk failure, which requests have been processed? Storage devices are slower than main memory Many complex optimizations to avoid disks writes Complexity leads to more opportunities for problems Only one computer in a distributed system fails Key-value store seemed to process a write Reads seem to return new value In the meantime, KV server crashes and loses in-memory write! Now reads appear to jump back to old value Real-world examples [ Slide: l01-bugs.pdf ] Linux file system study [http://pages.cs.wisc.edu/~ll/papers/fsstudy.pdf] FSCQ slides [https://pdos.csail.mit.edu/papers/fscq:sosp15-slides.pdf] SSD study [https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf] Bugs can have bad consequences Bugs due to complexity in file system logging Standard logging (slow but correct) data + metadata through log flush log, flush commit hdr Disk buffers and re-orders writes [ Slide: l01-bugs.pdf first bugfix commit ] possible problem: applied log writes, then cleared log clearing of log made it to disk, but applied log writes didn't crash while applying log contents could result in writes being lost solution: wait for disk writes to become durable ("barrier") when flushing Log-bypass data writes metadata through log, data direct flush data, flush log, BARRIER, flush commit hdr [can combine flush of data+log] Checksums put checksum(log) in hdr flush log+hdr together on crash, can determine if log is intact by comparing checksum in hdr [ Slide: l01-bugs.pdf second bugfix commit ] incompatible with log-bypass! subtle problem existed in Linux for years, in widely used file system How to find/avoid bugs? Testing Works well! But doesn't demonstrate absence of bugs Model checking Check all symbolic states Hard to capture key properties without state explosion Verification Prove implementation has no bugs All approaches are active areas of research 6.826 focuses on verification But we will read papers about the other approaches too Verification: three parts Specification: a precise description of the desired behavior of a system Implementation: an implementation of a system Proof: mathematical proof showing implementation meets specification [ Proof ] | V +--------------+ [ Spec ] ---> | Verification | <--- [ Code ] ---> Execute | system | +--------------+ Many variations on this workflow: Some approaches check for lightweight specs (don't require writing spec) Some approaches use automated provers (don't require writing proof) Some approaches focus on pseudocode/design (don't require executable code) Next lecture is more in this style. Specifications are important in their own right. Especially in layered systems. Especially in infrastructure software with applications are built on top. Some approaches rely on paper proofs (don't require verification system) Verification is an old area, but has become hot again Verification technology has become good enough to tackle complex systems Powerful machinery: Provers (Coq, Lean); SAT/SMT solvers (Z3, CVC, ..) Powerful ideas for how to use machinery: Separation logic, layering, .. Demand: security, complexity Real-world successes of verification. CompCert: verified C compiler. Amazon's use of specs (next lecture). Verified cryptography in Chrome, Firefox. Verified crypto in Linux kernel (via Wireguard). Maybe more in the future. Specifications Properties of a good spec: Describe the behavior of the system Tell client and implementer what each need to know Precise enough to be able to prove theorems based on spec Leave out unnecessary requirements that may confuse clients make proofs intractable This is hard! Many ways to describe behavior of a system Easy to miss subtle cases Bugs in the specs are problematic Specifications on their own have much value See Amazon paper next week The old POCS notes [http://bwlampson.site/48-POCScourse/48-POCS2006.pdf] Implementation/code Implementations are not much code in POCS POCS focuses on the core of a system with subtle issues sometimes complex ones too Proofs By hand or machine-checked Old POCS was pen&paper proofs Old POCS notes influenced much recent research in verified systems New POCS focuses on machine-checked Write spec, code, and proof in proof assistant (Coq) Coq checks proof Doing a proof in Coq requires developing a proof argument Often done on paper Class structure Labs to teach you to write specs and prove implementations correct Readings to teach you about the state of the art verification Many recent papers, fast developing area of research Grading: Labs Paper summaries (in lieu of a midterm/final) Class participation (lecture questions, etc). No quizzes Staff Butler Lampson Nickolai Zeldovich Tej Chajed Stella Lau Lab collaboration Must complete and submit your own labs Collaboration in figuring out how to approach the lab, debug issues, is encouraged Post on Piazza to find collaboration group mates that work for your schedule Coq overview Proof assistant Expressive: functional programs abstract mathematical definitions theorems proofs scripting language for proofs Assistant: checks proofs rather than generates all proofs can write scripts to automate proof generation Can generate running code Coq -> Haskell -> executable binary Demo Bank.v bank definition nat balances transfer extraction "Separate Extraction" look at Top.hs look at Main.hs ghc Main ./Main 5 works 20 is nonsense fix and prove no more bugs transfer total theorem: total remains the same cases lia ghc Main ./Main 5, 20, 10 spec might not be quite what you want Toy example; real programs are more sophisticated in many dimensions mutable state (variables) I/O (disk, network, user) crashes concurrency assembly code .. but each dimension adds complexity Difficult to verify existing programs (say, Linux kernel) specs are messy, or not well-defined at all language is not a great fit for verification complex semantics of what C statements mean might run differently depending on which C compiler or flags are used ... code not structured for verification: too much complexity all at once mutable shared memory, IO, crashes, concurrency, assembly, ... verification effort is typically larger than effort to build new impl so, makes sense to rewrite impl to simplify verification Labs will explore how to deal with some of these complexities mutable state I/O crashes concurrency (limited form) no assembly code .. but carefully introducing each form of complexity Final lab: disk mirroring (RAID 0) [ Ran out of time, didn't do the second demo below. ] Demo [ run everything as root ] ./bin/replicate-nbd init ./bin/replicate-nbd start & modprobe nbd nbd-client localhost /dev/nbd0 mkfs -t ext4 /dev/nbd0 mount /dev/nbd0 /mnt/x cd /mnt/x echo hello world > test.txt sync strings disk0.img strings disk1.img rm disk0.img echo another test > foo.txt sync strings disk1.img cat test.txt umount /mnt/x nbd-client -d /dev/nbd0