SSD fault tolerance
===================

preface: lab 2 questions?
  abstraction relation
  example abstractions

why this paper?
  important problem: disks should retain data across power faults
    tangentially related anecdote: BU MOC cluster, Ceph bug, lost VM disks
    .. though it's important to have backups, replication, etc as well
  interesting results: many SSDs exhibit various incorrect behaviors
    of course, not terribly surprising that there are issues
    but the extent of the problem was surprisingly high
    many more issues than rotational drives
  widely used interface: important to understand what is correct behavior
  relevant to labs: you will implement disk remapping / mirroring in labs

  extended version of this paper from ACM TOCS, with some more data
    https://dl.acm.org/citation.cfm?id=2992782

  another paper studied SSD errors in the absence of power faults
    https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf
    but for this lecture, focus on tolerating power faults

what does a typical system with an SSD look like?
  CPU, DRAM
  AHCI controller
  SATA link
  SSD:
    its own CPU, running FTL (flash translation layer) code
    DRAM
      cache of written data that hasn't made it to non-volatile flash yet
    multiple flash chips
    FTL code stored on flash chips, along with regular data
      flash chip operations: read, erase large block, write to erased area
      limited number of erase cycles per block (1k -- 100k)
      MLC vs SLC
      MLC: stripe the two bits across different sectors
        error-prone: fate sharing between different sectors now
      MLC: complex iterative write process
        performance penalty for writes
    perhaps a capacitor for clean shutdown

  aside: might look somewhat different in the near future
    NVMe interface: lower latency, higher bandwidth than SATA
    some SSDs provide raw access to flash ("open-channel SSDs")

abstract disk state, approximately: array of sectors
  common sector sizes: 512 bytes, 1024 bytes, 4096 bytes
  bad sectors?  typically remapping happens inside disk's own firmware.
  but what happens when a sector goes bad during execution?

typical FTL plan
  goal: avoid running out of erase cycles
  remap "user-visible" sectors to flash chip locations
    table: sector# <-> flash location
  maintain large chunks of erased flash locations
  when a new write comes in, write new data to a pre-erased location
    update table for the sector number
    mark the old sector data as unneeded
    when we get an entire erase unit of unneeded data, erase it
    may need to proactively move data to get an eraseable unit

what does the disk interface look like?
  http://www.t13.org/Documents/UploadedDocuments/docs2013/d2161r5-ATAATAPI_Command_Set_-_3.pdf
    577 pages long, just for the command set (not the physical layer like SATA)

read/write
  READs
    multiple sectors
    might fail due to physical errors
  WRITEs
    multiple sectors
    might fail due to physical errors
    what happens to the state of the disk if failure occurs in the middle
      of a multi-sector write?  not precisely specified.

write caching
  disk may cache writes
  SET FEATURES
    write caching
    disk can flush modified sectors in the background, in any order
  FLUSH CACHE
    what if the write to non-volatile storage fails at this point?
    spec says: report back the failed address, don't flush more
    unclear what happens to the data at the failed sector
  WRITE FUA
    how does this interact with existing cache entries?
    oddly enough, spec doesn't say
  READ FUA
    forces cache flush before reading from non-volatile storage

pipelining / concurrency
  disk controller vs inside the disk
  SATA AHCI: disk controller
  NCQ: inside the disk

unnecessary data
  DATA SET MANAGEMENT: TRIM
    subsequent reads will return zero ("RZAT")
    subsequent reads will return some non-det value ("DRAT")
    subsequent reads can return different non-det values
    disk reports which behavior it will implement
    security requirement in the ATA spec:
      "data read from an LBA that has been trimmed shall not be retrieved from
       data that was previously received from a host addressed to any other LBA"
  file system format snooping
    DOS file system implementations don't issue TRIM command
    some disks guess the file system format
    guess that deleted file contents will not be needed
    perform implicit TRIM

other interesting tidbits
  power management: spec requires writing cache to non-volatile storage before
    entering low-power state
  firmware itself stored on non-volatile storage
    upgraded via DOWNLOAD MICROCODE command
  many more complex commands
    on-disk encryption, "secure erase", write-read verify, ..

not obvious what is the precise specification for this interface
  need more complex abstract state than just an array of sectors
    non-deterministic reads after trim
  not clear what's the behavior in case of write failures
  ..

how does this paper test for failures?
  issue writes with a particular pattern
  no concurrency, wait for one write to complete before issuing next
  FUA or flush for each write (Linux translates O_SYNC into FUA writes)
  block pattern contains checksums, counters, etc.
    helps determine what happened afterwards
  physically disconnect power at some point during this workload

what did the authors expect might go wrong?
  (and why might it go wrong?)  (and how might we detect this?)
  bit corruption
    something going wrong with NAND flash gates
    maybe this is where SLC vs MLC flash would help?
    detect: checksum in the data pattern
  flying writes (data ends up in the wrong place)
    probably a bug in the FTL implementation
    detect: expected address in the data pattern
  shorn writes (partially updated sector)
    probably a bug in the FTL implementation, or mis-understanding of semantics
    detect: individual patterns in each 512-byte chunk of the 4KB write?
  metadata corruption
    low-level flash corruption, or bug in FTL implementation
  dead device
    low-level flash corruption, or bug in FTL implementation
  unserializability (later writes visible but earlier writes are not)
    probably a bug in the FTL implementation
    detect: sequence number in the data pattern
  read disturb (added in a later version of the paper)
    might be that "half-programmed" NAND gates don't have enough electrons?
    easy to disturb by reading repeatedly
    detect: repeated reads, check if the data changes

what were the results?
  most SSDs had trouble
  both expensive "enterprise" drives and "regular" drives had trouble
  backup capacitor seems to help, but not perfect

  no flying writes
  dead device
    corrupted storage of FTL firmware, perhaps?
  metadata corruption: device lost space
  unserializable writes: how is this possible?
    write caching does not respect
    multiple flash chips
  shorn writes
    physical device uses 512-byte sector sizes internally
    even though the supposed atomic unit is 4096-byte sectors
  no read disturb errors (in subsequent version of paper)

  more reliable drives (e.g., 6, 10, 14, 15) had lower write throughput
    than less reliable drives (e.g., 2, 7, 8)
    [ from subsequent version of paper ]