SSD fault tolerance =================== preface: lab 2 questions? abstraction relation example abstractions why this paper? important problem: disks should retain data across power faults tangentially related anecdote: BU MOC cluster, Ceph bug, lost VM disks .. though it's important to have backups, replication, etc as well interesting results: many SSDs exhibit various incorrect behaviors of course, not terribly surprising that there are issues but the extent of the problem was surprisingly high many more issues than rotational drives widely used interface: important to understand what is correct behavior relevant to labs: you will implement disk remapping / mirroring in labs extended version of this paper from ACM TOCS, with some more data https://dl.acm.org/citation.cfm?id=2992782 another paper studied SSD errors in the absence of power faults https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf but for this lecture, focus on tolerating power faults what does a typical system with an SSD look like? CPU, DRAM AHCI controller SATA link SSD: its own CPU, running FTL (flash translation layer) code DRAM cache of written data that hasn't made it to non-volatile flash yet multiple flash chips FTL code stored on flash chips, along with regular data flash chip operations: read, erase large block, write to erased area limited number of erase cycles per block (1k -- 100k) MLC vs SLC MLC: stripe the two bits across different sectors error-prone: fate sharing between different sectors now MLC: complex iterative write process performance penalty for writes perhaps a capacitor for clean shutdown aside: might look somewhat different in the near future NVMe interface: lower latency, higher bandwidth than SATA some SSDs provide raw access to flash ("open-channel SSDs") abstract disk state, approximately: array of sectors common sector sizes: 512 bytes, 1024 bytes, 4096 bytes bad sectors? typically remapping happens inside disk's own firmware. but what happens when a sector goes bad during execution? typical FTL plan goal: avoid running out of erase cycles remap "user-visible" sectors to flash chip locations table: sector# <-> flash location maintain large chunks of erased flash locations when a new write comes in, write new data to a pre-erased location update table for the sector number mark the old sector data as unneeded when we get an entire erase unit of unneeded data, erase it may need to proactively move data to get an eraseable unit what does the disk interface look like? http://www.t13.org/Documents/UploadedDocuments/docs2013/d2161r5-ATAATAPI_Command_Set_-_3.pdf 577 pages long, just for the command set (not the physical layer like SATA) read/write READs multiple sectors might fail due to physical errors WRITEs multiple sectors might fail due to physical errors what happens to the state of the disk if failure occurs in the middle of a multi-sector write? not precisely specified. write caching disk may cache writes SET FEATURES write caching disk can flush modified sectors in the background, in any order FLUSH CACHE what if the write to non-volatile storage fails at this point? spec says: report back the failed address, don't flush more unclear what happens to the data at the failed sector WRITE FUA how does this interact with existing cache entries? oddly enough, spec doesn't say READ FUA forces cache flush before reading from non-volatile storage pipelining / concurrency disk controller vs inside the disk SATA AHCI: disk controller NCQ: inside the disk unnecessary data DATA SET MANAGEMENT: TRIM subsequent reads will return zero ("RZAT") subsequent reads will return some non-det value ("DRAT") subsequent reads can return different non-det values disk reports which behavior it will implement security requirement in the ATA spec: "data read from an LBA that has been trimmed shall not be retrieved from data that was previously received from a host addressed to any other LBA" file system format snooping DOS file system implementations don't issue TRIM command some disks guess the file system format guess that deleted file contents will not be needed perform implicit TRIM other interesting tidbits power management: spec requires writing cache to non-volatile storage before entering low-power state firmware itself stored on non-volatile storage upgraded via DOWNLOAD MICROCODE command many more complex commands on-disk encryption, "secure erase", write-read verify, .. not obvious what is the precise specification for this interface need more complex abstract state than just an array of sectors non-deterministic reads after trim not clear what's the behavior in case of write failures .. how does this paper test for failures? issue writes with a particular pattern no concurrency, wait for one write to complete before issuing next FUA or flush for each write (Linux translates O_SYNC into FUA writes) block pattern contains checksums, counters, etc. helps determine what happened afterwards physically disconnect power at some point during this workload what did the authors expect might go wrong? (and why might it go wrong?) (and how might we detect this?) bit corruption something going wrong with NAND flash gates maybe this is where SLC vs MLC flash would help? detect: checksum in the data pattern flying writes (data ends up in the wrong place) probably a bug in the FTL implementation detect: expected address in the data pattern shorn writes (partially updated sector) probably a bug in the FTL implementation, or mis-understanding of semantics detect: individual patterns in each 512-byte chunk of the 4KB write? metadata corruption low-level flash corruption, or bug in FTL implementation dead device low-level flash corruption, or bug in FTL implementation unserializability (later writes visible but earlier writes are not) probably a bug in the FTL implementation detect: sequence number in the data pattern read disturb (added in a later version of the paper) might be that "half-programmed" NAND gates don't have enough electrons? easy to disturb by reading repeatedly detect: repeated reads, check if the data changes what were the results? most SSDs had trouble both expensive "enterprise" drives and "regular" drives had trouble backup capacitor seems to help, but not perfect no flying writes dead device corrupted storage of FTL firmware, perhaps? metadata corruption: device lost space unserializable writes: how is this possible? write caching does not respect multiple flash chips shorn writes physical device uses 512-byte sector sizes internally even though the supposed atomic unit is 4096-byte sectors no read disturb errors (in subsequent version of paper) more reliable drives (e.g., 6, 10, 14, 15) had lower write throughput than less reliable drives (e.g., 2, 7, 8) [ from subsequent version of paper ]