x86-TSO
=======

Why are we reading this paper?
  Shared memory is an important problem where specifications matter.
    Concurrent software is challenging to write.
    Correctness depends on the details of shared memory behavior.
    Different architectures provide different guarantees.
  Concurrency with weak memory is extremely complicated.
    Concurrency means many possible interleavings.
    Weak memory means even more cases to consider for memory behavior.
  x86-TSO helped clarify the shared memory model for x86 CPUs.
    Before x86-TSO, challenging to understand shared memory system.
    On-paper specs differ between Intel and AMD.
    Even Intel changed their mind in specs over time.
  Example of a spec that cuts out details to get at the essence.
    Helpful as a thought exercise even if not doing formal verification.
  Start of a line of papers on concurrency in 6.826.

Sequential consistency (SC).
  Reads, writes.
  Two roles:
    1. Actually provided by some systems.
      E.g., OS threads on single CPU, or simplistic multicores.
    2. Convenient abstraction to provide in data-race-free programs.
      E.g., ensured by implementation of locks + programmer discipline.

Example program:
  T1:
    x = 1;
    ysnapshot = y;

  T2:
    y = 1;
    xsnapshot = x;

  main:
    x = 0;
    y = 0;
    spawn T1, T2 and wait for completion
    print xsnapshot, ysnapshot

On SC, this should never print "0, 0".  Acceptable are "1, 0", "0, 1" and "1, 1".
  Suppose T1 gets ysnapshot=0.
  That means T2 hasn't executed its first write (y=1) yet.
  So it hasn't executed its snapshot operation (xsnapshot=x) either.
  Its snapshot operation will see the x=1 value that T1 wrote.
  Thus, T2 should see xsnapshot=1.

High-performance multicore systems do not provide SC.
  Surprising behavior when running code on weak memory.
  On many real multicore machines, the above example can print "0, 0".

Hardware implementation of shared memory.
  Caches: L1/L2, L3, sockets.
    Each core has its own L1/L2.
    Each socket has its own L3.
    Same memory address can be present in multiple caches (in shared state).
    Cache coherence implemented through message-based protocol between caches.
  Store buffers.
    Each core has its own store buffer.
    Write instruction finishes after putting value in store buffer.
    Store buffer eventually flushes write to the cache system.
  Out-of-order execution.
    Many instructions executing at the same time.
      store(x, 1)
      t := fetch(y)
      store(ysnapshot, t)
    Memory read might happen before instruction commits.
  Speculative execution.
    More aggressive out-of-order execution: CPU guesses branches, addresses, etc.
      T1: *ptr = x; shared = ptr;
      T2: myptr = shared; y = *ptr;
    With speculative execution on some architectures:
      T2 could guess address of myptr, even though it hasn't loaded it yet.
      Could perform the load (*ptr) before T1 stores x there.
      T2 eventually loads the shared address, confirms speculation was correct.
  SMT / Hyperthreading.
    Shared resources across cores.
    Shared L1/L2, maybe shared store buffer.

What if hyperthreads shared a store buffer?
  Might observe IRIW (Independent Reads of Independent Writes).
    T1: write(x, 1)
    T2: write(y, 1)
    T3: x3=read(x); y3=read(y); print x3, y3
    T4: y4=read(y); x4=read(x); print x4, y4
  Could we ever see T3: (1, 0) and T4: (0, 1)
    Weird behavior: T3 and T4 don't agree on order of updates from T1, T2.
    "Partial store order", in contrast to x86-TSO ("total store order").
  Can happen if hyperthreads share store buffers.
    T1, T3 are hyperthreads on the same core (so T3 sees x updates sooner).
    T2, T4 are hyperthreads on another core (so T4 sees y updates sooner).

Weak memory models.
  Many relaxations possible.
  x86 has several different memory types.
    [ Ref: section 11.3 of https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf ]
  This paper considers the default "write-back" mode on x86.
    This x86 "default" behavior seems similar to SPARC v8.
  x86 write-combining is weaker (no order guarantees), does not satisfy the x86-TSO spec.
    With write-combining, store buffer not guaranteed to flush in order.
  ARM and PowerPC: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

How does any software ever work on weak memory?
  Goal: simulate SC on top of weak memory, for well-behaved code.
  "Data race freedom."
  Not always possible (i.e., unusable specs, according to this paper).

Does weak memory matter?
  Matters in the implementation of the locking primitives.
  Matters for lock-free algorithms.
  Linux kernel RCU: lock-free accesses.
    Key idea: atomic pointer update.
    Fill in data structure before "publishing" pointer.
    If pointer "publishing" write goes before data write, inconsistency.
  Postgres had a work queue where this re-ordering caused lost wakeups.
    [ Ref: https://www.postgresql.org/message-id/24241.1312739269@sss.pgh.pa.us ]

Intel, AMD informal specs.
  Principles.
  Litmus tests.
  Insufficient and imprecise.
  Axiomatic: spec states invariants that should be true over execution.
    E.g., Intel P9: "Any two stores are seen in a consistent order by
      processors other than those performing the stores".
    Relatively good fit for checking a particular trace of execution.
    Harder as a specification for helping programmers understand what happens.

What do Intel and AMD specs say now?
  Still similar to what the x86-TSO paper describes.
  AMD: section 7.2 of https://www.amd.com/system/files/TechDocs/24593.pdf
  Intel: section 8.2.2 of https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf

x86-TSO model.
  Abstract machine (in contrast to axiomatic spec).
  Store buffers.
  Shared memory.
  Lock.
  Instructions:
    Reads from store buffer + shared memory.
    Writes to store buffer.
    Locked operations are atomic (no concurrent reads/writes to shared memory),
      and flush store buffer afterwards.
    Fence flushes store buffer.
  Internal transition \tau_p: flush one item from p's store buffer.

Understanding our example on x86-TSO.
  Store buffer holds the "1".
  Snapshotting reads see the 0 from shared memory.

Validating x86-TSO.
  Rare events: amd3 only on specific hardware config, 4 times out of 3.2*10^9.
  amd3:
    CPU0      CPU1
    x=1       y=1
    x=2       y=2
    print(y)  print(x)
  Sometimes (even though rarely) prints 1,1.

Implementing locks on top of x86-TSO.
  Section 4.
  Is it safe for release to omit the LOCK prefix on the MOV?
    Linux developers argued for 2 weeks, 119 email messages!
    x86-TSO model clarifies: LOCK prefix means flushing store buffer.
    Safe from a correctness perspective.
    Unlock doesn't guarantee concurrent lock will immediately succeed.
    But that seems within spec.  (And processors seem to flush eagerly.)
  Why no MFENCE anywhere?
    Implicit in lock prefix.

Implementing SC on top of x86-TSO.
  Locks.
  Data-race-free program: requires programmer to follow some rules.
    Program does not access memory protected by lock without holding lock.
    View of protected memory from other cores doesn't matter while lock held.
    Once lock is released, store buffer order guarantees that
      protected memory updates flushed before lock release is visible.

Hardware threads vs OS threads.
  Suppose we are implementing a lock in user-space code.
  Need to reason about low-level shared memory operation, ala x86-TSO.
    Single OS thread could get moved between CPU cores by the OS kernel.
  What's the abstract view of shared memory for OS threads?

x86-TSO is correct even for OS threads.
  If threads are not migrated, x86-TSO applies directly.
  If threads are migrated, the kernel needs to communicate state between cores.
    Kernel communication happens through the same hardware store buffer.
    Application-level stores get flushed before kernel stores.
    By the time OS thread is migrated, its store buffer is flushed.
    Effectively allowed by tau transition.
  What if the store buffer isn't empty when this thread resumes on new core?
    Could have the same effect as sharing store buffer between two threads.
    Could we get IRIW?
    OS kernel needs to flush store buffer (e.g., MFENCE) when switching threads.
    Luckily, the usual way to switch between threads (IRET) incurs a fence.