Static analysis tools
=====================

Papers:
  "Lessons from Building Static Analysis Tools at Google"
  "Scaling Static Analyses at Facebook"

Nicely complementing papers
  Google paper focuses on "breadth"
  Facebook paper focuses on "depth"

Lots of similarities between Google, Facebook, beyond what papers focus on
  Similar environment: common unified repository, review workflow, CI tools
  Facebook paper hints Facebook does lint-style checks too, like Google
  Google also has more in-depth static analysis in some specific projects
  But still interesting to contrast "breadth" vs "depth" approaches,
    even for static analysis tools that don't go to full verification

Focus of these papers: static analysis
  Both Google, Facebook make extensive use of tests
  Both also do runtime checking / instrumentation
    AddressSanitizer, ThreadSanitizer, ...
  Static analysis used in ways that complement test suites, runtime checking

Strength of static analysis
  Corner cases that are hard to catch through tests or runtime instrumentation
  Don't need to actually get an execution that triggers potential issue
  Same argument as "full" verification: reason about all possible executions

What are the specs that these tools are checking for?

  Cross-cutting specs, not app-specific
    Developer does not write a spec for their code
    Spec must be a general property expected to hold of all code

  Often focused on internal aspects, not externally-observable behavior.
    More of an invariant than a spec for external behavior.

  Localized specs
    Ideally specs capture something that must be true of code at small scale
    Facebook goes a bit further: inter-procedural analysis, with clear summaries
    But must be able to mechanically infer specs for each procedure

  Partial specs that catch common bugs or specific aspect, not prove correctness

  Co-designed specs together with developer, code base
    Developers must believe this spec is worth checking for
    Google: "motherhood and apple pie" kind of specs, everyone agrees, low FP
    Facebook: important specs where developers believe this is tricky stuff

  Focus on practical specs that can be checked, as opposed to complete specs

Google case study: FindBugs
  Surprisingly common themes of how analyses can fail across Google, Facebook

  Attempt 1: dashboard
    Couldn't get developers to care about reports from the analysis tool
    Facebook also had the same problem
      0% fix rate for batch reports
      70% fix rate for reports at review time ("diff time")
    Interestingly, Facebook made this work eventually: "master" bug list
      Security engineers own dashboard, in effect

  Attempt 2: file bugs based on analysis reports
    Developers didn't think the reports were worth fixing
    Facebook similarly reports: "near silence" when assigning reports to engineers
    Perhaps a case of the "context switch effort" problem that Facebook mentions
    Perhaps less clear who should fix the bug (Facebook paper)
      In contrast, review-time reports are more likely seen by relevant developer
    Facebook made this work by having security engineers triage analysis reports
      When security engineer files bug, developers take it more seriously

  Attempt 3: integrated into code review
    A bit unclear what exactly went wrong
    Sounds like developers of different modules customized the tool too much
    Everyone got different reports, developers confused about what the tool does?

Lessons learned for analysis tools?

Google paper: avoiding pitfalls
  Biggest point: integration into developer workflow
    Compile-time, review-time
    Both Google and Facebook do larger-scale "master" bug list, batch reports
    Facebook's workflow: security team figures out what to do with master report
    Google's workflow: try to get the master list to zero before integrating tool
    Facebook had a good characterization: "mental effort of context switch"
      Once change is merged, lots of overhead to go back and fix things
      Much easier to fix while writing code (Google's focus on compile-time checks)
      Also not too bad to fix things while code is being reviewed
  Actionable warnings
    Google: review-time checks if too many false positives in legacy code;
      not viable as a compile-time check because would overwhem developer.
  Users trust the tool
    Google: low effective FP
    Facebook: focus on hard problem appreciated by some team
      Security
      Soncurrency
      Closer to the Amazon story (distributed system also a hard problem)
  Bugs must be relevant in practice / Fixing the issue must be easy enough
    Facebook: focus on important bugs
    Google: surface the warning sooner (compile-time)
  Developers understand warnings
    Google example: developers misunderstood printf warning

Common: leverage centralized CI
  Incremental analysis
  Perhaps too expensive to run at compile-time
  Perhaps generates too many false positives for existing code
  Incremental mode possible due to centralized cache of analysis state (Infer)

Different: focus, big metrics

  Google: focus on "effective false positive"
    If developer changes the code, that's not a false positive
    If developer doesn't fix an actual fault, that's effectively a false positive
    Review tool has a button to flag analysis reports that are not useful
    Paper describes how Google targets broad range of "common" bugs
    First "lesson learned" is that, for any analysis you run, you find bugs

  Facebook: focus on "missed bugs"
    Paper targets bugs in specific areas of importance
      Crashes
      Security
      Concurrency
    Different mindset
      "How could we have caught it with static analysis?"
    Surprisingly low missed bug counts, for categories they target!
      3 data race bugs (due to error in analyzer itself)
      11 missed security bugs (some due to error in model or tool)
    Developers tolerate false positives
      50% fix rate for concurrency bug reports
      But developers still like the reports: hard to write correct concurrent code

Different: kinds of specs/analyses
  Google: style checkers, error-prone patterns, type-based checks, unused vars
  Facebook: mem safety, data flow for security, concurrency race conditions

What makes analyses scale?
  Local analysis
  Incremental analysis

Some bugs require interprocedural analysis
  OpenSSL null pointer bug:
    revtm = X509_gmtime_adj(...);
      .. X509_gmtime_adj calls other procedures
      .. some procedure, 5 calls deep, can return null
      .. this eventually propagates to revtm
    i = revtm->length + 1;

  Figure 3: data path from user input to form submission URL.
    render()
      calls getIDs()
        gets member_id from HTTP request
      passes result to getConfirmationForm()
        member_id gets used as the action URL of a form
      returns resulting form to the HTTP client

Compositional analysis
  Procedure summary: need to infer procedure specs!
    Null pointer: need to infer what args, return values might be null?
      null return value might depend on arguments
      Facebook's Infer deduces pre/postcondition specs
      Deduced specs are checked to be correct, but might be too coarse
    Data flow: which values contain user input, which values are used in web page
      Similarly, these inferred specs/summaries are correct but maybe coarse
  Compositional analysis naturally good fit for scale
    Relatively easy to parallelize: independently check each procedure
    Relatively easy to do incremental analysis: reuse existing summaries