Static analysis tools ===================== Papers: "Lessons from Building Static Analysis Tools at Google" "Scaling Static Analyses at Facebook" Nicely complementing papers Google paper focuses on "breadth" Facebook paper focuses on "depth" Lots of similarities between Google, Facebook, beyond what papers focus on Similar environment: common unified repository, review workflow, CI tools Facebook paper hints Facebook does lint-style checks too, like Google Google also has more in-depth static analysis in some specific projects But still interesting to contrast "breadth" vs "depth" approaches, even for static analysis tools that don't go to full verification Focus of these papers: static analysis Both Google, Facebook make extensive use of tests Both also do runtime checking / instrumentation AddressSanitizer, ThreadSanitizer, ... Static analysis used in ways that complement test suites, runtime checking Strength of static analysis Corner cases that are hard to catch through tests or runtime instrumentation Don't need to actually get an execution that triggers potential issue Same argument as "full" verification: reason about all possible executions What are the specs that these tools are checking for? Cross-cutting specs, not app-specific Developer does not write a spec for their code Spec must be a general property expected to hold of all code Often focused on internal aspects, not externally-observable behavior. More of an invariant than a spec for external behavior. Localized specs Ideally specs capture something that must be true of code at small scale Facebook goes a bit further: inter-procedural analysis, with clear summaries But must be able to mechanically infer specs for each procedure Partial specs that catch common bugs or specific aspect, not prove correctness Co-designed specs together with developer, code base Developers must believe this spec is worth checking for Google: "motherhood and apple pie" kind of specs, everyone agrees, low FP Facebook: important specs where developers believe this is tricky stuff Focus on practical specs that can be checked, as opposed to complete specs Google case study: FindBugs Surprisingly common themes of how analyses can fail across Google, Facebook Attempt 1: dashboard Couldn't get developers to care about reports from the analysis tool Facebook also had the same problem 0% fix rate for batch reports 70% fix rate for reports at review time ("diff time") Interestingly, Facebook made this work eventually: "master" bug list Security engineers own dashboard, in effect Attempt 2: file bugs based on analysis reports Developers didn't think the reports were worth fixing Facebook similarly reports: "near silence" when assigning reports to engineers Perhaps a case of the "context switch effort" problem that Facebook mentions Perhaps less clear who should fix the bug (Facebook paper) In contrast, review-time reports are more likely seen by relevant developer Facebook made this work by having security engineers triage analysis reports When security engineer files bug, developers take it more seriously Attempt 3: integrated into code review A bit unclear what exactly went wrong Sounds like developers of different modules customized the tool too much Everyone got different reports, developers confused about what the tool does? Lessons learned for analysis tools? Google paper: avoiding pitfalls Biggest point: integration into developer workflow Compile-time, review-time Both Google and Facebook do larger-scale "master" bug list, batch reports Facebook's workflow: security team figures out what to do with master report Google's workflow: try to get the master list to zero before integrating tool Facebook had a good characterization: "mental effort of context switch" Once change is merged, lots of overhead to go back and fix things Much easier to fix while writing code (Google's focus on compile-time checks) Also not too bad to fix things while code is being reviewed Actionable warnings Google: review-time checks if too many false positives in legacy code; not viable as a compile-time check because would overwhem developer. Users trust the tool Google: low effective FP Facebook: focus on hard problem appreciated by some team Security Soncurrency Closer to the Amazon story (distributed system also a hard problem) Bugs must be relevant in practice / Fixing the issue must be easy enough Facebook: focus on important bugs Google: surface the warning sooner (compile-time) Developers understand warnings Google example: developers misunderstood printf warning Common: leverage centralized CI Incremental analysis Perhaps too expensive to run at compile-time Perhaps generates too many false positives for existing code Incremental mode possible due to centralized cache of analysis state (Infer) Different: focus, big metrics Google: focus on "effective false positive" If developer changes the code, that's not a false positive If developer doesn't fix an actual fault, that's effectively a false positive Review tool has a button to flag analysis reports that are not useful Paper describes how Google targets broad range of "common" bugs First "lesson learned" is that, for any analysis you run, you find bugs Facebook: focus on "missed bugs" Paper targets bugs in specific areas of importance Crashes Security Concurrency Different mindset "How could we have caught it with static analysis?" Surprisingly low missed bug counts, for categories they target! 3 data race bugs (due to error in analyzer itself) 11 missed security bugs (some due to error in model or tool) Developers tolerate false positives 50% fix rate for concurrency bug reports But developers still like the reports: hard to write correct concurrent code Different: kinds of specs/analyses Google: style checkers, error-prone patterns, type-based checks, unused vars Facebook: mem safety, data flow for security, concurrency race conditions What makes analyses scale? Local analysis Incremental analysis Some bugs require interprocedural analysis OpenSSL null pointer bug: revtm = X509_gmtime_adj(...); .. X509_gmtime_adj calls other procedures .. some procedure, 5 calls deep, can return null .. this eventually propagates to revtm i = revtm->length + 1; Figure 3: data path from user input to form submission URL. render() calls getIDs() gets member_id from HTTP request passes result to getConfirmationForm() member_id gets used as the action URL of a form returns resulting form to the HTTP client Compositional analysis Procedure summary: need to infer procedure specs! Null pointer: need to infer what args, return values might be null? null return value might depend on arguments Facebook's Infer deduces pre/postcondition specs Deduced specs are checked to be correct, but might be too coarse Data flow: which values contain user input, which values are used in web page Similarly, these inferred specs/summaries are correct but maybe coarse Compositional analysis naturally good fit for scale Relatively easy to parallelize: independently check each procedure Relatively easy to do incremental analysis: reuse existing summaries