CS7580

Intensive Construction of Software and its Engineering

Scaling Software Engineering to Big Code with Static Analysis, Dynamic Analysis and Machine Learning

Syllabus

One of the aims of modern software engineering is to automate the process of understanding software artifacts. This course focuses on techniques for analyzing programs to find, e.g., security vulnerabilities and errors, as well as understanding the software development process. You will gain a working knowledge of program analysis and its theoretical foundations rooted in abstract interpretation. The class starts with lectures on key concepts of program analysis including data flow analysis, constraint-based analysis and abstract interpretation. Applications to security and bug finding will be used as motivation. Topics in dynamic analysis and machine-learning will be covered to explore the tradeoff between soundness and scalability. The second part of the course focuses on reading, presenting and validating research papers.

Readings

You are expected to read the following before each lecture, it will increase your understanding of the material.

Week Readings
1
2
  • PoPA Chapter 2 (stop before interprocedural analyis)
3
  • Identify project candidates and write project abstracts.
4
5
6
7
8
  • Midterm
9
10
11
12
13
  • Thanksgiving
14
  • Project presentations


Workload and Grading

The expected workload is one day a week. Mostly reading and writing. Some opportunities for coding are in the second half but they are optional. The deliverables are:

  • One midterm exam
  • Two, one-page, paper reviews
  • One, in-class, lecture
  • One project

The project is the only open-ended task. While grades are besides the point in a research class, they are required and the expectation is for them to fall the range [B, B+, A-, A], where a B is below average, B+ and A- mean expectations were met, and A+ means awesome.

Project

Working in groups students, you will select a research paper, obtain its data and code and perform a reproduction study, write a report, and give a lecture.

Reproduction study

Depending on time and energy students can perform a repetition, reanalysis or reproduction. A repetition is the simplest kind of reproduction where you simply run the code provided by the authors on their data and report on the results. A reanalysis starts from the author’s code and data, but attempts to evaluate if the results are robust in the presence of changes to the methodology. A reproduction is an attempt to replicate an experiment from first principles.

Report

Your report should do the following: (1) summarize the paper’s approach and results, (2) describe how you validated the results, and (3) summarize what you found. In simple cases, when all is as claimed, the report can be just a couple of pages. If there are interesting differences a report could be extended into a full paper.

Lecture

A lecture is 90 minutes long including time for questions and discusion. The goal is to give an in-depth presentation of the paper, placing it in context of previous research and talk about is strenghs and limitation.

Textbook and Papers

The textbook is Principle of Program Analysis (PPA) by F. Nielson, H.R. Nielson and C. Hankin. (An old version of the text is on ResearchGate here. The newest edition has 476 pages.)

Testing
  • Fairness Testing: Testing Software for Discrimination FSE 2017. link
  • Do Automatically Generated Unit Tests Find Real Faults? ASE 2015. link
  • QuickChecking Static Analysis Properties STVR 2017. link
Bug finding
  • Finding Broken Promises in Asynchronous JavaScript Programs OOPSLA 2018. link
  • ExceLint: Automatically Finding Spreadsheet Formula Errors OOPSLA 2018. link
  • To Type or Not to Type: Quantifying Detectable Bugs in JavaScript ICSE 2017. link
Security
  • Automated Identification of Security Issues from Commit Messages and Bug Reports FSE 2017. link
  • ZENIDS: Introspective Intrusion Detection for PHP Applications ICSE 2017.. link
Algorithms
  • Active Learning of Points-To Specifications PLDI 2018. link
  • P/Taint: Unified Points-to and Taint Analysis OOPSLA 2017. link
  • Modular static analysis of string manipulations in C programs SAS 2018. link
Big Code
  • Automatically Diagnosing and Repairing Error Handling Bugs in C FSE 2017. link
  • Recovering Clear, Natural Identifiers from Obfuscated JS Names FSE 2017. link

Reviews

“Jan Vitek is the most horrible professor I’ve ever met. […] he made students read and learn all by themselves. […] He kind of enjoys watching students working so hard over night and night” Be warned.