Schedule Thur, 6–9pm @ WVH110
Staff Jan Vitek and invited lectures by Konrad Siek.
Contact Piazza for all communication.
Experts John Tristan (ML Lead, Oracle Labs), Jiahao Chen (DS Manager, CapitalOne)

Do you want to: Build a machine learning portfolio? Solve real data science problems? Improve your programming skills? Meet data science experts?


The class has the following data science pre-requisites: DS5110, DS5220 and DS5230.


This course encourages student to solve real-world data science problems by applying the skills they obtained in previous classes of their Data Science program. Students will gain practical experience with the key steps of any data science project, namely, data import, data tyding and transformation, statistical modelling, and visualization. The course combines a programming component in R with a machine learning component. Repeatability and reproducibility of results will be emphasized.


Data Science is a discipline that combines computing with statistics. A data analysis problem is solved in a series of data-centric steps: data acquisition and representation (Import), data cleaning (Tidy), and an iterative sequence of data transformation (Transform), data modelling (Model) and data visualization (Visualize). The end result of the process is to communicate insights obtained from the data (Communicate). This class will take you through all the steps in the process and will teach you how to approach such problems in a systematic manner. You will learn how to design data analysis pipelines as well as how to implement data analysis pipelines. The class will also emphasize how elegant code leads to reproducible science.

The class will feature a couple small starter projects and one class project in which students will deliver a solution to a real-world data analysis challenge. All projects will be done individually. Starter projects will be done in the R programming language, the class project can be done using any language or tool, but class staff will only be able to provide detailed support for technologies they are familiar with.


The goal of this class is to reinforce the Data Science practices that were discussed in previous classes. These are summarized by the six activities listed next (adapted from “The levels of data science class” by Jeff Leek).

  • Asking: How to define a question, turn that question into a statement about data, identify data sets that may be applicable, and design the experiment.
  • Telling: How to write about data science, express models qualitatively and in mathematical notation, interpret results of models and make figures.
  • Practicing: Learn the basic tools of R, load data of various types, read data, plot data.
  • Scaling: Manipulate different file formats, work with “messy” data, organize multiple data sets into one data set, deal with real-world large datasets.
  • Solving: Use real data examples, but work them through from start to finish as case studies, with a clear path from the beginning of the problem to the end.
  • Science: Formulate your own questions, and try to solve them, at scale, using the tools your are familar with, and communicate your conclusions effectively.


The projects will focus on analyzing large dataset extracted from GitHub. The challenges that will have to be overcome will include large scale data acquistion (the target is 1 million projects), data cleaning (the data is not guaranteed to be clean, to the contrary it is know to be messy), data representation (how do we model the data), data storage (define a storage format, e.g. a relational database), how to analyze the data, choose machine learning techniques (NLP, clustering, etc.), and vizualize the results.

One paper that is an example of an analysis of GitHub data is “A large-scale study of programming languages and code quality in GitHub” by Ray et al. html.



Lectures will reviews basics of data science, reproducibility and depending on student interest and need, topics in machine learning, and programming: R, Markdown, dataframes, ggplot, tidyverse, Shiny.

  • Lecture 1/10: Introduction and placement exam.
  • Lecture 1/17: GitHub; “Large Scale Study”; bash; exam review; Git API notes
  • Lecture 1/24: Data management and databases. Papers: DéjàVu: A Map of Code Duplicates on GitHub by Alex Bender and What is the Truck Factor of Popular GitHub Apps? by Nick Tyler.
  • Lecture 1/31: Data wrangling with dplyr. Papers: An Algorithm to Classify GitHub Repos by Justin Littman, DéjàVu: A Map of Code Duplicates on GitHub by Alex Gomez, and What is the Truck Factor of Popular GitHub Apps? by Noul Singla.
  • Lecture 2/07: Repeatable science with Rmd.
    Vizualization with Ggplot and Shiny. Papers: Pomises and perils of mining GitHub by Tyler Brown, Mining the Network of the Programmers by Jingci Wang, and An Algorithm to Classify GitHub Repos by Troy Yang
  • Lecture 2/14: No class
  • Lecture 2/21: Advanced R: Profiling and Debugging Papers: What’s in a GitHub Star? by Tajas Bala and Mining the Network of the Programmers by Tim Sauchuk.
  • Lecture 2/28: Guest Lecture on Bias in Machine Learning by Tina Eliassi-Rad. Test-driven development for data science. Papers: What’s in a GitHub Star? by Qinyu. A Large Scale Study of Programming Languages and Code Quality by Xubo Tang.
  • Lecture 3/07: Spring break
  • Lecture 3/14: Industry panel. Project pitches.
  • Lecture 3/21: Guest Lecture on Model interpretation by Olga Vitek.
  • Lecture 3/28: Advanced R: OOP, Packages, Scalability
  • Lecture 4/04: Julia and other data science languages and pipelines. Adoption. Security.
  • Lecture 4/11: No class
  • Lecture 4/18: Final projects due.


The grade in this class will be entirely based on the final deliverables that consist of (a) a 1-minute elevetor pitch, (b) a 10-minute presentation, (c) a project report, and (d) a github repository with software.

Projects are individual.

Expert panel

Projects will be presented to a pannel of experts in data science. Currently two experts are confirmed: John Tristan and Jiahao Chen.