DS 6050 – Expeditions in Data Science

Schedule	Thur, 6–9pm @ WVH110
Staff	Jan Vitek and invited lectures by Konrad Siek.
Contact	Piazza for all communication.
Experts	John Tristan (ML Lead, Oracle Labs), Jiahao Chen (DS Manager, CapitalOne)

Do you want to: Build a machine learning portfolio? Solve real data science problems? Improve your programming skills? Meet data science experts?

Pre-requisites

The class has the following data science pre-requisites: DS5110, DS5220 and DS5230.

Syllabus

This course encourages student to solve real-world data science problems by applying the skills they obtained in previous classes of their Data Science program. Students will gain practical experience with the key steps of any data science project, namely, data import, data tyding and transformation, statistical modelling, and visualization. The course combines a programming component in R with a machine learning component. Repeatability and reproducibility of results will be emphasized.

Overview

Data Science is a discipline that combines computing with statistics. A data analysis problem is solved in a series of data-centric steps: data acquisition and representation (Import), data cleaning (Tidy), and an iterative sequence of data transformation (Transform), data modelling (Model) and data visualization (Visualize). The end result of the process is to communicate insights obtained from the data (Communicate). This class will take you through all the steps in the process and will teach you how to approach such problems in a systematic manner. You will learn how to design data analysis pipelines as well as how to implement data analysis pipelines. The class will also emphasize how elegant code leads to reproducible science.

The class will feature a couple small starter projects and one class project in which students will deliver a solution to a real-world data analysis challenge. All projects will be done individually. Starter projects will be done in the R programming language, the class project can be done using any language or tool, but class staff will only be able to provide detailed support for technologies they are familiar with.

Philosophy

The goal of this class is to reinforce the Data Science practices that were discussed in previous classes. These are summarized by the six activities listed next (adapted from “The levels of data science class” by Jeff Leek).

Asking: How to define a question, turn that question into a statement about data, identify data sets that may be applicable, and design the experiment.
Telling: How to write about data science, express models qualitatively and in mathematical notation, interpret results of models and make figures.
Practicing: Learn the basic tools of R, load data of various types, read data, plot data.
Scaling: Manipulate different file formats, work with “messy” data, organize multiple data sets into one data set, deal with real-world large datasets.
Solving: Use real data examples, but work them through from start to finish as case studies, with a clear path from the beginning of the problem to the end.
Science: Formulate your own questions, and try to solve them, at scale, using the tools your are familar with, and communicate your conclusions effectively.

Projects

The projects will focus on analyzing large dataset extracted from GitHub. The challenges that will have to be overcome will include large scale data acquistion (the target is 1 million projects), data cleaning (the data is not guaranteed to be clean, to the contrary it is know to be messy), data representation (how do we model the data), data storage (define a storage format, e.g. a relational database), how to analyze the data, choose machine learning techniques (NLP, clustering, etc.), and vizualize the results.

One paper that is an example of an analysis of GitHub data is “A large-scale study of programming languages and code quality in GitHub” by Ray et al. html.

Assignments

Readings

Lectures

Lectures will reviews basics of data science, reproducibility and depending on student interest and need, topics in machine learning, and programming: R, Markdown, dataframes, ggplot, tidyverse, Shiny.

Lecture 1/10: Introduction and placement exam.
Lecture 1/17: GitHub; “Large Scale Study”; bash; exam review; Git API notes
Lecture 1/24: Data management and databases. Papers: DéjàVu: A Map of Code Duplicates on GitHub by Alex Bender and What is the Truck Factor of Popular GitHub Apps? by Nick Tyler.
Lecture 1/31: Data wrangling with dplyr. Papers: An Algorithm to Classify GitHub Repos by Justin Littman, DéjàVu: A Map of Code Duplicates on GitHub by Alex Gomez, and What is the Truck Factor of Popular GitHub Apps? by Noul Singla.
Lecture 2/07: Repeatable science with Rmd.
Vizualization with Ggplot and Shiny. Papers: Pomises and perils of mining GitHub by Tyler Brown, Mining the Network of the Programmers by Jingci Wang, and An Algorithm to Classify GitHub Repos by Troy Yang
Lecture 2/14: No class
Lecture 2/21: Advanced R: Profiling and Debugging Papers: What’s in a GitHub Star? by Tajas Bala and Mining the Network of the Programmers by Tim Sauchuk.
Lecture 2/28: Guest Lecture on Bias in Machine Learning by Tina Eliassi-Rad. Test-driven development for data science. Papers: What’s in a GitHub Star? by Qinyu. A Large Scale Study of Programming Languages and Code Quality by Xubo Tang.
Lecture 3/07: Spring break
Lecture 3/14: Industry panel. Project pitches.
Lecture 3/21: Guest Lecture on Model interpretation by Olga Vitek.
Lecture 3/28: Advanced R: OOP, Packages, Scalability
Lecture 4/04: Julia and other data science languages and pipelines. Adoption. Security.
Lecture 4/11: No class
Lecture 4/18: Final projects due.

Grading

The grade in this class will be entirely based on the final deliverables that consist of (a) a 1-minute elevetor pitch, (b) a 10-minute presentation, (c) a project report, and (d) a github repository with software.

Projects are individual.

Expert panel

Projects will be presented to a pannel of experts in data science. Currently two experts are confirmed: John Tristan and Jiahao Chen.

	Joseph Goldbeck is a Data Engineer at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Joe was a Senior Software Engineer at TetraScience, a life sciences data integration and analytics platform company. Joe earned his Master of Arts degree in Neuroscience from the University of California–Berkeley, supported by a National Science Foundation Graduate Research Fellowship, and his Bachelor of Science degree in Brain and Cognitive Science from the Massachusetts Institute of Technology (MIT), with awards for outstanding academics and research.
	Shantam Gupta earned a MSc in Data Science from Northeastern in 2018. He works as a Machine Learning Engineer at Quantiphi Inc. https://www.quantiphi.com/ He is aso a Data Analytics Recitation Instructor for Higher-Level Education. He enjoys helping industry professionals learn and apply the concepts of Probability & Statistics, Business Intelligence, Data Mining, Data Visualization and Machine Learning by harnessing the power of analytical tools.
	John Tristan leads a research team in the Machine Learning group at Oracle in Boston. He is interested in parallel and distributed computing, scalable machine learning, probabilistic modeling and inference, and language and compiler design for parallel architectures. He was a post-doctoral fellow at Harvard University, received a Ph.D. in computer science from university Denis Diderot (Paris 7) and a masters degree in computer science from the Ecole Normale Superieure of Paris.

Sample projects

To give an idea of what a final project could look like, consider the following projects (from other sources):

Textbooks

While there are no prescribed textbooks, we may review, if needed, material from the following:

R for Data Science by Wickham & Grolemund
Advanced R by Wickham
Machine Learning with R by Lantz

Reviews from past classes

“Jan Vitek is the most horrible professor I’ve ever met. […] he made students read and learn all by themselves. […] He emphasize[s] too much on coding style instead of teaching knowledge […] he wants students to write a so-called report to help him better understand the code submitted. […] He gave students incredibly many assignments to do. He kind of enjoys watching students working so hard over night and night”