jv
Schedule W/18-21
Location Ryder Hall 124
Staff Jan Vitek, Filip Krikava, Aviral Goel
Contact Piazza for all communication. Registration is open; sensitive questions by private notes.
Textbook RfDS R for Data Science by Wickham and Grolemund, AR Advanced R by Wickham.

Syllabus

This course discusses the practical issues and techniques for data importing, tidying, transforming, and modeling. The course will offer a gentle introduction to techniques for processing big data. Programming is a cross-cutting aspect of the course. Students will gain experience with data science tools through short assignments. The course work includes a term project based on real-world data. Required topics include: Data management and processing: definition & background; Data transformation; Data import; Data cleaning; Data modeling; Relational and analytic databases; Basics of SQL; Programming in R and/or Python; MapReduce fundamentals and distributed data management; Data processing pipelines, connecting multiple data management and analysis components; Interaction between the capabilities and requirements of data analysis methods (data structures, algorithms, memory requirements) and the choice of data storage and management tools; Repeatable and reproducible data analysis.

Overview

Data Science is a discipline that combines computing with statistics. A data analysis problem is solved in a series of data-centric steps: data acquisition and representation (Import), data cleaning (Tidy), and an iterative sequence of data transformation (Transform), data modelling (Model) and data visualization (Visualize). The end result of the process is to communicate insights obtained from the data (Communicate). This class will take you through all the steps in the process and will teach you how to approach such problems in a systematic manner. You will learn how to design data analysis pipelines as well as how to implement data analysis pipelines in a widely used industrial language. The class will also emphasize how elegant code leads to reproducible science.

Along the way you will pick up a number of skills: You will become proficient in the R programming language and concepts such as as data frames, vectorized computation and lazy functional and object-oriented programming; you will learn how to use the grammar of graphics, literate programming, and will be exposed to big data problems. We will also look at distributed processing with Spark and, time permitting, discuss other data processing languages such as Python.

The class will feature several small projects and will lead to a larger group project in which you will deliver a solution to a real-world data analysis challenge.

Case Studies

The class will be structured around four case studies. The first will focus on analyzing data coming from computer benchmarks (Computer Performance Evaluation), the second will look at the Airline On-Time Performance data (OTP), the third will have the student access a database containing Google Play data (Play), and the last will feature the student obtaining data directly from GitHub (Git).

Schedule

Date Topic Deliverable
01.11 Introduction to Data Science Case study: Computer performance. ToT: R, Markdown.
01.18 Designing Data Pipelines (I) Case study: Computer performance. ToT: ggplot, dataframes. Read: RfDS “Functions”, RfDS “Vectors”, RfDS “RMarkdown”, R3 Exo 1 (Simple R, performance data)
01.25 Vizualizing and Presenting Data Case study: Airline OTP. ToT: dataframes Read: AR “Data Structures”, AR “Style Guide”, AR “Subsetting” Exo 2 (ggplot, simple R, performance data)
02.01 Designing Data Pipelines (II) Case study: Airline OTP, ToT: dplyr Read: RfDS “Data visualization with ggplot” RfDS “Workflow Basics” RfDS “Data Transformation with dplyr” Exo 3 (data.frames, OTP)
02.08 Relational Data Case study: Airline OTP. ToT: readr Read: RfDS “Exploratory Data Analysis”, RfDS “Relational Data with dplyr” Exo 4 (OTP)
02.15 Relational Data Cast stude: Google Play. ToT: SQL. Read: RfDS “Tibbles”, RfDS “Data Import”, RfDS “Tidy data” Exo 5 (OTP)
02.22 Mideterm Review Exo 6 (Play)
03.01 Modelling Case study: Google Play Read: RfDS “Strings” “Factors” “Data and time” “Pipes” “Iteration” Exo 7 (Play)
03.08 No Class - Spring break
03.15 Big Data ToT: Hadoop Exo 8 (optional Hadoop)
03.22 Big Data ToT: Spark Exo 9 (Git)
03.29 Graphics and Interaction Markdown and Shiny Exo 10 (Git)
04.05 Performance, profiling and Debugging Project report draft
04.12 Data Sprints Week Project slides and presentation

Grading

The grade in this class will be based on (a) final exam (20%), (b) the project presentation (20%), (c) final project report (20%), (d) exercise sets (20%), and (e) in class participation (20%). Grades range from A, A-, B+, B, B-, C, D and F. Cheating is discouraged. Students involved in a cheating case will be given an F, expelled from class and reported.

Exercise sets

The class will feature 10 exercise sets. Exosets are due before class the week they appear on the schedule. Students are required to complete them as they reinforce the material discussed in class. Four randomly selected exosets will be graded, the best three scores will be retained.

In-class participation

Each student will get at least 5 in-class participation grades. These grades are based on answers to questions about assigned reading, codewalks, and presentations. Students are selected randomly for in-class participation. The four best grades are retained.

Why R?

The following is excerpted from the Revolution blog.

O’Reilly has released the results of the 2016 Data Science Salary Survey. […] The median salary reported in the survey was US$87,000; amongst data scientists in the US, the median salary was US$106,000. […] The survey also reports on use of tools, and the top 3 in each category were as follows (respondents could select multiple tools in each category):

  • Operating Systems: Windows 74%; Linux 49%; Mac OS X 42%
  • Databases: MySQL 37%; Microsoft SQL Server 33%; Oracle 23%
  • Programming languages: SQL 70%; R 57%; Python 54%

Interestingly, the survey also reported on the tasks that data scientists perform: over 90% reported some kind of coding in their day-to-day work.

IEEE Spectrum has just published its third annual ranking with its 2016 Top Programming Languages, and the R Language is once again near the top of the list, moving up one place to fifth position. As I said last year (when R moved up to take sixth place), this is an extraordinary result for a domain-specific language. The other four languages in the top 5 (C, Java, Python and C++) are all general-purpose languages, suitable for just about any programming task. R by contrast is a language specifically for data science, and its high ranking here reflects both the critical importance of data science as a discipline today, and of R as the language of choice for data scientists.

Resources

Video Tutorials
PRLCCIS
Copyright Northeastern University, 2017