Schedule TBD
Staff Prof. Jan Vitek
Contact Piazza for all communication. Registration is open.

Do you want to

  • create an impressive machine learning portfolio?
  • solve real data science problems?
  • improve your programming skills?


This course encourages student to solve real-world data science problems by applying the skills they obtained in previous classes of the CCIS Data Science program. Students will gain practical experience with the key steps of any data science project, namely, data import, data tyding and transformation, statistical modelling, and communication. The course combines a programming component (in either R or Python) with a machine learning and statistical modelling component. Repeatability and reproducibility of results will be emphasized.


Data Science is a discipline that combines computing with statistics. A data analysis problem is solved in a series of data-centric steps: data acquisition and representation (Import), data cleaning (Tidy), and an iterative sequence of data transformation (Transform), data modelling (Model) and data visualization (Visualize). The end result of the process is to communicate insights obtained from the data (Communicate). This class will take you through all the steps in the process and will teach you how to approach such problems in a systematic manner. You will learn how to design data analysis pipelines as well as how to implement data analysis pipelines. The class will also emphasize how elegant code leads to reproducible science.

The class will feature a couple small starter projects and one class project in which students will deliver a solution to a real-world data analysis challenge. All projects will be done individually. Starter projects will be done in the R programming language, the class project can be done using any language or tool, but class staff will only be able to provide detailed support for technologies they are familiar with.


Lectures will reviews basics of data science, reproducibility and depending on student interest and need, topics in machine learning, and programming.


The grade in this class will be entirely based on the final deliverables that consist of (a) a 1-minute elevetor pitch, (b) a 10-minute presentation, (c) a project report, and (d) a github repository with software.

Projects are individual.

Sample projects

To give an idea of what a final project could look like, consider the following projects (from other sources):


While there are no prescribed textbooks, we may review, if needed, material from the following:

Reviews from past classes

“Jan Vitek is the most horrible professor I’ve ever met. […] he made students read and learn all by themselves. […] He emphasize[s] too much on coding style instead of teaching knowledge […] he wants students to write a so-called report to help him better understand the code submitted. […] He gave students incredibly many assignments to do. He kind of enjoys watching students working so hard over night and night”

Copyright Northeastern University, 2018