Schedule | Thur, 6–9pm @ WVH110 |
Staff | Jan Vitek and invited lectures by Konrad Siek. |
Contact | Piazza for all communication. |
Experts | John Tristan (ML Lead, Oracle Labs), Jiahao Chen (DS Manager, CapitalOne) |
Do you want to: Build a machine learning portfolio? Solve real data science problems? Improve your programming skills? Meet data science experts?
The class has the following data science pre-requisites: DS5110, DS5220 and DS5230.
This course encourages student to solve real-world data science problems by applying the skills they obtained in previous classes of their Data Science program. Students will gain practical experience with the key steps of any data science project, namely, data import, data tyding and transformation, statistical modelling, and visualization. The course combines a programming component in R with a machine learning component. Repeatability and reproducibility of results will be emphasized.
Data Science is a discipline that combines computing with statistics. A data analysis problem is solved in a series of data-centric steps: data acquisition and representation (Import), data cleaning (Tidy), and an iterative sequence of data transformation (Transform), data modelling (Model) and data visualization (Visualize). The end result of the process is to communicate insights obtained from the data (Communicate). This class will take you through all the steps in the process and will teach you how to approach such problems in a systematic manner. You will learn how to design data analysis pipelines as well as how to implement data analysis pipelines. The class will also emphasize how elegant code leads to reproducible science.
The class will feature a couple small starter projects and one class project in which students will deliver a solution to a real-world data analysis challenge. All projects will be done individually. Starter projects will be done in the R programming language, the class project can be done using any language or tool, but class staff will only be able to provide detailed support for technologies they are familiar with.
The goal of this class is to reinforce the Data Science practices that were discussed in previous classes. These are summarized by the six activities listed next (adapted from “The levels of data science class” by Jeff Leek).
The projects will focus on analyzing large dataset extracted from GitHub. The challenges that will have to be overcome will include large scale data acquistion (the target is 1 million projects), data cleaning (the data is not guaranteed to be clean, to the contrary it is know to be messy), data representation (how do we model the data), data storage (define a storage format, e.g. a relational database), how to analyze the data, choose machine learning techniques (NLP, clustering, etc.), and vizualize the results.
One paper that is an example of an analysis of GitHub data is “A large-scale study of programming languages and code quality in GitHub” by Ray et al. html.
Lectures will reviews basics of data science, reproducibility and depending on student interest and need, topics in machine learning, and programming: R, Markdown, dataframes, ggplot, tidyverse, Shiny.
The grade in this class will be entirely based on the final deliverables that consist of (a) a 1-minute elevetor pitch, (b) a 10-minute presentation, (c) a project report, and (d) a github repository with software.
Projects are individual.
Projects will be presented to a pannel of experts in data science. Currently two experts are confirmed: John Tristan and Jiahao Chen.