Big data is a catchall term for datasets that are resource intensive. This course will introduce student to parallel and distributed processing technologies for analyzing ‘big data’. The course will cover programming paradigms and abstractions for data analysis at scale. Students will gain an understanding of the performance and usability trade-offs of various technologies and paradigms. Students will become familiar with technologies such as Hadoop, Spark, H20 and TensorFlow amongst other. Hands-on assignments will focus on machine learning and data analysis tasks. The class builds on known principles such as the design recipe, testing and code reviews.


Students should have taken the Algorithms course, should enjoy programming and strive to write beautiful code.


Instructors: Jan Vitek and Konrad Siek Teaching Assistant: Aviral Goel

Communication with staff is exclusively through Piazza. Use private notes for messages that should not be visible to other students.


Approximatively fifteen hours a week, with regular codewalks and in-class tests.


The class grade is on 100 points with 33 points for in-class participation (quizzes, questions, paper presentation, note taking), 34 points for the final project (software, report and oral defense) and 33 points for codewalks. The final grade is given on a scale of A=90, A-=85, B+=80, B=75, B-=70, C=65, D=55. Codewalks conducted in class focus on the student’s code output and graded based on code quality.


Every code element (file or function) must have a note ascribing authorship to an initial author and, possibly, multiple maintainers. Sharing code is allowed. Changing authorship is cheating.


Cheating means an F to all involved parties, expulsion from the class and notification of ORAF.


There will be few traditional lectures, instead we discuss code, books and papers. Student will answer questions about assigned reading during class. Students should notify the instructor of absences. A scribe will record discussions in a shared document.


Reception hours are held daily M,W,R from 11 am to 12 am. Class is in the Forsyth Building, room 129, Tuesday, Friday 3:30 – 5:00.


All regrade requests and grade challenges must be submitted in writing, private posts on Piazza, no more than 7 days after the grade was awarded.


A code review is conducted after each task. Teams review other teams’ work, commenting on code quality, design, documentation and tests. The goal is to produce actionable suggestions for improvements. The output of a review is a report. Code walks are in-class discussions of programming tasks that start from a code review and explore the implementation of a student’s project. Review are emailed in PDF to the instructor and code authors. They comment on the report, code and packaging. The report should give sufficient information to understand what was achieved. It should be clear and complete. The code should be well documents, avoid repetition, have no obvious bugs, have authorship comments. The packaging should make the results easily reproducible. Be curious about the code. Question the results. Don’t waste space in your report on what works well – spend your time on what could be improved. Avoid generalities, be precise, give examples.


HTDG Hadoop The Definitive Guide 4th Edition, White, O’Reilly
KKWZ Learning Spark, Zaharia M., et al., O’Reilly
MR04 MapReduce: Simplified Data Processing on Large Clusters, Dean, Ghemawat, OSDI04 link
FJ10 FlumeJava: Easy, Efficient Data-Parallel Pipelines, Chambers+, PLDI’10 link
SK12 Possible Hadoop Trajectories Stonebraker, Kepner, CACM’12 link
S14 Hadoop at a Crossroads? Stonebraker, CACM’14 link
M3R12 Increased Performance for In-Memory Hadoop Jobs, Shinnar+, VLDB’12 link
Z+12 A Fault-Tolerant Abstraction for In-Memory Cluster Comp, Zaharia+, NSDI’12 link
J+12 The Performance of MapReduce: An In-depth Study, Jiang+, VLDB’10 link
MAS11 Evaluating MapReduce Performance Using Workload Suites, Chen+, MASCOTS’11 link
OS06 Bigtable: A Distributed Storage System for Structured Data, Chang+, OSDI06 link
R+12 Nobody ever got fired for using Hadoop on a cluster, Rowstron+ link
A+15 Spark SQL: Relational Data Processing in Spark, Armburst+, SIGMOD15 link
O+14 Processing Theta-Joins using MapReduce, Okcan+ SIGMOD11 link
AD15 Scaling spark in the real world: performance and usability, Armburst+, VLDB15 link
JMM The Java Memory Model FAQ link
B+16 SystemML: Declarative Machine Learning on Spark, M. Boehm et al., VLDB16 link
M+16 dmapply: A functional primitive to express distributed machine learning algorithms in R, Ma et al. VLDB16 link
S+16 Titian: Data Provenance Support in Spark, Shah et al., VLDB16 link