Big data is a catchall term for datasets that are resource intensive. This course will introduce student to parallel and distributed processing technologies for analyzing ‘big data’. The course will cover programming paradigms and abstractions for data analysis at scale. Students will gain an understanding of the performance and usability trade-offs of various technologies and paradigms. Students will become familiar with technologies such as Hadoop, Spark, H20 and TensorFlow amongst other. Hands-on assignments will focus on machine learning and data analysis tasks. The class builds on known principles such as the design recipe, testing and code reviews.
Students should have taken the Algorithms course, should enjoy programming and strive to write beautiful code.
Instructors: Jan Vitek and Konrad Siek Teaching Assistant: Aviral Goel
Communication with staff is exclusively through Piazza. Use private notes for messages that should not be visible to other students.
Approximatively fifteen hours a week, with regular codewalks and in-class tests.
The class grade is on 100 points with 33 points for in-class participation (quizzes, questions, paper presentation, note taking), 34 points for the final project (software, report and oral defense) and 33 points for codewalks. The final grade is given on a scale of A=90, A-=85, B+=80, B=75, B-=70, C=65, D=55. Codewalks conducted in class focus on the student’s code output and graded based on code quality.
Cheating means an F to all involved parties, expulsion from the class and notification of ORAF.
There will be few traditional lectures, instead we discuss code, books and papers. Student will answer questions about assigned reading during class. Students should notify the instructor of absences. A scribe will record discussions in a shared document.
Reception hours are held daily M,W,R from 11 am to 12 am. Class is in the Forsyth Building, room 129, Tuesday, Friday 3:30 – 5:00.
All regrade requests and grade challenges must be submitted in writing, private posts on Piazza, no more than 7 days after the grade was awarded.
A code review is conducted after each task. Teams review other teams’ work, commenting on code quality, design, documentation and tests. The goal is to produce actionable suggestions for improvements. The output of a review is a report. Code walks are in-class discussions of programming tasks that start from a code review and explore the implementation of a student’s project. Review are emailed in PDF to the instructor and code authors. They comment on the report, code and packaging. The report should give sufficient information to understand what was achieved. It should be clear and complete. The code should be well documents, avoid repetition, have no obvious bugs, have authorship comments. The packaging should make the results easily reproducible. Be curious about the code. Question the results. Don’t waste space in your report on what works well – spend your time on what could be improved. Avoid generalities, be precise, give examples.
HTDG | Hadoop The Definitive Guide 4th Edition, White, O’Reilly | |
KKWZ | Learning Spark, Zaharia M., et al., O’Reilly | |
MR04 | MapReduce: Simplified Data Processing on Large Clusters, Dean, Ghemawat, OSDI04 | link |
FJ10 | FlumeJava: Easy, Efficient Data-Parallel Pipelines, Chambers+, PLDI’10 | link |
SK12 | Possible Hadoop Trajectories Stonebraker, Kepner, CACM’12 | link |
S14 | Hadoop at a Crossroads? Stonebraker, CACM’14 | link |
M3R12 | Increased Performance for In-Memory Hadoop Jobs, Shinnar+, VLDB’12 | link |
Z+12 | A Fault-Tolerant Abstraction for In-Memory Cluster Comp, Zaharia+, NSDI’12 | link |
J+12 | The Performance of MapReduce: An In-depth Study, Jiang+, VLDB’10 | link |
MAS11 | Evaluating MapReduce Performance Using Workload Suites, Chen+, MASCOTS’11 | link |
OS06 | Bigtable: A Distributed Storage System for Structured Data, Chang+, OSDI06 | link |
R+12 | Nobody ever got fired for using Hadoop on a cluster, Rowstron+ | link |
A+15 | Spark SQL: Relational Data Processing in Spark, Armburst+, SIGMOD15 | link |
O+14 | Processing Theta-Joins using MapReduce, Okcan+ SIGMOD11 | link |
AD15 | Scaling spark in the real world: performance and usability, Armburst+, VLDB15 | link |
JMM | The Java Memory Model FAQ | link |
B+16 | SystemML: Declarative Machine Learning on Spark, M. Boehm et al., VLDB16 | link |
M+16 | dmapply: A functional primitive to express distributed machine learning algorithms in R, Ma et al. VLDB16 | link |
S+16 | Titian: Data Provenance Support in Spark, Shah et al., VLDB16 | link |
O+04 | An Overview of the Scala Programming Language, Odersky et al., LAMP TR | link |
A+16 | TensorFlow: A system for large-scale machine learning, Abadi et al., USENIX16 | link |