Parallel Data Processing in MapReduce F’17

Big data is a catchall term for datasets that are resource intensive. This course will introduce student to parallel and distributed processing technologies for analyzing ‘big data’. The course will cover programming paradigms and abstractions for data analysis at scale. Students will gain an understanding of the performance and usability trade-offs of various technologies and paradigms. Students will become familiar with technologies such as Hadoop, Spark, H20 and TensorFlow amongst other. Hands-on assignments will focus on machine learning and data analysis tasks. The class builds on known principles such as the design recipe, testing and code reviews.

prereqs

Students should have taken the Algorithms course, should enjoy programming and strive to write beautiful code.

staff

Instructors: Jan Vitek and Konrad Siek Teaching Assistant: Aviral Goel

Communication with staff is exclusively through Piazza. Use private notes for messages that should not be visible to other students.

work

Approximatively fifteen hours a week, with regular codewalks and in-class tests.

grading

The class grade is on 100 points with 33 points for in-class participation (quizzes, questions, paper presentation, note taking), 34 points for the final project (software, report and oral defense) and 33 points for codewalks. The final grade is given on a scale of A=90, A-=85, B+=80, B=75, B-=70, C=65, D=55. Codewalks conducted in class focus on the student’s code output and graded based on code quality.

authorship

Every code element (file or function) must have a note ascribing authorship to an initial author and, possibly, multiple maintainers. Sharing code is allowed. Changing authorship is cheating.

integrity

Cheating means an F to all involved parties, expulsion from the class and notification of ORAF.

presence

There will be few traditional lectures, instead we discuss code, books and papers. Student will answer questions about assigned reading during class. Students should notify the instructor of absences. A scribe will record discussions in a shared document.

hours

Reception hours are held daily M,W,R from 11 am to 12 am. Class is in the Forsyth Building, room 129, Tuesday, Friday 3:30 – 5:00.

regrades

All regrade requests and grade challenges must be submitted in writing, private posts on Piazza, no more than 7 days after the grade was awarded.

codewalks

A code review is conducted after each task. Teams review other teams’ work, commenting on code quality, design, documentation and tests. The goal is to produce actionable suggestions for improvements. The output of a review is a report. Code walks are in-class discussions of programming tasks that start from a code review and explore the implementation of a student’s project. Review are emailed in PDF to the instructor and code authors. They comment on the report, code and packaging. The report should give sufficient information to understand what was achieved. It should be clear and complete. The code should be well documents, avoid repetition, have no obvious bugs, have authorship comments. The packaging should make the results easily reproducible. Be curious about the code. Question the results. Don’t waste space in your report on what works well – spend your time on what could be improved. Avoid generalities, be precise, give examples.

reading

HTDG	Hadoop The Definitive Guide 4th Edition, White, O’Reilly
KKWZ	Learning Spark, Zaharia M., et al., O’Reilly
MR04	MapReduce: Simplified Data Processing on Large Clusters, Dean, Ghemawat, OSDI04	link
FJ10	FlumeJava: Easy, Efficient Data-Parallel Pipelines, Chambers+, PLDI’10	link
SK12	Possible Hadoop Trajectories Stonebraker, Kepner, CACM’12	link
S14	Hadoop at a Crossroads? Stonebraker, CACM’14	link
M3R12	Increased Performance for In-Memory Hadoop Jobs, Shinnar+, VLDB’12	link
Z+12	A Fault-Tolerant Abstraction for In-Memory Cluster Comp, Zaharia+, NSDI’12	link
J+12	The Performance of MapReduce: An In-depth Study, Jiang+, VLDB’10	link
MAS11	Evaluating MapReduce Performance Using Workload Suites, Chen+, MASCOTS’11	link
OS06	Bigtable: A Distributed Storage System for Structured Data, Chang+, OSDI06	link
R+12	Nobody ever got fired for using Hadoop on a cluster, Rowstron+	link
A+15	Spark SQL: Relational Data Processing in Spark, Armburst+, SIGMOD15	link
O+14	Processing Theta-Joins using MapReduce, Okcan+ SIGMOD11	link
AD15	Scaling spark in the real world: performance and usability, Armburst+, VLDB15	link
JMM	The Java Memory Model FAQ	link
B+16	SystemML: Declarative Machine Learning on Spark, M. Boehm et al., VLDB16	link
M+16	dmapply: A functional primitive to express distributed machine learning algorithms in R, Ma et al. VLDB16	link
S+16	Titian: Data Provenance Support in Spark, Shah et al., VLDB16	link
O+04	An Overview of the Scala Programming Language, Odersky et al., LAMP TR	link
A+16	TensorFlow: A system for large-scale machine learning, Abadi et al., USENIX16	link