“Is it wrong, wanting to be at home with your record collection? It’s not like collecting records is like collecting stamps, or beermats, or antique thimbles. There’s a whole world in here (…)”
― Nick Hornby, High Fidelity
Learning the dataset and the basics of Spark
We are using a dataset based on the metadata in the Million Song Dataset. The data can be found here:
Use the subset for development and testing. Use the full dataset for your final submission.
Description of the dataset can be found here. Watch out for dirty data!
Write a Spark program that retrieves:
artist_term
)Evaluate your solution using your a local machine. Optionally evaluate your solution using AWS.
Write a report discussing the results and the performance results. Pay attention to how the data and operations impact performance.
Assignment is performed in groups of 2 (assigned by TA). Authors are clearly marked on all deliverables.
Required files:
src/
(directory containing the sources of the implementation)README.md
(Markdown file containing the description of the implementations)Makefile
report.Rmd
(The report as described in the previous section)report.pdf
(A PDF rendering of report.Rmd
)input/
(A directory containing sample input files; please gzip
all text files to save space)Provide a Makefile with the following rules:
build
builds all the implementationsrun
runs the variantsreport
generates the reportsall
build, run, and reportPrepare the report in R Markdown and generate a PDF.
It is possible to get extra credit for the assignment by providing a complete Hadoop implementation of this assignment and comparing the performance of the two solutions.