“Is it wrong, wanting to be at home with your record collection? It’s not like collecting records is like collecting stamps, or beermats, or antique thimbles. There’s a whole world in here (…)”

― Nick Hornby, High Fidelity

Objectives

Learning the dataset and the basics of Spark

Dataset

We are using a dataset based on the metadata in the Million Song Dataset. The data can be found here:

Use the subset for development and testing. Use the full dataset for your final submission.

Description of the dataset can be found here. Watch out for dirty data!

Functional requirements

Write a Spark program that retrieves:

Evaluate your solution using your a local machine. Optionally evaluate your solution using AWS.

Write a report discussing the results and the performance results. Pay attention to how the data and operations impact performance.

Non-functional requirements

Assignment is performed in groups of 2 (assigned by TA). Authors are clearly marked on all deliverables.

Required files:

Provide a Makefile with the following rules:

Prepare the report in R Markdown and generate a PDF.

Extra credit

It is possible to get extra credit for the assignment by providing a complete Hadoop implementation of this assignment and comparing the performance of the two solutions.