A large part of every data science project is getting your hands on a dataset. In this assignment you’ll do just that using GitHub as a source of your data.
There are two objectives to this assignment:
Experimental work should strive for reproducibility. Because of this, your system must be completely automated. This means that the system extracts the data and formats it into the desired output schema all on its own, without user input during run-time.
Since you may be required to re-run the data collection, the automated system has to be able to run the data collection from start to finish in less than 24 hours.
The system can be implemented using any technology. Given the amount of system calls the implementation is likely to require, we recommend using bash (but anything else that does the job is fine too).
Research implementations have a tendency to expand and increase in complexity as the researchers explore the problem. You will likely need to change the output format of the data you gather and download additional data to meet the requirements of future research questions. Keep this in mind while implementing the current system.
The output dataset should consist of commit histories of main branches of repositories specified here. Format the dataset to match the following schema:
commits.csv
collects basic information about commits and contains the following columns:
hash
author
author email
author timestamp
committer
committer email
committer timestamp
messages.csv
collects commit messages and their subject as follows:
hash
subject
message
files.csv
informs which files were modified by commits. If a commit modifies multiple files, files.csv
will contain multiple lines referencing that commit’s hash (one per modified file).
hash
file path
Write a short report that describes the architecture of your solution. In addition report on the following:
Create a repository on GitHub called EDS19
to host the report and the source code of your solution. The repository should have the following structure and contents:
01-data-acqu/
- the source code for your system01-data-acqu/README.md
- a brief instruction explaining how to install and run your system,01-data-acqu/report.rmd
- the report for this assignment.Due to its size, do not commit the dataset you generated.
The work should be completed Thursday January 24th. If you get stuck or have difficulties with any part of this, do not hesitate to ask the course staff or colleagues.