We now know how to get data the data we need. But how do we store it? Choosing a schema for data will have wide-ranging implications down the line: it will inform how much disk space we need and how much work specific tasks will require (both in terms of programmer effort as well as computational power). The objective of the assignment is to design and prototype a relational database schema. The design should facilitate exploring commit histories in GitHub repositories.
The data contains the following information:
Design a schema to contain all of this data:
As you design the schema you will need to make design decisions about how to model specific entities or relationships between them. Motivate each decision you make.
Create an SQL script that creates an SQLite database implementing the schema you designed. This script should create the logical structure of the database without populating it, ie. it should not contain any insert
statements. Make sure you comment the SQL script so that others can read it and understand it.
Break in the prototype by populating it with data for the following projects (a random sample from repos.list
from the previous assignment):
mochajs/mocha
paularmstrong/normalizr
php-fig/fig-standards
HubSpot/pace
spring-projects/spring-framework
torvalds/linux
react-boilerplate/react-boilerplate
magicalpanda/MagicalRecord
googlesamples/android-architecture-components
hammerjs/hammer.js
Make sure the process of populating the prototype is automated. That is, write scripts that do it from beginning to end. Note how long it takes to import the data into the database and how much the data weighs in terms of added rows and in terms of disk space. Project time necessary to import the data for the entirety of repos.list
and the size of the database.
Write a short report describing both the schema and the prototype. This should include:
Add a new directory to your EDS19
repository on GitHub with the following contents:
02-schema/report.rmd
- the report for this assignment,02-schema/schema.sqlite
- the definition of the database,02-schema/populate/
- a directory containing the script or collection of scripts that were used to populate the prototype02-schema/populate/README.md
- instructions about running the scripts that populate the prototype.Due to its size, do not commit the prototype you generated.
The work should be completed Thursday January 31st. If you get stuck or have difficulties with any part of this, do not hesitate to ask the course staff or colleagues.