As a huge repository of source code and software-development–related data, GitHub is a natural subject for research scientists to study. However, GitHub is by no means a pre-prepared dataset that one can just download and start playing with. As with most real-life systems, data must be extracted and it’s not all that easy.
There are four basic methods of retrieving data, especially metadata, from repositories hosted on GitHub. One can use the GitHub RESTful API, as indicated at the GitHub developer website. Alternatively, one can clone individual repositories and use the git toolchain to extract information that way. There is also a periodically updated collection of metadata from GitHub projects known as GHTorrent that one can download and analyze. One can also scrape the GitHub website. This short tutorial will cover the first three methods, give an example of their usage, and explain the pros and cons of all of them.
The first and correct method is to use the GitHub RESTful API. The API allows a programmer to issue queries requesting specific data. Through queries one can limit the requested data to only the subset one needs. For instance, one can request only the commit history of a repo without downloading anything else. One can also issue queries searching through the entirety of the service rather than single repos.
The API method comes with some serious drawbacks though. First, it is designed to aid app developers rather than to provide data for research. Thus, the API employs pagination. That is, a query returning more than 30 items will split the result into chunks and these chunks will have to be retrieved separately by individual queries. The number of items per page can be adjusted in some queries, but does not exceed 100. Thus, in order to retrieve the commit history of a repo such as torvalds/linux
one needs to issue more than 8000 requests.
The problem is compounded by the number of requests an individual user is allowed to issue. GitHub is rate limited: an anonymous user can issue up to 60 requests per hour, while an authenticated user can issue up to 5000. If the limit is exceeded, the API returns 403 errors instead of results. This means that the aforementioned torvalds/linux
commit history will take no less than close to two hours to download (by a single user).
Given that the API is a RESTful interface, it is implemented using HTTP and returns results in the form of JSON structures. This means that it can be accessed using just about any programming language, as well as with commandline tools like curl
. The documentation at developer.github.com provides detailed examples using curl
. This tutorial will concentrate on using the API in the R language.
In order to run HTTP queries in R one needs to install httr
and httpuv
. The tutorial also makes use of stringr
for string processing.
install.packages(c("httr", "httpuv", "stringr"))
Then load the packages.
library(httr)
Since the rate limit for unauthorized users is so low, it’s best to start with authentication to get that 5000 per hour rate. In order to authenticate one needs a Client ID
and Client Secret
for the GitHub account. This can be obtained from the developer settings tab in account settings on the GitHub website. There register a new OAuth app (website and callback can be set to http://github.com/
). Then use the pre-existing GitHub configuration for httr
:
oauth_endpoints("github")
Afterwards, use the Client ID
as key
and Client Secret
as secret
to authenticate with GitHub.
myapp <- oauth_app("github",
key = "XXXXXXXXXXXXXXXXXXXX",
secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
)
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
This should cause the browser to open and ask for authentication.
Let’s issue a simple request to the github API asking about our rate limit. To do that one needs to simply issue a HTTP GET
request to the resource at https://api.github.com/users/USERNAME
.
The httr
package provides a helpful function to do that called GET
that takes the URL as an argument:
req <- GET("https://api.github.com/users/kondziu")
The GET
function returns a promise-like object that can be used to read the results of the query after it was evaluated. Thus, first one must wait for the results.
stop_for_status(req)
Then one can retrieve the status, headers and the contents of the request using appropriate functions:
status_code(req)
headers(req)
content(req)
One can see that the header informs of the rate limit the query is subject to.
headers(req)`x-ratelimit-limit`
Currently it is 60
, because the query was executed without authentication. In order to authenticate the query, one must pass the OAuth configuration object along with the query like so:
gtoken <- config(token=github_token)
req <- GET("https://api.github.com/users/kondziu", config=gtoken)
stop_for_status(req)
headers(req)$`x-ratelimit-limit`
Now the rate is at 5000
. Now we’re ready for serious business.
Let’s issue a slightly more complex query. Let’s try to get the commit history of that torvalds/linux
repo one heard so much about. The GitHub documentation instructs that to do so, one must make a HTTP GET request to https://api.github.com/repos/USER/PROJECT/commits
. That sounds simple enough.
req <- GET("https://api.github.com/repos/torvalds/linux/commits", config=gtoken)
stop_for_status(req)
content(req)
The contents contain a list of 30 records containing commit information. Let’s iterate through the results and print out some of the more interesting information.
lapply(content(req), function(item) {
cat(paste0("hash: ", item$sha))
cat(paste0("author name: ", item$commit$author$name, "\n"))
cat(paste0("author email: ", item$commit$author$email, "\n"))
cat(paste0("author date: ", item$commit$author$date, "\n"))
cat(paste0("committer name: ", item$commit$committer$name, "\n"))
cat(paste0("committer email: ", item$commit$committer$email, "\n"))
cat(paste0("committer date: ", item$commit$committer$date, "\n"))
cat(paste0("commit message: ", item$commit$message, "\n"))
cat("----------------------------------------------\n\n")
})
We can also stick this data into a dataframe for later analysis!
df <- do.call(rbind, lapply(content(req), function(item) {
data.frame(`hash` = item$sha,
`author name` = item$commit$author$name,
`author email` = item$commit$author$email,
`author date` = item$commit$author$date,
`committer name` = item$commit$committer$name,
`committer email` = item$commit$committer$email,
`committer date` = item$commit$committer$date,
`commit message` = item$commit$message)
}))
View(df)
These records contain a lot of useful things, but why are there only 30? The results are paginated and the results only contain the first page of the results. So how does one get more? The answer lies in the header:
headers(req)$link
Here one can see two links, one to the next page and one to the last page of results. That latter number is of particular interest: if one knows how many pages there are, one is able to ask for every single one of them in succession.
A few judiciously applied string operation can recover that number from the headers:
library(stringr)
read_page_count<- function(req) {
# Get the commit-specific link header
git_header <- headers(req)$link
# Split the header into individual lines and select the reference to last page
references <- unlist(str_split(git_header, ","))
last_reference <- references[grepl("rel=\"last\"", references)]
# Extract last page number and convert to int
as.integer(str_extract(str_extract(last_reference, pattern="[&?]page=[0-9]+"), pattern="[0-9]+"))
}
read_page_count(req)
There is one more thing worth paying attention to here though. Each page contains only 30 results. If one is to consult with the documentation, a page can be made to contain up to 100 results. One should modify one’s query to achieve that. This requires one to add a parameter to one’s query like so:
req <- GET("https://api.github.com/repos/torvalds/linux/commits?per_page=100", config=gtoken)
stop_for_status(req)
length(content(req))
read_page_count(req)
Now one can see one hundred items in the result page and 8103 pages. So how does one iterate over these pages? The query may ask for a specific page. One can request every specific page in sequence and aggregate the results. (This takes forever, BTW).
# Our essential query:
query <- "https://api.github.com/repos/torvalds/linux/commits?per_page=100"
# Grab headers returned by the query and use them to figure out how many pages of results there will be.
req <- GET(query, config=gtoken)
stop_for_status(req)
pages <- read_page_count(req)
# Iterate over all pages and make a request.
data_from_pages <- lapply(1:pages, function(page) {
# Create a request URL by appending a page specification.
paginated_query <- paste0(query, '&page=', page)
# Issue the request.
req <- GET(paginated_query, config=gtoken)
stop_for_status(req)
cat(paste0("Downloading page ", page, "\n"), file=stderr())
# Process the data.
do.call(rbind, lapply(content(req), function(item) {
data.frame(`hash` = item$sha,
`author name` = item$commit$author$name,
`author email` = item$commit$author$email,
`author date` = item$commit$author$date,
`committer name` = item$commit$committer$name,
`committer email` = item$commit$committer$email,
`committer date` = item$commit$committer$date,
`commit message` = item$commit$message)
}))
})
# Aggregate the results from all pages.
data <- do.call(rbind, data_from_pages)
And there you have it. In summary, while the GitHub RESTful API is definitely the way GitHub wants people to interact with GitHub, it has pros and cons.
Pros
Cons
So what’s the other choice?
Another method of getting data out of GitHub is simply to clone repositories and analyze them locally. This will download the entirety of these repositories, meaning that one not only has access to data like commit histories, but also one can access the raw source files to analyze them further. The down side of that is that this data takes a lot of time to download and takes up a lot of disk space as well. On the other hand, there is no hard limit on cloning repositories, so one can gather specific information at a quicker pace than with the API. In addition, given that the cloning itself, as well as the extraction of data afterwards are done using the git toolchain, retrieving data from a cloned repository is much less fiddly than using the API.
Given that the git toolchain consists of commandline tools, this part of the tutorial is presented in bash
.
In order to clone a repository one needs to know the username and project name of the repository to construct its URL. The URL has the form https://github.com/USER/REPO.git
. So for the torvalds/linux
repo we would use: https://github.com/torvalds/linux.git
. Thus we clone it using the command (this will take a moment):
git clone https://github.com/torvalds/linux.git
cd linux
Afterwards, one can proceed to data extraction. Let’s start by getting the commit history as with the API. One can retrieve it by using git log
.
git log
When executed, git log
will produce a human-readable history. But human-readable is not very useful for the purposes of data science. Instead, let’s reconfigure git log
to return information in a more computer-friendly format instead. Let’s make it return a CSV file containing information about the author, committer, and timestamps.
echo '"hash","author name","author email","author timestamp","committer name","committer email","committer timestamp"' > commits.csv
git log --pretty=format:'"%H","%an","%ae","%ad","%cn","%ce","%ct"' >> commits.csv
This will return the commit history in the current branch. One can expand the log to all branches by setting the flag --all
in the git log
command.
echo '"hash","author name","author email","author timestamp","committer name","committer email","committer timestamp"' > commits.csv
git log --pretty=format:'"%H","%an","%ae","%ad","%cn","%ce","%ct"' --all >> commits.csv
More information about formatting the log can be found by executing the command git log --help
.
The information one got this far is pretty boring and clinical though. Let’s instead look at the files the commits modify. One can also use git log
to extract them, but unfortunately, the pretty printer will not make the data as machine-friendly as one would wish. Let’s look at it by printing the hash of each commit and the list of files it modifies:
git log --pretty=format:%H --numstat
The in the results one can see that modified files are listed below each hash along with the numbers of added and removed lines, respectively. This format is not regular, so let’s add a new line before the hashes to make it more regular:
git log --pretty=format:%n%H --numstat
Now the commits are separated by an empty line, so one can write a fairly simple script to read it in one’s language of choice.
For instance, one can write a Python script that reads the output of git log, saves it to a buffer, and writes out the results when it encounters a new line:
#!/usr/bin/env python3
import sys
if __name__ == '__main__':
buffer = []
for line in sys.stdin:
if line.strip() == '':
if not buffer:
continue
hash = buffer[0]
for statline in buffer[1:]:
added, deleted, path = statline.split(maxsplit=2)
print('%s,%s,"%s"' % (added, deleted, path))
buffer.clear()
else:
buffer.append(line.strip())
One can then run git log
and this script in a pipeline:
git log --pretty=format:%n%H --numstat | python3 numstat_to_csv.py > commit_files.csv
A similar technique can be employed to get commit subject lines and messages. This is simple, as one can simply ask the pretty printer to produce these:
git log --pretty=format:"%H %s %B"
Again, the problem here is to somehow convert this into a machine friendly output. It is enough to figure out a way to distinguish between fields and records. For instance, one can terminate a record by adding an atypical character to the format like so:
git log --pretty=format:"%H %s %B🐱"
Then all that is left is to write a small script in one’s language of choice again.
#!/usr/bin/env python3
import sys
if __name__ == '__main__':
buffer = []
for line in sys.stdin:
if line.strip() == u'🐱':
if not buffer:
continue
hash, subject = buffer[0].split(maxsplit=1)
message = '\n'.join(buffer[1:])
subject = repr(subject).replace('"', r'\"')
message = repr(message).replace('"', r'\"')
print('%s,"%s","%s"' % (hash, subject, message))
buffer.clear()
else:
buffer.append(line.strip())
Finally, one pipes the commands together:
git log --pretty=format:"%H %s %B🐱" | python3 messages_to_csv.py > commit_messages.csv
(In case of encoding woes, try iconv -t utf-8//IGNORE
to forcibly convert everything to UTF-8.)
On the whole cloning GitHub repositories is a pretty decent, if expensive way to collect data from the service.
Pros
Cons
The third alternative is to use the data gathered by GHTorrent available from http://ghtorrent.org. GHTorrent observes the public events generated on GitHub and logs them into a database. The database is a relational MySQL database whose schema contains the following interconnected tables:
One can then download a snapshot of the database to use in research. In that case, I hear you ask, why aren’t we just using this database? That’s because the database is stored as monthly or daily compressed snapshots. These snapshots have the breadth of the entire database schema, but only contain the events from that month. This means if one wants to observe, say, just a few repos, one must download the entire database and extract just the interesting information therein. Given that each snapshot tends to be between 50GB and 100GB of data, and there are 43 of them, this in itself is tricky. Of course, if one wants to observe the entire activity of GitHub since 2014 and has both the bandwidth and the disk space to download and store all that information, this is not a bad way of obtaining this data.
The GHTorrent project also allows people to query their complete database through an online interface at http://ghtorrent.org/dblite/, but to the best of my knowledge, getting big data out of that is not possible.
What the database is good for though, is getting the data about a particular month’s activity on GitHub.
Just as an example, let’s extract commit information for all projects from one of the snapshots on GHTorrent. To do that, first download a snapshot.
curl https://ghtstorage.blob.core.windows.net/downloads/mysql-2017-01-19.tar.gz -o snapshot.tar.gz
After the snapshot is downloaded, one can then extract specific tables in the form of CSV files:
tar --extract --file=snapshot.tar.gz mysql-2017-01-19/projects.csv
tar --extract --file=snapshot.tar.gz mysql-2017-01-19/project_commits.csv
That is simple enough.
In summary, the GHTorrent database is a good approach for getting comprehensive data about all of GitHub or monthly activity snapshots. However, getting deep data about specific projects raises issues.
Pros
Cons
In order to perform many sorts of analysis on GitHub, one needs a list of projects to analyze. One natural source of projects is the list of most popular repositories, as indicated by the amount of stars they have. How would one go about downloading that list?
The easiest way to do so is to go through the GitHub API again. Let’s download 1000 user-project pairs of repositories. Let’s start by composing a search query through the API. The query should specify that we want repositories that all have more than 0 stars. We also want these repositories to be sorted in descending order. Let’s ask for 100 repositories per page and grab the first 10 pages.
most_starred_query = 'https://api.github.com/search/repositories?q=stars:>0&sort=stars&order=desc&per_page=100'
Now, let us execute the query iteratively as usual:
req <- GET(most_starred_query, config=gtoken)
stop_for_status(req)
pages <- read_page_count(req)
data_from_pages <- lapply(1:pages, function(page) {
paginated_query <- paste0(most_starred_query, '&page=', page)
req <- GET(paginated_query, config=gtoken)
stop_for_status(req)
cat(paste0("Downloading page ", page, "\n"), file=stderr())
do.call(rbind, lapply(content(req)$items, function(item) {
#print(item)
data.frame(repository=item$full_name)
}))
})
repositories <- do.call(rbind, data_from_pages)
Et voilà!