Software version control systems contain a huge amount of evolutionary data. It’s very common to mine these repositories to gain some insight about how the development of a software product works. But there is the need for some preprocessing of that data to avoid false analysis.
That’s why I show you how to read the commit information of a Git repository into Pandas’ DataFrame!
Idea
The main idea is to use an existing Git library for Python that provides the necessary (and hopefully efficient) access to a Git repository. In this notebook, we’ll use GitPython, because at first glance it seems easy to use and to do the things we need.
Our implementation strategy is straightforward: We try to avoid any functions as much as possible but try to use all the processing power Pandas delivers. The So let’s get started.
Create an initial DataFrame
First, we import our two main libraries for analysis: Pandas and GitPython.
import pandas as pd
import git
With GitPython, you can access a Git repository via a Repo object. That’s your entry point to the world of Git.
For this notebook, we analyze the Sprint PetClinic repository that can be easily cloned to your local computer with a
git clone https://github.com/spring-projects/spring-petclinic.git
Repo needs at least the directory to your Git repository. I’ve added an additional argument odbt with the git.GitCmdObjectDB. With this, GitPython will be using a more performant approach for retrieving all the data (see doc for more details).
repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)
repo
To transform the complete repository into Pandas’ DataFrame, we simply iterate over all commits of the master branch.
commits = pd.DataFrame(repo.iter_commits('master'), columns=['raw'])
commits.head()
Our raw column now contains all the commits as PythonGit’s Commit Objects (to be more accurate: references to these objects). The string representation is coincidental the SHA key of the commit.
Investigate commit data
Let’s have a look at the last commit.
last_commit = commits.ix[0, 'raw']
last_commit
Such a Commit object is our entry point for retrieving further data.
print(last_commit.__doc__)
It provides all data we need:
last_commit.__slots__
E. g. basic data like the commit message.
last_commit.message
Or the date of the commit
last_commit.committed_datetime
Some information about the author.
last_commit.author.name
last_commit.author.email
Or file statistics about the commit,
last_commit.stats.files
Fill the DataFrame with data
Let’s check how fast we can retrieve all the authors from the commit’s data.
%%time
commits['author'] = commits['raw'].apply(lambda x: x.author.name)
commits.head()
Let’s go further and retrieve some more data (DataFrame is transposed / rotated via a T for displaying reasons).
%%time
commits['email'] = commits['raw'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['raw'].apply(lambda x: pd.to_datetime(x.committed_datetime))
commits['message'] = commits['raw'].apply(lambda x: x.message)
commits['sha'] = commits['raw'].apply(lambda x: str(x))
commits.head(2).T
Dead easy and reasonable fast, but what about the modified files? Let’s challenge our computer a little bit more by extracting the statistics data about every commit. The Stats object contains all the touched files per commit including the information about the number of lines that were either inserted or deleted.
Additionally, we need some tricks to get the data we need. For this, I guide you step by step through this approach. The main idea is to retrieve the real statistics data (not only the object’s references) and temporarily store these statistics information as Pandas’ Series. Then we take another round to transform this data to use it in DataFrame.
Cracking the stats files statistic object
This step is a little bit tricky and was found only by a good amount of trial and error. But it works in the end as we will see. The goal is to unpack the information in the stats object into nice columns of out DataFrame via the Series#apply method. I’ll show you step by step how this works in principle (albeit it will work a little bit different when using the apply approach).
As seen above, we have access to every file modification of each commit. In the end, it’s a dictionary with the filename as the key and a dictionary of the change attributes as values.
some_commit = commits.ix[56, 'raw']
some_commit.stats.files
We extract the dictionary of dictionaries in two steps. We have to keep in mind that all tricky data transformation is highly dependent on the right index. But first things first.
First, to the outer dictionary: We create a Series of the dictionary.
dict_as_series = pd.Series(some_commit.stats.files)
dict_as_series
Second, we wrap that series into a DataFrame (for index reasons):
dict_as_series_wrapped_in_dataframe = pd.DataFrame(dict_as_series)
dict_as_series_wrapped_in_dataframe
After that, some magic occurs. We stack the DataFrame, meaning that we put our columns into our index which becomes a MultiIndex.
stacked_dataframe = dict_as_series_wrapped_in_dataframe.stack()
stacked_dataframe
stacked_dataframe.index
With some manipulation of the index, we achive what we need: an expansion of the rows for each file in a commit.
stacked_dataframe.reset_index().set_index('level_1')
With this (dirty?) trick, we achieved that all files from the stats object can be assigned to the original index of our DataFrame.
In the context of a call with the apply method, the command looks a little bit different, but in the end, the result is the same (I took a commit with multiple modified files from the DataFrame just to show the tranformation a little bit better):
pd.DataFrame(commits[64:65]['raw'].apply(
lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
%%time
stats = pd.DataFrame(commits['raw'].apply(
lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats.head()
Unfortunately, this takes almost 30 seconds on my machine 🙁 (Help needed! Maybe there is a better way for doing this).
Next, we extract the data from the stats_modification column. We do this by simply wrapping the dictionary in a Series, that will return the data needed.
pd.Series(stats.ix[0, 'stats_modifications'])
With an apply, it looks a little bit different because we are applying the lambda function along the DataFrame‘s index.
We get a warning because there seems to be a problem with the ordering of the index. But I haven’t found any errors so far with this approach.
stats_modifications = stats['stats_modifications'].apply(lambda x: pd.Series(x))
stats_modifications.head(7)
We join the newly created data with the existing one with a join method.
stats = stats.join(stats_modifications)
stats.head()
After we get rid of the now obsolete stats_modifications columns…
del(stats['stats_modifications'])
stats.head()
…we join the existing DataFrame with the stats information (transposed for displaying reasons)…
commits = commits.join(stats)
commits.head(2).T
…and come to an end by deleting the raw data column, too (and also transposed for displaying reasons).
del(commits['raw'])
commits.head(2).T
So we’re finished! A DataFrame that contains all the repository information needed for further analysis!
commits.info()
At the end, we still have our commits from the beginning, but with all information that we can work on in another notebook.
len(commits.index.unique())
Store for later usage
For now, we just store the DataFrame into a h5 format with compression for later usage (we get a warning because of the string objects we’re using, but that’s no problem AFAIK).
commits.to_hdf("data/commits.h5", 'commits', mode='w', complevel=9, complib='zlib')
All in one code block
This notebook is really long because it includes a lot of explanations. But if you just need the code to extract a Git repository, here it is:
import pandas as pd
import git
repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)
commits = pd.DataFrame(repo.iter_commits('master'), columns=['raw'])
commits['author'] = commits['raw'].apply(lambda x: x.author.name)
commits['email'] = commits['raw'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['raw'].apply(lambda x: pd.to_datetime(x.committed_datetime))
commits['message'] = commits['raw'].apply(lambda x: x.message)
commits['sha'] = commits['raw'].apply(lambda x: str(x))
stats = pd.DataFrame(commits['raw'].apply(lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats_modifications = stats['stats_modifications'].apply(lambda x: pd.Series(x))
stats = stats.join(stats_modifications)
del(stats['stats_modifications'])
commits = commits.join(stats)
del(commits['raw'])
commits.to_hdf("data/commits.h5", 'commits', mode='w', complevel=9, complib='zlib')
Summary
I hope you aren’t demotivated now by my Pandas’ approach for extracting data from Git repositories. Agreed, the stats object is little unconventional to work with (and there may be better ways for doing it), but I think in the end, the result is pretty useful.