In the past, I did a lot of Git log analysis on my blog. The main reason is that developers know what Git is and what kind of data it provides. So it is easy to connect to developers then doing more advanced analysis of Git data.
But there is an area of problems with these kinds of analysis when you want to do file-based analysis in a long-running repository: Deletions, merges, splits and renames.
For the latter one, I want to show you the kinds of problems in this notebook:
Git Example repository
For this analysis, we want to use a little but long-lived repository: The Spring PetClinic project (anti-refactored by me to show some interesting things).
We first clone this repository locally.
%%bash
git clone https://github.com/JavaOnAutobahn/spring-petclinic
Next, we export the Git history by using a special command (background explained here)
%%bash
# path to git repository
cd spring-petclinic
git log --numstat --pretty=format:"%x09%x09%x09%ai" -- *.java > git_log.csv
With a little helper function, we import the exported data (see link above for details on that as well).
import pandas as pd
def parse_git_log(path):
# reading
git_log = pd.read_csv(
path,
sep="\t",
header=None,
names=[
'additions',
'deletions',
'filename',
'timestamp'])
# converting in "one line"
git_log = git_log[['additions', 'deletions', 'filename']]\
.join(git_log[['timestamp']]\
.fillna(method='ffill'))\
.dropna().reset_index(drop=True)
# data type conversions
git_log['additions'] = pd.to_numeric(git_log['additions'], errors='coerce')
git_log['deletions'] = pd.to_numeric(git_log['deletions'], errors='coerce')
churn = git_log['additions'] - git_log['deletions']
git_log.insert(2, "churn", churn)
git_log['timestamp'] = pd.to_datetime(git_log['timestamp'])
return git_log.set_index('timestamp')
timed_log = parse_git_log("spring-petclinic/git_log.csv")
timed_log.head()
So what we got is a nice parsed pandas dataframe we can use for further analysis.
Analysis
Let’s dive into the actual problem analysis. Say we want to do some file-based analysis of the software project with data based on Git. So we group our features along the timestamps.
(Note that we keep the last timestamp entry for each file to do an analysis based on the most recent data later on).
file_churns = timed_log.reset_index().groupby('filename').agg({
"additions" : "sum",
"deletions" : "sum",
"churn" : "sum",
"timestamp" : "first"
})
file_churns.head()
So, at this point, something weird happens: There are files that have a negative number of lines!
How can this happen?
weird_churns = file_churns[file_churns['churn'] < 0].sort_values(by="timestamp", ascending=False)
weird_churns.head()
Let’s look at a more recent file with such a negative number of lines (“recent” because then it is more likely that it still exists in the repository).
weird_churn_filename = weird_churns.iloc[0].name
weird_churn_filename
For this file, we want to follow the development. Using the --follow
option if Git, we can trace the evolution of this single file. As in the first Git data export, we store this data into a file.
%%bash
cd spring-petclinic
git log --numstat --pretty=format:"%x09%x09%x09%ai" --follow src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java > ../weird_churn_filename_log.csv
Let’s read in the data with our little helper function from above.
weird_file_churn = parse_git_log("weird_churn_filename_log.csv")
weird_file_churn.head()
Insights
OK, what is the problem with the negative number of lines?
Let’s look at the history of this one specific file: It was renamed several times!
weird_file_churn['filename'].value_counts()
Albeit Git provides rename tracking features, some of the renames aren’t renames compliant to the Git rename approach (the ones with the =>
are the ones that Git can track) and thus making it difficult to track those renames with standard means.
If we now sum up all the churn
values for these files, we get the actual number of lines for the files based on pure Git repository data.
weird_file_churn['churn'].sum()
Let’s compare this one with the actual number of lines in the real file using the word count comment wc
.
%%bash
wc -l spring-petclinic/src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java
Cool, this one matches! This might not be always the case for example if you do some weird renaming actions with your source code base or to some merges or splitting ups of files.
weird_file_churn[weird_file_churn['filename'] == weird_churn_filename]['churn'].sum()
Visualization
Let’s look at the number of lines for this specific file to get a feeling if the data is right at all.
%matplotlib inline
weird_file_churn[['additions', 'deletions', 'churn']].cumsum().plot(figsize=[20,5]);
We see that somehow we got a negative number of lines of code at the beginning, which could be an indication that there was something wrong with the previous rename detection. But later on, we get a positive number of lines.
Conclusion
So there are limitations of Git repository analysis when you don’t want to dive deep into a more sophisticated model of the evolutions of a project.
Here are some ideas to mitigate this problem around renames:
- Maybe more advanced Git repository mining tools: There are tools like the open-source tool PyDriller or commercial tools like CodeScence or TeamScale (from the later I know that they’ve invested significant brain-power to solve file renaming and merging problems)
- Leverage Git rename detection: Git provides rename detection by default. You might be able to tweak some parameters to get the results you need. I once used this but I can’t remember any further details, though 🙁
- Avoid file-based Git analysis: There are plenty of other interesting analyses waiting for you out there which could be more valuable in your specific context.
- Use the actual lines of code: You might use tools like
cloc
to get the real number of lines of your currently existing files in the repository.
As of today, I’ve chosen the latter two options (with a tendency to 3. ;-)).
Using Git repository data together with the actual number of lines of code (option 4.) is good enough for me to get a first glimpse at the evolution of a software project.
Your context could be a different one where you have to choose more sophisticated techniques to handle all the problems around Git analysis. It would be very interesting to get to know your specific context!