Introduction¶
In this blog post, I want to show you a nice complexity metric that works for most major programming languages that we use for our software systems – the indentation-based complexity metric. Adam Tornhill describes this metric in his fabulous book Software Design X-Rays on page 25 as follows:
With indentation-based complexity we count the leading tabs and whitespaces to convert them into logical indentations. … This works because indentations in code carry meaning. Indentations are used to increase readability by separating code blocks from each other.
He further says that the trend of the indentation-based complexity in combination with the lines of code metric a good indicator for complexity-growth. If the lines of code don’t change but the indentation does, you got a complexity problem in your code.
In this blog post, we are just interested in a static view of the indentation of a code (the evolution of the indentation-based complexity will be discussed in another blog post). I want to show you how you can spot complex areas in your application using the Pandas data analysis framework. As application under analysis, I choose the Linux kernel project because we want to apply our analysis at big scale.
The Idea¶
This analysis works as follows:
- We search for relevant source code files using
glob
- We read in the source code by hacking Pandas’
read_csv()
method - We extract the preceding whitespaces/tabs in the code
- We calculate the ratio between the lines of code of a file and the indentation complexity
- We spot areas in the Linux kernel that are most complex
- We visualize the complete result with a treemap
So let’s start!
Analysis¶
First, let’s get all the files with a recursive file search by using glob
. Because the Linux kernel code is mostly written in the C programming language, we search for the corresponding file endings .c
(C program code files) and .h
(header files).
import glob
file_list = glob.glob("../../linux/**/*.[c|h]", recursive=True)
file_list[:5]
Read in in the content¶
With the file_list
at hand, we can now read in the content of the source code files. We could have used a standard Python with <path> as f
and read in the content with a f.read()
, but we want to get the data as early into a Pandas’ DataFrame (abbreviated “df”) to leverage the power of the data analysis framework.
For each path in our file list, we create a single DataFrame content_df
. We use a line break \n
as column separator, ensuring that we create a single row for each source code line. We also specify the encoding
parameter with the file encoding latin-1
to make sure that we can read in weird file contents, too (this is totally normal when working in an international team). We also set skip_blank_lines
to False
to keep blank lines. Keeping the blank lines is necessary for debugging purposes (you can inspect certain source code lines with a text editor more easily) as well as for later analysis.
The process of reading in all the source code could take a while for the thousands of files of the Linux kernel. After looping through and reading in all the data, we concatenate all DataFrames with the concat
method. This gives us a single DataFrame with all source code content for all source code files.
import pandas as pd
content_dfs = []
for path in file_list:
content_df = pd.read_csv(
path,
encoding='latin-1',
sep='\n',
skip_blank_lines=False,
names=['line']
)
content_df.insert(0, 'filepath', path)
content_dfs.append(content_df)
content = pd.concat(content_dfs)
content.head()
content
.content.info()
Clean the data¶
We convert filepath
into a categorical data type to optimize performance and memory consumption. We then get the operating system specific directory separator right and get rid of the superfluous parts of the path.
content['filepath'] = pd.Categorical(content['filepath'])
content['filepath'] = content['filepath']\
.str.replace("\\", "/")\
.str.replace("../../linux/", "")
content.head(1)
content['line'] = content['line'].fillna("")
FOUR_SPACES = " " * 4
content['line'] = content['line'].str.replace("\t", FOUR_SPACES)
content.head(1)
Get the measures¶
Let’s get some measures that can help us to spot complex code. We add some additional information to make further analysis and debugging easier: We keep track of the line number of each source code file and create a single continuous index for all source code lines.
content['line_number'] = content.index + 1
content = content.reset_index(drop=True)
content.head(1)
content['is_comment'] = content['line'].str.match(r'^ *(//|/\*|\*).*')
content['is_empty'] = content['line'].str.replace(" ","").str.len() == 0
content.head(1)
content['indent'] = content['line'].str.len() - content['line'].str.lstrip().str.len()
content.head(1)
Get the source code¶
We make sure to only inspect the real source code lines. Because we have the information about the blank lines and comments, we can filter these out very easily. We immediately aggregate the indentations with a count
to get the number of source code lines as well as the summary of all indents for each filepath
aka source code file. We also rename the columns of our new DataFrame accordingly.
source_code_content = content[content['is_comment'] | content['is_empty']]
source_code = source_code_content.groupby('filepath')['indent'].agg(['count', 'sum'])
source_code.columns = ['lines', 'indents']
source_code.head()
%matplotlib inline
source_code.plot.scatter('lines', 'indents', alpha=0.3);
Analyze complexity¶
Let’s build the ratio between the indentations and the lines of code to kind of normalize the data. This is the complexity measure that we are using further on.
source_code['complexity'] = source_code['indents'] / source_code['lines']
source_code.head(1)
source_code['complexity'].hist(bins=50)
Complexity per component¶
Next, we execute an analysis per component to find out where the most complex areas are in the application. For this, we first sum up the metrics for each component. We can identify a component in the Linux kernel roughly by using the first parts of the file path.
source_code['component'] = source_code.index\
.str.split("/", n=2)\
.str[:2].str.join(":")
source_code.head(1)
measures_per_component = source_code.groupby('component').sum()
measures_per_component.head()
complexity
.measures_per_component['complexity'].nlargest(10)
Visualization¶
Finally, we plot our the complexity per component with a treemap. We use the Python visualization library pygal
(http://www.pygal.org) which is very easy to use and just fits our use case perfectly. As size for the treemap’s rectangles, we use the lines of code of the components. As color, we choose red and visualize with the help of the red color’s alpha level (aka normalized complexity comp_norm
between 0 and 1), how red a rectangle of the treemap should be.
This gives us a treemap with the following properties:
- The bigger the component/rectangle, the more lines of code.
- The redder you’ll see a rectangle, the more complex is a component.
We render the treemap as PNG image (to save space) and display it directly in the notebook.
import pygal
from IPython.display import Image
config = pygal.Config(show_legend=False)
treemap = pygal.Treemap(config)
max_comp = measures_per_component['complexity'].max()
for row in measures_per_component.iterrows():
filename = row[0]
entry = row[1]
comp_norm = entry['complexity'] / max_comp
data = {}
data['value'] = entry['lines']
data['color'] = 'rgba(255,0,0,' + str(comp_norm) + ')'
treemap.add(filename, [data])
Image(treemap.render_to_png())
Conclusion¶
In this blog post, I showed you how we can obtain an almost programming language agnostic measure for complexity by measuring the indentation of source code. We also spotted the TOP 10 most complex components and visualized the completed results as a treemap.
All this was relatively easy and straightforward to do by using standard Data Science tooling. I hope you liked it and I would be happy, if you could provide some feedback in the comment section.
Came here after reading Tornhill’s book. Like what you write and hope we can convince the people who maintain the line counting tool ‘cloc’ to integrate this compexity into their program.
Thanks for the reference to pygal, as a ruby guy i wasn’t aware of this library.