groupby
, you can group data in a DataFrame and apply calculations on those groups in various ways.This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas’ groupby
function. Here can find the executable Jupyter Notebook version to directly play around with it!
References¶
Here you can find out more about this function.
Example Scenario¶
This is an excerpt of a file list from a directory with the following information as separate columns / Series:
file
: The name of the filedir
: The name of the directory where the file lives inbytes
: The size of the file in bytes
This data is stored into a pandas’ DataFrame named df
.
import pandas as pd
df = pd.DataFrame({
"file" : ['tutorial.md', 'hello.java', 'controller.java', "build.sh", "deploy.sh"],
"dir" : ["docs", "src", "src", "src", "src"],
"bytes" : [124, 54, 36, 78, 62]
})
df
When to use it¶
groupby
is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level). E.g., in our scenario, we could count the number of files per directory. Let’s take a look at this use case step by step.
Basic Principles¶
groupby
function on our DataFrame df
. As parameter, you can put in the name (or a list of names) of the Series you want to group. In our case, we want to group the directories / the Series dir
.df.groupby('dir')
This gives you a GroupBy object. We can take a look at the built groups by inspecting the groups
object of the GroupBy object.
df.groupby('dir').groups
The groups
object shows you the groups and their members, using their indexes.
Aggregating Values¶
Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a multitude of aggregating functions here, e.g.
count
: count the number of entries of each groupfirst
: take the first entry of each groupmax
: take the entry with the highest value
We can also apply dedicated functions on each group using e.g.,
agg
: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once)apply
: apply a custom function on each group to execute calculations as you liketransform
: calculate summarizing values for each group (e.g., the sum of all entries for each group)
But step by step!
Counting values¶
We group by the directories and use count
on the built groups.
df.groupby('dir').count()
This gives us the number of entries of files and bytes information in each directory.
Summing up¶
We group by the directories and use sum
on the built groups.
df.groupby('dir').sum()
This gives us the number of bytes of all files that reside in a directory. Note that there is no more file
Series because it doesn’t contain any values we could sum up. So this Series was thrown away.
More Advanced Use Cases¶
Let’s dig deeper into our example scenario. We want to find out which kind of files occupy what space in which directory. For this, we extract the files’ extensions from the file
series. We use the string split
function to split by the .
sign and keep just the last piece of the split file name (which is the file’s extension).
df['ext'] = df["file"].str.split(".").str[-1]
df
We can then group this data in a more sophisticated way by using two Series for our groups. We sum up the numeric values (= the bytes
) for each file for each group.
dir_ext_bytes = df.groupby(['dir', 'ext']).sum()
dir_ext_bytes
Last, we want to calculate the ratio of the files’ bytes for each extension. We first calculate the overall size for each extension in each directory by using transform
. The transform
function doesn’t compute results for each value of a group. Instead, it provides results for all values of a group.
bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum')
bytes_per_dir
In our case, we summed up all the files’ bytes of the file extensions per directory. We can add this new information to our existing DataFrame.
dir_ext_bytes['all'] = bytes_per_dir
dir_ext_bytes
Now we are able to calculate the ratio.
dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all']
dir_ext_bytes
The result gives us the ratios of file sizes per file type for each directory.
Summary¶
The groupby
function is an excellent way to summarize data. It will create a higher-level view of your fine-grained raw data.