Git activity punchcard
2021-02-22
Visualising Git commit frequency on a calendar using Matplotlib.
GitHub and Codeberg display the daily number of code contributions over the past year on members’ profile pages. For an example, see the “Public activity” tab on my Codeberg page [requires JavaScript]. GitHub apparently calls this a “contribution calendar”; I’ve also seen it called a “punchcard”. This post describes how to produce such a visualisation yourself.
Contents
First, a caveat: I take it as understood that commit frequency is not a meaningful metric in and of itself, since commits are not all of the same “size”. A commit may be a trivial bug fix, e.g. a typo, or version bump, or it may be the result of a week of research and experimentation. But the procedure described below can be modified to explore more interesting questions. Some examples:
- Do I make more commits later in the week, e.g. because there
are lots of meetings on Mondays, or because sprints tend to start
with experimentation and implementations only show up on the
master
branch later on? - How many different projects am I working on in a given week?
- Do I tend to work on hobby projects during weeks when I’m not writing much code professionally?
- Am I consistent about updating the documentation repository towards the end of a sprint, or about writing feature specifications and integration tests before writing the implementations and unit tests?
In this post, however, I will simply recreate the punchcard of the popular code forges.
Extracting the data from Git
The simplest way to get the number of commits per day in a given
repo is to use git log
with a custom
format1 to print the date
for every commit; then count the occurrence of each date with
uniq -c
:
$ git log --format="%as" | sort | uniq -c
15 2019-12-02
5 2019-12-03
18 2019-12-04
We need to make a few modifications to this command, though.
First, let’s make sure we’re only pulling out our own commits by
adding the --author=<pattern>
filter.
Since my author name and email aren’t always consistent throughout the commit history, I like to do a sanity check on the pattern by comparing the output of these two commands:
$ git shortlog -se
$ git shortlog -se --author=<pattern>
If the pattern matches all the versions of your name and not
anyone else’s, it is ready to be used with git
log
.
Optionally, add the --all
flag to get the commits
on all branches, not just master
.
$ git log --all --author="javiljoen" --format="%as" \
| sort | uniq -c
11 2019-12-02
3 2019-12-03
18 2019-12-04
4 2019-12-05
Now because we’re going to be plotting the calendar with weeks
along one axis and weekdays along the other, it would be more
convenient to turn our standard ISO dates into ISO week date
format. We can use dateconv
2 from dateutils
.
$ git log --all --author="javiljoen" --format="%as" \
| dateconv -f "%G-W%V-%u" \
| sort | uniq -c
11 2019-W49-1
3 2019-W49-2
18 2019-W49-3
4 2019-W49-4
Next, to make it easier to parse, let’s emit the data in CSV
format, by replacing the uniq -c
step with an
equivalent awk
command, and inserting a header line
with sed
:
$ git log --all --author="javiljoen" --format="%as" \
| dateconv -f "%G-W%V-%u" \
| awk '{cnt[$0]++} END {for (ln in cnt) print ln","cnt[ln]}' \
| sort \
| sed "1iDate,Commits"
Date,Commits
2019-W49-1,11
2019-W49-2,3
2019-W49-3,18
2019-W49-4,4
To run this command on a number of repos, I put it in a shell function in the following script.
#!/usr/bin/bash
count-commits() {
repo=$1
git -C $repo log --all --author="javiljoen" --format="%as" \
| dateconv -f "%G-W%V-%u" \
| awk '{cnt[$0]++} END {for (ln in cnt) print ln","cnt[ln]}' \
| sort \
| sed "1iDate,Commits"
}
repos=(~/projects/ProjectA \
~/projects/ProjectB)
for repo in ${repos[@]}
do
echo Counting commits in $repo ...
outcsv=$(basename $repo).csv
count-commits $repo > $outcsv && \
echo ... saved to $(realpath $outcsv)
echo
done
This would produce two files in my working directory:
ProjectA.csv
and ProjectB.csv
(depends on
the contents of the $repos
array).
On to the visualisation!
Reading the data into Python
Let’s read the data into a single dataframe using Pandas. To
retain the repo name in the index, we can use pd.concat(dfs,
keys=repos)
:
from pathlib import Path
import pandas as pd
csvs = sorted(Path().glob("*.csv"))
data = pd.concat(
(pd.read_csv(f, index_col="Date") for f in csvs),
keys=(f.stem for f in csvs),
names=["Project", "Date"],
)
[The full script is given in the Appendix.]
The dataframe data
now looks something like this,
with “Project” and “Date” as the index:
data.groupby("Project").tail(3)
Project | Date | Commits |
---|---|---|
ProjectA | 2020-W41-3 | 6 |
2020-W41-4 | 1 | |
2020-W42-3 | 6 | |
ProjectB | 2020-W40-1 | 2 |
2020-W42-3 | 1 | |
2020-W49-4 | 1 |
To turn this single “Commits” column into a 2D week-by-weekday grid for plotting, we need to do a few transformations:
- Split the “Date” values into week number and weekday number (day of the week).
- Add up the commits for the various projects that fall on the same day.
- Insert the commits into the right cells of a matrix of week vs. day of the week.
- Fill in zeroes for the remaining cells, i.e. the days on which there were no commits in our git log.
The first step can be done with
DataFrame.assign()
:
dates = data.index.to_frame()["Date"]
data = data.assign(
Week=dates.str[:8],
Day=dates.str[-1].astype(int),
)
This adds two new columns to the dataframe:
data.head(6)
Project | Date | Commits | Week | Day |
---|---|---|---|---|
ProjectA | 2019-W06-1 | 1 | 2019-W06 | 1 |
2019-W06-3 | 6 | 2019-W06 | 3 | |
2019-W06-4 | 6 | 2019-W06 | 4 | |
2019-W06-5 | 8 | 2019-W06 | 5 | |
2019-W06-7 | 6 | 2019-W06 | 7 | |
2019-W07-1 | 5 | 2019-W07 | 1 |
The next three steps can be done in a single call to
DataFrame.pivot_table()
:
grid = data.pivot_table(
values="Commits", index="Week", columns="Day",
aggfunc=sum,
fill_value=0,
)
which transforms the commits column into a matrix like this:
grid.head(2)
Day | |||||||
---|---|---|---|---|---|---|---|
Week | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
2019-W06 | 1 | 0 | 6 | 6 | 8 | 0 | 6 |
2019-W07 | 5 | 8 | 4 | 3 | 11 | 3 | 0 |
Note, however, that this matrix only contains weeks for which we have data. To insert rows filled with zeroes for weeks in which there weren’t any commits — and also to remove weeks outside our range of interest — we can reindex this dataframe with the range of weeks we want to plot:
jan_to_sep = [f"2019-W{i:02d}" for i in range(1, 40)]
to_plot = grid.reindex(index=jan_to_sep, fill_value=0)
We now also have empty rows for rows with no commits, and the matrix is ready for plotting:
to_plot
Day | |||||||
---|---|---|---|---|---|---|---|
Week | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
2019-W01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W06 | 1 | 0 | 6 | 6 | 8 | 0 | 6 |
2019-W07 | 5 | 8 | 4 | 3 | 11 | 3 | 0 |
[…] | |||||||
2019-W35 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W36 | 0 | 0 | 11 | 0 | 1 | 0 | 2 |
2019-W37 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2019-W38 | 0 | 2 | 0 | 1 | 0 | 0 | 0 |
2019-W39 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Plotting the punchcard with Matplotlib
We can use
Axes.imshow()
to plot the grid as a heatmap. But to
make sure that the cells are correctly aligned and that the tick
labels on the x axis make sense, we need to pass in an
extent
array listing the first and last week (x
extent), and the first and last day of the week, offset by 0.5
(y extent).
One complication is that the values for the x extent need to be floats representing Matplotlib dates. So we have to:
- turn the week number into a full date by appending a “null”
day of
1
, i.e. Monday; - parse the date string into a
datetime
object withdatetime.strptime()
; - convert the
datetime
into a timestamp withmatplotlib.dates.date2num()
.
from datetime import datetime
import matplotlib.dates as mdates
def week2num(w):
dt = datetime.strptime(f"{w}-1", "%G-W%V-%u")
return mdates.date2num(dt)
weeks = to_plot.index.values
x_min = week2num(weeks[0])
x_max = week2num(weeks[-1])
y_min, y_max = -0.5, 6.5
extent = [x_min, x_max, y_min, y_max]
The data can then be plotted onto this extent:
import matplotlib.pyplot as plt
plt.style.use("seaborn-dark")
# This aspect ratio produces more-or-less square cells:
figsize = len(weeks) / 7 * 1, 1.3
commits = to_plot.values.T
fig, ax = plt.subplots(figsize=figsize)
im = ax.imshow(
commits,
extent=extent, aspect="auto",
cmap="Blues",
)
To get readable tick labels on the x axis, we set the following:
locator = mdates.AutoDateLocator()
formatter = mdates.ConciseDateFormatter(locator, show_offset=False)
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)
And to label the days of the week in the correct order on the y axis (tweaking the offset values until it looks right):
days = to_plot.columns.values
ax.set_yticks(days[::-1] - 0.55)
ax.set_yticklabels("MTWTFSS", va="top", ha="center", x=-0.01)
Let’s also add the colour scale, so that we can see the number of commits represented by each hue:
fig.colorbar(im)
Note that this uses a continuous colour scale, with a unique hue for each value. Although the number of commits per day is represented faithfully, it does make it hard to read the corresponding value off the scale bar.
We can instead group the values into bins, which can be represented with a discrete colour scale with a smaller number of distinct hues. This makes the scale bar more useful, at the cost of reducing the colour resolution of the heatmap.
We could divide the range of values into bins of equal sizes, by specifying the bin boundaries with a linear series, such as [0, 5, 10, 15, 20]. But since there is a bigger qualitative difference between coding sessions that result in 1 vs. 2 vs. 5 commits than between those that produce 21 vs. 22 vs. 25 commits, I think it makes more sense to use a geometric series, resulting in bins of successively increasing sizes, such as [0, 1, 2, 4, 9, 18], with higher resolution at the lower end of the range.
import numpy as np
import matplotlib.colors as mcolors
n_bins = 5
bounds = np.zeros(n_bins + 1, dtype=int)
bounds[1:] = np.geomspace(1, commits.max() + 1, n_bins, dtype=int)
fig, ax = plt.subplots(figsize=figsize)
im = ax.imshow(
commits,
extent=extent, aspect="auto",
cmap=plt.get_cmap("Blues", n_bins),
norm=mcolors.BoundaryNorm(bounds, n_bins),
)
Then, after applying the same axis formatting as before, we get this:
Appendix: Plotting script
Here are the above commands for reading and plotting the commit data, consolidated into a single script:
#!/usr/bin/env python3
from datetime import datetime
from pathlib import Path
import matplotlib.colors as mcolors
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
csvs = sorted(Path().glob("*.csv"))
data = pd.concat(
(pd.read_csv(f, index_col="Date") for f in csvs),
keys=(f.stem for f in csvs),
names=["Project", "Date"],
)
dates = data.index.to_frame()["Date"]
data = data.assign(
Week=dates.str[:8],
Day=dates.str[-1].astype(int),
)
grid = data.pivot_table(
values="Commits", index="Week", columns="Day",
aggfunc=sum,
fill_value=0,
)
jan_to_sep = [f"2019-W{i:02d}" for i in range(1, 40)]
to_plot = grid.reindex(index=jan_to_sep, fill_value=0)
weeks = to_plot.index.values
days = to_plot.columns.values
commits = to_plot.values.T
def week2num(w):
dt = datetime.strptime(f"{w}-1", "%G-W%V-%u")
return mdates.date2num(dt)
x_min = week2num(weeks[0])
x_max = week2num(weeks[-1])
y_min, y_max = -0.5, 6.5
extent = [x_min, x_max, y_min, y_max]
n_bins = 5
bounds = np.zeros(n_bins + 1, dtype=int)
bounds[1:] = np.geomspace(1, commits.max() + 1, n_bins, dtype=int)
plt.style.use("seaborn-dark")
figsize = len(weeks) / 7 * 1, 1.3
fig, ax = plt.subplots(figsize=figsize)
im = ax.imshow(
commits,
extent=extent, aspect="auto",
cmap=plt.get_cmap("Blues", n_bins),
norm=mcolors.BoundaryNorm(bounds, n_bins),
)
locator = mdates.AutoDateLocator()
formatter = mdates.ConciseDateFormatter(locator, show_offset=False)
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)
ax.set_yticks(days[::-1] - 0.55)
ax.set_yticklabels("MTWTFSS", va="top", ha="center", x=-0.01)
fig.colorbar(im)
fig.tight_layout()
fig.savefig("punchcard.svg", transparent=True)