Git activity punchcard

2021-02-22

Visualising Git commit frequency on a calendar using Matplotlib.

GitHub and GitLab display the daily number of code contributions over the past year on members’ profile pages. For an example, see the “Overview” section on my GitLab.com page: https://gitlab.com/javiljoen [requires JavaScript]. GitHub apparently calls this a “contribution calendar”; I’ve also seen it called a “punchcard”. This post describes how to produce such a visualisation yourself.

Contents

  1. Extracting the data from Git
  2. Reading the data into Python
  3. Plotting the punchcard with Matplotlib
  4. Appendix: Plotting script

First, a caveat: I take it as understood that commit frequency is not a meaningful metric in and of itself, since commits are not all of the same “size”. A commit may be a trivial bug fix, e.g. a typo, or version bump, or it may be the result of a week of research and experimentation. But the procedure described below can be modified to explore more interesting questions. Some examples:

In this post, however, I will simply recreate the punchcard of the popular code forges.

Extracting the data from Git

The simplest way to get the number of commits per day in a given repo is to use git log with a custom format1 to print the timestamp for every commit; then discard the time portion (keeping only the date); then count the occurrence of each date with uniq -c:

$ git log --format="%aI" | cut -d"T" -f1 | sort | uniq -c
     15 2019-12-02
      5 2019-12-03
     18 2019-12-04

We need to make a few modifications to this command, though. First, let’s make sure we’re only pulling out our own commits by adding the --author=<pattern> filter.

Since my author name and email aren’t always consistent throughout the commit history, I like to do a sanity check on the pattern by comparing the output of these two commands:

$ git shortlog -se
$ git shortlog -se --author=<pattern>

If the pattern matches all the versions of your name and not anyone else’s, it is ready to be used with git log.

Optionally, add the --all flag to get the commits on all branches, not just master.

$ git log --all --author="javiljoen" --format="%aI" \
    | cut -d"T" -f1 | sort | uniq -c
     11 2019-12-02
      3 2019-12-03
     18 2019-12-04
      4 2019-12-05

Now because we’re going to be plotting the calendar with weeks along one axis and weekdays along the other, it would be more convenient to turn our standard ISO dates into ISO week date format. We can use dateconv2 from dateutils.

$ git log --all --author="javiljoen" --format="%aI" \
    | dateconv -f "%G-W%V-%u" \
    | sort | uniq -c
     11 2019-W49-1
      3 2019-W49-2
     18 2019-W49-3
      4 2019-W49-4

Next, to make it easier to parse, let’s emit the data in CSV format, by replacing the uniq -c step with an equivalent awk command, and inserting a header line with sed:

$ git log --all --author="javiljoen" --format="%aI" \
    | dateconv -f "%G-W%V-%u" \
    | awk '{cnt[$0]++} END {for (ln in cnt) print ln","cnt[ln]}' \
    | sort \
    | sed "1iDate,Commits"
Date,Commits
2019-W49-1,11
2019-W49-2,3
2019-W49-3,18
2019-W49-4,4

To run this command on a number of repos, I put it in a shell function in the following script. Note that this script is for the Fish shell; it should be straightforward to convert it to Bash/POSIX.

#!/usr/bin/fish
function count-commits
    set repo $argv[1]
    cd $repo
    git log --all --author="javiljoen" --format="%aI" \
    | dateconv -f "%G-W%V-%u" \
    | awk '{cnt[$0]++} END {for (ln in cnt) print ln","cnt[ln]}' \
    | sort
    | sed "1iDate,Commits"
    cd -
end

set repos ~/projects/ProjectA \
          ~/projects/ProjectB

for repo in $repos
    echo Counting commits in $repo ...
    set outcsv (basename $repo).csv
    count-commits $repo > $outcsv
    echo ... saved to (realpath $outcsv)
    echo
end

This would produce two files in my working directory: ProjectA.csv and ProjectB.csv (depends on the contents of the $repos array).

On to the visualisation!

Reading the data into Python

Let’s read the data into a single dataframe using Pandas. To retain the repo name in the index, we can use pd.concat(dfs, keys=repos):

from pathlib import Path
import pandas as pd

csvs = sorted(Path().glob("*.csv"))
data = pd.concat(
    (pd.read_csv(f, index_col="Date") for f in csvs),
    keys=(f.stem for f in csvs),
    names=["Project", "Date"],
)

[The full script is given in the Appendix.]

The dataframe data now looks something like this, with “Project” and “Date” as the index:

data.groupby("Project").tail(3)
Project Date Commits
ProjectA 2020-W41-3 6
2020-W41-4 1
2020-W42-3 6
ProjectB 2020-W40-1 2
2020-W42-3 1
2020-W49-4 1

To turn this single “Commits” column into a 2D week-by-weekday grid for plotting, we need to do a few transformations:

  • Split the “Date” values into week number and weekday number (day of the week).
  • Add up the commits for the various projects that fall on the same day.
  • Insert the commits into the right cells of a matrix of week vs. day of the week.
  • Fill in zeroes for the remaining cells, i.e. the days on which there were no commits in our git log.

The first step can be done with DataFrame.assign():

dates = data.index.to_frame()["Date"]
data = data.assign(
    Week=dates.str[:8],
    Day=dates.str[-1].astype(int),
)

This adds two new columns to the dataframe:

data.head(6)
Project Date Commits Week Day
ProjectA 2019-W06-1 1 2019-W06 1
2019-W06-3 6 2019-W06 3
2019-W06-4 6 2019-W06 4
2019-W06-5 8 2019-W06 5
2019-W06-7 6 2019-W06 7
2019-W07-1 5 2019-W07 1

The next three steps can be done in a single call to DataFrame.pivot_table():

grid = data.pivot_table(
    values="Commits", index="Week", columns="Day",
    aggfunc=sum,
    fill_value=0,
)

which transforms the commits column into a matrix like this:

grid.head(2)
Day
Week 1 2 3 4 5 6 7
2019-W06 1 0 6 6 8 0 6
2019-W07 5 8 4 3 11 3 0

Note, however, that this matrix only contains weeks for which we have data. To insert rows filled with zeroes for weeks in which there weren’t any commits — and also to remove weeks outside our range of interest — we can reindex this dataframe with the range of weeks we want to plot:

jan_to_sep = [f"2019-W{i:02d}" for i in range(1, 40)]
to_plot = grid.reindex(index=jan_to_sep, fill_value=0)

We now also have empty rows for rows with no commits, and the matrix is ready for plotting:

to_plot
Day
Week 1 2 3 4 5 6 7
2019-W01 0 0 0 0 0 0 0
2019-W02 0 0 0 0 0 0 0
2019-W03 0 0 0 0 0 0 0
2019-W04 0 0 0 0 0 0 0
2019-W05 0 0 0 0 0 0 0
2019-W06 1 0 6 6 8 0 6
2019-W07 5 8 4 3 11 3 0
[…]
2019-W35 0 0 0 0 0 0 0
2019-W36 0 0 11 0 1 0 2
2019-W37 0 0 0 0 0 0 0
2019-W38 0 2 0 1 0 0 0
2019-W39 0 0 0 0 0 0 0

Plotting the punchcard with Matplotlib

We can use Axes.imshow() to plot the grid as a heatmap. But to make sure that the cells are correctly aligned and that the tick labels on the x axis make sense, we need to pass in an extent array listing the first and last week (x extent), and the first and last day of the week, offset by 0.5 (y extent).

One complication is that the values for the x extent need to be floats representing Matplotlib dates. So we have to:

  1. turn the week number into a full date by appending a “null” day of 1, i.e. Monday;
  2. parse the date string into a datetime object with datetime.strptime();
  3. convert the datetime into a timestamp with matplotlib.dates.date2num().
from datetime import datetime
import matplotlib.dates as mdates

def week2num(w):
    dt = datetime.strptime(f"{w}-1", "%G-W%V-%u")
    return mdates.date2num(dt)

weeks = to_plot.index.values

x_min = week2num(weeks[0])
x_max = week2num(weeks[-1])

y_min, y_max = -0.5, 6.5
extent = [x_min, x_max, y_min, y_max]

The data can then be plotted onto this extent:

import matplotlib.pyplot as plt
plt.style.use("seaborn-dark")

# This aspect ratio produces more-or-less square cells:
figsize = len(weeks) / 7 * 1, 1.3

commits = to_plot.values.T

fig, ax = plt.subplots(figsize=figsize)
im = ax.imshow(
    commits,
    extent=extent, aspect="auto",
    cmap="Blues",
)

To get readable tick labels on the x axis, we set the following:

locator = mdates.AutoDateLocator()
formatter = mdates.ConciseDateFormatter(locator, show_offset=False)

ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)

And to label the days of the week in the correct order on the y axis (tweaking the offset values until it looks right):

days = to_plot.columns.values
ax.set_yticks(days[::-1] - 0.55)
ax.set_yticklabels("MTWTFSS", va="top", ha="center", x=-0.01)

Let’s also add the colour scale, so that we can see the number of commits represented by each hue:

fig.colorbar(im)
Heatmap of commit frequency (continuous scale)
Commit frequency punchcard (continuous scale)

Note that this uses a continuous colour scale, with a unique hue for each value. Although the number of commits per day is represented faithfully, it does make it hard to read the corresponding value off the scale bar.

We can instead group the values into bins, which can be represented with a discrete colour scale with a smaller number of distinct hues. This makes the scale bar more useful, at the cost of reducing the colour resolution of the heatmap.

We could divide the range of values into bins of equal sizes, by specifying the bin boundaries with a linear series, such as [0, 5, 10, 15, 20]. But since there is a bigger qualitative difference between coding sessions that result in 1 vs. 2 vs. 5 commits than between those that produce 21 vs. 22 vs. 25 commits, I think it makes more sense to use a geometric series, resulting in bins of successively increasing sizes, such as [0, 1, 2, 4, 9, 18], with higher resolution at the lower end of the range.

import numpy as np
import matplotlib.colors as mcolors

n_bins = 5

bounds = np.zeros(n_bins + 1, dtype=int)
bounds[1:] = np.geomspace(1, commits.max() + 1, n_bins, dtype=int)

fig, ax = plt.subplots(figsize=figsize)
im = ax.imshow(
    commits,
    extent=extent, aspect="auto",
    cmap=plt.get_cmap("Blues", n_bins),
    norm=mcolors.BoundaryNorm(bounds, n_bins),
)

Then, after applying the same axis formatting as before, we get this:

Heatmap of commit frequency (discrete scale)
Commit frequency punchcard (discrete scale)

Appendix: Plotting script

Here are the above commands for reading and plotting the commit data, consolidated into a single script:

#!/usr/bin/env python3
from datetime import datetime
from pathlib import Path

import matplotlib.colors as mcolors
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

csvs = sorted(Path().glob("*.csv"))
data = pd.concat(
    (pd.read_csv(f, index_col="Date") for f in csvs),
    keys=(f.stem for f in csvs),
    names=["Project", "Date"],
)
dates = data.index.to_frame()["Date"]
data = data.assign(
    Week=dates.str[:8],
    Day=dates.str[-1].astype(int),
)
grid = data.pivot_table(
    values="Commits", index="Week", columns="Day",
    aggfunc=sum,
    fill_value=0,
)
jan_to_sep = [f"2019-W{i:02d}" for i in range(1, 40)]
to_plot = grid.reindex(index=jan_to_sep, fill_value=0)
weeks   = to_plot.index.values
days    = to_plot.columns.values
commits = to_plot.values.T

def week2num(w):
    dt = datetime.strptime(f"{w}-1", "%G-W%V-%u")
    return mdates.date2num(dt)

x_min = week2num(weeks[0])
x_max = week2num(weeks[-1])
y_min, y_max = -0.5, 6.5
extent = [x_min, x_max, y_min, y_max]

n_bins = 5
bounds = np.zeros(n_bins + 1, dtype=int)
bounds[1:] = np.geomspace(1, commits.max() + 1, n_bins, dtype=int)

plt.style.use("seaborn-dark")
figsize = len(weeks) / 7 * 1, 1.3
fig, ax = plt.subplots(figsize=figsize)
im = ax.imshow(
    commits,
    extent=extent, aspect="auto",
    cmap=plt.get_cmap("Blues", n_bins),
    norm=mcolors.BoundaryNorm(bounds, n_bins),
)

locator = mdates.AutoDateLocator()
formatter = mdates.ConciseDateFormatter(locator, show_offset=False)
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)

ax.set_yticks(days[::-1] - 0.55)
ax.set_yticklabels("MTWTFSS", va="top", ha="center", x=-0.01)

fig.colorbar(im)

fig.tight_layout()
fig.savefig("punchcard.svg", transparent=True)