Binary data serialisation vs. JSON in Python

2024-05-17

Is it more efficient to load data serialised in binary format or JSON in Python?

Formats (modules in the stdlib) recommended against by Jason C. McDonald in Dead Simple Python ch. 12, because they allow arbitrary code execution and because they are inefficient:

pickle
marshal
shelve

The other format available in the standard library is the Apple plist format, which comes in both XML and binary versions.

Writing the metadata for an arbitrary note to JSON and binary plist, then comparing the performance of loading both:

import json, pickle, plistlib
import time

def loadj(fn):
    with open(fn) as f:
        d = json.load(f)
    return d

def loadp(fn):
    with open(fn, "rb") as f:
        d = plistlib.load(f)
    return d

def loadk(fn):
    with open(fn, "rb") as f:
        d = pickle.load(f)
    return d

def rec(func, n=1000):
    start = time.time()
    for _ in range(n):
        func()
    end = time.time()
    return end - start

base = ".metadata/intro-to-cpp"
fj = base + ".json"
fp = base + ".plist"
fk = base + ".pkl"
assert loadp(fp) == loadj(fj)
assert loadk(fk) == loadj(fj)

rec(lambda: loadj(fj), 10_000)
rec(lambda: loadp(fp), 10_000)
rec(lambda: loadk(fk), 10_000)

Results:

Format	time (s)
pickle	0.050
JSON	0.085
plist (bin)	0.150
plist (xml)	0.260

But my reason for looking into this was to see whether it’s worth caching the metadata when converting this notebook to HTML, instead of parsing the markdown header block multiple times. But loading the metadata from the markdown headers with my custom function turns out to be slightly faster than loading it from a cache of JSON files (0.12 vs 0.14 ms for all 18 existing pages).