Binary data serialisation vs. JSON in Python
Is it more efficient to load data serialised in binary format or JSON in Python?
Formats (modules in the stdlib) recommended against by Jason C. McDonald in Dead Simple Python ch. 12, because they allow arbitrary code execution and because they are inefficient:
- pickle
- marshal
- shelve
The other format available in the standard library is the Apple plist format, which comes in both XML and binary versions.
Writing the metadata for an arbitrary note to JSON and binary plist, then comparing the performance of loading both:
import json, pickle, plistlib
import time
def loadj(fn):
with open(fn) as f:
d = json.load(f)
return d
def loadp(fn):
with open(fn, "rb") as f:
d = plistlib.load(f)
return d
def loadk(fn):
with open(fn, "rb") as f:
d = pickle.load(f)
return d
def rec(func, n=1000):
start = time.time()
for _ in range(n):
func()
end = time.time()
return end - start
base = ".metadata/intro-to-cpp"
fj = base + ".json"
fp = base + ".plist"
fk = base + ".pkl"
assert loadp(fp) == loadj(fj)
assert loadk(fk) == loadj(fj)
rec(lambda: loadj(fj), 10_000)
rec(lambda: loadp(fp), 10_000)
rec(lambda: loadk(fk), 10_000)
Results:
Format | time (s) |
---|---|
pickle | 0.050 |
JSON | 0.085 |
plist (bin) | 0.150 |
plist (xml) | 0.260 |
But my reason for looking into this was to see whether it’s worth caching the metadata when converting this notebook to HTML, instead of parsing the markdown header block multiple times. But loading the metadata from the markdown headers with my custom function turns out to be slightly faster than loading it from a cache of JSON files (0.12 vs 0.14 ms for all 18 existing pages).