Reproduction artifact for Practical Limits of Lossless Compression for bf16 Transformer LLM Weights

CPU-only benchmark; smoke test passes 3/3 in seconds; full benchmark reproduces in ~6.5 h on c3-highcpu-88. Includes the evaluation scripts, manifests, trained-codec artifacts, smoke-test fixtures, and figure-generation pipeline used for the results reported in the paper.

Paper: paper.md. Code: github.com/NimoRotem/llm-compression-limits (Apache-2.0 for code; CC-BY-4.0 for data, results, profiles, figures). Reproduction notes: NOTES.md. Software Heritage: swh:1:rel:08d597be136278838e8cc2fef2a68303d990208d.

Headline empirical findings

Domain	Best lossless ratio	Atlas-derived ceiling
bf16 (290-tensor test corpus)	1.499× byte-weighted (trained OpenZL le-u16, 617 B profile; 1.4986 geomean per-tensor)	1.495× byte-weighted (R_marginal; median per-tensor 1.498)
Q4_K tensor stream (530-tensor test corpus)	1.052× byte-weighted (qb_mixture_k4; 1.0517 geomean)	1.076×
GGUF whole file (3 held-out models)	1.0422× geomean (decomp_perstream_zstd19_bg); deployable artifact-level 1.041–1.045×	—
Cross-layer bf16 redundancy (250 pairs, Qwen2.5 family)	net null — median Pearson +0.0004	—

Browse

paper.md — the paper live
README.md — repo entry point: layout, citation, reproduce recipe live
NOTES.md — what is and is not in this artifact set live
code/ — Apache-2.0 core harness sources (browse via GitHub; per-file URLs at /code/<file> on this mirror, e.g. code/atlas.py) live
methods/bf16-split/ — bf16 byte-split predictor + mixture-CDF live
methods/qk-mixture-cdf/ — Q4_K mixture-CDF coder + 277 B qb_k4.profile live
reproductions/dfloat11/ — pointer to official DFloat11 kernels not measured here
reproductions/unweight/ — pointer to official Cloudflare Unweight kernels not measured here
reproductions/zipnn/ — pointer to official ZipNN repository not measured here
results/results.jsonl.zst — 7,960 rows / 11,942 verified method-evals live
manifest/model-manifest.json — 13 source models actually measured live
manifest/expected-checksums.json — three measured prefixes; smoke test 3/3 passing live
manifest/dependencies.txt — pinned versions + official reproduction URLs live
prompts/inference-sanity-prompts.json — 8 prompts; gpt2 verified 8/8 (sanity_check_gpt2.jsonl) live
profiles/bf16.zl — 617 B trained OpenZL le-u16 bf16 profile (fp16 profile not vendored — see paper §6 lead) live
profiles/bench_bf16zl.jsonl — 290-tensor benchmark, 1.4986 geomean live
fixtures/ — TinyLlama bf16 (8 MiB) + Llama-3.2-3B Q4_K (13.5 MiB) + Qwen2.5-0.5B GGUF (379 MiB) live
figures/ — 5 figures × {pdf, png} live
notebooks/figures.ipynb — populated; render_figures.py CLI alongside live
notebooks/lomo-tables.ipynb — LOMO aggregation cells live
cross-layer-pairs.csv — 250 (model, role, layer-pair) rows from the cross-layer correlation atlas (paper §8) live
provenance/iter4/FINAL_VERDICT.md — bf16 supplementary stage verdict live
provenance/iter5/FINAL_VERDICT.md — Q4_K + cross-layer stage verdict live
provenance/iter6/FINAL_VERDICT.md — GGUF artifact stage verdict live
smoke-test/ — fixtures + checksum gate, ./reproducibility_smoke_test.sh exits 0 live

How to reproduce

git clone https://github.com/NimoRotem/llm-compression-limits.git
cd llm-compression-limits
git checkout v1.0.0
./reproducibility_smoke_test.sh    # 3/3 sha256[:16] checks pass

Expected output (sha256[:16] of compressed bytes):

bf16_split(TinyLlama_layer3_q_proj.bin) → fc3544c35489eb94
qb_mixture_k4(Llama32_3B_layer14_gate.q4k.bin) → c79972a64d11b717
decomp_perstream_zstd19_bg(Qwen2.5-0.5B-Q4_K_M.gguf) → 1c6f077f8c228cf2

The script auto-fetches the 379 MiB GGUF from this mirror if absent. End-to-end roundtrip is byte-exact for all three methods.

License

Code: Apache-2.0 (LICENSE). Data / results / profiles / figures: CC-BY-4.0 (LICENSE-DATA).

Paper: paper.md. Code: github.com/NimoRotem/llm-compression-limits.