knowva.ai · 2026
[ Artifact ] [ Code ] [ Release v4.0.3 ] [ SWHID ] [ Results data ] [ Figures ] [ BibTeX ]
Draft v4.0.6 (final). Citation target: v4.0.3 (immutable, SWHID-archived). Primary venue: DCC 2027.
Abstract. We measure the lossless compressibility of bf16 transformer LLM weights and of Q4_K-typed tensors in GGUF Q4_K_M files under a strict fairness contract that counts profile bytes against trained methods on every file, verifies byte-exact roundtrips on every run, and uses a model-level train/test split. Across 12 source models and 11,942 verified roundtrip method-evaluations over 7,960 unique benchmark rows (zero roundtrip failures), we establish three results.
First, the measured practical ceiling for bf16 weights under iid marginal-byte coding is ≈1.495× byte-weighted (model-level 95% CI [1.487, 1.502]), derivable from a one-line Shannon proposition at measured byte entropies (median Hhigh = 2.74 bits, Hlow = 7.97 bits) and consistent with α-stable theory of SGD-trained networks. Our bf16_split predictor reaches 1.488×; the official OpenZL le-u16 result marginally exceeds the bound (1.499×) by using context coding within byte planes, a regime our proposition does not cover. Result validated through Qwen2.5-7B-Instruct (byte-weighted +0.0001 vs the small-model corpus).
Second, the measured practical ceiling for Q4_K-typed tensors in GGUF Q4_K_M files is ≈1.076× at tensor-stream level (best method 1.052×). At the full GGUF artifact level this becomes ≈1.041–1.045× because Q4_K_M files include Q6_K tensors that compress at ~1.01×. Mechanism: optimized per-block scaling produces near-uniform nibble distributions (median 3.86 of 4.0 bits/symbol).
Third, we find no evidence of usable linear redundancy between adjacent same-role transformer-layer weight matrices at bf16 precision (median Pearson +0.0004 across 250 pairs in two source models from the Qwen2.5 family), ruling out simple scalar-affine cross-layer compression but not nonlinear schemes or cross-checkpoint deltas.
This paper does not propose a new compressor. The two methods we built (bf16_split and a Q4_K block mixture-CDF coder) are used only to test whether the measured entropy ceilings are practically reachable; both cluster at the ceiling rather than beating it. The methodology contribution is the fairness contract under which prior methods can be compared directly.
| Domain | Best lossless ratio | Atlas ceiling | Sample |
|---|---|---|---|
| bf16 weights (Prop. 1 byte-marginal) | 1.488–1.499× | 1.495× | 290 test tensors, ≤7B |
| Q4_K tensor stream | 1.052× | 1.076× | 530 Q4_K-typed test tensors |
| GGUF Q4_K_M whole file | 1.041–1.045× | ≈1.05× | 5 held-out GGUF files |
| Adjacent-layer linear residuals (bf16) | net loss | median Pearson +0.0004 | 250 layer pairs, 2 Qwen2.5 models |
Headline numbers are byte-weighted corpus ratios. Per-tensor distributions and 95% bootstrap CIs in paper §5–§7.
The smoke test runs three lossless compressors against bundled fixtures and verifies their sha256[:16] prefixes match the values measured at v4.0.3 release time. Roundtrip is byte-exact-asserted in every method before the hash is taken. Expected to pass 3/3:
git clone https://github.com/NimoRotem/llm-compression-limits.git cd llm-compression-limits git checkout v4.0.3 ./reproducibility_smoke_test.sh
| Method | Fixture | Ratio | sha256[:16] |
|---|---|---|---|
| bf16_split | TinyLlama_layer3_q_proj.bin (8 MiB) | 1.4820× | fc3544c35489eb94 |
| qb_mixture_k4 + 277 B profile | Llama32_3B_layer14_gate.q4k.bin (13.5 MiB) | 1.0507× | c79972a64d11b717 |
| decomp_perstream_zstd19_bg | Qwen2.5-0.5B-Q4_K_M.gguf (379 MiB) | 1.0367× | 1c6f077f8c228cf2 |
| Code (Apache-2.0) | github.com/NimoRotem/llm-compression-limits |
| Tagged release | v4.0.3 (immutable; release ZIP byte-identical to mirror) |
| SWHID (persistent identifier) | swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164 |
| Artifact mirror | knowva.ai/CompressionV4 |
| Results table | results.jsonl.zst — 7,960 rows / 11,942 verified evals |
| Trained profile | bf16.zl (617 B, OpenZL le-u16, regenerated v4.0.3) |
| Fixtures | TinyLlama bf16 · Llama-3.2-3B Q4_K · Qwen2.5-0.5B GGUF |
| Smoke checksums | expected-checksums.json |
| Cross-layer pairs | cross-layer-pairs.csv — 250 rows |
| Provenance | per-stage benchmark verdicts: bf16 supplementary · Q4_K + cross-layer · GGUF artifact |
| Drift catalog | DEVIATIONS.md |
@misc{rotem2026compression,
title = {Practical Limits of Lossless Compression for bf16 Transformer LLM Weights},
author = {Rotem, Nimo},
year = {2026},
url = {https://github.com/NimoRotem/llm-compression-limits/releases/tag/v4.0.3},
howpublished = {GitHub release v4.0.3},
swhid = {swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164},
note = {Reproduction artifact v4.0.3 (canonical citation target; head v4.0.6).}
}
License: Apache-2.0 (code) and CC-BY-4.0 (data, profiles, figures). Smoke test passes 3/3. No Zenodo DOI was minted; the GitHub release tag and the Software Heritage SWHID together provide a verifiable archive and a citation-grade persistent identifier.