Practical Limits of Lossless Compression for bf16 Transformer LLM Weights With Companion Measurements on fp16 and on Q4_K-typed Tensors in GGUF Q4_K_M Files, Under Strict Profile-Byte Accounting

Nimo Rotem

knowva.ai · 2026

Draft v4.0.6 (final). Citation target: v4.0.3 (immutable, SWHID-archived). Primary venue: DCC 2027.

Abstract. We measure the lossless compressibility of bf16 transformer LLM weights and of Q4_K-typed tensors in GGUF Q4_K_M files under a strict fairness contract that counts profile bytes against trained methods on every file, verifies byte-exact roundtrips on every run, and uses a model-level train/test split. Across 12 source models and 11,942 verified roundtrip method-evaluations over 7,960 unique benchmark rows (zero roundtrip failures), we establish three results.

First, the measured practical ceiling for bf16 weights under iid marginal-byte coding is ≈1.495× byte-weighted (model-level 95% CI [1.487, 1.502]), derivable from a one-line Shannon proposition at measured byte entropies (median Hhigh = 2.74 bits, Hlow = 7.97 bits) and consistent with α-stable theory of SGD-trained networks. Our bf16_split predictor reaches 1.488×; the official OpenZL le-u16 result marginally exceeds the bound (1.499×) by using context coding within byte planes, a regime our proposition does not cover. Result validated through Qwen2.5-7B-Instruct (byte-weighted +0.0001 vs the small-model corpus).

Second, the measured practical ceiling for Q4_K-typed tensors in GGUF Q4_K_M files is ≈1.076× at tensor-stream level (best method 1.052×). At the full GGUF artifact level this becomes ≈1.041–1.045× because Q4_K_M files include Q6_K tensors that compress at ~1.01×. Mechanism: optimized per-block scaling produces near-uniform nibble distributions (median 3.86 of 4.0 bits/symbol).

Third, we find no evidence of usable linear redundancy between adjacent same-role transformer-layer weight matrices at bf16 precision (median Pearson +0.0004 across 250 pairs in two source models from the Qwen2.5 family), ruling out simple scalar-affine cross-layer compression but not nonlinear schemes or cross-checkpoint deltas.

This paper does not propose a new compressor. The two methods we built (bf16_split and a Q4_K block mixture-CDF coder) are used only to test whether the measured entropy ceilings are practically reachable; both cluster at the ceiling rather than beating it. The methodology contribution is the fairness contract under which prior methods can be compared directly.

Headline Results

DomainBest lossless ratioAtlas ceilingSample
bf16 weights (Prop. 1 byte-marginal)1.488–1.499×1.495×290 test tensors, ≤7B
Q4_K tensor stream1.052×1.076×530 Q4_K-typed test tensors
GGUF Q4_K_M whole file1.041–1.045×≈1.05×5 held-out GGUF files
Adjacent-layer linear residuals (bf16)net lossmedian Pearson +0.0004250 layer pairs, 2 Qwen2.5 models

Headline numbers are byte-weighted corpus ratios. Per-tensor distributions and 95% bootstrap CIs in paper §5–§7.

Figures

Figure 1 — bf16 byte-marginal entropy decomposition
Fig 1. Per-tensor stacked bar of bf16 byte-marginal entropy. PDF
Figure 2 — Per-tensor ratio vs R_marginal scatter
Fig 2. Per-tensor best-method ratio vs Rmarginal. PDF
Figure 3 — Q4_K nibble entropy histogram
Fig 3. Per-tensor H(nibble) across 1,043 Q4_K tensors. PDF
Figure 4 — Adjacent-layer Pearson correlation histogram
Fig 4. Pearson correlation across 250 (model, role, K) pairs. PDF
Figure 5 — Throughput / ratio Pareto frontier
Fig 5. Decompress MB/s (log) vs byte-weighted geomean ratio across all benchmark methods. PDF

Reproducibility

The smoke test runs three lossless compressors against bundled fixtures and verifies their sha256[:16] prefixes match the values measured at v4.0.3 release time. Roundtrip is byte-exact-asserted in every method before the hash is taken. Expected to pass 3/3:

git clone https://github.com/NimoRotem/llm-compression-limits.git
cd llm-compression-limits
git checkout v4.0.3
./reproducibility_smoke_test.sh
MethodFixtureRatiosha256[:16]
bf16_splitTinyLlama_layer3_q_proj.bin (8 MiB)1.4820×fc3544c35489eb94
qb_mixture_k4 + 277 B profileLlama32_3B_layer14_gate.q4k.bin (13.5 MiB)1.0507×c79972a64d11b717
decomp_perstream_zstd19_bgQwen2.5-0.5B-Q4_K_M.gguf (379 MiB)1.0367×1c6f077f8c228cf2

Resources

Code (Apache-2.0)github.com/NimoRotem/llm-compression-limits
Tagged releasev4.0.3 (immutable; release ZIP byte-identical to mirror)
SWHID (persistent identifier)swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164
Artifact mirrorknowva.ai/CompressionV4
Results tableresults.jsonl.zst — 7,960 rows / 11,942 verified evals
Trained profilebf16.zl (617 B, OpenZL le-u16, regenerated v4.0.3)
FixturesTinyLlama bf16 · Llama-3.2-3B Q4_K · Qwen2.5-0.5B GGUF
Smoke checksumsexpected-checksums.json
Cross-layer pairscross-layer-pairs.csv — 250 rows
Provenanceper-stage benchmark verdicts: bf16 supplementary · Q4_K + cross-layer · GGUF artifact
Drift catalogDEVIATIONS.md

Citation

@misc{rotem2026compression,
  title  = {Practical Limits of Lossless Compression for bf16 Transformer LLM Weights},
  author = {Rotem, Nimo},
  year   = {2026},
  url    = {https://github.com/NimoRotem/llm-compression-limits/releases/tag/v4.0.3},
  howpublished = {GitHub release v4.0.3},
  swhid  = {swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164},
  note   = {Reproduction artifact v4.0.3 (canonical citation target; head v4.0.6).}
}

License: Apache-2.0 (code) and CC-BY-4.0 (data, profiles, figures). Smoke test passes 3/3. No Zenodo DOI was minted; the GitHub release tag and the Software Heritage SWHID together provide a verifiable archive and a citation-grade persistent identifier.