Practical Limits of Lossless Compression for bf16 Transformer LLM Weights

Draft v4.0.6 (final). Citation target: v4.0.3 (immutable, SWHID-archived). Primary venue: DCC 2027.

Abstract. We measure the lossless compressibility of bf16 transformer LLM weights and of Q4_K-typed tensors in GGUF Q4_K_M files under a strict fairness contract that counts profile bytes against trained methods on every file, verifies byte-exact roundtrips on every run, and uses a model-level train/test split. Across 12 source models and 11,942 verified roundtrip method-evaluations over 7,960 unique benchmark rows (zero roundtrip failures), we establish three results.

First, the measured practical ceiling for bf16 weights under iid marginal-byte coding is ≈1.495× byte-weighted (model-level 95% CI [1.487, 1.502]), derivable from a one-line Shannon proposition at measured byte entropies (median H_high = 2.74 bits, H_low = 7.97 bits) and consistent with α-stable theory of SGD-trained networks. Our bf16_split predictor reaches 1.488×; the official OpenZL le-u16 result marginally exceeds the bound (1.499×) by using context coding within byte planes, a regime our proposition does not cover. Result validated through Qwen2.5-7B-Instruct (byte-weighted +0.0001 vs the small-model corpus).

Second, the measured practical ceiling for Q4_K-typed tensors in GGUF Q4_K_M files is ≈1.076× at tensor-stream level (best method 1.052×). At the full GGUF artifact level this becomes ≈1.041–1.045× because Q4_K_M files include Q6_K tensors that compress at ~1.01×. Mechanism: optimized per-block scaling produces near-uniform nibble distributions (median 3.86 of 4.0 bits/symbol).

Third, we find no evidence of usable linear redundancy between adjacent same-role transformer-layer weight matrices at bf16 precision (median Pearson +0.0004 across 250 pairs in two source models from the Qwen2.5 family), ruling out simple scalar-affine cross-layer compression but not nonlinear schemes or cross-checkpoint deltas.

This paper does not propose a new compressor. The two methods we built (bf16_split and a Q4_K block mixture-CDF coder) are used only to test whether the measured entropy ceilings are practically reachable; both cluster at the ceiling rather than beating it. The methodology contribution is the fairness contract under which prior methods can be compared directly.

Headline Results

Headline numbers are byte-weighted corpus ratios. Per-tensor distributions and 95% bootstrap CIs in paper §5–§7.

Figures

Reproducibility

The smoke test runs three lossless compressors against bundled fixtures and verifies their sha256[:16] prefixes match the values measured at v4.0.3 release time. Roundtrip is byte-exact-asserted in every method before the hash is taken. Expected to pass 3/3:

Resources

Citation

License: Apache-2.0 (code) and CC-BY-4.0 (data, profiles, figures). Smoke test passes 3/3. No Zenodo DOI was minted; the GitHub release tag and the Software Heritage SWHID together provide a verifiable archive and a citation-grade persistent identifier.

Domain	Best lossless ratio	Atlas ceiling	Sample
bf16 weights (Prop. 1 byte-marginal)	1.488–1.499×	1.495×	290 test tensors, ≤7B
Q4_K tensor stream	1.052×	1.076×	530 Q4_K-typed test tensors
GGUF Q4_K_M whole file	1.041–1.045×	≈1.05×	5 held-out GGUF files
Adjacent-layer linear residuals (bf16)	net loss	median Pearson +0.0004	250 layer pairs, 2 Qwen2.5 models

Method	Fixture	Ratio	sha256[:16]
bf16_split	TinyLlama_layer3_q_proj.bin (8 MiB)	1.4820×	`fc3544c35489eb94`
qb_mixture_k4 + 277 B profile	Llama32_3B_layer14_gate.q4k.bin (13.5 MiB)	1.0507×	`c79972a64d11b717`
decomp_perstream_zstd19_bg	Qwen2.5-0.5B-Q4_K_M.gguf (379 MiB)	1.0367×	`1c6f077f8c228cf2`

Code (Apache-2.0)	github.com/NimoRotem/llm-compression-limits
Tagged release	v4.0.3 (immutable; release ZIP byte-identical to mirror)
SWHID (persistent identifier)	`swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164`
Artifact mirror	knowva.ai/CompressionV4
Results table	results.jsonl.zst — 7,960 rows / 11,942 verified evals
Trained profile	bf16.zl (617 B, OpenZL le-u16, regenerated v4.0.3)
Fixtures	TinyLlama bf16 · Llama-3.2-3B Q4_K · Qwen2.5-0.5B GGUF
Smoke checksums	expected-checksums.json
Cross-layer pairs	cross-layer-pairs.csv — 250 rows
Provenance	per-stage benchmark verdicts: bf16 supplementary · Q4_K + cross-layer · GGUF artifact
Drift catalog	DEVIATIONS.md

Practical Limits of Lossless Compression for bf16 Transformer LLM Weights With Companion Measurements on fp16 and on Q4_K-typed Tensors in GGUF Q4_K_M Files, Under Strict Profile-Byte Accounting

Headline Results

Figures

Reproducibility

Resources

Citation