# llm-compression-limits — v4.0.6

**Status: Draft v4.0.6 (final).** Canonical archival snapshot: [GitHub release v4.0.3](https://github.com/NimoRotem/llm-compression-limits/releases/tag/v4.0.3) (immutable git tag; release ZIP is byte-identical to the published artifact; current head v4.0.6 contains only post-archive doc housekeeping). Supplementary persistent identifier: Software Heritage SWHID `swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164` covering the v4.0.3 commit (archived 2026-05-14, SH save-task 2330392). **No Zenodo DOI** — the Zenodo↔GitHub webhook installation did not complete on this repository, so we use the GitHub release tag as the canonical archive instead.

Practical limits of lossless compression for bf16 and Q4_K transformer LLM weights — empirical confirmation that the achievable ratios at the two deployment-default formats are essentially at their byte-marginal entropy floors, with no usable linear cross-layer redundancy at bf16.

## Headline empirical findings

| Domain | Compressor | Best ratio | Atlas-derived ceiling |
|---|---|---|---|
| bf16 (290-tensor test corpus) | trained OpenZL le-u16 (617 B profile, regenerated v4.0.3) | **1.4986 geomean** (1.5014 median) | 1.4979 |
| bf16 (same corpus) | our `bf16_split` per-file | 1.4850 geomean (1.4886 median) | — |
| bf16 (7B Qwen2.5 validation) | `bf16_split` per-file | 1.4882 median | 1.5006 |
| Q4_K tensor stream (530-tensor test corpus) | our `qb_mixture_k4` (277 B profile) | **1.0517 geomean** (1.0514 median) | 1.076 |
| GGUF whole file (3 held-out models) | `decomp_perstream_zstd19_bgscale` | **1.0422 geomean** (1.0436 median) | ≈ 1.05 practical |
| bf16 inter-layer correlation (250 adjacent layer pairs) | scalar / rank-up-to-64 affine | **net loss** (residual + codebook costs more than original) | corr ≈ 0 |

Across the two deployment-default LLM weight formats and the two natural axes for finding more structure (joint coding within a tensor and cross-layer reuse), the achievable lossless ratios are within ~0.01 of provable byte-marginal ceilings. There is no large remaining headroom for lossless compression of distributed model weights.

## Layout

```
.
├── README.md                              # this file
├── DEVIATIONS.md                          # live catalog of drift between paper text and as-shipped artifacts
├── REMAINING_WORK.md                      # post-completion status (no work remaining for v4.0.6)
├── LICENSE                                # Apache 2.0 (code)
├── LICENSE-DATA                           # CC-BY-4.0 (data/profiles/figures)
├── Dockerfile                             # local container reproduction; no hosted image
├── reproducibility_smoke_test.sh          # 3/3 sha256[:16] checks pass
├── run_full_benchmark.sh
├── summarize.py
├── atlas.py
├── bootstrap_ci.py
├── inference_sanity_check.py              # end-to-end token-identity driver
├── whole_file_gguf_ablation.py
├── render_figures.py                      # CLI rendering for the 10 paper figures
├── requirements.txt
├── cross-layer-pairs.csv                  # 250 (model, role, K) from the cross-layer correlation atlas
├── code/                                  # SCAPE pkg + atlas + cross-layer + supporting experiment scripts
├── methods/
│   ├── bf16-split/                        # bf16 byte-split predictor + mixture-CDF + pretrained CDFs
│   └── qk-mixture-cdf/                    # Q4_K mixture-CDF coder + qa_residual + general baselines
│       └── profiles/qb_k4.profile         # 277 B trained K=4 mixture-CDF profile
├── reproductions/                         # pointers to official DFloat11 / Unweight / ZipNN repos
├── profiles/                              # bf16.zl (617 B, regenerated v4.0.3); fp16 deferred
├── prompts/                               # inference-sanity-prompts.json (8 real prompts) + gpt2 result
├── fixtures/                              # 8 MiB bf16 + 13.5 MiB Q4_K vendored; 379 MiB GGUF on mirror
├── manifest/                              # model-manifest, expected-checksums (measured), dependencies
├── results/results.jsonl.zst              # 7,960 rows / 11,942 verified method-evaluations
├── atlas/                                 # pointers to per-stage atlas outputs (under provenance/)
├── notebooks/                             # figures.ipynb + lomo-tables.ipynb (both populated)
├── figures/                               # 5 figs × {pdf, png} rendered and committed
├── decomposer/                            # GGUF byte-region decomposer
└── provenance/                            # per-stage FINAL_VERDICT and atlas/results.jsonl for each benchmark stage (subdirectories named for internal scaffolding)
```

## How to reproduce

```bash
git clone https://github.com/NimoRotem/llm-compression-limits.git
cd llm-compression-limits
git checkout v4.0.3
./reproducibility_smoke_test.sh
```

Expected output (sha256[:16] of compressed bytes, all roundtrip-verified):

| Method | Fixture | Compressed bytes | Ratio | sha256[:16] |
|---|---|---:|---:|---|
| `bf16_split` | `fixtures/TinyLlama_layer3_q_proj.bin` (8 MiB) | 5,660,476 | 1.4820× | `fc3544c35489eb94` |
| `qb_mixture_k4` (+ 277 B profile) | `fixtures/Llama32_3B_layer14_gate.q4k.bin` (13.5 MiB) | 13,472,852 | 1.0507× | `c79972a64d11b717` |
| `decomp_perstream_zstd19_bg` | `fixtures/Qwen2.5-0.5B-Q4_K_M.gguf` (379 MiB) | 383,733,236 | 1.0367× | `1c6f077f8c228cf2` |

The script auto-fetches the 379 MiB GGUF fixture from the public mirror if absent (GitHub's 100 MB per-file limit prevents shipping it in-repo). Each method's roundtrip is byte-exact-asserted before the hash is taken.

## Full benchmark invocation (target c3-highcpu-88; ~6.5 h end-to-end)

```bash
./run_full_benchmark.sh --corpus=/data/corpus --output=/data/results --workers=88
./summarize.py --results=/data/results/results.jsonl.zst --output=/data/tables
./atlas.py --corpus=/data/corpus --output=/data/atlas
./bootstrap_ci.py --results=/data/results/results.jsonl.zst --output=/data/ci_tables --seed 42 --n-resamples 1000
./inference_sanity_check.py --prompts prompts/inference-sanity-prompts.json --output sanity_results.jsonl
./whole_file_gguf_ablation.py --corpus=/data/corpus/gguf --output=/data/gguf_ablation
```

## License

- All source files under `code/`, `methods/`, `decomposer/`, top-level Python entry scripts, and shell scripts: **Apache 2.0** (`LICENSE`).
- All data files (`results/`, `profiles/`, `provenance/`, `manifest/`, `cross-layer-pairs.csv`, `figures/`): **CC-BY-4.0** (`LICENSE-DATA`).

## Hosted artifacts

The full artifact tree is mirrored at <https://knowva.ai/CompressionV4>. The 379 MiB GGUF fixture is served from there directly; everything else is also in this Git repo.

## Contact

GitHub Issues on this repo.

## Citation

Cite the GitHub release directly (immutable tag; release ZIP is byte-identical to the published artifact). The BibTeX block below already includes the Software Heritage SWHID as an additional persistent identifier.

```bibtex
@misc{llm_compression_limits_v4_2026,
  title  = {Practical Limits of Lossless Compression for bf16 Transformer LLM Weights},
  year   = {2026},
  note   = {Reproduction artifact v4.0.3.},
  url    = {https://github.com/NimoRotem/llm-compression-limits/releases/tag/v4.0.3},
  howpublished = {GitHub release v4.0.3},
  swhid  = {swh:1:rel:55d910f5af170c22719cc9346f4d8a5029f09164}
}
```

No Zenodo DOI: the Zenodo↔GitHub integration didn't onboard cleanly, so we use the GitHub release tag as the canonical archive instead. See `DEVIATIONS.md §H` for the rationale.

## What's deliberately not in this version

- **External-method strict-accounting rows** (DFloat11 / Cloudflare Unweight / ZipNN). DFloat11 and Unweight are Hopper-only CUDA kernels and cannot run on the c3-highcpu-88 reproduction host; ZipNN fails byte-exact roundtrip in default config. Paper §5.3 / §11.2 scope this out explicitly. Pointers to official repos in `reproductions/*/README.md`.
- **`fp16.zl` trained profile.** fp16_split was dropped at the pre-commit atlas check because the fp16 byte-marginal headroom is < 0.04 bits/elem. Recipe in `profiles/README.md` if it ever needs to be regenerated.
- **30B / 70B entropy spot-check.** Not run; 7B Qwen2.5 is the largest validation point.
- **Full 11-model inference-sanity sweep at `max_new_tokens=256`.** gpt2 end-to-end verified (8/8 token identity); the per-tensor byte-exact roundtrip already implies the same property for the larger models, so the gpt2 verification is presented in paper §4.3 as sufficient evidence. The script supports a CUDA host (`genom-beast-gpu` T4) for the full sweep if it becomes useful.
- **Hosted Docker image.** `Dockerfile` ships in-repo; reviewers can build locally.
