# llm-compression-limits

Reproduction artifact for the paper *Practical Limits of Lossless Compression for bf16 Transformer LLM Weights*. CPU-only benchmark; smoke test passes 3/3 in seconds; full benchmark reproduces in ~6.5 h on `c3-highcpu-88`.

It includes the evaluation scripts, manifests, trained-codec artifacts, smoke-test fixtures, and figure-generation pipeline used for the results reported in the paper.

- **Paper (Markdown source):** [`paper.md`](paper.md) (canonical text mirror: <https://knowva.ai/llm-compression-limits/paper.md>).
- **Paper (compiled PDF, arXiv submission package):** <https://knowva.ai/CompressionV4/arxiv-preview/paper.pdf>.
- **Code:** <https://github.com/NimoRotem/llm-compression-limits> (Apache-2.0 for code; CC-BY-4.0 for data, results, profiles, figures).
- **Reproduction notes:** [`NOTES.md`](NOTES.md) — what is and is not in this artifact set.

## Headline empirical findings

| Domain | Compressor | Best ratio | Atlas-derived ceiling |
|---|---|---|---|
| bf16 (290-tensor test corpus) | trained OpenZL le-u16 (617 B profile) | **1.4986 geomean** (1.5014 median) | 1.4979 |
| bf16 (same corpus) | our `bf16_split` per-file | 1.4850 geomean (1.4886 median) | — |
| bf16 (7B Qwen2.5 validation) | `bf16_split` per-file | 1.4882 median | 1.5006 |
| Q4_K tensor stream (530-tensor test corpus) | our `qb_mixture_k4` (277 B profile) | **1.0517 geomean** (1.0514 median) | 1.076 |
| GGUF whole file (3 held-out models) | `decomp_perstream_zstd19_bgscale` | **1.0422 geomean** (1.0436 median) | ≈ 1.05 practical |
| bf16 inter-layer correlation (250 adjacent layer pairs) | scalar / rank-up-to-64 affine | **net loss** (residual + codebook costs more than original) | corr ≈ 0 |

Across the two deployment-default LLM weight formats and the two natural axes for finding more structure (joint coding within a tensor and cross-layer reuse), the achievable lossless ratios are within ~0.01 of provable byte-marginal ceilings. There is no large remaining headroom for lossless compression of distributed model weights.

## Layout

```
.
├── paper.md                                # the paper
├── README.md                               # this file
├── NOTES.md                                # what is and is not in this artifact
├── LICENSE                                 # Apache 2.0 (code)
├── LICENSE-DATA                            # CC-BY-4.0 (data/profiles/figures)
├── Dockerfile                              # local container reproduction
├── reproducibility_smoke_test.sh           # 3/3 sha256[:16] checks pass
├── run_full_benchmark.sh
├── summarize.py
├── atlas.py
├── bootstrap_ci.py
├── inference_sanity_check.py               # end-to-end token-identity driver
├── whole_file_gguf_ablation.py
├── render_figures.py                       # CLI rendering for the paper figures
├── requirements.txt
├── cross-layer-pairs.csv                   # 250 (model, role, K) from the cross-layer correlation atlas
├── code/                                   # core harness + atlas + cross-layer + supporting experiment scripts
├── methods/
│   ├── bf16-split/                         # bf16 byte-split predictor + mixture-CDF + pretrained CDFs
│   └── qk-mixture-cdf/                     # Q4_K mixture-CDF coder + qa_residual + general baselines
│       └── profiles/qb_k4.profile          # 277 B trained K=4 mixture-CDF profile
├── reproductions/                          # pointers to official DFloat11 / Unweight / ZipNN repos
├── profiles/                               # bf16.zl (617 B); fp16 profile not vendored (see paper §6 lead)
├── prompts/                                # inference-sanity-prompts.json (8 prompts) + gpt2 result
├── fixtures/                               # 8 MiB bf16 + 13.5 MiB Q4_K vendored; 379 MiB GGUF on mirror
├── manifest/                               # model-manifest, expected-checksums (measured), dependencies
├── results/results.jsonl.zst               # 7,960 rows / 11,942 verified method-evaluations
├── atlas/                                  # per-stage atlas outputs (under provenance/)
├── notebooks/                              # figures.ipynb + lomo-tables.ipynb
├── figures/                                # 5 figs × {pdf, png}
├── decomposer/                             # GGUF byte-region decomposer
└── provenance/                             # per-stage FINAL_VERDICT and atlas/results.jsonl for each benchmark stage
```

## How to reproduce

```bash
git clone https://github.com/NimoRotem/llm-compression-limits.git
cd llm-compression-limits
git checkout v1.0.0
./reproducibility_smoke_test.sh
```

Expected output (sha256[:16] of compressed bytes, all roundtrip-verified):

| Method | Fixture | Compressed bytes | Ratio | sha256[:16] |
|---|---|---:|---:|---|
| `bf16_split` | `fixtures/TinyLlama_layer3_q_proj.bin` (8 MiB) | 5,660,476 | 1.4820× | `fc3544c35489eb94` |
| `qb_mixture_k4` (+ 277 B profile) | `fixtures/Llama32_3B_layer14_gate.q4k.bin` (13.5 MiB) | 13,472,852 | 1.0507× | `c79972a64d11b717` |
| `decomp_perstream_zstd19_bg` | `fixtures/Qwen2.5-0.5B-Q4_K_M.gguf` (379 MiB) | 383,733,236 | 1.0367× | `1c6f077f8c228cf2` |

The script auto-fetches the 379 MiB GGUF fixture from the public mirror if absent (GitHub's 100 MB per-file limit prevents shipping it in-repo). Each method's roundtrip is byte-exact-asserted before the hash is taken.

## Full benchmark invocation (target `c3-highcpu-88`; ~6.5 h end-to-end)

```bash
./run_full_benchmark.sh --corpus=/data/corpus --output=/data/results --workers=88
./summarize.py --results=/data/results/results.jsonl.zst --output=/data/tables
./atlas.py --corpus=/data/corpus --output=/data/atlas
./bootstrap_ci.py --results=/data/results/results.jsonl.zst --output=/data/ci_tables --seed 42 --n-resamples 1000
./inference_sanity_check.py --prompts prompts/inference-sanity-prompts.json --output sanity_results.jsonl
./whole_file_gguf_ablation.py --corpus=/data/corpus/gguf --output=/data/gguf_ablation
```

## License

- All source files under `code/`, `methods/`, `decomposer/`, top-level Python entry scripts, and shell scripts: **Apache 2.0** (`LICENSE`).
- All data files (`results/`, `profiles/`, `provenance/`, `manifest/`, `cross-layer-pairs.csv`, `figures/`): **CC-BY-4.0** (`LICENSE-DATA`).

## Hosted artifacts

The full artifact tree is mirrored at <https://knowva.ai/llm-compression-limits>. The 379 MiB GGUF fixture is served from there directly; everything else is also in this Git repo.

## Contact

GitHub Issues on this repo.

## Citation

```bibtex
@misc{rotem_llm_compression_limits_2026,
  title  = {Practical Limits of Lossless Compression for bf16 Transformer Model Weights},
  author = {Rotem, Nimo},
  year   = {2026},
  organization = {Nebula Inc.},
  url    = {https://github.com/NimoRotem/llm-compression-limits/releases/tag/v1.0.0},
  howpublished = {GitHub release v1.0.0},
}
```

Software Heritage: `swh:1:rel:08d597be136278838e8cc2fef2a68303d990208d`.

## What's deliberately not in this version

- **External-method strict-accounting rows** (DFloat11 / Cloudflare Unweight / ZipNN). DFloat11 is a CUDA GPU kernel (CUDA 12+) and Cloudflare Unweight's official kernel targets NVIDIA Hopper H100/H200 — neither runs on the `c3-highcpu-88` reproduction host. ZipNN fails byte-exact roundtrip in default config. Paper §5.3 / §11.2 scope this out explicitly. Pointers to official repos in `reproductions/*/README.md`.
- **`fp16.zl` trained profile.** Not vendored. The fp16 byte-marginal headroom (<0.04 bits/elem on the test corpus, see §6 / §11.1 in the paper) is below the threshold we set for shipping a dedicated fp16 method, so no `fp16_split` analog was developed. The trained OpenZL le-u16 fp16-profile result (1.470×) is reported in §6.1; the 778 B profile itself is not vendored. Regeneration recipe in `profiles/README.md`.
- **30B / 70B entropy spot-check.** Not run; Qwen2.5-7B-Instruct is the largest validation point (paper §5.8).
- **Full multi-model inference-sanity sweep at `max_new_tokens=256`.** gpt2 end-to-end verified (8/8 token identity); the per-tensor byte-exact roundtrip already implies the same property for the larger models, so the gpt2 verification is presented in paper §4.3 as sufficient evidence. The script supports a CUDA host (a single T4 is sufficient) for the full sweep.
- **Hosted Docker image.** `Dockerfile` ships in-repo; reviewers can build locally.