# Reproduction notes

This file documents the scope and limitations of the public reproduction artifact for *Practical Limits of Lossless Compression for bf16 Transformer LLM Weights* (Nimo Rotem, 2026). Paper canonical text: [`paper.md`](paper.md).

## A. HF revision prefix mismatches

Paper §3.1 and Appendix A.2 list 8-character HF revision prefixes for the 13 source-model rows. Twelve of the thirteen were prefix-pinned at corpus-build time (the 13th, `Qwen/Qwen2.5-1.5B`, is the cross-layer-only model and was resolved from HF main on the benchmark date). **7 of the 12 prefix-pinned rows do not match the actually-resolved full SHAs**: `distilbert-base-uncased`, `sentence-transformers/all-MiniLM-L6-v2`, and all five bartowski Q4_K_M GGUF rows. Authoritative SHAs are in [`manifest/model-manifest.json`](manifest/model-manifest.json) with `prefix_match: true/false/null` per model. The paper text already documents this drift (note under §3.1 table).

## B. fp16 trained profile not vendored

Paper §6 and §11.1 scope fp16 as a companion measurement. The trained OpenZL le-u16 fp16-profile result (1.470× byte-weighted on the fp16 test corpus) is reported as fact in §6.1 with provenance pointer to `provenance/iter3_precommit/results_fp16_openzl.jsonl`. Only the 778 B fp16 profile itself is not vendored, per the deprioritization decision in the §6 lead paragraph (fp16 byte-marginal headroom below the 0.04 bits/elem threshold for committing to a format-specific method). The bf16 profile (`bf16.zl`, 617 B) is vendored.

## C. External methods not run under the strict contract

Paper §11.2 and §5.3 explicitly scope this out. DFloat11 ([`LeanModels/DFloat11`](https://github.com/LeanModels/DFloat11)) is a CUDA GPU kernel (CUDA 12+ per its README); Cloudflare Unweight ([`cloudflareresearch/unweight-kernels`](https://github.com/cloudflareresearch/unweight-kernels)) is a CUDA GPU kernel whose README explicitly targets NVIDIA Hopper H100/H200. Neither runs on the `c3-highcpu-88` / CPU-only reference. ZipNN ([`zipnn/zipnn`](https://github.com/zipnn/zipnn)) installs on CPU but fails byte-exact roundtrip in default configuration. `reproductions/{dfloat11,unweight,zipnn}/` ship pointer READMEs only.

**External-method published ratios — verified scopes.** Paper §5.3 observation (5) verifies these against the primary sources:

- **Cloudflare Unweight** — Cloudflare Technical Report Cf-TR-2026.04.v1: ~30% compression on MLP weights. Paper cites ≈1.43× on MLP weights and 1.15–1.28× whole-model on Llama-3.1-8B depending on bundle scope (≈1.15× for inference bundles, ≈1.28× for distribution bundles per Cf-TR-2026.04.v1). The MLP-only scope is by design (non-MLP layers stored verbatim).
- **DFloat11** — arXiv:2504.11651 / NeurIPS 2025 Table 1: whole-model compression ratios on 9 LLMs range 1.467×–1.480× (% of original size 67.58%–68.17%), averaging ≈1.475×. Paper cites ≈1.47–1.48×.
- **ZipNN** — arXiv:2411.05239 abstract: ≈33% reduction on bf16 popular models → 1/0.67 ≈ 1.49×. Paper cites ≈1.49×.

## D. 30B / 70B entropy spot-check not run

Paper §5.8 explicitly notes scope is ≤7B (validated through Qwen2.5-7B-Instruct). No 30B/70B measurement is claimed or shipped.

## E. Inference sanity check verified only on gpt2

Paper §4.3 reports gpt2 verified (149 bf16 tensor objects after tied-alias expansion, 311 MiB unique tensor payload, 8/8 token identity). The full multi-model `max_new_tokens=256` sweep (the 8 bf16/fp16-loadable rows plus the 3 Q4_K test models loadable through llama.cpp, giving 11 inference-runnable models) is run separately on a GPU host and is not part of this CPU-only paper. The byte-exact tensor-level roundtrip assertion (§4.2) verified across all 11,942 method-evaluations in `results/results.jsonl.zst` mathematically implies token identity for the other 10 inference-runnable models at temperature 0 / seed 42.

## F. Smoke-test sha256 prefixes

Paper Appendix A.6 quotes three sha256[:16] prefixes from the smoke-test run: `fc3544c35489eb94`, `c79972a64d11b717`, `1c6f077f8c228cf2`. [`manifest/expected-checksums.json`](manifest/expected-checksums.json) carries these values. `reproducibility_smoke_test.sh` runs against the fixture set bundled at `fixtures/`.

## G. Docker image not built

Paper Appendix A.1 notes no hosted Docker image. The `Dockerfile` ships in-repo for local builds. No publication to `ghcr.io` (no `write:packages` scope on the build account).

## H. Cross-layer test phase scoped out by L_RED pre-check

The cross-layer (§8) experiment was scoped at design time to use Qwen2.5-{0.5B, 1.5B} as the train models and to use a separate held-out test model — Llama-3.2-3B substituted by HuggingFaceTB/SmolLM2-1.7B because Llama-3.2-3B is HF-gated.

The pre-check verdict (`provenance/iter5/FINAL_VERDICT.md` §"Pre-check L outcome") was **L_RED unambiguously**: median |Pearson correlation| = 0.0004 across the 250 training-set adjacent layer pairs (Qwen2.5-0.5B + Qwen2.5-1.5B), vs the 0.5 L_GREEN threshold. Per the iter5 design spec, the full benchmark phase was **SKIPPED** for L_RED outcomes.

Consequences:
- The 250 cross-layer pairs reported in §8 are all from the two train models (Qwen2.5-0.5B and Qwen2.5-1.5B, both Qwen2.5 family). The paper's "both Qwen2.5 family" framing reflects what was actually measured.
- SmolLM2-1.7B was downloaded and prepared as the test-set substitute but **no cross-layer measurements were taken on it**. The manifest reflects "prepared but not measured" for the cross-layer test phase.
- The §8 result's scope is therefore "two source models from the Qwen2.5 family at 0.5B and 1.5B". §11.7 already states this scope ("does not address whether nonlinear schemes, cross-checkpoint deltas, or other model architectures would show different structure"); the cross-family generalization remains an explicit open direction.

## I. Bibliography style

Bibliography style (JMLR-leaning): consistent structural form across §B.1: `Authors. *Title.* Venue, Year. arXiv:ID. Code: link.` All entries use period-separated fields, year-only dates, italic titles. The JMLR house rule is full author lists for ≤7 authors, "et al." only for 8+; a small number of multi-author entries still use the compact "et al." form when the underlying author count is at or near the cutoff — verifiable against the primary source for each citation.