Investigation · Storage efficiency

On the storage cost of the LDF v0.1 extraction pipeline relative to source PDF and PPTX documents

Generated 2026-05-09T20:18:58.264Z · 36 documents

Abstract. The cleaned extraction occupies 93.1% of source size as loose files and 65.7% when archived as a single tar.gz. Page-protobuf bytes themselves compress to 29.7% of raw size under gzip-9 — substantial residual redundancy, much of which is repeated Frame metadata and class-id varints. ContentItem (44.0%) and TextBox (36.8%) dominate page-proto bytes; per-Frame metadata accounts for an additional 11.3%.

§1Aggregate sizes

Table 1. Aggregate sizes by corpus.

Corpus n Source Extracted (raw) Cleaned tar.gz Cleaned/Source tar.gz/Source Σ page.pb page.pb gz/raw
PDF 24 9.67 MB 9.02 MB 9.02 MB 5.66 MB 93.3% 58.6% 2.29 MB 31.4%
PPTX 12 6.37 MB 5.91 MB 5.91 MB 4.88 MB 92.9% 76.6% 561.4 KB 22.2%
Combined 36 16.03 MB 14.93 MB 14.93 MB 10.54 MB 93.1% 65.7% 2.84 MB 29.7%

§2Per-document results

page.proto sidecar.proto font img vec other

Table 2. Per-PDFs sizes and composition.

Document Source Cleaned tar.gz C/S Z/S Comparison Composition (cleaned)
05.pdf 132.1 KB 46.8 KB 31.1 KB 35.4% 23.5%
source 132.1 KB
clean 46.8 KB
tar.gz 31.1 KB
06.pdf 167.2 KB 664.2 KB 652.4 KB 397.2% 390.2%
source 167.2 KB
clean 664.2 KB
tar.gz 652.4 KB
07.pdf 123.0 KB 42.7 KB 28.0 KB 34.8% 22.8%
source 123.0 KB
clean 42.7 KB
tar.gz 28.0 KB
08.pdf 152.4 KB 557.3 KB 544.6 KB 365.7% 357.4%
source 152.4 KB
clean 557.3 KB
tar.gz 544.6 KB
12-12.pdf 191.1 KB 94.1 KB 67.4 KB 49.3% 35.3%
source 191.1 KB
clean 94.1 KB
tar.gz 67.4 KB
2505.18706v3-9.pdf 105.8 KB 54.2 KB 42.0 KB 51.2% 39.7%
source 105.8 KB
clean 54.2 KB
tar.gz 42.0 KB
2505.18706v3.pdf 403.1 KB 332.2 KB 185.1 KB 82.4% 45.9%
source 403.1 KB
clean 332.2 KB
tar.gz 185.1 KB
Ali-Argun-Sayilgan-CV-ML.pdf 117.4 KB 51.2 KB 37.6 KB 43.6% 32.0%
source 117.4 KB
clean 51.2 KB
tar.gz 37.6 KB
Ali_Argun_Sayilgan_CV.pdf 104.1 KB 44.4 KB 33.1 KB 42.7% 31.8%
source 104.1 KB
clean 44.4 KB
tar.gz 33.1 KB
Chapter 5 Model Predictive Control-somepages.pdf 298.5 KB 223.6 KB 104.2 KB 74.9% 34.9%
source 298.5 KB
clean 223.6 KB
tar.gz 104.2 KB
Chapter 5 Model Predictive Control.pdf 1.32 MB 1.24 MB 521.7 KB 93.7% 38.5%
source 1.32 MB
clean 1.24 MB
tar.gz 521.7 KB
Chapter 5-11111.pdf 126.9 KB 61.5 KB 44.2 KB 48.5% 34.8%
source 126.9 KB
clean 61.5 KB
tar.gz 44.2 KB
DIJJI.ai.pdf 2.60 MB 1.29 MB 649.0 KB 49.7% 24.4%
source 2.60 MB
clean 1.29 MB
tar.gz 649.0 KB
cal1-somepage1.pdf 88.7 KB 40.3 KB 22.0 KB 45.4% 24.8%
source 88.7 KB
clean 40.3 KB
tar.gz 22.0 KB
cal1-somepages.pdf 1.06 MB 1.57 MB 1.42 MB 148.7% 134.4%
source 1.06 MB
clean 1.57 MB
tar.gz 1.42 MB
cal1-somepages11.pdf 191.3 KB 94.4 KB 67.6 KB 49.3% 35.3%
source 191.3 KB
clean 94.4 KB
tar.gz 67.6 KB
data0-1.pdf 22.3 KB 55.1 KB 53.9 KB 246.8% 241.6%
source 22.3 KB
clean 55.1 KB
tar.gz 53.9 KB
data0-10.pdf 24.8 KB 28.2 KB 24.2 KB 113.9% 97.5%
source 24.8 KB
clean 28.2 KB
tar.gz 24.2 KB
data0.pdf 153.2 KB 607.0 KB 341.1 KB 396.2% 222.6%
source 153.2 KB
clean 607.0 KB
tar.gz 341.1 KB
diji01.pdf 805.1 KB 1.35 MB 549.6 KB 171.1% 68.3%
source 805.1 KB
clean 1.35 MB
tar.gz 549.6 KB
letterspcwht.pdf 131.2 KB 40.1 KB 21.8 KB 30.6% 16.6%
source 131.2 KB
clean 40.1 KB
tar.gz 21.8 KB
matrixful1.pdf 120.8 KB 60.9 KB 43.1 KB 50.4% 35.7%
source 120.8 KB
clean 60.9 KB
tar.gz 43.1 KB
rubin-pdf5.pdf 1.12 MB 473.0 KB 232.0 KB 41.3% 20.3%
source 1.12 MB
clean 473.0 KB
tar.gz 232.0 KB
sat-complexnumbers0.pdf 201.3 KB 86.8 KB 50.1 KB 43.1% 24.9%
source 201.3 KB
clean 86.8 KB
tar.gz 50.1 KB

Table 3. Per-PPTXs sizes and composition.

Document Source Cleaned tar.gz C/S Z/S Comparison Composition (cleaned)
1-Introduction.pptx 1.14 MB 1.07 MB 1020.0 KB 94.3% 87.4%
source 1.14 MB
clean 1.07 MB
tar.gz 1020.0 KB
Chapter8-Pres.pptx 1.39 MB 1.48 MB 1.21 MB 106.9% 87.2%
source 1.39 MB
clean 1.48 MB
tar.gz 1.21 MB
PrimeFactorisation.pptx 705.8 KB 172.1 KB 159.6 KB 24.4% 22.6%
source 705.8 KB
clean 172.1 KB
tar.gz 159.6 KB
RUBIN UX UI.pptx 510.5 KB 738.6 KB 440.5 KB 144.7% 86.3%
source 510.5 KB
clean 738.6 KB
tar.gz 440.5 KB
Recordkeeping_Software_Presentation.pptx 972.0 KB 929.4 KB 872.1 KB 95.6% 89.7%
source 972.0 KB
clean 929.4 KB
tar.gz 872.1 KB
charts-generated-basic.pptx 68.7 KB 50.4 KB 12.8 KB 73.4% 18.6%
source 68.7 KB
clean 50.4 KB
tar.gz 12.8 KB
charts-generated-extra.pptx 68.6 KB 50.3 KB 12.9 KB 73.3% 18.8%
source 68.6 KB
clean 50.3 KB
tar.gz 12.9 KB
cloud.pptx 314.0 KB 254.1 KB 194.4 KB 80.9% 61.9%
source 314.0 KB
clean 254.1 KB
tar.gz 194.4 KB
ink-maybedraw.pptx 106.7 KB 62.3 KB 56.9 KB 58.4% 53.4%
source 106.7 KB
clean 62.3 KB
tar.gz 56.9 KB
onenote-math-features.pptx 841.5 KB 802.4 KB 774.5 KB 95.4% 92.0%
source 841.5 KB
clean 802.4 KB
tar.gz 774.5 KB
split_presentations_2.pptx 154.1 KB 222.4 KB 143.9 KB 144.3% 93.4%
source 154.1 KB
clean 222.4 KB
tar.gz 143.9 KB
teach-a-level-computing-1-data-structures-2018.pptx 188.9 KB 156.3 KB 70.3 KB 82.7% 37.2%
source 188.9 KB
clean 156.3 KB
tar.gz 70.3 KB

§3Page-proto field analysis

Every repeated PageContent content = 3 entry in every page proto, classified by oneof variant. Because Frame is a transparent grouping container, we recurse into Frame.children (field 10) and attribute its repeated ContentItem content (field 8) to item; residual Frame metadata is reported as frame-overhead. The squeezable note in each header is the gzip-9 ratio of the concatenated page-proto stream.

Combined

366 pb files · 2.84 MB → 862.6 KB gz (29.7%; 3.4× squeezable)

item 1.25 MB 44.0%
textbox 1.05 MB 36.8%
shape 11.3 KB 0.4%
table 78.3 KB 2.7%
hardchar 95.8 KB 3.3%
frame-overhead 329.6 KB 11.3%
page-meta 11.0 KB 0.4%

PDF

167 pb files · 2.29 MB → 737.8 KB gz (31.4%; 3.2× squeezable)

item 1.16 MB 50.6%
textbox 831.3 KB 35.4%
hardchar 95.8 KB 4.1%
frame-overhead 207.4 KB 8.8%
page-meta 2.1 KB 0.1%

PPTX

199 pb files · 561.4 KB → 124.8 KB gz (22.2%; 4.5× squeezable)

item 91.4 KB 16.3%
textbox 238.8 KB 42.5%
shape 11.3 KB 2.0%
table 78.3 KB 13.9%
frame-overhead 122.2 KB 21.8%
page-meta 8.9 KB 1.6%

§4Pruning rules

PDF — clean-pdf-extraction.ts

  • debug/, _profile/
  • blob-raster-sources/ (rasterization types)
  • progress.txt, font-route.jsonl, font-capabilities.jsonl, font_observation_pack.json, font_manifest.json, style_summary.json, design_system.json
  • styles.css (CSS is reconstructed from reftable.pb + classtable.pb)
  • preview_*.webp (regeneratable rasterized previews)
  • font/*.native-font.log, font/*_preview.png

PPTX — clean-pptx-extraction.ts

  • progress.txt, stray *.log
  • closure-matrix.json, promotion-candidates.json
  • blob-raster-sources/
  • extra/<id>/t<N>.pb (rasterization types)
  • preview_*.webp