On the storage cost of the LDF v0.1 extraction pipeline relative to source PDF and PPTX documents
Generated 2026-05-09T20:18:58.264Z · 36 documents
Frame metadata and class-id varints.
ContentItem (44.0%) and
TextBox (36.8%) dominate page-proto bytes; per-Frame metadata accounts for an
additional 11.3%.
§1Aggregate sizes
Table 1. Aggregate sizes by corpus.
| Corpus | n | Source | Extracted (raw) | Cleaned | tar.gz | Cleaned/Source | tar.gz/Source | Σ page.pb | page.pb gz/raw |
|---|---|---|---|---|---|---|---|---|---|
| 24 | 9.67 MB | 9.02 MB | 9.02 MB | 5.66 MB | 93.3% | 58.6% | 2.29 MB | 31.4% | |
| PPTX | 12 | 6.37 MB | 5.91 MB | 5.91 MB | 4.88 MB | 92.9% | 76.6% | 561.4 KB | 22.2% |
| Combined | 36 | 16.03 MB | 14.93 MB | 14.93 MB | 10.54 MB | 93.1% | 65.7% | 2.84 MB | 29.7% |
§2Per-document results
page.proto sidecar.proto font img vec other
Table 2. Per-PDFs sizes and composition.
| Document | Source | Cleaned | tar.gz | C/S | Z/S | Comparison | Composition (cleaned) |
|---|---|---|---|---|---|---|---|
| 05.pdf | 132.1 KB | 46.8 KB | 31.1 KB | 35.4% | 23.5% | source 132.1 KB clean 46.8 KB tar.gz 31.1 KB | |
| 06.pdf | 167.2 KB | 664.2 KB | 652.4 KB | 397.2% | 390.2% | source 167.2 KB clean 664.2 KB tar.gz 652.4 KB | |
| 07.pdf | 123.0 KB | 42.7 KB | 28.0 KB | 34.8% | 22.8% | source 123.0 KB clean 42.7 KB tar.gz 28.0 KB | |
| 08.pdf | 152.4 KB | 557.3 KB | 544.6 KB | 365.7% | 357.4% | source 152.4 KB clean 557.3 KB tar.gz 544.6 KB | |
| 12-12.pdf | 191.1 KB | 94.1 KB | 67.4 KB | 49.3% | 35.3% | source 191.1 KB clean 94.1 KB tar.gz 67.4 KB | |
| 2505.18706v3-9.pdf | 105.8 KB | 54.2 KB | 42.0 KB | 51.2% | 39.7% | source 105.8 KB clean 54.2 KB tar.gz 42.0 KB | |
| 2505.18706v3.pdf | 403.1 KB | 332.2 KB | 185.1 KB | 82.4% | 45.9% | source 403.1 KB clean 332.2 KB tar.gz 185.1 KB | |
| Ali-Argun-Sayilgan-CV-ML.pdf | 117.4 KB | 51.2 KB | 37.6 KB | 43.6% | 32.0% | source 117.4 KB clean 51.2 KB tar.gz 37.6 KB | |
| Ali_Argun_Sayilgan_CV.pdf | 104.1 KB | 44.4 KB | 33.1 KB | 42.7% | 31.8% | source 104.1 KB clean 44.4 KB tar.gz 33.1 KB | |
| Chapter 5 Model Predictive Control-somepages.pdf | 298.5 KB | 223.6 KB | 104.2 KB | 74.9% | 34.9% | source 298.5 KB clean 223.6 KB tar.gz 104.2 KB | |
| Chapter 5 Model Predictive Control.pdf | 1.32 MB | 1.24 MB | 521.7 KB | 93.7% | 38.5% | source 1.32 MB clean 1.24 MB tar.gz 521.7 KB | |
| Chapter 5-11111.pdf | 126.9 KB | 61.5 KB | 44.2 KB | 48.5% | 34.8% | source 126.9 KB clean 61.5 KB tar.gz 44.2 KB | |
| DIJJI.ai.pdf | 2.60 MB | 1.29 MB | 649.0 KB | 49.7% | 24.4% | source 2.60 MB clean 1.29 MB tar.gz 649.0 KB | |
| cal1-somepage1.pdf | 88.7 KB | 40.3 KB | 22.0 KB | 45.4% | 24.8% | source 88.7 KB clean 40.3 KB tar.gz 22.0 KB | |
| cal1-somepages.pdf | 1.06 MB | 1.57 MB | 1.42 MB | 148.7% | 134.4% | source 1.06 MB clean 1.57 MB tar.gz 1.42 MB | |
| cal1-somepages11.pdf | 191.3 KB | 94.4 KB | 67.6 KB | 49.3% | 35.3% | source 191.3 KB clean 94.4 KB tar.gz 67.6 KB | |
| data0-1.pdf | 22.3 KB | 55.1 KB | 53.9 KB | 246.8% | 241.6% | source 22.3 KB clean 55.1 KB tar.gz 53.9 KB | |
| data0-10.pdf | 24.8 KB | 28.2 KB | 24.2 KB | 113.9% | 97.5% | source 24.8 KB clean 28.2 KB tar.gz 24.2 KB | |
| data0.pdf | 153.2 KB | 607.0 KB | 341.1 KB | 396.2% | 222.6% | source 153.2 KB clean 607.0 KB tar.gz 341.1 KB | |
| diji01.pdf | 805.1 KB | 1.35 MB | 549.6 KB | 171.1% | 68.3% | source 805.1 KB clean 1.35 MB tar.gz 549.6 KB | |
| letterspcwht.pdf | 131.2 KB | 40.1 KB | 21.8 KB | 30.6% | 16.6% | source 131.2 KB clean 40.1 KB tar.gz 21.8 KB | |
| matrixful1.pdf | 120.8 KB | 60.9 KB | 43.1 KB | 50.4% | 35.7% | source 120.8 KB clean 60.9 KB tar.gz 43.1 KB | |
| rubin-pdf5.pdf | 1.12 MB | 473.0 KB | 232.0 KB | 41.3% | 20.3% | source 1.12 MB clean 473.0 KB tar.gz 232.0 KB | |
| sat-complexnumbers0.pdf | 201.3 KB | 86.8 KB | 50.1 KB | 43.1% | 24.9% | source 201.3 KB clean 86.8 KB tar.gz 50.1 KB | |
Table 3. Per-PPTXs sizes and composition.
| Document | Source | Cleaned | tar.gz | C/S | Z/S | Comparison | Composition (cleaned) |
|---|---|---|---|---|---|---|---|
| 1-Introduction.pptx | 1.14 MB | 1.07 MB | 1020.0 KB | 94.3% | 87.4% | source 1.14 MB clean 1.07 MB tar.gz 1020.0 KB | |
| Chapter8-Pres.pptx | 1.39 MB | 1.48 MB | 1.21 MB | 106.9% | 87.2% | source 1.39 MB clean 1.48 MB tar.gz 1.21 MB | |
| PrimeFactorisation.pptx | 705.8 KB | 172.1 KB | 159.6 KB | 24.4% | 22.6% | source 705.8 KB clean 172.1 KB tar.gz 159.6 KB | |
| RUBIN UX UI.pptx | 510.5 KB | 738.6 KB | 440.5 KB | 144.7% | 86.3% | source 510.5 KB clean 738.6 KB tar.gz 440.5 KB | |
| Recordkeeping_Software_Presentation.pptx | 972.0 KB | 929.4 KB | 872.1 KB | 95.6% | 89.7% | source 972.0 KB clean 929.4 KB tar.gz 872.1 KB | |
| charts-generated-basic.pptx | 68.7 KB | 50.4 KB | 12.8 KB | 73.4% | 18.6% | source 68.7 KB clean 50.4 KB tar.gz 12.8 KB | |
| charts-generated-extra.pptx | 68.6 KB | 50.3 KB | 12.9 KB | 73.3% | 18.8% | source 68.6 KB clean 50.3 KB tar.gz 12.9 KB | |
| cloud.pptx | 314.0 KB | 254.1 KB | 194.4 KB | 80.9% | 61.9% | source 314.0 KB clean 254.1 KB tar.gz 194.4 KB | |
| ink-maybedraw.pptx | 106.7 KB | 62.3 KB | 56.9 KB | 58.4% | 53.4% | source 106.7 KB clean 62.3 KB tar.gz 56.9 KB | |
| onenote-math-features.pptx | 841.5 KB | 802.4 KB | 774.5 KB | 95.4% | 92.0% | source 841.5 KB clean 802.4 KB tar.gz 774.5 KB | |
| split_presentations_2.pptx | 154.1 KB | 222.4 KB | 143.9 KB | 144.3% | 93.4% | source 154.1 KB clean 222.4 KB tar.gz 143.9 KB | |
| teach-a-level-computing-1-data-structures-2018.pptx | 188.9 KB | 156.3 KB | 70.3 KB | 82.7% | 37.2% | source 188.9 KB clean 156.3 KB tar.gz 70.3 KB | |
§3Page-proto field analysis
Every repeated PageContent content = 3 entry in every page
proto, classified by oneof variant. Because Frame is a
transparent grouping container, we recurse into Frame.children
(field 10) and attribute its repeated ContentItem content
(field 8) to item; residual Frame metadata is reported as
frame-overhead. The squeezable note in each header is the gzip-9
ratio of the concatenated page-proto stream.
Combined
366 pb files · 2.84 MB → 862.6 KB gz (29.7%; 3.4× squeezable)
| item | 1.25 MB | 44.0% |
| textbox | 1.05 MB | 36.8% |
| shape | 11.3 KB | 0.4% |
| table | 78.3 KB | 2.7% |
| hardchar | 95.8 KB | 3.3% |
| frame-overhead | 329.6 KB | 11.3% |
| page-meta | 11.0 KB | 0.4% |
167 pb files · 2.29 MB → 737.8 KB gz (31.4%; 3.2× squeezable)
| item | 1.16 MB | 50.6% |
| textbox | 831.3 KB | 35.4% |
| hardchar | 95.8 KB | 4.1% |
| frame-overhead | 207.4 KB | 8.8% |
| page-meta | 2.1 KB | 0.1% |
PPTX
199 pb files · 561.4 KB → 124.8 KB gz (22.2%; 4.5× squeezable)
| item | 91.4 KB | 16.3% |
| textbox | 238.8 KB | 42.5% |
| shape | 11.3 KB | 2.0% |
| table | 78.3 KB | 13.9% |
| frame-overhead | 122.2 KB | 21.8% |
| page-meta | 8.9 KB | 1.6% |
§4Pruning rules
PDF — clean-pdf-extraction.ts
debug/,_profile/blob-raster-sources/(rasterization types)-
progress.txt,font-route.jsonl,font-capabilities.jsonl,font_observation_pack.json,font_manifest.json,style_summary.json,design_system.json -
styles.css(CSS is reconstructed fromreftable.pb+classtable.pb) -
preview_*.webp(regeneratable rasterized previews) -
font/*.native-font.log,font/*_preview.png
PPTX — clean-pptx-extraction.ts
progress.txt, stray*.log-
closure-matrix.json,promotion-candidates.json blob-raster-sources/-
extra/<id>/t<N>.pb(rasterization types) preview_*.webp