LDF v0.1 · page-protobuf reference

A compact, editable representation of PDFs and PPTX presentations as per-page protocol buffers.

LDF (Layout Data Format) v0.1 is the shared schema written by eddocu when extracting PDFs and by eddocupptx when extracting PowerPoint presentations. Across a corpus of 36 documents  (24 PDFs, 12 PPTX), the cleaned extraction occupies 93.1% of source size as loose files, and 65.7% when archived as a single tar.gz — i.e. ~34% smaller than the originals while preserving full edit fidelity.

Documents
36
24 PDF · 12 PPTX
Source corpus
16.03 MB
extracted 14.93 MB raw
Cleaned ÷ source
93.1%
14.93 MB
tar.gz ÷ source
65.7%
10.54 MB
LDF v0.1 schemas
20 files
131 messages indexed
page.pb gzip ratio
29.7%
3.4× squeezable

§1Source vs cleaned vs gzipped

Figure 1. Total bytes per stage, by format. Source documents are PDFs (Flate streams) and PPTX (ZIP). Cleaned = loose files after pruning logs / telemetry / raster sources / preview rasters / styles.css. tar.gz = same files archived. Lower bars are better.

Combined

n = 36
source 16.03 MB cleaned 14.93 MB tar.gz 10.54 MB
tar.gz is 65.7% of source

PDFs

n = 24
source 9.67 MB cleaned 9.02 MB tar.gz 5.66 MB
tar.gz is 58.6% of source

PPTXs

n = 12
source 6.37 MB cleaned 5.91 MB tar.gz 4.88 MB
tar.gz is 76.6% of source

§2What lives inside the cleaned extraction

Figure 2. Cleaned extraction composition by asset class. page.proto (the per-page editable payloads) and img (raster assets) account for the majority. Sidecars (theme.pb, reftable.pb, classtable.pb, metadata.pb) are flattened reference tables shared across pages.

Combined

page.proto: 2.84 MB (19.0%) sidecar.proto: 200.8 KB (1.3%) font: 1.97 MB (13.2%) img: 8.55 MB (57.3%) vec: 1.19 MB (8.0%) other: 181.4 KB (1.2%) 14.93 MB cleaned
page.proto 2.84 MB · 19.0%
sidecar.proto 200.8 KB · 1.3%
font 1.97 MB · 13.2%
img 8.55 MB · 57.3%
vec 1.19 MB · 8.0%
other 181.4 KB · 1.2%

PDFs

page.proto: 2.29 MB (25.4%) sidecar.proto: 144.4 KB (1.6%) font: 1.97 MB (21.9%) img: 3.58 MB (39.6%) vec: 1.03 MB (11.5%) other: 702 B (0.0%) 9.02 MB cleaned
page.proto 2.29 MB · 25.4%
sidecar.proto 144.4 KB · 1.6%
font 1.97 MB · 21.9%
img 3.58 MB · 39.6%
vec 1.03 MB · 11.5%
other 702 B · 0.0%

PPTXs

page.proto: 564.7 KB (9.3%) sidecar.proto: 56.4 KB (0.9%) img: 4.98 MB (84.2%) vec: 157.5 KB (2.6%) other: 180.7 KB (3.0%) 5.91 MB cleaned
page.proto 564.7 KB · 9.3%
sidecar.proto 56.4 KB · 0.9%
img 4.98 MB · 84.2%
vec 157.5 KB · 2.6%
other 180.7 KB · 3.0%

§3What's inside page.pb

Figure 3. Internal composition of every page-level protobuf in the corpus, after recursively descending the Frame wrapper so bytes are attributed to leaf payloads. ContentItem (img / vec / chart / media references) and TextBox dominate.

Combined · 2.84 MB

item 1.25 MB
textbox 1.05 MB
shape 11.3 KB
table 78.3 KB
hardchar 95.8 KB
frame-overhead 329.6 KB
page-meta 11.0 KB

PDF · 2.29 MB

item 1.16 MB
textbox 831.3 KB
hardchar 95.8 KB
frame-overhead 207.4 KB
page-meta 2.1 KB

PPTX · 561.4 KB

item 91.4 KB
textbox 238.8 KB
shape 11.3 KB
table 78.3 KB
frame-overhead 122.2 KB
page-meta 8.9 KB

§4Where to go next

/proto — schema reference

20 schema files, 131 messages, cross-linked. Start at page.proto — the per-page top-level protobuf — or browse the index.

/storage — full investigation

Per-document tables (source / cleaned / tar.gz), bucket composition, page-proto field-level breakdown, pruning rules, and discussion of why the loose-files extraction is larger than the source.

/performance — extraction wall-clock

Per-document extraction time, throughput in pages/second.

/methodology — pipeline

Phases (parse → layout → emit), output contract, sidecars, and the cleaning rules that define "what counts as the editable extraction".