A compact, editable representation of PDFs and PPTX presentations as per-page protocol buffers.
LDF (Layout Data Format) v0.1 is the shared schema written by eddocu when extracting PDFs and by eddocupptx when extracting PowerPoint presentations. Across a corpus of 36 documents (24 PDFs, 12 PPTX), the cleaned extraction occupies 93.1% of source size as loose files, and 65.7% when archived as a single tar.gz — i.e. ~34% smaller than the originals while preserving full edit fidelity.
§1Source vs cleaned vs gzipped
Figure 1. Total bytes per stage, by format. Source documents are PDFs (Flate streams) and PPTX (ZIP). Cleaned = loose files after pruning logs / telemetry / raster sources / preview rasters / styles.css. tar.gz = same files archived. Lower bars are better.
Combined
n = 36PDFs
n = 24PPTXs
n = 12§2What lives inside the cleaned extraction
Figure 2. Cleaned extraction composition by asset class. page.proto (the per-page editable payloads) and img (raster assets) account for the majority. Sidecars (theme.pb, reftable.pb, classtable.pb, metadata.pb) are flattened reference tables shared across pages.
Combined
PDFs
PPTXs
§3What's inside page.pb
Figure 3. Internal composition of every page-level protobuf in the
corpus, after recursively descending the Frame wrapper so
bytes are attributed to leaf payloads. ContentItem (img / vec
/ chart / media references) and TextBox dominate.
Combined · 2.84 MB
PDF · 2.29 MB
PPTX · 561.4 KB
§4Where to go next
/proto — schema reference
20 schema files, 131 messages,
cross-linked. Start at page.proto
— the per-page top-level protobuf — or browse the index.
/storage — full investigation
Per-document tables (source / cleaned / tar.gz), bucket composition, page-proto field-level breakdown, pruning rules, and discussion of why the loose-files extraction is larger than the source.
/performance — extraction wall-clock
Per-document extraction time, throughput in pages/second.
/methodology — pipeline
Phases (parse → layout → emit), output contract, sidecars, and the cleaning rules that define "what counts as the editable extraction".