Extraction methodology
How a PDF or PPTX becomes a tree of editable protobufs on disk.
§1Pipeline at a glance
MuPDF for PDFs; ZIP + DrawingML XML for PPTX. Produces opaque page objects we can iterate.
Glyph runs → spans → lines → text-boxes; vector paint → SVG; native DrawingML shapes; tables; charts.
Colours, fonts, sizes, line-heights, letter-spacings deduped into
reftable.pb; per-class
property bags into
classtable.pb.
One page.proto per page;
assets to img/, vec/, font/;
metadata.pb / theme.pb / theme-layouts.pb.
Charts, drawings, smartart, ink, raster sources written under
extra/<id>/ and
blob-raster-sources/<hash>/ for lazy loading.
§2Output contract
Every extraction produces a directory the proto-based renderer
(proto2html) can consume directly without re-reading the
original document. The on-disk layout is identical for PDF and PPTX so a
single rendering pipeline serves both.
<output>/
├── 0.pb 1.pb 2.pb … # PPTX: per-slide page.proto
├── <id>/0.pb # PDF: per-page page.proto under content-id dir
├── extra/<id>/0.pb # PPTX: typed sidecars (charts.pb, drawing.pb, …)
├── metadata.pb # document title, author, page count, …
├── theme.pb # design system: palette / fonts / sizes
├── theme-layouts.pb # PPTX: master+layout templates
├── reftable.pb # interned colours, fonts, sizes, spacings
├── classtable.pb # span/line CSS classes (id-keyed)
├── pagenotes.pb # PPTX: speaker notes per slide
├── img/<hash>.<ext> # raster assets (webp/png/jpg)
├── vec/<hash>.svg # vector assets (SVG)
├── font/<id>.woff2 + .charmap # font subsets + glyph maps
└── blob-raster-sources/<hash>/ # cold typed raster sources (rasterization types) §3What "cleaned" means
The storage investigation in /storage measures the extraction after pruning artefacts that are diagnostic-only or regeneratable. Two TypeScript programs implement the cleaning rules:
| File | Removed |
|---|---|
| clean-pdf-extraction.ts | debug/, _profile/,
blob-raster-sources/, telemetry JSONL/JSON
(font-route, font-capabilities,
font_observation_pack, style_summary,
design_system),
font/*.native-font.log,
font/*_preview.png,
preview_*.webp, styles.css,
progress.txt.
|
| clean-pptx-extraction.ts | progress.txt, closure-matrix.json,
promotion-candidates.json,
blob-raster-sources/,
extra/<id>/t<N>.pb,
preview_*.webp, stray *.log.
|
§4The shared schema
Both extractors emit the same protobuf wire format. The schema is the
LDF v0.1 namespace and lives in /proto. The top-level
page envelope is Page; its
repeated PageContent content is a oneof over
frame, textbox, item,
hardchar, table, shape.
The reference catalogues are
reftable.proto,
classtable.proto, and the
theme tree.