LDF v0.1 · pipeline

Extraction methodology

How a PDF or PPTX becomes a tree of editable protobufs on disk.

§1Pipeline at a glance

1. Parse

Source decode

MuPDF for PDFs; ZIP + DrawingML XML for PPTX. Produces opaque page objects we can iterate.

2. Layout

Structural typing

Glyph runs → spans → lines → text-boxes; vector paint → SVG; native DrawingML shapes; tables; charts.

3. Theme

Style flattening

Colours, fonts, sizes, line-heights, letter-spacings deduped into reftable.pb; per-class property bags into classtable.pb.

4. Emit

page.proto + assets

One page.proto per page; assets to img/, vec/, font/; metadata.pb / theme.pb / theme-layouts.pb.

5. Sidecars

Cold typed payloads

Charts, drawings, smartart, ink, raster sources written under extra/<id>/ and blob-raster-sources/<hash>/ for lazy loading.

§2Output contract

Every extraction produces a directory the proto-based renderer (proto2html) can consume directly without re-reading the original document. The on-disk layout is identical for PDF and PPTX so a single rendering pipeline serves both.

<output>/
├── 0.pb 1.pb 2.pb …             # PPTX: per-slide page.proto
├── <id>/0.pb                    # PDF:  per-page page.proto under content-id dir
├── extra/<id>/0.pb              # PPTX: typed sidecars (charts.pb, drawing.pb, …)
├── metadata.pb                  # document title, author, page count, …
├── theme.pb                     # design system: palette / fonts / sizes
├── theme-layouts.pb             # PPTX: master+layout templates
├── reftable.pb                  # interned colours, fonts, sizes, spacings
├── classtable.pb                # span/line CSS classes (id-keyed)
├── pagenotes.pb                 # PPTX: speaker notes per slide
├── img/<hash>.<ext>             # raster assets (webp/png/jpg)
├── vec/<hash>.svg               # vector assets (SVG)
├── font/<id>.woff2 + .charmap   # font subsets + glyph maps
└── blob-raster-sources/<hash>/  # cold typed raster sources (rasterization types)

§3What "cleaned" means

The storage investigation in /storage measures the extraction after pruning artefacts that are diagnostic-only or regeneratable. Two TypeScript programs implement the cleaning rules:

File	Removed
clean-pdf-extraction.ts	`debug/`, `_profile/`, `blob-raster-sources/`, telemetry JSONL/JSON (`font-route`, `font-capabilities`, `font_observation_pack`, `style_summary`, `design_system`), `font/.native-font.log`, `font/_preview.png`, `preview_*.webp`, `styles.css`, `progress.txt`.
clean-pptx-extraction.ts	`progress.txt`, `closure-matrix.json`, `promotion-candidates.json`, `blob-raster-sources/`, `extra/<id>/t<N>.pb`, `preview_.webp`, stray `.log`.

§4The shared schema

Both extractors emit the same protobuf wire format. The schema is the LDF v0.1 namespace and lives in /proto. The top-level page envelope is Page; its repeated PageContent content is a oneof over frame, textbox, item, hardchar, table, shape. The reference catalogues are reftable.proto, classtable.proto, and the theme tree.