LDF v0.1 · pipeline

Extraction methodology

How a PDF or PPTX becomes a tree of editable protobufs on disk.

§1Pipeline at a glance

1. Parse
Source decode

MuPDF for PDFs; ZIP + DrawingML XML for PPTX. Produces opaque page objects we can iterate.

2. Layout
Structural typing

Glyph runs → spans → lines → text-boxes; vector paint → SVG; native DrawingML shapes; tables; charts.

3. Theme
Style flattening

Colours, fonts, sizes, line-heights, letter-spacings deduped into reftable.pb; per-class property bags into classtable.pb.

4. Emit
page.proto + assets

One page.proto per page; assets to img/, vec/, font/; metadata.pb / theme.pb / theme-layouts.pb.

5. Sidecars
Cold typed payloads

Charts, drawings, smartart, ink, raster sources written under extra/<id>/ and blob-raster-sources/<hash>/ for lazy loading.

§2Output contract

Every extraction produces a directory the proto-based renderer (proto2html) can consume directly without re-reading the original document. The on-disk layout is identical for PDF and PPTX so a single rendering pipeline serves both.

<output>/
├── 0.pb 1.pb 2.pb …             # PPTX: per-slide page.proto
├── <id>/0.pb                    # PDF:  per-page page.proto under content-id dir
├── extra/<id>/0.pb              # PPTX: typed sidecars (charts.pb, drawing.pb, …)
├── metadata.pb                  # document title, author, page count, …
├── theme.pb                     # design system: palette / fonts / sizes
├── theme-layouts.pb             # PPTX: master+layout templates
├── reftable.pb                  # interned colours, fonts, sizes, spacings
├── classtable.pb                # span/line CSS classes (id-keyed)
├── pagenotes.pb                 # PPTX: speaker notes per slide
├── img/<hash>.<ext>             # raster assets (webp/png/jpg)
├── vec/<hash>.svg               # vector assets (SVG)
├── font/<id>.woff2 + .charmap   # font subsets + glyph maps
└── blob-raster-sources/<hash>/  # cold typed raster sources (rasterization types)

§3What "cleaned" means

The storage investigation in /storage measures the extraction after pruning artefacts that are diagnostic-only or regeneratable. Two TypeScript programs implement the cleaning rules:

File Removed
clean-pdf-extraction.ts debug/, _profile/, blob-raster-sources/, telemetry JSONL/JSON (font-route, font-capabilities, font_observation_pack, style_summary, design_system), font/*.native-font.log, font/*_preview.png, preview_*.webp, styles.css, progress.txt.
clean-pptx-extraction.ts progress.txt, closure-matrix.json, promotion-candidates.json, blob-raster-sources/, extra/<id>/t<N>.pb, preview_*.webp, stray *.log.

§4The shared schema

Both extractors emit the same protobuf wire format. The schema is the LDF v0.1 namespace and lives in /proto. The top-level page envelope is Page; its repeated PageContent content is a oneof over frame, textbox, item, hardchar, table, shape. The reference catalogues are reftable.proto, classtable.proto, and the theme tree.