LDF v0.1 · vocabulary

Glossary

Terms used across the schema reference, storage investigation, and methodology pages.

page.proto

The per-page editable protobuf — the top-level Page message in LDF v0.1. Holds repeated PageContent (frame / textbox / item / hardchar / table / shape).

→ /proto/page

PageContent

Oneof envelope distinguishing the kinds of children a page can hold. The byte distribution across these variants is the focus of /storage §3.

→ /proto/page#m-PageContent

Frame

Transparent grouping container with id, original_frame_id, opacity, blend_mode, clip_rect, clip_path, kind, flags. Children are repeated PageContent.

→ /proto/page#m-Frame

ContentItem

Reference to an asset payload — an image (img/<hash>.webp), vector (vec/<hash>.svg), chart, smartart, drawing, ink, media, or rasterization source. Smallest editable unit for a non-text visual.

→ /proto/page#m-ContentItem

TextBox

Positioned text body. Carries paragraphs, runs, class_id refs into classtable.pb, and inline geometry (position, line height, gaps).

→ /proto/page#m-TextBox

HardChar

PDF fallback for a glyph that the text-flow layout could not safely place inside a TextBox — emitted as its own positioned PageContent so the rendered output still pixel-matches.

→ /proto/page#m-HardChar

reftable.pb

Interned reference catalogue — colours, fonts, font sizes, line heights, letter and word spacings. Every page proto refers to entries by ref_id rather than repeating literal values.

→ /proto/reftable

classtable.pb

Per-document table of CSS classes (span-level and line-level). page.proto fields like text-run / paragraph carry class_id references into this table; CSS is reconstructed from the joined view at render time.

→ /proto/classtable

theme.pb

Design-system summary: palette (primary / secondary / accent / background), font roles, typography scales. Sourced from PowerPoint themes or PDF style analysis.

→ /proto/theme

theme-layouts.pb

PowerPoint-only — master and layout templates referenced by individual slides. Lets the renderer reproduce per-layout placeholder geometry without re-reading the original .pptx.

→ /proto/theme-layouts

metadata.pb

Document-level metadata: title, authors, page count, last-edited timestamp, output contract version, capability flags.

→ /proto/metadata

rasterization type / blob-raster-sources

When an editable native render is not exact-pixel-faithful, the extractor stores a typed raster source under blob-raster-sources/<hash>/t<N>.pb (PDF) or extra/<id>/t<N>.pb (PPTX). Type N indexes the source category (e.g. 4 = exact image source). Pruned from the cleaned extraction since they're cold and regeneratable.

frame-overhead

Synthetic byte bucket in the page-proto field analysis. Captures all per-Frame metadata (id, original_frame_id, opacity, blend_mode, clip_rect, clip_path, kind, flags) so it isn't conflated with leaf-content payloads.

→ /storage#pb-frame-overhead

tar.gz/source ratio

Apples-to-apples storage comparison: source PDFs (Flate-compressed streams) and PPTX (ZIP) versus the cleaned extraction archived as a single tar.gz blob. The relevant metric when storing extractions in S3/SeaweedFS with at-rest compression.

→ /storage

page.pb gz/raw ratio

How much residual redundancy lives inside the page protobuf bytes themselves. A value near 30% means a generic compressor would shrink page.pb to roughly a third of its current size — most of that redundancy is repeated Frame metadata, repeated coordinate ints, and repeated class-id varints.

→ /storage#pb

eddocu

C/C++ PDF extractor built on MuPDF. Emits LDF v0.1 protobufs.

eddocupptx

C/C++ PPTX extractor. Reads ZIP/DrawingML directly and emits the same LDF v0.1 protobufs as eddocu.

proto2html

Reference renderer that consumes the LDF v0.1 output tree (without re-reading the source document) and produces editable HTML.