Glossary
Terms used across the schema reference, storage investigation, and methodology pages.
page.proto
The per-page editable protobuf — the top-level Page message in LDF v0.1. Holds repeated PageContent (frame / textbox / item / hardchar / table / shape).
PageContent
Oneof envelope distinguishing the kinds of children a page can hold. The byte distribution across these variants is the focus of /storage §3.
Frame
Transparent grouping container with id, original_frame_id, opacity, blend_mode, clip_rect, clip_path, kind, flags. Children are repeated PageContent.
ContentItem
Reference to an asset payload — an image (img/<hash>.webp), vector (vec/<hash>.svg), chart, smartart, drawing, ink, media, or rasterization source. Smallest editable unit for a non-text visual.
TextBox
Positioned text body. Carries paragraphs, runs, class_id refs into classtable.pb, and inline geometry (position, line height, gaps).
HardChar
PDF fallback for a glyph that the text-flow layout could not safely place inside a TextBox — emitted as its own positioned PageContent so the rendered output still pixel-matches.
reftable.pb
Interned reference catalogue — colours, fonts, font sizes, line heights, letter and word spacings. Every page proto refers to entries by ref_id rather than repeating literal values.
classtable.pb
Per-document table of CSS classes (span-level and line-level). page.proto fields like text-run / paragraph carry class_id references into this table; CSS is reconstructed from the joined view at render time.
theme.pb
Design-system summary: palette (primary / secondary / accent / background), font roles, typography scales. Sourced from PowerPoint themes or PDF style analysis.
theme-layouts.pb
PowerPoint-only — master and layout templates referenced by individual slides. Lets the renderer reproduce per-layout placeholder geometry without re-reading the original .pptx.
metadata.pb
Document-level metadata: title, authors, page count, last-edited timestamp, output contract version, capability flags.
rasterization type / blob-raster-sources
When an editable native render is not exact-pixel-faithful, the extractor stores a typed raster source under blob-raster-sources/<hash>/t<N>.pb (PDF) or extra/<id>/t<N>.pb (PPTX). Type N indexes the source category (e.g. 4 = exact image source). Pruned from the cleaned extraction since they're cold and regeneratable.
frame-overhead
Synthetic byte bucket in the page-proto field analysis. Captures all per-Frame metadata (id, original_frame_id, opacity, blend_mode, clip_rect, clip_path, kind, flags) so it isn't conflated with leaf-content payloads.
tar.gz/source ratio
Apples-to-apples storage comparison: source PDFs (Flate-compressed streams) and PPTX (ZIP) versus the cleaned extraction archived as a single tar.gz blob. The relevant metric when storing extractions in S3/SeaweedFS with at-rest compression.
page.pb gz/raw ratio
How much residual redundancy lives inside the page protobuf bytes themselves. A value near 30% means a generic compressor would shrink page.pb to roughly a third of its current size — most of that redundancy is repeated Frame metadata, repeated coordinate ints, and repeated class-id varints.
eddocu
C/C++ PDF extractor built on MuPDF. Emits LDF v0.1 protobufs.
eddocupptx
C/C++ PPTX extractor. Reads ZIP/DrawingML directly and emits the same LDF v0.1 protobufs as eddocu.
proto2html
Reference renderer that consumes the LDF v0.1 output tree (without re-reading the source document) and produces editable HTML.