Validating Text Layout: Corpora, Browser Sweeps, and Accuracy at Scale

How do you know that a text layout library produces correct results? For Pretext, "correct" means "matches the browser's native layout" — same line count, same height, same line breaks at every container width. Across Chrome, Safari, and Firefox. Across English, Arabic, Chinese, Japanese, Korean, Khmer, Myanmar, Urdu, Hebrew, and mixed-script text. Across font sizes from 10px to 28px. At every integer pixel width from 1 to 900.

This is an unusually ambitious correctness target, and Pretext has built an unusually thorough validation infrastructure to maintain it. This final article examines the three pillars of that infrastructure: deterministic unit tests with a fake canvas, automated browser accuracy sweeps, and multilingual corpus validation with a taxonomy of mismatch types.

Unit Tests: Deterministic Fake Canvas

The unit test suite in layout.test.ts takes a pragmatic approach to the fundamental problem of testing against Canvas: it replaces measureText() with a deterministic width function.

src/layout.test.ts#L1-L8

The opening comment captures the philosophy:

// Keep the permanent suite small and durable. These tests exercise the shipped
// prepare/layout exports with a deterministic fake canvas backend.

The fake measureWidth() function assigns deterministic widths by character type:

src/layout.test.ts#L71-L92

function measureWidth(text, font) {
  const fontSize = parseFontSize(font)
  let width = 0
  for (const ch of text) {
    if (ch === ' ')           width += fontSize * 0.33
    else if (ch === '\t')     width += fontSize * 1.32
    else if (isEmoji(ch))     width += fontSize
    else if (isWideChar(ch))  width += fontSize  // CJK
    else if (isPunctuation(ch)) width += fontSize * 0.4
    else                      width += fontSize * 0.6
  }
  return width
}

This gives tests reproducible widths without depending on any font engine. The real Intl.Segmenter is still used for text analysis — only the measurement backend is faked. This means tests exercise the full analysis pipeline (segmentation, merge cascade, break-kind classification) while keeping line-breaking behavior deterministic.

flowchart TD
    A[Test input text] --> B[Real Intl.Segmenter]
    B --> C[Real analysis pipeline]
    C --> D[Fake Canvas measureText]
    D --> E[Real line walker]
    E --> F[Deterministic results]
    style D fill:#ff9,stroke:#333

The test suite validates several categories of invariants:

Reconstruction: Lines produced by all three rich APIs (layoutWithLines, walkLineRanges, layoutNextLine) reconstruct back to the original normalized text
Cursor monotonicity: Line end cursors are strictly increasing
API consistency: layout() and layoutWithLines() agree on line counts
Streaming equivalence: layoutNextLine() produces identical lines to layoutWithLines() for fixed-width input
Variable-width streaming: layoutNextLine() with per-line width arrays produces valid, reconstructable output

The helpers reconstructFromLineBoundaries() and collectStreamedLines() are the workhorses:

src/layout.test.ts#L136-L159

Tip: The "small and durable" test philosophy means new browser-specific edge cases are investigated with throwaway probes and browser checker scripts, not by growing the permanent test suite. Only stable invariants get promoted to permanent tests.

Browser Accuracy Sweeps

While unit tests verify internal consistency, browser accuracy sweeps verify external correctness — does Pretext's output match the actual browser's layout?

The accuracy-check.ts script automates this comparison:

scripts/accuracy-check.ts#L1-L14

The sweep process:

Start a temporary local Bun server serving the accuracy page
Launch a browser (Chrome, Safari, or Firefox) via automation
For each test case, measure the actual DOM text height at various container widths
Compare against Pretext's predicted line count
Report mismatches with per-line diagnostics

The results are checked into the repository as JSON snapshots:

File	Contents
`accuracy/chrome.json`	Full accuracy rows for Chrome
`accuracy/safari.json`	Full accuracy rows for Safari
`accuracy/firefox.json`	Full accuracy rows for Firefox
`status/dashboard.json`	Machine-readable aggregate dashboard

flowchart TD
    A["accuracy-check.ts"] --> B["Start Bun server"]
    B --> C["Launch browser via automation"]
    C --> D["Navigate to accuracy page"]
    D --> E["For each test case × width"]
    E --> F["DOM: measure actual height"]
    E --> G["Pretext: layout() predicted height"]
    F --> H{Match?}
    G --> H
    H -->|yes| I[Record match]
    H -->|no| J[Record mismatch with diagnostics]
    I --> K["Write accuracy/*.json"]
    J --> K
    K --> L["Update status/dashboard.json"]

The per-line diagnostics in mismatch reports are critical for debugging — they show exactly which line the browser broke differently from Pretext, enabling targeted investigation rather than guesswork.

The sweep runs on each browser separately because browser-specific flags (as we saw in Part 4) produce intentionally different results. A Chrome mismatch might be a carryCJKAfterClosingQuote issue; a Safari mismatch might be a preferPrefixWidthsForBreakableRuns issue.

Multilingual Corpus Validation

Beyond the curated accuracy test cases, Pretext validates against real-world multilingual text using a corpus system:

scripts/corpus-sweep.ts#L1-L11

The corpus system includes:

Sources: Real text in Arabic, Chinese, Japanese, Korean, Khmer, Myanmar, Urdu, Hebrew, and mixed-script
Representative canaries: A curated subset (corpora/representative.json) of texts that are known to exercise specific script-specific edge cases
Sweep snapshots: Checked-in results at sampled widths and fine-grained 10px-step widths

The methodology is sweep cheap, diagnose narrow:

Sweep: Run all corpus texts across a range of container widths (e.g., 300–900px in 10px steps). This is fast because layout() is fast.
Identify: Find widths where predicted and actual line counts disagree.
Diagnose: For mismatching widths only, run the expensive per-line diagnostic comparison to identify the exact break difference.

The RESEARCH.md file captures the current steering summary for each script:

RESEARCH.md#L9-L19

- Japanese: two real canaries (羅生門, 蜘蛛の糸), both clean at anchor widths
- Chinese: two long-form canaries (祝福, 故鄉) with real font sensitivity
- Myanmar: two canaries with residual Chrome/Safari disagreement
- Urdu: Nastaliq/Naskh canary with narrow-width negative field
- Arabic: coarse corpora are clean; remaining work is fine-width edge-fit

The taxonomy system (scripts/corpus-taxonomy.ts) classifies mismatches into categories, helping distinguish between:

Preprocessing issues: The analysis pipeline segments differently from the browser
Edge-fit issues: Floating-point accumulation pushes a segment just past the line edge
Font-sensitivity issues: Different fonts produce different results at the same text/width
Browser-specific quirks: One browser behaves differently from others

Build, Release, and Status Dashboard

Pretext ships as ESM via tsc with a minimal build configuration:

tsconfig.build.json#L1-L13

{
  "extends": "./tsconfig.json",
  "compilerOptions": {
    "noEmit": false,
    "allowImportingTsExtensions": false,
    "rootDir": "./src",
    "outDir": "./dist",
    "declaration": true
  },
  "include": ["src/**/*.ts"],
  "exclude": ["src/layout.test.ts", "src/test-data.ts"]
}

The .js specifier convention in source imports (import { analyzeText } from './analysis.js') means plain tsc emit produces correct JavaScript and .d.ts files without a declaration rewrite step. This is deliberate — no Webpack, no Rollup, no Vite in the build chain.

package.json#L1-L20

The package.json exports map points "." at ./dist/layout.js with TypeScript types at ./dist/layout.d.ts. A smoke test script (scripts/package-smoke-test.ts) verifies the tarball works for both JS and TS consumers — catching issues like missing type declarations or incorrect export paths before publish.

The status dashboard (status/dashboard.json) aggregates checked-in accuracy and benchmark snapshots into a machine-readable summary. This serves as a single source of truth for the library's current accuracy and performance state.

Tip: If you're contributing to Pretext, always check the checked-in snapshots in accuracy/ and corpora/ before and after changes. The AGENTS.md file contains the canonical instructions for when to regenerate these snapshots and how to interpret mismatches.

RESEARCH.md as Institutional Memory

Perhaps the most unusual artifact in the repository is RESEARCH.md — a research log capturing everything tried, measured, and learned while building the library:

RESEARCH.md#L1-L8

# Research Log
Everything we tried, measured, and learned while building this library.

It documents:

Rejected approaches: DOM-based measurement in the hot path, SVG getComputedTextLength(), string reconstruction during layout — all tried and rejected with clear reasoning
Durable discoveries: The system-ui font resolution mismatch between Canvas and DOM on macOS, the word-by-word sum accuracy characteristics of Canvas measureText()
Design decisions: Why layout() must stay arithmetic-only, why bidi levels are metadata-only on the rich path
Script-specific notes: The current accuracy status of each script family and what remains unresolved

This is institutional memory in its most useful form. A new contributor reading RESEARCH.md learns not just what was built, but what was tried and discarded — avoiding months of re-discovering dead ends.

The AGENTS.md file complements this with implementation-level notes for contributors:

AGENTS.md#L35-L60

Key rules include:

Keep layout() fast and allocation-light
Keep script-specific fixes in preprocessing, not the line walker
Accuracy pages should be green in all three browsers on fresh runs
Prefer throwaway probes over growing the permanent test suite
Sweep widths cheaply first, diagnose mismatching widths in detail

Series Conclusion

Across six articles, we've traced Pretext from its motivating problem (DOM layout thrashing) through its architectural answer (two-phase prepare/layout), its deepest internals (the merge cascade, the line-breaking engine, browser shims), its consumer-facing APIs (shrinkwrap, obstacle routing, editorial layout), and finally its validation infrastructure (fake canvas tests, browser sweeps, multilingual corpora).

The design is opinionated: measurement happens once, layout is pure arithmetic, the public handle is opaque, and browser differences are handled with explicit flags rather than feature detection. These opinions are backed by extensive empirical validation — the checked-in accuracy snapshots across three browsers and a dozen scripts provide confidence that the approach works in practice, not just in theory.

For a ~3,200-line library (excluding tests and demos), Pretext packs a remarkable amount of text layout sophistication. The codebase rewards careful reading — particularly analysis.ts for its merge cascade architecture, line-break.ts for its simple/full walker dispatch, and measurement.ts for its elegant cache-and-correct approach to cross-browser emoji widths.

Validating Text Layout: Corpora, Browser Sweeps, and Accuracy at Scale

Prerequisites

Validating Text Layout: Corpora, Browser Sweeps, and Accuracy at Scale

Unit Tests: Deterministic Fake Canvas

Browser Accuracy Sweeps

Multilingual Corpus Validation

Build, Release, and Status Dashboard

RESEARCH.md as Institutional Memory

Series Conclusion