Read OSS

Inside the Text Analysis Pipeline: From Raw String to Measured Segments

Advanced

Prerequisites

  • Article 1: Architecture and the Two-Phase Model
  • Basic Unicode awareness (CJK ranges, combining marks)
  • Intl.Segmenter API concepts

Inside the Text Analysis Pipeline: From Raw String to Measured Segments

In Part 1, we saw that Pretext's prepare() call does all the expensive work once so that layout() can stay arithmetic-only. But what exactly happens inside prepare()? The answer is a surprisingly deep pipeline: whitespace normalization, word segmentation via Intl.Segmenter, classification of each segment into one of eight break kinds, an elaborate multi-pass merge cascade that handles everything from Japanese kinsoku rules to URL detection, CJK grapheme splitting during measurement, and finally assembly of the parallel arrays.

This article walks through that pipeline from start to finish, following the actual code path from raw string to prepared handle.

The Orchestrator: analyzeText()

The text analysis phase is internally separated from measurement. The prepareInternal() function in layout.ts calls two functions in sequence:

src/layout.ts#L424-L432

function prepareInternal(text, font, includeSegments, options) {
  const analysis = analyzeText(text, getEngineProfile(), options?.whiteSpace)
  return measureAnalysis(analysis, font, includeSegments)
}

analyzeText() produces a TextAnalysis — a struct containing the normalized string, an array of segment texts, break kinds, and hard-break chunks. The measurement phase then iterates these to produce the final parallel arrays. Let's follow the analysis path first.

src/analysis.ts#L993-L1019

flowchart TD
    A[Raw text string] --> B{WhiteSpace mode?}
    B -->|normal| C[normalizeWhitespaceNormal]
    B -->|pre-wrap| D[normalizeWhitespacePreWrap]
    C --> E[buildMergedSegmentation]
    D --> E
    E --> F[compileAnalysisChunks]
    F --> G[TextAnalysis]

Whitespace Normalization: normal vs pre-wrap

Before segmentation, whitespace must be normalized to match CSS rendering behavior. Pretext supports two modes:

normal mode (default): Collapse all runs of whitespace (spaces, tabs, newlines, form feeds) into a single space, then trim leading and trailing spaces. This matches white-space: normal in CSS.

src/analysis.ts#L56-L67

The implementation is defensive — it first tests whether normalization is even needed with a quick regex, avoiding unnecessary string allocations for already-clean text:

export function normalizeWhitespaceNormal(text: string): string {
  if (!needsWhitespaceNormalizationRe.test(text)) return text
  let normalized = text.replace(collapsibleWhitespaceRunRe, ' ')
  // trim leading/trailing space
  ...
}

pre-wrap mode: Preserves ordinary spaces, tabs, and hard breaks. Only normalizes line endings (\r\n\n, stray \r or \f\n).

src/analysis.ts#L69-L74

Tip: The pre-wrap mode targets editor/input-oriented use cases where space preservation matters. It's not the full CSS pre-wrap surface — it handles ordinary spaces, \t tabs, and \n hard breaks with browser-style tab stops.

Segmentation and Break-Kind Classification

After normalization, the text passes through Intl.Segmenter configured for word-granularity segmentation:

src/analysis.ts#L79-L84

Each segment from the segmenter is then split by break-kind character. The splitSegmentByBreakKind() function classifies each character into one of eight SegmentBreakKind values:

src/analysis.ts#L1-L11

SegmentBreakKind Character(s) Line-breaking behavior
text Regular characters Break before if overflowing
space Collapsible space Break after; hangs past line edge
preserved-space Space in pre-wrap Break after; visible width included
tab Tab in pre-wrap Break after; advance computed from tab stops
glue NBSP, NNBSP, WJ, ZWNBSP Non-breaking; measured as visible content
zero-width-break ZWSP (U+200B) Break opportunity with zero width
soft-hyphen SHY (U+00AD) Break opportunity; shows - if chosen
hard-break \n in pre-wrap Forced line break

The classification function classifySegmentBreakChar() handles the mode-dependent behavior — in normal mode, spaces become space; in pre-wrap, they become preserved-space:

src/analysis.ts#L321-L334

flowchart TD
    A["Intl.Segmenter output"] --> B[splitSegmentByBreakKind]
    B --> C{Character type?}
    C -->|"U+0020"| D{pre-wrap?}
    D -->|yes| E[preserved-space]
    D -->|no| F[space]
    C -->|"NBSP/NNBSP"| G[glue]
    C -->|"U+200B"| H[zero-width-break]
    C -->|"U+00AD"| I[soft-hyphen]
    C -->|"tab"| J{pre-wrap?}
    J -->|yes| K[tab]
    C -->|other| L[text]

The Merge Cascade in buildMergedSegmentation()

The heart of the analysis pipeline is buildMergedSegmentation(). It takes the split pieces and applies a series of merges that match browser CSS behavior for how segments cluster together.

src/analysis.ts#L795-L956

The primary merge loop processes each piece from Intl.Segmenter and attempts to merge it with the previous segment. Six merge rules fire in priority order within this loop:

  1. CJK closing-quote carry (Chromium-specific): When the previous CJK segment ends with a closing quote and the next segment is CJK, merge them. This matches Chromium's specific behavior where 」東 stays together.

  2. Kinsoku shori (line-start prohibition): Characters prohibited at line start (like , , ) are merged with the preceding CJK segment. This prevents punctuation from being orphaned at the start of a line.

  3. Myanmar medial-glue: When the preceding segment ends with a Myanmar medial character, merge the next segment into it.

  4. Arabic no-space clusters: When the preceding segment ends with Arabic trailing punctuation (:, ., ،, ؛) and the next word contains Arabic script, merge them.

  5. Repeated-character runs: Single non-hyphen, non-em-dash characters repeated in a row (like ... or ===) merge into one unit.

  6. Left-sticky punctuation: Punctuation that attaches to the preceding word — ., ,, !, ), ", , closing quotes — merges with the preceding text. This is why "better." is measured as one unit.

After the primary loop, two post-loop passes handle forward-sticky behavior:

  • Escaped-quote cluster merging: Forward pass attaching \" sequences to their neighbors.
  • Forward-sticky carry: Reverse pass moving opening brackets and quotes ((, , ") to the front of the next text segment.

Both passes are followed by compaction to remove empty entries.

Post-Merge Passes: URLs, Numbers, and Punctuation Chains

After the primary merge cascade, a further chain of passes refines the segmentation:

const compacted = mergeGlueConnectedTextRuns(...)
const withMergedUrls = carryTrailingForwardStickyAcrossCJKBoundary(
  mergeAsciiPunctuationChains(
    splitHyphenatedNumericRuns(
      mergeNumericRuns(
        mergeUrlQueryRuns(
          mergeUrlLikeRuns(compacted))))))

src/analysis.ts#L924-L935

flowchart TD
    A[Primary merge output] --> B[mergeGlueConnectedTextRuns]
    B --> C[mergeUrlLikeRuns]
    C --> D[mergeUrlQueryRuns]
    D --> E[mergeNumericRuns]
    E --> F[splitHyphenatedNumericRuns]
    F --> G[mergeAsciiPunctuationChains]
    G --> H[carryTrailingForwardStickyAcrossCJKBoundary]
    H --> I[Arabic space+mark splitting]
    I --> J[Final MergedSegmentation]

Glue-connected text runs: NBSP characters sandwiched between text segments are absorbed into one unbreakable unit. This matches how word NBSP word behaves in CSS.

src/analysis.ts#L691-L765

URL-like run merging: Segments starting with a URL scheme (https:) or www. absorb all subsequent non-whitespace segments up to the first ?. A second pass (mergeUrlQueryRuns) handles the query portion. This ensures https://example.com/path?q=1 remains one breakable segment rather than fragmenting at slashes.

src/analysis.ts#L417-L465

Numeric run merging: Segments like 7:00 or 3/4 where digits are connected by joiner characters (:, -, /, ×, ,, ., +) merge into one unit. But hyphenated numeric runs like 2024-01-15 are then split back at hyphens to allow line breaking at date separators.

src/analysis.ts#L541-L689

ASCII punctuation chains: Word-like segments connected by trailing commas or colons (like item1,item2,item3) merge into one unit matching browser behavior.

A final Arabic-specific pass splits leading space+combining-marks so that the space stays a break opportunity while the marks attach to the following Arabic word.

Measurement: Canvas, CJK Splitting, and Caching

Once analyzeText() returns the TextAnalysis, measureAnalysis() in layout.ts iterates the segments and builds the parallel arrays that the line walker will consume.

src/layout.ts#L191-L392

The measurement loop handles each segment type differently:

Soft hyphens: Zero width normally, but store discretionaryHyphenWidth (the width of -) in both fit and paint advances so the line walker can account for the visible hyphen if it chooses this break.

Hard breaks and tabs: Zero width — the line walker handles their special behavior.

CJK text segments: This is where the most interesting splitting happens. When a text segment contains CJK characters, it's split into individual graphemes using Intl.Segmenter with grapheme granularity. But the split respects kinsoku rules — opening brackets stay with the next grapheme, closing punctuation stays with the preceding one:

src/layout.ts#L279-L319

if (
  kinsokuEnd.has(unitText) ||         // Opening bracket stays with next
  kinsokuStart.has(grapheme) ||       // Closing punct stays with prev
  leftStickyPunctuation.has(grapheme) // Period/comma stays with prev
) {
  unitText += grapheme
  continue
}

Regular text: Measured via getSegmentMetrics(), which uses the shared Canvas context. For word-like segments with multiple graphemes, per-grapheme widths are pre-computed for overflow-wrap: break-word support. On Safari, cumulative prefix widths are also computed since Safari's text shaping produces different results than summing individual grapheme widths.

lineEndFitAdvance vs lineEndPaintAdvance: For spaces and zero-width breaks, the fit advance is zero (trailing whitespace doesn't contribute to line-fit decisions) but the paint advance varies (collapsible spaces are invisible, preserved spaces are visible). We'll explore this critical distinction in Part 3.

flowchart TD
    A[TextAnalysis segment] --> B{Segment kind?}
    B -->|soft-hyphen| C["width=0, fitAdvance=hyphenWidth"]
    B -->|hard-break/tab| D["width=0, advance=0"]
    B -->|"text (CJK)"| E[Split into graphemes]
    E --> F[Apply kinsoku merging]
    F --> G[Measure each unit]
    B -->|"text (other)"| H[Measure whole segment]
    H --> I{word-like + multi-grapheme?}
    I -->|yes| J[Pre-compute grapheme widths]
    I -->|no| K[Store null for breakableWidths]
    G --> L[Push to parallel arrays]
    J --> L
    K --> L
    C --> L
    D --> L

Each measured segment is pushed via pushMeasuredSegment(), which appends to all parallel arrays simultaneously and also flips the simpleLineWalkFastPath flag to false if any non-simple segment kinds are encountered.

src/layout.ts#L220-L241

Tip: The segment metric cache is keyed by (font, segment_text) and shared across all prepare() calls with the same font. If your app renders 1,000 chat messages in the same font, common words like "the" are measured only once via Canvas. Call clearCache() only when changing fonts entirely.

From Analysis to Prepared Handle

The final steps wire up the analysis-level chunk boundaries to the prepared-level segment indices (since CJK splitting can expand one analysis segment into multiple prepared segments) and optionally compute bidi levels:

src/layout.ts#L361-L392

The chunk mapping is handled by mapAnalysisChunksToPreparedChunks(), which translates hard-break boundaries from analysis-space indices to prepared-space indices using lookup arrays built during the measurement loop.

Bidi levels are computed only on the rich path (prepareWithSegments), and only when the text actually contains right-to-left characters — the computeBidiLevels() function returns null early if it finds no R, AL, or AN characters, avoiding work for pure-LTR text.

Looking Ahead

We've now traced the full path from raw string to parallel arrays: normalization → segmentation → break-kind classification → merge cascade → post-merge passes → Canvas measurement → CJK splitting → array assembly. The result is a PreparedText handle containing everything the line-breaking engine needs.

In Part 3, we'll enter that engine — the hot-path code in line-break.ts that walks these arrays with pure arithmetic to produce line counts in microseconds. We'll see why the simple vs full walker split matters, how trailing whitespace hangs past the line edge, and how soft hyphens interact with overflow-wrap breaking.