Inside the Text Analysis Pipeline: From Raw String to Measured Segments
Prerequisites
- ›Article 1: Architecture and the Two-Phase Model
- ›Basic Unicode awareness (CJK ranges, combining marks)
- ›Intl.Segmenter API concepts
Inside the Text Analysis Pipeline: From Raw String to Measured Segments
In Part 1, we saw that Pretext's prepare() call does all the expensive work once so that layout() can stay arithmetic-only. But what exactly happens inside prepare()? The answer is a surprisingly deep pipeline: whitespace normalization, word segmentation via Intl.Segmenter, classification of each segment into one of eight break kinds, an elaborate multi-pass merge cascade that handles everything from Japanese kinsoku rules to URL detection, CJK grapheme splitting during measurement, and finally assembly of the parallel arrays.
This article walks through that pipeline from start to finish, following the actual code path from raw string to prepared handle.
The Orchestrator: analyzeText()
The text analysis phase is internally separated from measurement. The prepareInternal() function in layout.ts calls two functions in sequence:
function prepareInternal(text, font, includeSegments, options) {
const analysis = analyzeText(text, getEngineProfile(), options?.whiteSpace)
return measureAnalysis(analysis, font, includeSegments)
}
analyzeText() produces a TextAnalysis — a struct containing the normalized string, an array of segment texts, break kinds, and hard-break chunks. The measurement phase then iterates these to produce the final parallel arrays. Let's follow the analysis path first.
flowchart TD
A[Raw text string] --> B{WhiteSpace mode?}
B -->|normal| C[normalizeWhitespaceNormal]
B -->|pre-wrap| D[normalizeWhitespacePreWrap]
C --> E[buildMergedSegmentation]
D --> E
E --> F[compileAnalysisChunks]
F --> G[TextAnalysis]
Whitespace Normalization: normal vs pre-wrap
Before segmentation, whitespace must be normalized to match CSS rendering behavior. Pretext supports two modes:
normal mode (default): Collapse all runs of whitespace (spaces, tabs, newlines, form feeds) into a single space, then trim leading and trailing spaces. This matches white-space: normal in CSS.
The implementation is defensive — it first tests whether normalization is even needed with a quick regex, avoiding unnecessary string allocations for already-clean text:
export function normalizeWhitespaceNormal(text: string): string {
if (!needsWhitespaceNormalizationRe.test(text)) return text
let normalized = text.replace(collapsibleWhitespaceRunRe, ' ')
// trim leading/trailing space
...
}
pre-wrap mode: Preserves ordinary spaces, tabs, and hard breaks. Only normalizes line endings (\r\n → \n, stray \r or \f → \n).
Tip: The
pre-wrapmode targets editor/input-oriented use cases where space preservation matters. It's not the full CSSpre-wrapsurface — it handles ordinary spaces,\ttabs, and\nhard breaks with browser-style tab stops.
Segmentation and Break-Kind Classification
After normalization, the text passes through Intl.Segmenter configured for word-granularity segmentation:
Each segment from the segmenter is then split by break-kind character. The splitSegmentByBreakKind() function classifies each character into one of eight SegmentBreakKind values:
| SegmentBreakKind | Character(s) | Line-breaking behavior |
|---|---|---|
text |
Regular characters | Break before if overflowing |
space |
Collapsible space | Break after; hangs past line edge |
preserved-space |
Space in pre-wrap | Break after; visible width included |
tab |
Tab in pre-wrap | Break after; advance computed from tab stops |
glue |
NBSP, NNBSP, WJ, ZWNBSP | Non-breaking; measured as visible content |
zero-width-break |
ZWSP (U+200B) | Break opportunity with zero width |
soft-hyphen |
SHY (U+00AD) | Break opportunity; shows - if chosen |
hard-break |
\n in pre-wrap |
Forced line break |
The classification function classifySegmentBreakChar() handles the mode-dependent behavior — in normal mode, spaces become space; in pre-wrap, they become preserved-space:
flowchart TD
A["Intl.Segmenter output"] --> B[splitSegmentByBreakKind]
B --> C{Character type?}
C -->|"U+0020"| D{pre-wrap?}
D -->|yes| E[preserved-space]
D -->|no| F[space]
C -->|"NBSP/NNBSP"| G[glue]
C -->|"U+200B"| H[zero-width-break]
C -->|"U+00AD"| I[soft-hyphen]
C -->|"tab"| J{pre-wrap?}
J -->|yes| K[tab]
C -->|other| L[text]
The Merge Cascade in buildMergedSegmentation()
The heart of the analysis pipeline is buildMergedSegmentation(). It takes the split pieces and applies a series of merges that match browser CSS behavior for how segments cluster together.
The primary merge loop processes each piece from Intl.Segmenter and attempts to merge it with the previous segment. Six merge rules fire in priority order within this loop:
-
CJK closing-quote carry (Chromium-specific): When the previous CJK segment ends with a closing quote and the next segment is CJK, merge them. This matches Chromium's specific behavior where
」東stays together. -
Kinsoku shori (line-start prohibition): Characters prohibited at line start (like
、,。,)) are merged with the preceding CJK segment. This prevents punctuation from being orphaned at the start of a line. -
Myanmar medial-glue: When the preceding segment ends with a Myanmar medial character, merge the next segment into it.
-
Arabic no-space clusters: When the preceding segment ends with Arabic trailing punctuation (
:,.,،,؛) and the next word contains Arabic script, merge them. -
Repeated-character runs: Single non-hyphen, non-em-dash characters repeated in a row (like
...or===) merge into one unit. -
Left-sticky punctuation: Punctuation that attaches to the preceding word —
.,,,!,),",…, closing quotes — merges with the preceding text. This is why "better." is measured as one unit.
After the primary loop, two post-loop passes handle forward-sticky behavior:
- Escaped-quote cluster merging: Forward pass attaching
\"sequences to their neighbors. - Forward-sticky carry: Reverse pass moving opening brackets and quotes (
(,「,") to the front of the next text segment.
Both passes are followed by compaction to remove empty entries.
Post-Merge Passes: URLs, Numbers, and Punctuation Chains
After the primary merge cascade, a further chain of passes refines the segmentation:
const compacted = mergeGlueConnectedTextRuns(...)
const withMergedUrls = carryTrailingForwardStickyAcrossCJKBoundary(
mergeAsciiPunctuationChains(
splitHyphenatedNumericRuns(
mergeNumericRuns(
mergeUrlQueryRuns(
mergeUrlLikeRuns(compacted))))))
flowchart TD
A[Primary merge output] --> B[mergeGlueConnectedTextRuns]
B --> C[mergeUrlLikeRuns]
C --> D[mergeUrlQueryRuns]
D --> E[mergeNumericRuns]
E --> F[splitHyphenatedNumericRuns]
F --> G[mergeAsciiPunctuationChains]
G --> H[carryTrailingForwardStickyAcrossCJKBoundary]
H --> I[Arabic space+mark splitting]
I --> J[Final MergedSegmentation]
Glue-connected text runs: NBSP characters sandwiched between text segments are absorbed into one unbreakable unit. This matches how word NBSP word behaves in CSS.
URL-like run merging: Segments starting with a URL scheme (https:) or www. absorb all subsequent non-whitespace segments up to the first ?. A second pass (mergeUrlQueryRuns) handles the query portion. This ensures https://example.com/path?q=1 remains one breakable segment rather than fragmenting at slashes.
Numeric run merging: Segments like 7:00 or 3/4 where digits are connected by joiner characters (:, -, /, ×, ,, ., +) merge into one unit. But hyphenated numeric runs like 2024-01-15 are then split back at hyphens to allow line breaking at date separators.
ASCII punctuation chains: Word-like segments connected by trailing commas or colons (like item1,item2,item3) merge into one unit matching browser behavior.
A final Arabic-specific pass splits leading space+combining-marks so that the space stays a break opportunity while the marks attach to the following Arabic word.
Measurement: Canvas, CJK Splitting, and Caching
Once analyzeText() returns the TextAnalysis, measureAnalysis() in layout.ts iterates the segments and builds the parallel arrays that the line walker will consume.
The measurement loop handles each segment type differently:
Soft hyphens: Zero width normally, but store discretionaryHyphenWidth (the width of -) in both fit and paint advances so the line walker can account for the visible hyphen if it chooses this break.
Hard breaks and tabs: Zero width — the line walker handles their special behavior.
CJK text segments: This is where the most interesting splitting happens. When a text segment contains CJK characters, it's split into individual graphemes using Intl.Segmenter with grapheme granularity. But the split respects kinsoku rules — opening brackets stay with the next grapheme, closing punctuation stays with the preceding one:
if (
kinsokuEnd.has(unitText) || // Opening bracket stays with next
kinsokuStart.has(grapheme) || // Closing punct stays with prev
leftStickyPunctuation.has(grapheme) // Period/comma stays with prev
) {
unitText += grapheme
continue
}
Regular text: Measured via getSegmentMetrics(), which uses the shared Canvas context. For word-like segments with multiple graphemes, per-grapheme widths are pre-computed for overflow-wrap: break-word support. On Safari, cumulative prefix widths are also computed since Safari's text shaping produces different results than summing individual grapheme widths.
lineEndFitAdvance vs lineEndPaintAdvance: For spaces and zero-width breaks, the fit advance is zero (trailing whitespace doesn't contribute to line-fit decisions) but the paint advance varies (collapsible spaces are invisible, preserved spaces are visible). We'll explore this critical distinction in Part 3.
flowchart TD
A[TextAnalysis segment] --> B{Segment kind?}
B -->|soft-hyphen| C["width=0, fitAdvance=hyphenWidth"]
B -->|hard-break/tab| D["width=0, advance=0"]
B -->|"text (CJK)"| E[Split into graphemes]
E --> F[Apply kinsoku merging]
F --> G[Measure each unit]
B -->|"text (other)"| H[Measure whole segment]
H --> I{word-like + multi-grapheme?}
I -->|yes| J[Pre-compute grapheme widths]
I -->|no| K[Store null for breakableWidths]
G --> L[Push to parallel arrays]
J --> L
K --> L
C --> L
D --> L
Each measured segment is pushed via pushMeasuredSegment(), which appends to all parallel arrays simultaneously and also flips the simpleLineWalkFastPath flag to false if any non-simple segment kinds are encountered.
Tip: The segment metric cache is keyed by
(font, segment_text)and shared across allprepare()calls with the same font. If your app renders 1,000 chat messages in the same font, common words like "the" are measured only once via Canvas. CallclearCache()only when changing fonts entirely.
From Analysis to Prepared Handle
The final steps wire up the analysis-level chunk boundaries to the prepared-level segment indices (since CJK splitting can expand one analysis segment into multiple prepared segments) and optionally compute bidi levels:
The chunk mapping is handled by mapAnalysisChunksToPreparedChunks(), which translates hard-break boundaries from analysis-space indices to prepared-space indices using lookup arrays built during the measurement loop.
Bidi levels are computed only on the rich path (prepareWithSegments), and only when the text actually contains right-to-left characters — the computeBidiLevels() function returns null early if it finds no R, AL, or AN characters, avoiding work for pure-LTR text.
Looking Ahead
We've now traced the full path from raw string to parallel arrays: normalization → segmentation → break-kind classification → merge cascade → post-merge passes → Canvas measurement → CJK splitting → array assembly. The result is a PreparedText handle containing everything the line-breaking engine needs.
In Part 3, we'll enter that engine — the hot-path code in line-break.ts that walks these arrays with pure arithmetic to produce line counts in microseconds. We'll see why the simple vs full walker split matters, how trailing whitespace hangs past the line edge, and how soft hyphens interact with overflow-wrap breaking.