Browser Quirks, Emoji Correction, and World Script Support
Prerequisites
- ›Article 1: Architecture and the Two-Phase Model
- ›Article 2: Text Analysis Pipeline
- ›Article 3: Line Breaking Engine
- ›Basic Unicode knowledge (bidi, grapheme clusters, emoji presentation)
Browser Quirks, Emoji Correction, and World Script Support
The first three articles traced the pipeline from architecture through analysis to line breaking, treating browser differences as parameters (epsilon values, boolean flags). This article examines those differences directly: how they're detected, why they exist, and what Pretext does to maintain line-count parity across Chrome, Safari, and Firefox.
We'll also look at the internationalization challenges — the bidi implementation inherited from pdf.js, the emoji correction system that compensates for Canvas/DOM width discrepancies, and how CJK, Arabic, Thai, and Myanmar each flow through the pipeline with their own script-specific rules.
Engine Profile Detection
Pretext identifies the browser engine via user agent string parsing and sets four behavioral flags that propagate through the entire pipeline:
export type EngineProfile = {
lineFitEpsilon: number
carryCJKAfterClosingQuote: boolean
preferPrefixWidthsForBreakableRuns: boolean
preferEarlySoftHyphenBreak: boolean
}
The detection uses navigator.userAgent and navigator.vendor, which is deliberately low-tech:
flowchart TD
A[navigator.userAgent] --> B{vendor = Apple<br/>+ Safari/ present<br/>+ no Chrome/Chromium?}
B -->|yes| C[Safari/WebKit]
B -->|no| D{UA contains<br/>Chrome/ or Chromium/<br/>or CriOS/ or Edg/?}
D -->|yes| E[Chromium]
D -->|no| F[Gecko/Firefox or fallback]
C --> G["lineFitEpsilon: 1/64<br/>carryCJK: false<br/>prefixWidths: true<br/>earlySoftHyphen: true"]
E --> H["lineFitEpsilon: 0.005<br/>carryCJK: true<br/>prefixWidths: false<br/>earlySoftHyphen: false"]
F --> I["lineFitEpsilon: 0.005<br/>carryCJK: false<br/>prefixWidths: false<br/>earlySoftHyphen: false"]
Why UA sniffing instead of feature detection? Because these aren't features — they're subtle behavioral differences in how each engine implements text layout. There's no reliable way to detect whether a browser uses 1/64 fixed-point precision for line fitting, or whether it carries CJK characters after closing quotes, without actually performing layout and comparing results. UA sniffing is the pragmatic choice.
The engine profile is cached as a module-level singleton and retrieved via getEngineProfile(). Server-side environments (where navigator is undefined) get the Gecko defaults.
The Emoji Correction System
One of the more surprising cross-browser issues is that Canvas measureText() and DOM layout report different widths for emoji on Chrome and Firefox on macOS. The Canvas width is consistently inflated by a constant per emoji grapheme at font sizes below ~24px.
The correction mechanism works in three steps:
Step 1: Detection. For each font, measure 😀 via both Canvas and DOM. If the Canvas width exceeds the DOM width by more than 0.5px, record the difference:
const canvasW = ctx.measureText('\u{1F600}').width
// ...
const domW = span.getBoundingClientRect().width
if (canvasW - domW > 0.5) {
correction = canvasW - domW
}
This is the only DOM read in the entire measurement system, and it's cached per font — so it happens once per font size, not once per text block.
Step 2: Per-segment correction. When measuring any segment, if emoji correction is active for the font, the corrected width is:
export function getCorrectedSegmentWidth(seg, metrics, emojiCorrection) {
if (emojiCorrection === 0) return metrics.width
return metrics.width - getEmojiCount(seg, metrics) * emojiCorrection
}
Step 3: Emoji counting. The emoji count per segment is computed lazily via grapheme segmentation and cached on the SegmentMetrics object:
Safari doesn't need correction — its Canvas and DOM emoji widths agree (both wider than fontSize), so the correction is 0.
flowchart TD
A["prepare() called"] --> B{Text may contain emoji?}
B -->|no| C["emojiCorrection = 0"]
B -->|yes| D["getEmojiCorrection(font, fontSize)"]
D --> E{Canvas 😀 width > DOM 😀 width + 0.5?}
E -->|yes| F["correction = canvasW - domW"]
E -->|no| G["correction = 0"]
F --> H["Apply per-segment:<br/>width - emojiCount × correction"]
G --> I["Use Canvas width directly"]
C --> I
Tip: The emoji detection fast path uses
textMayContainEmoji()— a regex test for emoji-related Unicode properties. If the entire text contains no emoji indicators, the correction system is bypassed entirely, avoiding the DOM read and the per-segment counting.
Segment Metric Cache and Prefix Widths
The segment metric cache is a two-level Map<font, Map<segment, SegmentMetrics>> stored at module scope:
const segmentMetricCaches = new Map<string, Map<string, SegmentMetrics>>()
The SegmentMetrics type carries more than just width:
export type SegmentMetrics = {
width: number
containsCJK: boolean
emojiCount?: number // Lazily populated
graphemeWidths?: number[] | null // Lazily populated
graphemePrefixWidths?: number[] | null // Lazily populated
}
The lazy fields are populated on first access and cached on the same object — a form of memoization at the field level. This avoids computing grapheme widths for segments that never need overflow-wrap breaking.
erDiagram
FONT_STRING ||--o{ SEGMENT_STRING : "outer Map"
SEGMENT_STRING ||--|| SEGMENT_METRICS : "inner Map"
SEGMENT_METRICS {
number width
boolean containsCJK
number emojiCount "optional"
numberArray graphemeWidths "optional"
numberArray graphemePrefixWidths "optional"
}
Safari's prefix-width measurement (getSegmentGraphemePrefixWidths()) deserves special attention. Instead of measuring each grapheme individually and summing, it measures cumulative prefixes: "h", "he", "hel", "hell", "hello". The per-grapheme width is then computed as the difference between consecutive prefix widths.
Why? Because text shaping can produce different widths when characters are adjacent versus isolated — ligatures, kerning, and contextual alternates all affect the result. Prefix widths capture these effects, making sub-word breaking more accurate on Safari. Chrome and Firefox are consistent enough that individual grapheme widths suffice.
Simplified Bidi: UAX #9 from pdf.js
The bidi implementation in bidi.ts is a simplified version of Unicode Bidirectional Algorithm (UAX #9), originally from pdf.js and adapted via Sebastian Markbage's text-layout research.
The implementation classifies characters into bidi types (L, R, AL, AN, EN, etc.) using two lookup tables — one for the Basic Latin range (0x00–0xFF) and one for Arabic (0x0600–0x06FF). Characters outside these ranges use simple range checks:
function classifyChar(charCode) {
if (charCode <= 0x00ff) return baseTypes[charCode]
if (0x0590 <= charCode && charCode <= 0x05f4) return 'R'
if (0x0600 <= charCode && charCode <= 0x06ff) return arabicTypes[charCode & 0xff]
if (0x0700 <= charCode && charCode <= 0x08AC) return 'AL'
return 'L'
}
The level computation implements W1-W7 (weak type resolution) and N1-N2 (neutral type resolution) rules, then I1-I2 for level assignment. It uses a simple heuristic for the paragraph direction: when any bidi characters (R, AL, AN) are present, the start level is set to 1 (RTL). If none are present, the function returns null early.
A critical design choice: bidi levels are metadata only. The line-breaking engine never reads them. They exist solely for consumers of the rich prepareWithSegments() path who need to render text with correct visual ordering. The opaque prepare() path skips bidi computation entirely, and computeSegmentLevels() maps per-character levels to per-segment levels only on the rich path.
World Script Support: CJK, Arabic, Thai, Myanmar
Each major script family flows through the pipeline differently. Here's a summary of the script-specific behavior:
| Script | Segmentation | Analysis Merges | Measurement | Line Breaking |
|---|---|---|---|---|
| CJK | Intl.Segmenter groups by word |
Kinsoku start/end merges; closing-quote carry (Chromium) | Split to individual graphemes with kinsoku re-merging | Break between any two graphemes |
| Arabic | Intl.Segmenter for words |
No-space trailing punctuation clusters; space+mark splitting | Measured as units (shaping-sensitive) | Standard word-boundary breaking |
| Thai | Intl.Segmenter (dictionary-based) |
No special merges | Measured as segments | Break at segmenter boundaries |
| Myanmar | Intl.Segmenter for words |
Medial-glue merging (U+104F) | Measured as units | Break at merged segment boundaries |
The isCJK() function covers an extensive range of CJK-related blocks — not just CJK Unified Ideographs but also Hiragana, Katakana, Hangul, CJK Compatibility, Fullwidth Forms, and multiple extension blocks including Supplementary Ideographs up to Plane 3:
The kinsoku sets define exactly which characters are prohibited at line start (closing punctuation like ), 。) and line end (opening punctuation like (, 「):
For Arabic, the pipeline must handle a subtle interaction: spaces followed by combining marks (diacritics) that visually attach to the next word. The post-merge pass in buildMergedSegmentation() detects " " + marks before Arabic text and splits the space from the marks, keeping the space as a break opportunity while moving the marks to the next word.
Tip: If you're working with a specific locale, use
setLocale()beforeprepare()to retarget theIntl.Segmenter. This affects how Thai, Khmer, Myanmar, and other dictionary-based segmenters produce word boundaries.
Looking Ahead
We've now covered the browser-facing infrastructure: engine detection, emoji correction, the metric cache, bidi levels, and per-script handling. Together with the architecture (Part 1), analysis pipeline (Part 2), and line walker (Part 3), we have a complete picture of how text flows from raw string to line counts.
In Part 5, we'll shift perspective from the library internals to the consumer side: how the rich layout APIs power real applications — chat bubble shrinkwrap using binary search, editorial obstacle routing with layoutNextLine(), and the wrap-geometry system that rasterizes SVG alpha to extract polygon hulls for text flow.