Read OSS

Browser Quirks, Emoji Correction, and World Script Support

Advanced

Prerequisites

  • Article 1: Architecture and the Two-Phase Model
  • Article 2: Text Analysis Pipeline
  • Article 3: Line Breaking Engine
  • Basic Unicode knowledge (bidi, grapheme clusters, emoji presentation)

Browser Quirks, Emoji Correction, and World Script Support

The first three articles traced the pipeline from architecture through analysis to line breaking, treating browser differences as parameters (epsilon values, boolean flags). This article examines those differences directly: how they're detected, why they exist, and what Pretext does to maintain line-count parity across Chrome, Safari, and Firefox.

We'll also look at the internationalization challenges — the bidi implementation inherited from pdf.js, the emoji correction system that compensates for Canvas/DOM width discrepancies, and how CJK, Arabic, Thai, and Myanmar each flow through the pipeline with their own script-specific rules.

Engine Profile Detection

Pretext identifies the browser engine via user agent string parsing and sets four behavioral flags that propagate through the entire pipeline:

src/measurement.ts#L11-L16

export type EngineProfile = {
  lineFitEpsilon: number
  carryCJKAfterClosingQuote: boolean
  preferPrefixWidthsForBreakableRuns: boolean
  preferEarlySoftHyphenBreak: boolean
}

The detection uses navigator.userAgent and navigator.vendor, which is deliberately low-tech:

src/measurement.ts#L65-L101

flowchart TD
    A[navigator.userAgent] --> B{vendor = Apple<br/>+ Safari/ present<br/>+ no Chrome/Chromium?}
    B -->|yes| C[Safari/WebKit]
    B -->|no| D{UA contains<br/>Chrome/ or Chromium/<br/>or CriOS/ or Edg/?}
    D -->|yes| E[Chromium]
    D -->|no| F[Gecko/Firefox or fallback]

    C --> G["lineFitEpsilon: 1/64<br/>carryCJK: false<br/>prefixWidths: true<br/>earlySoftHyphen: true"]
    E --> H["lineFitEpsilon: 0.005<br/>carryCJK: true<br/>prefixWidths: false<br/>earlySoftHyphen: false"]
    F --> I["lineFitEpsilon: 0.005<br/>carryCJK: false<br/>prefixWidths: false<br/>earlySoftHyphen: false"]

Why UA sniffing instead of feature detection? Because these aren't features — they're subtle behavioral differences in how each engine implements text layout. There's no reliable way to detect whether a browser uses 1/64 fixed-point precision for line fitting, or whether it carries CJK characters after closing quotes, without actually performing layout and comparing results. UA sniffing is the pragmatic choice.

The engine profile is cached as a module-level singleton and retrieved via getEngineProfile(). Server-side environments (where navigator is undefined) get the Gecko defaults.

The Emoji Correction System

One of the more surprising cross-browser issues is that Canvas measureText() and DOM layout report different widths for emoji on Chrome and Firefox on macOS. The Canvas width is consistently inflated by a constant per emoji grapheme at font sizes below ~24px.

The correction mechanism works in three steps:

Step 1: Detection. For each font, measure 😀 via both Canvas and DOM. If the Canvas width exceeds the DOM width by more than 0.5px, record the difference:

src/measurement.ts#L123-L151

const canvasW = ctx.measureText('\u{1F600}').width
// ...
const domW = span.getBoundingClientRect().width
if (canvasW - domW > 0.5) {
  correction = canvasW - domW
}

This is the only DOM read in the entire measurement system, and it's cached per font — so it happens once per font size, not once per text block.

Step 2: Per-segment correction. When measuring any segment, if emoji correction is active for the font, the corrected width is:

src/measurement.ts#L169-L172

export function getCorrectedSegmentWidth(seg, metrics, emojiCorrection) {
  if (emojiCorrection === 0) return metrics.width
  return metrics.width - getEmojiCount(seg, metrics) * emojiCorrection
}

Step 3: Emoji counting. The emoji count per segment is computed lazily via grapheme segmentation and cached on the SegmentMetrics object:

src/measurement.ts#L152-L167

Safari doesn't need correction — its Canvas and DOM emoji widths agree (both wider than fontSize), so the correction is 0.

flowchart TD
    A["prepare() called"] --> B{Text may contain emoji?}
    B -->|no| C["emojiCorrection = 0"]
    B -->|yes| D["getEmojiCorrection(font, fontSize)"]
    D --> E{Canvas 😀 width > DOM 😀 width + 0.5?}
    E -->|yes| F["correction = canvasW - domW"]
    E -->|no| G["correction = 0"]
    F --> H["Apply per-segment:<br/>width - emojiCount × correction"]
    G --> I["Use Canvas width directly"]
    C --> I

Tip: The emoji detection fast path uses textMayContainEmoji() — a regex test for emoji-related Unicode properties. If the entire text contains no emoji indicators, the correction system is bypassed entirely, avoiding the DOM read and the per-segment counting.

Segment Metric Cache and Prefix Widths

The segment metric cache is a two-level Map<font, Map<segment, SegmentMetrics>> stored at module scope:

src/measurement.ts#L19

const segmentMetricCaches = new Map<string, Map<string, SegmentMetrics>>()

The SegmentMetrics type carries more than just width:

src/measurement.ts#L3-L9

export type SegmentMetrics = {
  width: number
  containsCJK: boolean
  emojiCount?: number                    // Lazily populated
  graphemeWidths?: number[] | null       // Lazily populated
  graphemePrefixWidths?: number[] | null // Lazily populated
}

The lazy fields are populated on first access and cached on the same object — a form of memoization at the field level. This avoids computing grapheme widths for segments that never need overflow-wrap breaking.

erDiagram
    FONT_STRING ||--o{ SEGMENT_STRING : "outer Map"
    SEGMENT_STRING ||--|| SEGMENT_METRICS : "inner Map"
    SEGMENT_METRICS {
        number width
        boolean containsCJK
        number emojiCount "optional"
        numberArray graphemeWidths "optional"
        numberArray graphemePrefixWidths "optional"
    }

Safari's prefix-width measurement (getSegmentGraphemePrefixWidths()) deserves special attention. Instead of measuring each grapheme individually and summing, it measures cumulative prefixes: "h", "he", "hel", "hell", "hello". The per-grapheme width is then computed as the difference between consecutive prefix widths.

src/measurement.ts#L193-L211

Why? Because text shaping can produce different widths when characters are adjacent versus isolated — ligatures, kerning, and contextual alternates all affect the result. Prefix widths capture these effects, making sub-word breaking more accurate on Safari. Chrome and Firefox are consistent enough that individual grapheme widths suffice.

Simplified Bidi: UAX #9 from pdf.js

The bidi implementation in bidi.ts is a simplified version of Unicode Bidirectional Algorithm (UAX #9), originally from pdf.js and adapted via Sebastian Markbage's text-layout research.

src/bidi.ts#L1-L6

The implementation classifies characters into bidi types (L, R, AL, AN, EN, etc.) using two lookup tables — one for the Basic Latin range (0x00–0xFF) and one for Arabic (0x0600–0x06FF). Characters outside these ranges use simple range checks:

src/bidi.ts#L57-L63

function classifyChar(charCode) {
  if (charCode <= 0x00ff) return baseTypes[charCode]
  if (0x0590 <= charCode && charCode <= 0x05f4) return 'R'
  if (0x0600 <= charCode && charCode <= 0x06ff) return arabicTypes[charCode & 0xff]
  if (0x0700 <= charCode && charCode <= 0x08AC) return 'AL'
  return 'L'
}

The level computation implements W1-W7 (weak type resolution) and N1-N2 (neutral type resolution) rules, then I1-I2 for level assignment. It uses a simple heuristic for the paragraph direction: when any bidi characters (R, AL, AN) are present, the start level is set to 1 (RTL). If none are present, the function returns null early.

src/bidi.ts#L65-L162

A critical design choice: bidi levels are metadata only. The line-breaking engine never reads them. They exist solely for consumers of the rich prepareWithSegments() path who need to render text with correct visual ordering. The opaque prepare() path skips bidi computation entirely, and computeSegmentLevels() maps per-character levels to per-segment levels only on the rich path.

src/bidi.ts#L164-L173

World Script Support: CJK, Arabic, Thai, Myanmar

Each major script family flows through the pipeline differently. Here's a summary of the script-specific behavior:

Script Segmentation Analysis Merges Measurement Line Breaking
CJK Intl.Segmenter groups by word Kinsoku start/end merges; closing-quote carry (Chromium) Split to individual graphemes with kinsoku re-merging Break between any two graphemes
Arabic Intl.Segmenter for words No-space trailing punctuation clusters; space+mark splitting Measured as units (shaping-sensitive) Standard word-boundary breaking
Thai Intl.Segmenter (dictionary-based) No special merges Measured as segments Break at segmenter boundaries
Myanmar Intl.Segmenter for words Medial-glue merging (U+104F) Measured as units Break at merged segment boundaries

The isCJK() function covers an extensive range of CJK-related blocks — not just CJK Unified Ideographs but also Hiragana, Katakana, Hangul, CJK Compatibility, Fullwidth Forms, and multiple extension blocks including Supplementary Ideographs up to Plane 3:

src/analysis.ts#L105-L127

The kinsoku sets define exactly which characters are prohibited at line start (closing punctuation like , ) and line end (opening punctuation like , ):

src/analysis.ts#L129-L172

For Arabic, the pipeline must handle a subtle interaction: spaces followed by combining marks (diacritics) that visually attach to the next word. The post-merge pass in buildMergedSegmentation() detects " " + marks before Arabic text and splits the space from the marks, keeping the space as a break opportunity while moving the marks to the next word.

src/analysis.ts#L937-L953

Tip: If you're working with a specific locale, use setLocale() before prepare() to retarget the Intl.Segmenter. This affects how Thai, Khmer, Myanmar, and other dictionary-based segmenters produce word boundaries.

Looking Ahead

We've now covered the browser-facing infrastructure: engine detection, emoji correction, the metric cache, bidi levels, and per-script handling. Together with the architecture (Part 1), analysis pipeline (Part 2), and line walker (Part 3), we have a complete picture of how text flows from raw string to line counts.

In Part 5, we'll shift perspective from the library internals to the consumer side: how the rich layout APIs power real applications — chat bubble shrinkwrap using binary search, editorial obstacle routing with layoutNextLine(), and the wrap-geometry system that rasterizes SVG alpha to extract polygon hulls for text flow.