Read OSS

Language Definitions: Monarch Grammars and the Lazy Loading System

Intermediate

Prerequisites

  • Article 1: Architecture and Project Overview
  • Understanding of tokenization and syntax highlighting concepts
  • Familiarity with dynamic import() and code splitting

Language Definitions: Monarch Grammars and the Lazy Loading System

Monaco Editor supports syntax highlighting for roughly 85 programming languages out of the box. Loading all 85 grammars upfront would be wasteful — most users only need a handful. So the repository implements a lazy loading system that registers language metadata eagerly (a few bytes per language) but defers the actual grammar loading until a language is first used. This article traces the full lifecycle from language registration to tokenizer creation.

The Language Registration Framework

The heart of the lazy loading system lives in a single file: _.contribution.ts. Yes, the filename starts with an underscore — it's a convention signaling this is framework-level infrastructure, not a specific language.

src/languages/definitions/_.contribution.ts#L6-L18

import { languages, editor } from '../../editor';

interface ILang extends languages.ILanguageExtensionPoint {
    loader: () => Promise<ILangImpl>;
}

interface ILangImpl {
    conf: languages.LanguageConfiguration;
    language: languages.IMonarchLanguage;
}

const languageDefinitions: { [languageId: string]: ILang } = {};
const lazyLanguageLoaders: { [languageId: string]: LazyLanguageLoader } = {};

Two dictionaries — languageDefinitions and lazyLanguageLoaders — form the backbone. Each language registration stores its metadata in the first and gets a singleton loader in the second.

The ILang interface extends Monaco's ILanguageExtensionPoint (which carries the language ID, file extensions, aliases, and MIME types) with a loader function that returns a promise resolving to the actual grammar module. The ILangImpl interface specifies what that module must export: a conf (language configuration — brackets, comments, auto-closing pairs) and a language (the Monarch grammar itself).

The LazyLanguageLoader Class

The LazyLanguageLoader is a deceptively simple singleton cache:

src/languages/definitions/_.contribution.ts#L20-L53

class LazyLanguageLoader {
    public static getOrCreate(languageId: string): LazyLanguageLoader {
        if (!lazyLanguageLoaders[languageId]) {
            lazyLanguageLoaders[languageId] = new LazyLanguageLoader(languageId);
        }
        return lazyLanguageLoaders[languageId];
    }

    private _loadingTriggered: boolean;
    private _lazyLoadPromise: Promise<ILangImpl>;
    private _lazyLoadPromiseResolve!: (value: ILangImpl) => void;
    private _lazyLoadPromiseReject!: (err: any) => void;

    constructor(languageId: string) {
        this._languageId = languageId;
        this._loadingTriggered = false;
        this._lazyLoadPromise = new Promise((resolve, reject) => {
            this._lazyLoadPromiseResolve = resolve;
            this._lazyLoadPromiseReject = reject;
        });
    }

    public load(): Promise<ILangImpl> {
        if (!this._loadingTriggered) {
            this._loadingTriggered = true;
            languageDefinitions[this._languageId].loader().then(
                (mod) => this._lazyLoadPromiseResolve(mod),
                (err) => this._lazyLoadPromiseReject(err)
            );
        }
        return this._lazyLoadPromise;
    }
}

The pattern here is clever: the promise is created in the constructor but the loading is only triggered on first load() call. Subsequent calls return the same promise. This ensures that even if multiple consumers request the same language simultaneously, the grammar module is only loaded once.

The registerLanguage Function

The registerLanguage() function ties everything together:

src/languages/definitions/_.contribution.ts#L63-L80

export function registerLanguage(def: ILang): void {
    const languageId = def.id;

    languageDefinitions[languageId] = def;
    languages.register(def);

    const lazyLanguageLoader = LazyLanguageLoader.getOrCreate(languageId);
    languages.registerTokensProviderFactory(languageId, {
        create: async (): Promise<languages.IMonarchLanguage> => {
            const mod = await lazyLanguageLoader.load();
            return mod.language;
        }
    });
    languages.onLanguageEncountered(languageId, async () => {
        const mod = await lazyLanguageLoader.load();
        languages.setLanguageConfiguration(languageId, mod.conf);
    });
}

Three things happen at registration time:

  1. languages.register(def) — tells Monaco core that this language exists (metadata only)
  2. languages.registerTokensProviderFactory() — registers a factory that will create the Monarch tokenizer on demand
  3. languages.onLanguageEncountered() — sets up a callback to install the language configuration (brackets, comments, folding markers) when the language is first encountered

None of these trigger the actual grammar load. That only happens when Monaco core calls the factory's create() method or fires the onLanguageEncountered event.

sequenceDiagram
    participant App as Application
    participant Reg as registerLanguage()
    participant Core as Monaco Core
    participant Loader as LazyLanguageLoader
    participant Mod as Grammar Module

    App->>Reg: registerLanguage({ id: 'typescript', loader: ... })
    Reg->>Core: languages.register({ id: 'typescript', ... })
    Reg->>Core: registerTokensProviderFactory('typescript', factory)
    Reg->>Core: onLanguageEncountered('typescript', callback)
    Note over Core: No grammar loaded yet

    App->>Core: editor.createModel('...', 'typescript')
    Core->>Loader: factory.create()
    Loader->>Mod: import('./typescript')
    Mod-->>Loader: { conf, language }
    Loader-->>Core: language (Monarch grammar)
    Core->>Core: Create Monarch tokenizer

Anatomy of a Language Definition

Let's trace a concrete example. Here's the TypeScript language registration:

src/languages/definitions/typescript/register.ts#L6-L16

import { registerLanguage } from '../_.contribution';

registerLanguage({
    id: 'typescript',
    extensions: ['.ts', '.tsx', '.cts', '.mts'],
    aliases: ['TypeScript', 'ts', 'typescript'],
    mimetypes: ['text/typescript'],
    loader: (): Promise<any> => {
        return import('./typescript');
    }
});

The loader function uses a dynamic import() — this is the code-splitting boundary. When bundled with Rollup or webpack, the ./typescript module (which contains the full Monarch grammar) becomes a separate chunk that's only fetched when TypeScript is first needed.

The grammar module itself exports two things. The conf provides language configuration:

src/languages/definitions/typescript/typescript.ts#L8-L75

export const conf: languages.LanguageConfiguration = {
    wordPattern: /(-?\d*\.\d\w*)|([^\`\~\!\@\#\%\^\&\*\(\)\-\=\+\[\{\]\}\\\|\;\:\'\"\,\.\<\>\/\?\s]+)/g,
    comments: {
        lineComment: '//',
        blockComment: ['/*', '*/']
    },
    brackets: [
        ['{', '}'],
        ['[', ']'],
        ['(', ')']
    ],
    // ... auto-closing pairs, folding markers, etc.
};

And the language provides the Monarch grammar — a state machine for tokenization:

src/languages/definitions/typescript/typescript.ts#L77-L80

export const language = {
    defaultToken: 'invalid',
    tokenPostfix: '.ts',
    keywords: ['abstract', 'any', 'as', /* ... */],

The Monarch Grammar Format

Monarch is Monaco's built-in tokenizer format. It's a declarative state machine where each state contains an array of rules, and each rule is a regex pattern paired with an action (a token type and optionally a state transition).

src/languages/definitions/typescript/typescript.ts#L224-L255

tokenizer: {
    root: [[/[{}]/, 'delimiter.bracket'], { include: 'common' }],

    common: [
        [/#?[a-z_$][\w$]*/, {
            cases: {
                '@keywords': 'keyword',
                '@default': 'identifier'
            }
        }],
        [/[A-Z][\w\$]*/, 'type.identifier'],
        { include: '@whitespace' },
        [/\/(?=([^\\\\/]|\\.)+\/([dgimsuy]*)(\s*)(\.|
|,|\)|\]|\}|$))/,
            { token: 'regexp', bracket: '@open', next: '@regexp' }
        ],
        // ...
    ],
stateDiagram-v2
    [*] --> root
    root --> common: include
    common --> whitespace: include @whitespace
    common --> regexp: /regex_start/
    regexp --> common: /regex_end/
    common --> string_double: opening "
    string_double --> common: closing "
    common --> string_single: opening '
    string_single --> common: closing '
    common --> string_backtick: opening backtick
    string_backtick --> common: closing backtick

Key Monarch concepts:

  • States: Named groups of rules (root, common, whitespace, string, regexp)
  • Includes: { include: 'common' } pulls in rules from another state
  • Cases: @keywords references the keywords array declared earlier, enabling compact rules
  • State transitions: next: '@regexp' switches the tokenizer to the regexp state
  • Token types: Strings like 'keyword', 'identifier', 'type.identifier' that map to CSS classes for coloring

Tip: The defaultToken: 'invalid' setting is a development aid. Any text that doesn't match any rule gets the invalid token type, making untokenized content visually obvious. Production grammars should eventually cover all cases, but this default helps you spot gaps during development.

The Barrel Registration Pattern

All ~85 language registrations are aggregated through a barrel import:

src/languages/definitions/register.all.ts#L6-L87

import './abap/register';
import './apex/register';
import './azcli/register';
// ... 79 more
import './yaml/register';

Each import executes the registerLanguage() call, which as we've seen only stores metadata and sets up lazy factories. The total upfront cost of importing all 85 registration files is negligible — no grammar modules are loaded, no tokenizers are created.

flowchart TD
    BARREL[register.all.ts] --> |import| TS[typescript/register.ts]
    BARREL --> |import| PY[python/register.ts]
    BARREL --> |import| RS[rust/register.ts]
    BARREL --> |import| DOT[... 82 more]
    
    TS --> |"registerLanguage()"| META1["id: 'typescript'<br/>extensions: ['.ts', '.tsx']<br/>loader: () => import('./typescript')"]
    PY --> |"registerLanguage()"| META2["id: 'python'<br/>extensions: ['.py']<br/>loader: () => import('./python')"]
    
    META1 -.-> |on demand| GRAMMAR1[typescript.ts<br/>~300 lines of Monarch grammar]
    META2 -.-> |on demand| GRAMMAR2[python.ts<br/>~300 lines of Monarch grammar]

This pattern also enables selective imports. A consumer could bypass register.all.ts entirely and import only the specific language registrations they need:

import 'monaco-editor/esm/vs/languages/definitions/typescript/register';
import 'monaco-editor/esm/vs/languages/definitions/python/register';
// Only TypeScript and Python grammars will ever load

Tokenization Testing Framework

Every language grammar is backed by tokenization tests. The test runner in testRunner.ts provides the infrastructure:

src/languages/definitions/test/testRunner.ts#L12-L42

export interface IRelaxedToken {
    startIndex: number;
    type: string;
}

export interface ITestItem {
    line: string;
    tokens: IRelaxedToken[];
}

export function testTokenization(_language: string | string[], tests: ITestItem[][]): void {
    let languages: string[];
    if (typeof _language === 'string') {
        languages = [_language];
    } else {
        languages = _language;
    }
    let mainLanguage = languages[0];

    test(mainLanguage + ' tokenization', async () => {
        await Promise.all(languages.map((l) => loadLanguage(l)));
        await timeout(0);
        runTests(mainLanguage, tests);
    });
}

The test format pairs input source lines with expected token classifications. The loadLanguage() helper (also defined in _.contribution.ts) explicitly triggers the lazy loader and creates a temporary model to force tokenizer creation:

src/languages/definitions/_.contribution.ts#L55-L61

export async function loadLanguage(languageId: string): Promise<void> {
    await LazyLanguageLoader.getOrCreate(languageId).load();
    const model = editor.createModel('', languageId);
    model.dispose();
}
flowchart LR
    TEST[test file] --> |testTokenization| RUNNER[testRunner.ts]
    RUNNER --> |loadLanguage| LOADER[LazyLanguageLoader]
    LOADER --> |import| GRAMMAR[Grammar Module]
    RUNNER --> |editor.tokenize| CORE[Monaco Core]
    CORE --> |compare| EXPECTED[Expected Tokens]

The tests are run via Node.js's built-in test runner, as configured in package.json:

node --import tsx --import ./test/test-setup.mjs --test "src/languages/definitions/*/*.test.ts"

Each language has its own *.test.ts file alongside its grammar, making it easy to add tests when modifying a grammar.

What's Next

Language definitions give Monaco syntax highlighting, but for the four premium languages — TypeScript, CSS, HTML, and JSON — Monaco goes much further. In the next article, we'll explore the web worker architecture that powers full IntelliSense: completions, diagnostics, hover information, and formatting, all running off the main thread.