Language Definitions: Monarch Grammars and the Lazy Loading System
Prerequisites
- ›Article 1: Architecture and Project Overview
- ›Understanding of tokenization and syntax highlighting concepts
- ›Familiarity with dynamic import() and code splitting
Language Definitions: Monarch Grammars and the Lazy Loading System
Monaco Editor supports syntax highlighting for roughly 85 programming languages out of the box. Loading all 85 grammars upfront would be wasteful — most users only need a handful. So the repository implements a lazy loading system that registers language metadata eagerly (a few bytes per language) but defers the actual grammar loading until a language is first used. This article traces the full lifecycle from language registration to tokenizer creation.
The Language Registration Framework
The heart of the lazy loading system lives in a single file: _.contribution.ts. Yes, the filename starts with an underscore — it's a convention signaling this is framework-level infrastructure, not a specific language.
src/languages/definitions/_.contribution.ts#L6-L18
import { languages, editor } from '../../editor';
interface ILang extends languages.ILanguageExtensionPoint {
loader: () => Promise<ILangImpl>;
}
interface ILangImpl {
conf: languages.LanguageConfiguration;
language: languages.IMonarchLanguage;
}
const languageDefinitions: { [languageId: string]: ILang } = {};
const lazyLanguageLoaders: { [languageId: string]: LazyLanguageLoader } = {};
Two dictionaries — languageDefinitions and lazyLanguageLoaders — form the backbone. Each language registration stores its metadata in the first and gets a singleton loader in the second.
The ILang interface extends Monaco's ILanguageExtensionPoint (which carries the language ID, file extensions, aliases, and MIME types) with a loader function that returns a promise resolving to the actual grammar module. The ILangImpl interface specifies what that module must export: a conf (language configuration — brackets, comments, auto-closing pairs) and a language (the Monarch grammar itself).
The LazyLanguageLoader Class
The LazyLanguageLoader is a deceptively simple singleton cache:
src/languages/definitions/_.contribution.ts#L20-L53
class LazyLanguageLoader {
public static getOrCreate(languageId: string): LazyLanguageLoader {
if (!lazyLanguageLoaders[languageId]) {
lazyLanguageLoaders[languageId] = new LazyLanguageLoader(languageId);
}
return lazyLanguageLoaders[languageId];
}
private _loadingTriggered: boolean;
private _lazyLoadPromise: Promise<ILangImpl>;
private _lazyLoadPromiseResolve!: (value: ILangImpl) => void;
private _lazyLoadPromiseReject!: (err: any) => void;
constructor(languageId: string) {
this._languageId = languageId;
this._loadingTriggered = false;
this._lazyLoadPromise = new Promise((resolve, reject) => {
this._lazyLoadPromiseResolve = resolve;
this._lazyLoadPromiseReject = reject;
});
}
public load(): Promise<ILangImpl> {
if (!this._loadingTriggered) {
this._loadingTriggered = true;
languageDefinitions[this._languageId].loader().then(
(mod) => this._lazyLoadPromiseResolve(mod),
(err) => this._lazyLoadPromiseReject(err)
);
}
return this._lazyLoadPromise;
}
}
The pattern here is clever: the promise is created in the constructor but the loading is only triggered on first load() call. Subsequent calls return the same promise. This ensures that even if multiple consumers request the same language simultaneously, the grammar module is only loaded once.
The registerLanguage Function
The registerLanguage() function ties everything together:
src/languages/definitions/_.contribution.ts#L63-L80
export function registerLanguage(def: ILang): void {
const languageId = def.id;
languageDefinitions[languageId] = def;
languages.register(def);
const lazyLanguageLoader = LazyLanguageLoader.getOrCreate(languageId);
languages.registerTokensProviderFactory(languageId, {
create: async (): Promise<languages.IMonarchLanguage> => {
const mod = await lazyLanguageLoader.load();
return mod.language;
}
});
languages.onLanguageEncountered(languageId, async () => {
const mod = await lazyLanguageLoader.load();
languages.setLanguageConfiguration(languageId, mod.conf);
});
}
Three things happen at registration time:
languages.register(def)— tells Monaco core that this language exists (metadata only)languages.registerTokensProviderFactory()— registers a factory that will create the Monarch tokenizer on demandlanguages.onLanguageEncountered()— sets up a callback to install the language configuration (brackets, comments, folding markers) when the language is first encountered
None of these trigger the actual grammar load. That only happens when Monaco core calls the factory's create() method or fires the onLanguageEncountered event.
sequenceDiagram
participant App as Application
participant Reg as registerLanguage()
participant Core as Monaco Core
participant Loader as LazyLanguageLoader
participant Mod as Grammar Module
App->>Reg: registerLanguage({ id: 'typescript', loader: ... })
Reg->>Core: languages.register({ id: 'typescript', ... })
Reg->>Core: registerTokensProviderFactory('typescript', factory)
Reg->>Core: onLanguageEncountered('typescript', callback)
Note over Core: No grammar loaded yet
App->>Core: editor.createModel('...', 'typescript')
Core->>Loader: factory.create()
Loader->>Mod: import('./typescript')
Mod-->>Loader: { conf, language }
Loader-->>Core: language (Monarch grammar)
Core->>Core: Create Monarch tokenizer
Anatomy of a Language Definition
Let's trace a concrete example. Here's the TypeScript language registration:
src/languages/definitions/typescript/register.ts#L6-L16
import { registerLanguage } from '../_.contribution';
registerLanguage({
id: 'typescript',
extensions: ['.ts', '.tsx', '.cts', '.mts'],
aliases: ['TypeScript', 'ts', 'typescript'],
mimetypes: ['text/typescript'],
loader: (): Promise<any> => {
return import('./typescript');
}
});
The loader function uses a dynamic import() — this is the code-splitting boundary. When bundled with Rollup or webpack, the ./typescript module (which contains the full Monarch grammar) becomes a separate chunk that's only fetched when TypeScript is first needed.
The grammar module itself exports two things. The conf provides language configuration:
src/languages/definitions/typescript/typescript.ts#L8-L75
export const conf: languages.LanguageConfiguration = {
wordPattern: /(-?\d*\.\d\w*)|([^\`\~\!\@\#\%\^\&\*\(\)\-\=\+\[\{\]\}\\\|\;\:\'\"\,\.\<\>\/\?\s]+)/g,
comments: {
lineComment: '//',
blockComment: ['/*', '*/']
},
brackets: [
['{', '}'],
['[', ']'],
['(', ')']
],
// ... auto-closing pairs, folding markers, etc.
};
And the language provides the Monarch grammar — a state machine for tokenization:
src/languages/definitions/typescript/typescript.ts#L77-L80
export const language = {
defaultToken: 'invalid',
tokenPostfix: '.ts',
keywords: ['abstract', 'any', 'as', /* ... */],
The Monarch Grammar Format
Monarch is Monaco's built-in tokenizer format. It's a declarative state machine where each state contains an array of rules, and each rule is a regex pattern paired with an action (a token type and optionally a state transition).
src/languages/definitions/typescript/typescript.ts#L224-L255
tokenizer: {
root: [[/[{}]/, 'delimiter.bracket'], { include: 'common' }],
common: [
[/#?[a-z_$][\w$]*/, {
cases: {
'@keywords': 'keyword',
'@default': 'identifier'
}
}],
[/[A-Z][\w\$]*/, 'type.identifier'],
{ include: '@whitespace' },
[/\/(?=([^\\\\/]|\\.)+\/([dgimsuy]*)(\s*)(\.|
|,|\)|\]|\}|$))/,
{ token: 'regexp', bracket: '@open', next: '@regexp' }
],
// ...
],
stateDiagram-v2
[*] --> root
root --> common: include
common --> whitespace: include @whitespace
common --> regexp: /regex_start/
regexp --> common: /regex_end/
common --> string_double: opening "
string_double --> common: closing "
common --> string_single: opening '
string_single --> common: closing '
common --> string_backtick: opening backtick
string_backtick --> common: closing backtick
Key Monarch concepts:
- States: Named groups of rules (
root,common,whitespace,string,regexp) - Includes:
{ include: 'common' }pulls in rules from another state - Cases:
@keywordsreferences thekeywordsarray declared earlier, enabling compact rules - State transitions:
next: '@regexp'switches the tokenizer to theregexpstate - Token types: Strings like
'keyword','identifier','type.identifier'that map to CSS classes for coloring
Tip: The
defaultToken: 'invalid'setting is a development aid. Any text that doesn't match any rule gets theinvalidtoken type, making untokenized content visually obvious. Production grammars should eventually cover all cases, but this default helps you spot gaps during development.
The Barrel Registration Pattern
All ~85 language registrations are aggregated through a barrel import:
src/languages/definitions/register.all.ts#L6-L87
import './abap/register';
import './apex/register';
import './azcli/register';
// ... 79 more
import './yaml/register';
Each import executes the registerLanguage() call, which as we've seen only stores metadata and sets up lazy factories. The total upfront cost of importing all 85 registration files is negligible — no grammar modules are loaded, no tokenizers are created.
flowchart TD
BARREL[register.all.ts] --> |import| TS[typescript/register.ts]
BARREL --> |import| PY[python/register.ts]
BARREL --> |import| RS[rust/register.ts]
BARREL --> |import| DOT[... 82 more]
TS --> |"registerLanguage()"| META1["id: 'typescript'<br/>extensions: ['.ts', '.tsx']<br/>loader: () => import('./typescript')"]
PY --> |"registerLanguage()"| META2["id: 'python'<br/>extensions: ['.py']<br/>loader: () => import('./python')"]
META1 -.-> |on demand| GRAMMAR1[typescript.ts<br/>~300 lines of Monarch grammar]
META2 -.-> |on demand| GRAMMAR2[python.ts<br/>~300 lines of Monarch grammar]
This pattern also enables selective imports. A consumer could bypass register.all.ts entirely and import only the specific language registrations they need:
import 'monaco-editor/esm/vs/languages/definitions/typescript/register';
import 'monaco-editor/esm/vs/languages/definitions/python/register';
// Only TypeScript and Python grammars will ever load
Tokenization Testing Framework
Every language grammar is backed by tokenization tests. The test runner in testRunner.ts provides the infrastructure:
src/languages/definitions/test/testRunner.ts#L12-L42
export interface IRelaxedToken {
startIndex: number;
type: string;
}
export interface ITestItem {
line: string;
tokens: IRelaxedToken[];
}
export function testTokenization(_language: string | string[], tests: ITestItem[][]): void {
let languages: string[];
if (typeof _language === 'string') {
languages = [_language];
} else {
languages = _language;
}
let mainLanguage = languages[0];
test(mainLanguage + ' tokenization', async () => {
await Promise.all(languages.map((l) => loadLanguage(l)));
await timeout(0);
runTests(mainLanguage, tests);
});
}
The test format pairs input source lines with expected token classifications. The loadLanguage() helper (also defined in _.contribution.ts) explicitly triggers the lazy loader and creates a temporary model to force tokenizer creation:
src/languages/definitions/_.contribution.ts#L55-L61
export async function loadLanguage(languageId: string): Promise<void> {
await LazyLanguageLoader.getOrCreate(languageId).load();
const model = editor.createModel('', languageId);
model.dispose();
}
flowchart LR
TEST[test file] --> |testTokenization| RUNNER[testRunner.ts]
RUNNER --> |loadLanguage| LOADER[LazyLanguageLoader]
LOADER --> |import| GRAMMAR[Grammar Module]
RUNNER --> |editor.tokenize| CORE[Monaco Core]
CORE --> |compare| EXPECTED[Expected Tokens]
The tests are run via Node.js's built-in test runner, as configured in package.json:
node --import tsx --import ./test/test-setup.mjs --test "src/languages/definitions/*/*.test.ts"
Each language has its own *.test.ts file alongside its grammar, making it easy to add tests when modifying a grammar.
What's Next
Language definitions give Monaco syntax highlighting, but for the four premium languages — TypeScript, CSS, HTML, and JSON — Monaco goes much further. In the next article, we'll explore the web worker architecture that powers full IntelliSense: completions, diagnostics, hover information, and formatting, all running off the main thread.