Read OSS

Go Runtime Internals: Process Startup and the G-M-P Scheduler

Advanced

Prerequisites

  • Articles 1-3: Repository through Compiler Pipeline
  • Operating system concepts (threads, thread-local storage, virtual memory)
  • Basic understanding of Go's Plan 9-style assembly notation

Go Runtime Internals: Process Startup and the G-M-P Scheduler

Every Go binary carries a runtime with it — a sophisticated piece of systems software that manages goroutine scheduling, memory allocation, garbage collection, and OS interaction. When the OS loader starts a Go binary, execution begins not in your main() function but in platform-specific assembly that bootstraps the entire runtime. This article traces that bootstrap sequence, then dives deep into the G-M-P scheduler — the engine that makes goroutines work.

Assembly Entry Points and rt0_go

On Linux/amd64, the first instruction executed is in rt0_linux_amd64.s:

src/runtime/rt0_linux_amd64.s#L7-L8

TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
    JMP _rt0_amd64(SB)

That's the entire file — a single jump to the architecture-generic _rt0_amd64, which itself jumps to runtime·rt0_go. This layered dispatch separates OS-specific entry from architecture-specific initialization.

rt0_go is where the real work happens:

src/runtime/asm_amd64.s#L142-L269

sequenceDiagram
    participant OS as OS Loader
    participant rt0 as rt0_linux_amd64
    participant rt0go as rt0_go
    participant sched as schedinit
    participant main as runtime.main

    OS->>rt0: Load binary, jump to entry
    rt0->>rt0go: JMP _rt0_amd64 → rt0_go
    rt0go->>rt0go: Set up g0 stack bounds
    rt0go->>rt0go: CPUID: detect CPU features
    rt0go->>rt0go: Initialize TLS
    rt0go->>rt0go: Link g0 ↔ m0
    rt0go->>sched: CALL schedinit
    rt0go->>rt0go: Create goroutine for runtime.main
    rt0go->>rt0go: CALL runtime.mstart (enters scheduler)

The function performs these steps in order:

  1. g0 stack setup (lines 159-166): Creates the initial goroutine's stack from the OS-provided stack, setting stack.lo to 64KB below the current SP.

  2. CPU feature detection (lines 168-186): Runs CPUID to detect Intel vs. AMD and processor capabilities. This information drives runtime optimizations like faster memory operations.

  3. TLS initialization (lines 249-258): Sets up thread-local storage so getg() — which retrieves the current goroutine — works correctly. On most platforms this uses OS-provided TLS; on Linux it calls settls.

  4. g0/m0 linking (lines 260-269): Wires up the two foundational runtime objects: g0 (the system goroutine for scheduling work) and m0 (the initial OS thread).

// save m->g0 = g0
MOVQ    CX, m_g0(AX)
// save m0 to g0->m
MOVQ    AX, g_m(CX)

This bidirectional link is critical — every M knows its g0, and every g0 knows its M.

schedinit and runtime.main

After rt0_go sets up the hardware foundations, it calls schedinit:

src/runtime/proc.go#L835-L884

This function initializes every major runtime subsystem in careful order: lock ranks, the stack allocator, the random number generator, the memory allocator (mallocinit), CPU algorithm selection, GOMAXPROCS configuration, and more. The ordering matters — mallocinit needs randinit to have run, and both need the stack system initialized.

After schedinit, rt0_go creates the first real goroutine running runtime.main:

src/runtime/proc.go#L152-L294

runtime.main does the following:

  1. Starts the sysmon thread — a background monitor that handles preemption, network polling, and GC pacing
  2. Locks the main goroutine to the main OS thread (required by some C libraries)
  3. Runs all runtime init functions via doInit(runtime_inittasks)
  4. Enables the garbage collector with gcenable()
  5. Runs all package init functions in dependency order
  6. Finally calls main_main() — your main.main
fn := main_main // indirect call; linker resolves the address
fn()

The //go:linkname main_main main.main directive at line 139 connects the runtime's reference to whatever main() the linker finds.

Tip: Set GODEBUG=inittrace=1 to see timing information for every init() function. This is useful for diagnosing slow startup times.

The G-M-P Model

The scheduler documentation at the top of proc.go lays out the three core abstractions:

src/runtime/proc.go#L24-L34

  • G (goroutine): The unit of work. Contains a stack, scheduling state (gobuf), and GC metadata.
  • M (machine): An OS thread. Has a g0 for system stack operations and a curg pointing to the currently running goroutine.
  • P (processor): A logical processor with a local run queue, memory cache (mcache), and timer heap. There are exactly GOMAXPROCS Ps.
graph TD
    subgraph "P0 (Processor)"
        LRQ0["Local Run Queue<br/>[G3, G4, G5]"]
        MC0["mcache"]
        TH0["Timer Heap"]
    end
    subgraph "P1 (Processor)"
        LRQ1["Local Run Queue<br/>[G6, G7]"]
        MC1["mcache"]
        TH1["Timer Heap"]
    end

    M0["M0 (OS Thread)<br/>running G1"] --> P0
    M1["M1 (OS Thread)<br/>running G2"] --> P1
    M2["M2 (OS Thread)<br/>in syscall, no P"]

    GRQ["Global Run Queue<br/>[G8, G9, G10]"]

    style M2 fill:#f99

The P abstraction was introduced in the Go 1.1 scheduler redesign. Before P existed, all scheduling state was either per-M or global. The problem: when an M entered a syscall, its scheduling resources were locked up. P solves this by making scheduling resources detachable — when an M enters a syscall, its P can be handed off to another M that's ready to run Go code.

The global variables m0 and g0 are the primordial instances:

src/runtime/proc.go#L118-L124

Goroutine State Machine

Every goroutine has an atomicstatus field that tracks its state. The states are defined in runtime2.go:

src/runtime/runtime2.go#L17-L99

stateDiagram-v2
    [*] --> _Gidle: newproc allocates G
    _Gidle --> _Grunnable: initialized
    _Grunnable --> _Grunning: execute()
    _Grunning --> _Grunnable: preempted / yield
    _Grunning --> _Gsyscall: entering syscall
    _Gsyscall --> _Grunnable: syscall returns
    _Grunning --> _Gwaiting: gopark()
    _Gwaiting --> _Grunnable: goready()
    _Grunning --> _Gdead: goexit()
    _Gdead --> _Gidle: reused from free list
    _Grunning --> _Gpreempted: async preemption
    _Gpreempted --> _Gwaiting: suspendG

The status also acts as a lock on the goroutine's stack. The _Gscan bit (0x1000) can be OR'd with any state to indicate the GC is scanning the stack. Transitions between states must use atomic CAS operations because the GC may be concurrently setting the scan bit.

The g struct itself is substantial:

src/runtime/runtime2.go#L471-L596

Key fields include stack (bounds), stackguard0 (used for preemption — setting it to stackPreempt triggers a preemption check), sched (the gobuf containing saved registers for context switching), and gcAssistBytes (the goroutine's debt to the GC).

Work Stealing and Thread Management

The core scheduling loop is the schedule() function:

src/runtime/proc.go#L4141

It calls findRunnable(), which implements the work-stealing algorithm:

src/runtime/proc.go#L3395

The search order in findRunnable is carefully designed:

  1. Check the local run queue
  2. Check the global run queue (every 61 schedules, to prevent starvation)
  3. Poll the network poller for ready goroutines
  4. Try to steal work from other Ps' run queues
  5. If nothing found, park the M

The "spinning thread" optimization avoids excessive thread parking/unparking:

src/runtime/proc.go#L58-L83

The key insight: if at least one thread is spinning (looking for work), don't wake additional threads when new work arrives. Only when the last spinning thread finds work and stops spinning does it wake a replacement spinner. This smooths out thread creation bursts while guaranteeing eventual full CPU utilization.

flowchart TD
    A["schedule()"] --> B["findRunnable()"]
    B --> C{"Local run queue?"}
    C -->|Yes| H["execute(gp)"]
    C -->|No| D{"Global run queue?<br/>(every 61st check)"}
    D -->|Yes| H
    D -->|No| E{"Network poller?"}
    E -->|Yes| H
    E -->|No| F{"Steal from other P?"}
    F -->|Yes| H
    F -->|No| G["Park M<br/>(stopm)"]
    H --> I["Run goroutine"]
    I --> A

Preemption: Cooperative and Asynchronous

Go supports two preemption mechanisms, documented in preempt.go:

src/runtime/preempt.go#L1-L40

Cooperative preemption works by poisoning the goroutine's stackguard0 to stackPreempt. Every function prologue contains a stack bound check; when the poisoned value triggers the check, the function enters the stack growth path, which detects it's actually a preemption request and yields.

Asynchronous preemption (added in Go 1.14) handles tight loops without function calls — code that would never hit a cooperative preemption point. The runtime sends a signal (SIGURG on Unix) to the thread, the signal handler inspects the goroutine's state, and if it's at a safe point, pauses the goroutine.

Tip: If you have CPU-bound goroutines that seem to block the scheduler, they might be running tight loops without function calls. While async preemption handles most cases, CGo calls are one area where preemption still can't intervene.

Into the Memory System

The scheduler is tightly coupled with the memory allocator — every P has its own mcache for lock-free allocation, and the GC uses the scheduler to coordinate stop-the-world pauses and mark assist. In the next article, we'll explore Go's memory management: the tcmalloc-inspired allocator hierarchy and the concurrent tri-color garbage collector.