Go Runtime Internals: Process Startup and the G-M-P Scheduler
Prerequisites
- ›Articles 1-3: Repository through Compiler Pipeline
- ›Operating system concepts (threads, thread-local storage, virtual memory)
- ›Basic understanding of Go's Plan 9-style assembly notation
Go Runtime Internals: Process Startup and the G-M-P Scheduler
Every Go binary carries a runtime with it — a sophisticated piece of systems software that manages goroutine scheduling, memory allocation, garbage collection, and OS interaction. When the OS loader starts a Go binary, execution begins not in your main() function but in platform-specific assembly that bootstraps the entire runtime. This article traces that bootstrap sequence, then dives deep into the G-M-P scheduler — the engine that makes goroutines work.
Assembly Entry Points and rt0_go
On Linux/amd64, the first instruction executed is in rt0_linux_amd64.s:
src/runtime/rt0_linux_amd64.s#L7-L8
TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
JMP _rt0_amd64(SB)
That's the entire file — a single jump to the architecture-generic _rt0_amd64, which itself jumps to runtime·rt0_go. This layered dispatch separates OS-specific entry from architecture-specific initialization.
rt0_go is where the real work happens:
src/runtime/asm_amd64.s#L142-L269
sequenceDiagram
participant OS as OS Loader
participant rt0 as rt0_linux_amd64
participant rt0go as rt0_go
participant sched as schedinit
participant main as runtime.main
OS->>rt0: Load binary, jump to entry
rt0->>rt0go: JMP _rt0_amd64 → rt0_go
rt0go->>rt0go: Set up g0 stack bounds
rt0go->>rt0go: CPUID: detect CPU features
rt0go->>rt0go: Initialize TLS
rt0go->>rt0go: Link g0 ↔ m0
rt0go->>sched: CALL schedinit
rt0go->>rt0go: Create goroutine for runtime.main
rt0go->>rt0go: CALL runtime.mstart (enters scheduler)
The function performs these steps in order:
-
g0 stack setup (lines 159-166): Creates the initial goroutine's stack from the OS-provided stack, setting
stack.loto 64KB below the current SP. -
CPU feature detection (lines 168-186): Runs
CPUIDto detect Intel vs. AMD and processor capabilities. This information drives runtime optimizations like faster memory operations. -
TLS initialization (lines 249-258): Sets up thread-local storage so
getg()— which retrieves the current goroutine — works correctly. On most platforms this uses OS-provided TLS; on Linux it callssettls. -
g0/m0 linking (lines 260-269): Wires up the two foundational runtime objects:
g0(the system goroutine for scheduling work) andm0(the initial OS thread).
// save m->g0 = g0
MOVQ CX, m_g0(AX)
// save m0 to g0->m
MOVQ AX, g_m(CX)
This bidirectional link is critical — every M knows its g0, and every g0 knows its M.
schedinit and runtime.main
After rt0_go sets up the hardware foundations, it calls schedinit:
This function initializes every major runtime subsystem in careful order: lock ranks, the stack allocator, the random number generator, the memory allocator (mallocinit), CPU algorithm selection, GOMAXPROCS configuration, and more. The ordering matters — mallocinit needs randinit to have run, and both need the stack system initialized.
After schedinit, rt0_go creates the first real goroutine running runtime.main:
runtime.main does the following:
- Starts the sysmon thread — a background monitor that handles preemption, network polling, and GC pacing
- Locks the main goroutine to the main OS thread (required by some C libraries)
- Runs all runtime init functions via
doInit(runtime_inittasks) - Enables the garbage collector with
gcenable() - Runs all package init functions in dependency order
- Finally calls
main_main()— yourmain.main
fn := main_main // indirect call; linker resolves the address
fn()
The //go:linkname main_main main.main directive at line 139 connects the runtime's reference to whatever main() the linker finds.
Tip: Set
GODEBUG=inittrace=1to see timing information for everyinit()function. This is useful for diagnosing slow startup times.
The G-M-P Model
The scheduler documentation at the top of proc.go lays out the three core abstractions:
- G (goroutine): The unit of work. Contains a stack, scheduling state (
gobuf), and GC metadata. - M (machine): An OS thread. Has a
g0for system stack operations and acurgpointing to the currently running goroutine. - P (processor): A logical processor with a local run queue, memory cache (
mcache), and timer heap. There are exactly GOMAXPROCS Ps.
graph TD
subgraph "P0 (Processor)"
LRQ0["Local Run Queue<br/>[G3, G4, G5]"]
MC0["mcache"]
TH0["Timer Heap"]
end
subgraph "P1 (Processor)"
LRQ1["Local Run Queue<br/>[G6, G7]"]
MC1["mcache"]
TH1["Timer Heap"]
end
M0["M0 (OS Thread)<br/>running G1"] --> P0
M1["M1 (OS Thread)<br/>running G2"] --> P1
M2["M2 (OS Thread)<br/>in syscall, no P"]
GRQ["Global Run Queue<br/>[G8, G9, G10]"]
style M2 fill:#f99
The P abstraction was introduced in the Go 1.1 scheduler redesign. Before P existed, all scheduling state was either per-M or global. The problem: when an M entered a syscall, its scheduling resources were locked up. P solves this by making scheduling resources detachable — when an M enters a syscall, its P can be handed off to another M that's ready to run Go code.
The global variables m0 and g0 are the primordial instances:
Goroutine State Machine
Every goroutine has an atomicstatus field that tracks its state. The states are defined in runtime2.go:
src/runtime/runtime2.go#L17-L99
stateDiagram-v2
[*] --> _Gidle: newproc allocates G
_Gidle --> _Grunnable: initialized
_Grunnable --> _Grunning: execute()
_Grunning --> _Grunnable: preempted / yield
_Grunning --> _Gsyscall: entering syscall
_Gsyscall --> _Grunnable: syscall returns
_Grunning --> _Gwaiting: gopark()
_Gwaiting --> _Grunnable: goready()
_Grunning --> _Gdead: goexit()
_Gdead --> _Gidle: reused from free list
_Grunning --> _Gpreempted: async preemption
_Gpreempted --> _Gwaiting: suspendG
The status also acts as a lock on the goroutine's stack. The _Gscan bit (0x1000) can be OR'd with any state to indicate the GC is scanning the stack. Transitions between states must use atomic CAS operations because the GC may be concurrently setting the scan bit.
The g struct itself is substantial:
src/runtime/runtime2.go#L471-L596
Key fields include stack (bounds), stackguard0 (used for preemption — setting it to stackPreempt triggers a preemption check), sched (the gobuf containing saved registers for context switching), and gcAssistBytes (the goroutine's debt to the GC).
Work Stealing and Thread Management
The core scheduling loop is the schedule() function:
It calls findRunnable(), which implements the work-stealing algorithm:
The search order in findRunnable is carefully designed:
- Check the local run queue
- Check the global run queue (every 61 schedules, to prevent starvation)
- Poll the network poller for ready goroutines
- Try to steal work from other Ps' run queues
- If nothing found, park the M
The "spinning thread" optimization avoids excessive thread parking/unparking:
The key insight: if at least one thread is spinning (looking for work), don't wake additional threads when new work arrives. Only when the last spinning thread finds work and stops spinning does it wake a replacement spinner. This smooths out thread creation bursts while guaranteeing eventual full CPU utilization.
flowchart TD
A["schedule()"] --> B["findRunnable()"]
B --> C{"Local run queue?"}
C -->|Yes| H["execute(gp)"]
C -->|No| D{"Global run queue?<br/>(every 61st check)"}
D -->|Yes| H
D -->|No| E{"Network poller?"}
E -->|Yes| H
E -->|No| F{"Steal from other P?"}
F -->|Yes| H
F -->|No| G["Park M<br/>(stopm)"]
H --> I["Run goroutine"]
I --> A
Preemption: Cooperative and Asynchronous
Go supports two preemption mechanisms, documented in preempt.go:
Cooperative preemption works by poisoning the goroutine's stackguard0 to stackPreempt. Every function prologue contains a stack bound check; when the poisoned value triggers the check, the function enters the stack growth path, which detects it's actually a preemption request and yields.
Asynchronous preemption (added in Go 1.14) handles tight loops without function calls — code that would never hit a cooperative preemption point. The runtime sends a signal (SIGURG on Unix) to the thread, the signal handler inspects the goroutine's state, and if it's at a safe point, pauses the goroutine.
Tip: If you have CPU-bound goroutines that seem to block the scheduler, they might be running tight loops without function calls. While async preemption handles most cases, CGo calls are one area where preemption still can't intervene.
Into the Memory System
The scheduler is tightly coupled with the memory allocator — every P has its own mcache for lock-free allocation, and the GC uses the scheduler to coordinate stop-the-world pauses and mark assist. In the next article, we'll explore Go's memory management: the tcmalloc-inspired allocator hierarchy and the concurrent tri-color garbage collector.