Memory Management in Go: The Allocator Hierarchy and Concurrent Garbage Collector
Prerequisites
- ›Articles 1-4: Repository through Scheduler
- ›Garbage collection theory (tri-color marking, write barriers, concurrent collection)
- ›Understanding of virtual memory and page-level allocation
Memory Management in Go: The Allocator Hierarchy and Concurrent Garbage Collector
Go's memory management sits at the intersection of two forces: allocation speed (every make, new, and composite literal) and collection efficiency (the GC must reclaim memory without long pauses). The solution is a layered allocator that avoids locks in the common case paired with a concurrent garbage collector that runs alongside your program. This article explores both systems.
Allocator Design Overview
The allocator's design is documented in a masterful comment at the top of malloc.go:
The hierarchy is inspired by Google's tcmalloc but has diverged significantly. Small objects (≤32KB) are allocated from size-class-specific caches. Large objects bypass the cache entirely and go straight to the page allocator.
graph TD
subgraph "Per-P (lock-free)"
MC["mcache<br/>~70 size classes<br/>+ tiny allocator"]
end
subgraph "Per-size-class (locked)"
MCE["mcentral<br/>spans with free slots"]
end
subgraph "Global (locked)"
MH["mheap<br/>page-level allocation"]
end
subgraph "Operating System"
OS["mmap / VirtualAlloc"]
end
MC -->|"empty span"| MCE
MCE -->|"no spans"| MH
MH -->|"no pages"| OS
OS -->|"≥1MB chunks"| MH
MH -->|"span of pages"| MCE
MCE -->|"span with free slots"| MC
The allocation algorithm for small objects:
- Round the requested size to one of ~70 size classes
- Check the corresponding mspan in the current P's mcache
- Scan the mspan's free bitmap for a free slot
- If no free slot, get a new mspan from mcentral
- If mcentral is empty, get pages from mheap
- If mheap has no pages, request from the OS (at least 1MB)
Steps 1-3 happen without any lock — this is the critical fast path that makes Go allocation competitive with stack allocation in many cases.
mcache: Per-P Lock-Free Allocation
Each P (as we saw in the scheduler article) has its own mcache:
type mcache struct {
nextSample int64
scanAlloc uintptr
tiny uintptr
tinyoffset uintptr
tinyAllocs uintptr
alloc [numSpanClasses]*mspan
// ...
}
The alloc array is indexed by span class (a combination of size class and noscan/scan). Each entry points to an mspan with free slots. When allocating, the runtime finds the appropriate span and scans its free bitmap — a simple bitwise operation, no locks needed.
The tiny allocator is a special optimization for objects smaller than 16 bytes that contain no pointers. It packs multiple tiny allocations into a single 16-byte block, dramatically reducing overhead for things like small strings and single-byte values.
Tip: Run
go test -bench=. -benchmemto see allocation counts. Each "alloc/op" represents a trip through this allocator hierarchy. Zero-allocation hot paths are the gold standard for performance-critical Go code.
mcentral and mheap: Central and Global Allocation
When an mcache runs out of spans for a size class, it requests a new one from the mcentral:
src/runtime/mcentral.go#L22-L40
Each mcentral manages spans for a single size class, maintaining two sets that swap roles each GC cycle: swept spans and unswept spans. The sweepgen counter (incremented by 2 each cycle) determines which set is which. This design means allocation can trigger sweeping — if you need a span, you might sweep one first to find free objects.
When the mcentral has no spans, the mheap allocates pages:
The heap manages memory at page granularity (8KB pages) using arenas — 64MB chunks on 64-bit systems. Each arena has associated metadata: heap bitmaps for pointer scanning and span maps that record which span owns each page.
graph LR
subgraph "Arena (64MB)"
P1["Page 0<br/>mspan A"]
P2["Page 1<br/>mspan A"]
P3["Page 2<br/>mspan B"]
P4["Page 3<br/>free"]
P5["..."]
end
subgraph "Metadata"
HB["Heap Bitmap<br/>(pointer/scalar per word)"]
SM["Span Map<br/>(page → mspan)"]
end
P1 --- HB
P1 --- SM
The virtual memory layout from malloc.go explains this arena system: the address space is viewed as a series of arena frames, indexed by a two-level arena map (mheap_.arenas). This allows the Go heap to use any part of the address space while keeping metadata lookups O(1).
GC Algorithm and Four Phases
The garbage collector's algorithm is documented in mgc.go:
It's a concurrent mark-and-sweep collector that uses a hybrid write barrier combining Dijkstra's insertion barrier with Yuasa's deletion barrier:
src/runtime/mbarrier.go#L24-L35
writePointer(slot, ptr):
shade(*slot) // Yuasa: shade old referent
if current stack is grey:
shade(ptr) // Dijkstra: shade new referent
*slot = ptr
stateDiagram-v2
[*] --> SweepTerm: GC triggered
SweepTerm --> Mark: STW - enable write barrier
note right of SweepTerm
Phase 1 - Stop the world,
sweep remaining spans
end note
Mark --> MarkTerm: all grey objects drained
note right of Mark
Phase 2 - Concurrent marking
with write barriers active
end note
MarkTerm --> Sweep: STW - disable write barrier
note right of MarkTerm
Phase 3 - Stop the world,
flush mcaches
end note
Sweep --> [*]: all spans swept
note right of Sweep
Phase 4 - Concurrent sweeping
in background and on allocation
end note
Phase 1 (Sweep Termination): Stop the world. Sweep any remaining unswept spans from the previous cycle.
Phase 2 (Concurrent Mark): Start the world with write barriers enabled. Mark workers scan roots (stacks, globals, runtime structures) and drain the grey object queue. This runs concurrently with the program.
Phase 3 (Mark Termination): Stop the world again. Verify that all marking is complete, flush per-P caches.
Phase 4 (Concurrent Sweep): Start the world with write barriers disabled. Sweep spans in the background and lazily during allocation. Newly allocated objects are white.
The two STW pauses (phases 1 and 3) are typically sub-millisecond for most programs. The heavy lifting — scanning the heap — happens concurrently in phase 2.
GC Pacer and Tuning
The GC pacer determines when to start the next collection cycle. It lives in mgcpacer.go:
src/runtime/mgcpacer.go#L16-L39
The target is 25% CPU utilization for GC marking (gcBackgroundUtilization = 0.25). The pacer triggers GC before the heap grows to a target size, aiming to finish marking just as the heap reaches the target.
Two knobs control GC behavior:
- GOGC (default 100): The heap growth ratio. GOGC=100 means the GC triggers when the heap doubles since the last collection. GOGC=50 means trigger at 50% growth. GOGC=off disables GC.
- GOMEMLIMIT: A soft memory limit. When set, the GC works harder to stay below this limit, even if it means running more frequently than GOGC would suggest.
GC Assist is the mechanism that prevents fast-allocating goroutines from outrunning the GC. Each goroutine tracks a gcAssistBytes credit. When this goes negative (the goroutine has allocated more than its share), it must perform marking work before it can allocate again. This creates natural backpressure — the faster you allocate, the more you help with GC.
flowchart TD
A["Goroutine calls malloc"] --> B{"gcAssistBytes > 0?"}
B -->|Yes| C["Allocate, decrement credit"]
B -->|No| D["Perform mark work<br/>(scan grey objects)"]
D --> E["Earn assist credit"]
E --> C
C --> F["Return allocated memory"]
Tip: Use
GODEBUG=gctrace=1to see GC timing for every cycle. The output shows heap sizes, pause times, and CPU utilization. For production tuning,GOMEMLIMITis often more useful thanGOGCbecause it directly maps to your container's memory budget.
Connecting Allocation to the Scheduler
Memory management and scheduling are deeply intertwined. The per-P mcache (from the scheduler's P struct) enables lock-free allocation. The GC uses the scheduler's STW mechanism for its two pause phases. GC mark workers are goroutines scheduled by the same G-M-P machinery we covered in the previous article. And GC assist — goroutines helping with marking — is integrated into the allocation fast path.
In the final article, we'll explore Go's concurrency primitives and I/O infrastructure: how channels are implemented as lock-protected circular buffers, how the network poller integrates with the scheduler, and how runtime-only compiler directives tie these systems together.