Read OSS

Memory Management in Go: The Allocator Hierarchy and Concurrent Garbage Collector

Advanced

Prerequisites

  • Articles 1-4: Repository through Scheduler
  • Garbage collection theory (tri-color marking, write barriers, concurrent collection)
  • Understanding of virtual memory and page-level allocation

Memory Management in Go: The Allocator Hierarchy and Concurrent Garbage Collector

Go's memory management sits at the intersection of two forces: allocation speed (every make, new, and composite literal) and collection efficiency (the GC must reclaim memory without long pauses). The solution is a layered allocator that avoids locks in the common case paired with a concurrent garbage collector that runs alongside your program. This article explores both systems.

Allocator Design Overview

The allocator's design is documented in a masterful comment at the top of malloc.go:

src/runtime/malloc.go#L5-L76

The hierarchy is inspired by Google's tcmalloc but has diverged significantly. Small objects (≤32KB) are allocated from size-class-specific caches. Large objects bypass the cache entirely and go straight to the page allocator.

graph TD
    subgraph "Per-P (lock-free)"
        MC["mcache<br/>~70 size classes<br/>+ tiny allocator"]
    end
    subgraph "Per-size-class (locked)"
        MCE["mcentral<br/>spans with free slots"]
    end
    subgraph "Global (locked)"
        MH["mheap<br/>page-level allocation"]
    end
    subgraph "Operating System"
        OS["mmap / VirtualAlloc"]
    end

    MC -->|"empty span"| MCE
    MCE -->|"no spans"| MH
    MH -->|"no pages"| OS
    OS -->|"≥1MB chunks"| MH
    MH -->|"span of pages"| MCE
    MCE -->|"span with free slots"| MC

The allocation algorithm for small objects:

  1. Round the requested size to one of ~70 size classes
  2. Check the corresponding mspan in the current P's mcache
  3. Scan the mspan's free bitmap for a free slot
  4. If no free slot, get a new mspan from mcentral
  5. If mcentral is empty, get pages from mheap
  6. If mheap has no pages, request from the OS (at least 1MB)

Steps 1-3 happen without any lock — this is the critical fast path that makes Go allocation competitive with stack allocation in many cases.

mcache: Per-P Lock-Free Allocation

Each P (as we saw in the scheduler article) has its own mcache:

src/runtime/mcache.go#L14-L59

type mcache struct {
    nextSample  int64
    scanAlloc   uintptr
    tiny        uintptr
    tinyoffset  uintptr
    tinyAllocs  uintptr
    alloc       [numSpanClasses]*mspan
    // ...
}

The alloc array is indexed by span class (a combination of size class and noscan/scan). Each entry points to an mspan with free slots. When allocating, the runtime finds the appropriate span and scans its free bitmap — a simple bitwise operation, no locks needed.

The tiny allocator is a special optimization for objects smaller than 16 bytes that contain no pointers. It packs multiple tiny allocations into a single 16-byte block, dramatically reducing overhead for things like small strings and single-byte values.

Tip: Run go test -bench=. -benchmem to see allocation counts. Each "alloc/op" represents a trip through this allocator hierarchy. Zero-allocation hot paths are the gold standard for performance-critical Go code.

mcentral and mheap: Central and Global Allocation

When an mcache runs out of spans for a size class, it requests a new one from the mcentral:

src/runtime/mcentral.go#L22-L40

Each mcentral manages spans for a single size class, maintaining two sets that swap roles each GC cycle: swept spans and unswept spans. The sweepgen counter (incremented by 2 each cycle) determines which set is which. This design means allocation can trigger sweeping — if you need a span, you might sweep one first to find free objects.

When the mcentral has no spans, the mheap allocates pages:

src/runtime/mheap.go#L1-L20

The heap manages memory at page granularity (8KB pages) using arenas — 64MB chunks on 64-bit systems. Each arena has associated metadata: heap bitmaps for pointer scanning and span maps that record which span owns each page.

graph LR
    subgraph "Arena (64MB)"
        P1["Page 0<br/>mspan A"]
        P2["Page 1<br/>mspan A"]
        P3["Page 2<br/>mspan B"]
        P4["Page 3<br/>free"]
        P5["..."]
    end
    subgraph "Metadata"
        HB["Heap Bitmap<br/>(pointer/scalar per word)"]
        SM["Span Map<br/>(page → mspan)"]
    end
    P1 --- HB
    P1 --- SM

The virtual memory layout from malloc.go explains this arena system: the address space is viewed as a series of arena frames, indexed by a two-level arena map (mheap_.arenas). This allows the Go heap to use any part of the address space while keeping metadata lookups O(1).

GC Algorithm and Four Phases

The garbage collector's algorithm is documented in mgc.go:

src/runtime/mgc.go#L1-L83

It's a concurrent mark-and-sweep collector that uses a hybrid write barrier combining Dijkstra's insertion barrier with Yuasa's deletion barrier:

src/runtime/mbarrier.go#L24-L35

writePointer(slot, ptr):
    shade(*slot)          // Yuasa: shade old referent
    if current stack is grey:
        shade(ptr)        // Dijkstra: shade new referent
    *slot = ptr
stateDiagram-v2
    [*] --> SweepTerm: GC triggered
    SweepTerm --> Mark: STW - enable write barrier
    note right of SweepTerm
        Phase 1 - Stop the world,
        sweep remaining spans
    end note
    Mark --> MarkTerm: all grey objects drained
    note right of Mark
        Phase 2 - Concurrent marking
        with write barriers active
    end note
    MarkTerm --> Sweep: STW - disable write barrier
    note right of MarkTerm
        Phase 3 - Stop the world,
        flush mcaches
    end note
    Sweep --> [*]: all spans swept
    note right of Sweep
        Phase 4 - Concurrent sweeping
        in background and on allocation
    end note

Phase 1 (Sweep Termination): Stop the world. Sweep any remaining unswept spans from the previous cycle.

Phase 2 (Concurrent Mark): Start the world with write barriers enabled. Mark workers scan roots (stacks, globals, runtime structures) and drain the grey object queue. This runs concurrently with the program.

Phase 3 (Mark Termination): Stop the world again. Verify that all marking is complete, flush per-P caches.

Phase 4 (Concurrent Sweep): Start the world with write barriers disabled. Sweep spans in the background and lazily during allocation. Newly allocated objects are white.

The two STW pauses (phases 1 and 3) are typically sub-millisecond for most programs. The heavy lifting — scanning the heap — happens concurrently in phase 2.

GC Pacer and Tuning

The GC pacer determines when to start the next collection cycle. It lives in mgcpacer.go:

src/runtime/mgcpacer.go#L16-L39

The target is 25% CPU utilization for GC marking (gcBackgroundUtilization = 0.25). The pacer triggers GC before the heap grows to a target size, aiming to finish marking just as the heap reaches the target.

Two knobs control GC behavior:

  • GOGC (default 100): The heap growth ratio. GOGC=100 means the GC triggers when the heap doubles since the last collection. GOGC=50 means trigger at 50% growth. GOGC=off disables GC.
  • GOMEMLIMIT: A soft memory limit. When set, the GC works harder to stay below this limit, even if it means running more frequently than GOGC would suggest.

GC Assist is the mechanism that prevents fast-allocating goroutines from outrunning the GC. Each goroutine tracks a gcAssistBytes credit. When this goes negative (the goroutine has allocated more than its share), it must perform marking work before it can allocate again. This creates natural backpressure — the faster you allocate, the more you help with GC.

flowchart TD
    A["Goroutine calls malloc"] --> B{"gcAssistBytes > 0?"}
    B -->|Yes| C["Allocate, decrement credit"]
    B -->|No| D["Perform mark work<br/>(scan grey objects)"]
    D --> E["Earn assist credit"]
    E --> C
    C --> F["Return allocated memory"]

Tip: Use GODEBUG=gctrace=1 to see GC timing for every cycle. The output shows heap sizes, pause times, and CPU utilization. For production tuning, GOMEMLIMIT is often more useful than GOGC because it directly maps to your container's memory budget.

Connecting Allocation to the Scheduler

Memory management and scheduling are deeply intertwined. The per-P mcache (from the scheduler's P struct) enables lock-free allocation. The GC uses the scheduler's STW mechanism for its two pause phases. GC mark workers are goroutines scheduled by the same G-M-P machinery we covered in the previous article. And GC assist — goroutines helping with marking — is integrated into the allocation fast path.

In the final article, we'll explore Go's concurrency primitives and I/O infrastructure: how channels are implemented as lock-protected circular buffers, how the network poller integrates with the scheduler, and how runtime-only compiler directives tie these systems together.