Read OSS

GGML: The Tensor Engine Under llama.cpp

Advanced

Prerequisites

  • Articles 1-2
  • Basic understanding of GPU computing (host vs device memory, compute kernels)

GGML: The Tensor Engine Under llama.cpp

Every matrix multiply, every attention computation, every quantized weight lookup in llama.cpp ultimately flows through GGML—a standalone C tensor library that lives in the ggml/ directory. GGML is not just "the math backend." It defines the execution model, memory layout, data types, and hardware abstraction that make portable, quantized inference possible.

This article dissects GGML from the inside out. We'll cover its lazy evaluation pattern, the core data structures, the three-level backend vtable system that enables pluggable hardware support, compile-time backend registration, the scheduler that splits computation across devices, quantization types, and the GGUF file format.

Lazy Evaluation and Core Abstractions

GGML uses a two-phase execution model that will be familiar if you've worked with TensorFlow 1.x or JAX's tracing: build the computation graph, then execute it. Tensor operations like ggml_mul_mat() don't perform any computation—they create graph nodes that describe what to compute. Only ggml_graph_compute() actually runs the math.

The header comment in ggml.h explains this with a worked example: you define f(x) = a*x² + b by calling tensor operations, then compute it by setting input values and calling ggml_graph_compute.

Three core types power this model:

ggml_tensor (defined at line 658) is the fundamental unit. It stores:

  • type — the data type (F32, F16, Q4_K, etc.)
  • ne[4] — number of elements in each dimension (up to 4D)
  • nb[4] — stride in bytes per dimension
  • op — which operation produced this tensor
  • src[GGML_MAX_SRC] — pointers to source tensors (the graph edges)
  • data — pointer to the actual data (may be in CPU or GPU memory)
  • buffer — which backend buffer owns the data

ggml_context is a memory arena used to allocate tensors. It provides bump-pointer allocation from a pre-sized buffer, avoiding per-tensor malloc overhead. When you build a computation graph, all intermediate tensor descriptors (not data) are allocated from a context.

ggml_cgraph is the computation DAG—a list of nodes (tensors with operations) and leaf tensors (inputs). ggml_build_forward_expand(gf, tensor) walks the tensor's source links to discover all dependent operations and adds them to the graph.

flowchart LR
    subgraph "Build Phase"
        A["ggml_new_tensor(ctx)"] --> B["ggml_mul_mat(ctx, W, x)"]
        B --> C["ggml_add(ctx, result, bias)"]
        C --> D["ggml_build_forward_expand(gf, output)"]
    end
    subgraph "Execute Phase"
        D --> E["ggml_backend_sched_alloc_graph(sched, gf)"]
        E --> F["set tensor data"]
        F --> G["ggml_backend_sched_graph_compute(sched, gf)"]
    end

Tip: The src[] array in ggml_tensor is how the graph is implicitly encoded. Each tensor knows its inputs. ggml_build_forward_expand simply walks this tree recursively to discover the full DAG. There's no separate "graph builder" API—the graph is the tensor connectivity.

The Backend Vtable System

GGML supports CPUs, NVIDIA GPUs (CUDA), Apple GPUs (Metal), AMD GPUs (Vulkan, ROCm/HIP), Intel GPUs (SYCL), Huawei NPUs (CANN), and more. It achieves this through a classic C vtable pattern with three levels of abstraction, defined in ggml/src/ggml-backend-impl.h:

Level 1: ggml_backend_buffer_type_i — the memory allocation strategy. Each buffer type knows how to allocate buffers for a specific device, what alignment is required, and whether the memory is host-accessible. This is where you ask "can I allocate 2GB on this GPU?"

Level 2: ggml_backend_buffer_i — data transfer operations on an allocated buffer. It provides set_tensor(), get_tensor(), memset_tensor(), and cpy_tensor() for moving data between host and device memory.

Level 3: ggml_backend_i — the compute interface. It provides graph_compute() (execute a computation graph), async tensor operations, and synchronization.

classDiagram
    class ggml_backend_buffer_type_i {
        +get_name()
        +alloc_buffer(size)
        +get_alignment()
        +get_max_size()
        +is_host()
    }
    class ggml_backend_buffer_i {
        +free_buffer()
        +get_base()
        +set_tensor(tensor, data, offset, size)
        +get_tensor(tensor, data, offset, size)
        +clear(value)
    }
    class ggml_backend_i {
        +get_name()
        +free()
        +graph_compute(graph)
        +synchronize()
    }
    class ggml_backend_buffer_type {
        iface: buffer_type_i
        device: backend_dev_t
        context: void*
    }
    class ggml_backend_buffer {
        iface: buffer_i
        buft: buffer_type_t
        context: void*
        size: size_t
    }
    ggml_backend_buffer_type --> ggml_backend_buffer_type_i : "contains"
    ggml_backend_buffer --> ggml_backend_buffer_i : "contains"
    ggml_backend_buffer_type "1" --> "*" ggml_backend_buffer : "creates"

Each hardware backend provides its own implementations of these interfaces. For example, the CUDA backend's buffer type allocates GPU memory via cudaMalloc, its buffer interface uses cudaMemcpy for data transfers, and its compute backend dispatches CUDA kernels for each GGML operation.

Backend Registration and Discovery

How does GGML know which backends are available at runtime? Through compile-time registration in ggml/src/ggml-backend-reg.cpp. The ggml_backend_registry constructor uses #ifdef guards to register each backend:

ggml_backend_registry() {
#ifdef GGML_USE_CUDA
    register_backend(ggml_backend_cuda_reg());
#endif
#ifdef GGML_USE_METAL
    register_backend(ggml_backend_metal_reg());
#endif
#ifdef GGML_USE_VULKAN
    register_backend(ggml_backend_vk_reg());
#endif
// ...
#ifdef GGML_USE_CPU
    register_backend(ggml_backend_cpu_reg());
#endif
}

The ordering matters: GPU backends are registered first, CPU last. The backend scheduler (covered next) assigns operations to backends in registration order by default, preferring GPU over CPU.

flowchart TD
    CMAKE["CMake -DGGML_CUDA=ON"] --> DEFINE["#define GGML_USE_CUDA"]
    DEFINE --> INCLUDE["#include ggml-cuda.h"]
    INCLUDE --> REG["register_backend(ggml_backend_cuda_reg())"]
    REG --> DEV["Enumerate CUDA devices"]
    DEV --> READY["Backend ready for graph compute"]

Each register_backend() call discovers all devices exposed by that backend (e.g., two NVIDIA GPUs) and adds them to the global device list. GGML also supports dynamic backend loading from shared libraries, enabling out-of-tree backends without recompiling the core library.

Tip: CPU is always registered last (#ifdef GGML_USE_CPU). This ensures that GPU backends get priority in the scheduler. If no GPU backend is compiled in, CPU handles everything.

The Backend Scheduler

A single computation graph may span multiple backends—for instance, some layers on GPU and embedding/output on CPU. The backend scheduler (ggml_backend_sched) handles this automatically.

The scheduler's job is threefold:

  1. Graph partitioning — For each operation in the graph, decide which backend should execute it. The default strategy assigns ops to the backend that owns the operation's primary input tensor, with fallback to the first registered backend that supports the operation.

  2. Buffer allocation — Allocate the right backend buffer for each tensor based on which backend will operate on it.

  3. Data transfer insertion — When an operation's inputs live on different backends (e.g., a GPU op needs a CPU tensor), automatically insert copy operations to move data to the correct device.

sequenceDiagram
    participant Ctx as llama_context
    participant Sched as Backend Scheduler
    participant GPU as CUDA Backend
    participant CPU as CPU Backend

    Ctx->>Sched: alloc_graph(gf)
    Sched->>Sched: Assign ops to backends
    Sched->>Sched: Insert cross-device copies
    Sched->>GPU: Allocate GPU tensors
    Sched->>CPU: Allocate CPU tensors
    Ctx->>Sched: graph_compute(gf)
    Sched->>GPU: Execute GPU subgraph
    GPU->>CPU: Transfer intermediate results
    Sched->>CPU: Execute CPU subgraph
    Sched-->>Ctx: Done

The n_gpu_layers parameter that users set controls this partitioning at a higher level: libllama's model loading code assigns layer tensors to GPU or CPU buffer types based on this setting, and the scheduler respects those assignments.

Quantization Types

GGML's quantization type system is one of its defining features. Rather than storing weights as 32-bit floats, GGML supports block-quantized formats that drastically reduce memory usage while maintaining acceptable precision.

The core types fall into several families:

Family Types Block Size Bits/Weight
IEEE standard F32, F16, BF16 1 32, 16, 16
Uniform quantization Q8_0, Q4_0, Q4_1, Q5_0, Q5_1 32 8.5, 4.5, 5.0, 5.5, 6.0
K-quants Q2_K, Q3_K, Q4_K, Q5_K, Q6_K 256 2.6–6.6
I-quants IQ1_S, IQ2_XXS, IQ3_S, IQ4_NL varies 1.6–4.5
Ternary TQ1_0, TQ2_0 256 1.7, 2.1

K-quants use super-blocks of 256 elements with per-block scales and mins, achieving better precision-to-size ratios than uniform quantization. I-quants use importance matrices and lookup tables for even more aggressive compression.

Each quantization type is registered with a type traits structure in ggml.h that specifies block size, type size, and pointers to quantize/dequantize functions. The actual quantization/dequantization kernels are implemented in ggml/src/ggml-quants.c for CPU and in backend-specific files for GPUs.

flowchart LR
    F16["F16 Model\n14 GB"] --> Q8["Q8_0\n7.2 GB"]
    Q8 --> Q4K["Q4_K_M\n4.1 GB"]
    Q4K --> Q2K["Q2_K\n2.7 GB"]
    Q2K --> IQ2["IQ2_XXS\n2.1 GB"]
    
    style F16 fill:#e1f5fe
    style Q8 fill:#b3e5fc
    style Q4K fill:#81d4fa
    style Q2K fill:#4fc3f7
    style IQ2 fill:#29b6f6

The GGUF File Format

GGUF (GGML Universal File format) is the self-describing binary format that stores both model metadata and tensor data. Its structure is documented in the header comment of ggml/include/gguf.h:

flowchart TD
    subgraph "GGUF File Layout"
        MAGIC["Magic: 'GGUF' (4 bytes)"]
        VERSION["Version: 3 (uint32)"]
        NT["Tensor count (int64)"]
        NKV["KV pair count (int64)"]
        KV["KV Metadata Pairs\n- architecture, n_layer, n_embd...\n- tokenizer data\n- chat template"]
        TD["Tensor Descriptors\n- name, dimensions, type, offset"]
        PAD["Alignment padding"]
        DATA["Tensor Data Blob\n(aligned to GGUF_DEFAULT_ALIGNMENT)"]
    end
    MAGIC --> VERSION --> NT --> NKV --> KV --> TD --> PAD --> DATA

Key design decisions in GGUF:

  1. Self-describing: Every piece of metadata needed to load and use a model is in the file. The general.architecture key identifies the model type. Hyperparameters like llama.block_count and llama.embedding_length provide shape information. The tokenizer vocabulary and merge rules are embedded as KV arrays.

  2. Aligned data: Tensor data is aligned to 32 bytes by default (configurable via the general.alignment key), enabling efficient mmap-based loading where tensors can be accessed directly from the memory-mapped file without copying.

  3. Type system: KV values can be uint8, int8, uint16, int16, uint32, int32, float32, bool, string, arrays of any of these, uint64, int64, or float64. The gguf_type enum defines all 13 types.

  4. Version 3: The current version supports 64-bit tensor counts and offsets, enabling files larger than 4GB. Magic bytes are "GGUF" (0x46554747 in little-endian).

Tip: The gguf-py Python library in the repository provides a convenient way to inspect GGUF files: python -m gguf.scripts.gguf_dump model.gguf prints all metadata and tensor shapes.

What's Next

We've now covered GGML from tensor operations to hardware backends to file formats. But the story of a decode call is incomplete. In the next article, we'll trace the complete inference pipeline: how llama_decode() manages batching and micro-batching, how the polymorphic memory system handles KV caches for transformers and state buffers for recurrent models, and how the system recovers from failures without corrupting state.