GGML: The Tensor Engine Under llama.cpp
Prerequisites
- ›Articles 1-2
- ›Basic understanding of GPU computing (host vs device memory, compute kernels)
GGML: The Tensor Engine Under llama.cpp
Every matrix multiply, every attention computation, every quantized weight lookup in llama.cpp ultimately flows through GGML—a standalone C tensor library that lives in the ggml/ directory. GGML is not just "the math backend." It defines the execution model, memory layout, data types, and hardware abstraction that make portable, quantized inference possible.
This article dissects GGML from the inside out. We'll cover its lazy evaluation pattern, the core data structures, the three-level backend vtable system that enables pluggable hardware support, compile-time backend registration, the scheduler that splits computation across devices, quantization types, and the GGUF file format.
Lazy Evaluation and Core Abstractions
GGML uses a two-phase execution model that will be familiar if you've worked with TensorFlow 1.x or JAX's tracing: build the computation graph, then execute it. Tensor operations like ggml_mul_mat() don't perform any computation—they create graph nodes that describe what to compute. Only ggml_graph_compute() actually runs the math.
The header comment in ggml.h explains this with a worked example: you define f(x) = a*x² + b by calling tensor operations, then compute it by setting input values and calling ggml_graph_compute.
Three core types power this model:
ggml_tensor (defined at line 658) is the fundamental unit. It stores:
type— the data type (F32, F16, Q4_K, etc.)ne[4]— number of elements in each dimension (up to 4D)nb[4]— stride in bytes per dimensionop— which operation produced this tensorsrc[GGML_MAX_SRC]— pointers to source tensors (the graph edges)data— pointer to the actual data (may be in CPU or GPU memory)buffer— which backend buffer owns the data
ggml_context is a memory arena used to allocate tensors. It provides bump-pointer allocation from a pre-sized buffer, avoiding per-tensor malloc overhead. When you build a computation graph, all intermediate tensor descriptors (not data) are allocated from a context.
ggml_cgraph is the computation DAG—a list of nodes (tensors with operations) and leaf tensors (inputs). ggml_build_forward_expand(gf, tensor) walks the tensor's source links to discover all dependent operations and adds them to the graph.
flowchart LR
subgraph "Build Phase"
A["ggml_new_tensor(ctx)"] --> B["ggml_mul_mat(ctx, W, x)"]
B --> C["ggml_add(ctx, result, bias)"]
C --> D["ggml_build_forward_expand(gf, output)"]
end
subgraph "Execute Phase"
D --> E["ggml_backend_sched_alloc_graph(sched, gf)"]
E --> F["set tensor data"]
F --> G["ggml_backend_sched_graph_compute(sched, gf)"]
end
Tip: The
src[]array inggml_tensoris how the graph is implicitly encoded. Each tensor knows its inputs.ggml_build_forward_expandsimply walks this tree recursively to discover the full DAG. There's no separate "graph builder" API—the graph is the tensor connectivity.
The Backend Vtable System
GGML supports CPUs, NVIDIA GPUs (CUDA), Apple GPUs (Metal), AMD GPUs (Vulkan, ROCm/HIP), Intel GPUs (SYCL), Huawei NPUs (CANN), and more. It achieves this through a classic C vtable pattern with three levels of abstraction, defined in ggml/src/ggml-backend-impl.h:
Level 1: ggml_backend_buffer_type_i — the memory allocation strategy. Each buffer type knows how to allocate buffers for a specific device, what alignment is required, and whether the memory is host-accessible. This is where you ask "can I allocate 2GB on this GPU?"
Level 2: ggml_backend_buffer_i — data transfer operations on an allocated buffer. It provides set_tensor(), get_tensor(), memset_tensor(), and cpy_tensor() for moving data between host and device memory.
Level 3: ggml_backend_i — the compute interface. It provides graph_compute() (execute a computation graph), async tensor operations, and synchronization.
classDiagram
class ggml_backend_buffer_type_i {
+get_name()
+alloc_buffer(size)
+get_alignment()
+get_max_size()
+is_host()
}
class ggml_backend_buffer_i {
+free_buffer()
+get_base()
+set_tensor(tensor, data, offset, size)
+get_tensor(tensor, data, offset, size)
+clear(value)
}
class ggml_backend_i {
+get_name()
+free()
+graph_compute(graph)
+synchronize()
}
class ggml_backend_buffer_type {
iface: buffer_type_i
device: backend_dev_t
context: void*
}
class ggml_backend_buffer {
iface: buffer_i
buft: buffer_type_t
context: void*
size: size_t
}
ggml_backend_buffer_type --> ggml_backend_buffer_type_i : "contains"
ggml_backend_buffer --> ggml_backend_buffer_i : "contains"
ggml_backend_buffer_type "1" --> "*" ggml_backend_buffer : "creates"
Each hardware backend provides its own implementations of these interfaces. For example, the CUDA backend's buffer type allocates GPU memory via cudaMalloc, its buffer interface uses cudaMemcpy for data transfers, and its compute backend dispatches CUDA kernels for each GGML operation.
Backend Registration and Discovery
How does GGML know which backends are available at runtime? Through compile-time registration in ggml/src/ggml-backend-reg.cpp. The ggml_backend_registry constructor uses #ifdef guards to register each backend:
ggml_backend_registry() {
#ifdef GGML_USE_CUDA
register_backend(ggml_backend_cuda_reg());
#endif
#ifdef GGML_USE_METAL
register_backend(ggml_backend_metal_reg());
#endif
#ifdef GGML_USE_VULKAN
register_backend(ggml_backend_vk_reg());
#endif
// ...
#ifdef GGML_USE_CPU
register_backend(ggml_backend_cpu_reg());
#endif
}
The ordering matters: GPU backends are registered first, CPU last. The backend scheduler (covered next) assigns operations to backends in registration order by default, preferring GPU over CPU.
flowchart TD
CMAKE["CMake -DGGML_CUDA=ON"] --> DEFINE["#define GGML_USE_CUDA"]
DEFINE --> INCLUDE["#include ggml-cuda.h"]
INCLUDE --> REG["register_backend(ggml_backend_cuda_reg())"]
REG --> DEV["Enumerate CUDA devices"]
DEV --> READY["Backend ready for graph compute"]
Each register_backend() call discovers all devices exposed by that backend (e.g., two NVIDIA GPUs) and adds them to the global device list. GGML also supports dynamic backend loading from shared libraries, enabling out-of-tree backends without recompiling the core library.
Tip: CPU is always registered last (
#ifdef GGML_USE_CPU). This ensures that GPU backends get priority in the scheduler. If no GPU backend is compiled in, CPU handles everything.
The Backend Scheduler
A single computation graph may span multiple backends—for instance, some layers on GPU and embedding/output on CPU. The backend scheduler (ggml_backend_sched) handles this automatically.
The scheduler's job is threefold:
-
Graph partitioning — For each operation in the graph, decide which backend should execute it. The default strategy assigns ops to the backend that owns the operation's primary input tensor, with fallback to the first registered backend that supports the operation.
-
Buffer allocation — Allocate the right backend buffer for each tensor based on which backend will operate on it.
-
Data transfer insertion — When an operation's inputs live on different backends (e.g., a GPU op needs a CPU tensor), automatically insert copy operations to move data to the correct device.
sequenceDiagram
participant Ctx as llama_context
participant Sched as Backend Scheduler
participant GPU as CUDA Backend
participant CPU as CPU Backend
Ctx->>Sched: alloc_graph(gf)
Sched->>Sched: Assign ops to backends
Sched->>Sched: Insert cross-device copies
Sched->>GPU: Allocate GPU tensors
Sched->>CPU: Allocate CPU tensors
Ctx->>Sched: graph_compute(gf)
Sched->>GPU: Execute GPU subgraph
GPU->>CPU: Transfer intermediate results
Sched->>CPU: Execute CPU subgraph
Sched-->>Ctx: Done
The n_gpu_layers parameter that users set controls this partitioning at a higher level: libllama's model loading code assigns layer tensors to GPU or CPU buffer types based on this setting, and the scheduler respects those assignments.
Quantization Types
GGML's quantization type system is one of its defining features. Rather than storing weights as 32-bit floats, GGML supports block-quantized formats that drastically reduce memory usage while maintaining acceptable precision.
The core types fall into several families:
| Family | Types | Block Size | Bits/Weight |
|---|---|---|---|
| IEEE standard | F32, F16, BF16 | 1 | 32, 16, 16 |
| Uniform quantization | Q8_0, Q4_0, Q4_1, Q5_0, Q5_1 | 32 | 8.5, 4.5, 5.0, 5.5, 6.0 |
| K-quants | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K | 256 | 2.6–6.6 |
| I-quants | IQ1_S, IQ2_XXS, IQ3_S, IQ4_NL | varies | 1.6–4.5 |
| Ternary | TQ1_0, TQ2_0 | 256 | 1.7, 2.1 |
K-quants use super-blocks of 256 elements with per-block scales and mins, achieving better precision-to-size ratios than uniform quantization. I-quants use importance matrices and lookup tables for even more aggressive compression.
Each quantization type is registered with a type traits structure in ggml.h that specifies block size, type size, and pointers to quantize/dequantize functions. The actual quantization/dequantization kernels are implemented in ggml/src/ggml-quants.c for CPU and in backend-specific files for GPUs.
flowchart LR
F16["F16 Model\n14 GB"] --> Q8["Q8_0\n7.2 GB"]
Q8 --> Q4K["Q4_K_M\n4.1 GB"]
Q4K --> Q2K["Q2_K\n2.7 GB"]
Q2K --> IQ2["IQ2_XXS\n2.1 GB"]
style F16 fill:#e1f5fe
style Q8 fill:#b3e5fc
style Q4K fill:#81d4fa
style Q2K fill:#4fc3f7
style IQ2 fill:#29b6f6
The GGUF File Format
GGUF (GGML Universal File format) is the self-describing binary format that stores both model metadata and tensor data. Its structure is documented in the header comment of ggml/include/gguf.h:
flowchart TD
subgraph "GGUF File Layout"
MAGIC["Magic: 'GGUF' (4 bytes)"]
VERSION["Version: 3 (uint32)"]
NT["Tensor count (int64)"]
NKV["KV pair count (int64)"]
KV["KV Metadata Pairs\n- architecture, n_layer, n_embd...\n- tokenizer data\n- chat template"]
TD["Tensor Descriptors\n- name, dimensions, type, offset"]
PAD["Alignment padding"]
DATA["Tensor Data Blob\n(aligned to GGUF_DEFAULT_ALIGNMENT)"]
end
MAGIC --> VERSION --> NT --> NKV --> KV --> TD --> PAD --> DATA
Key design decisions in GGUF:
-
Self-describing: Every piece of metadata needed to load and use a model is in the file. The
general.architecturekey identifies the model type. Hyperparameters likellama.block_countandllama.embedding_lengthprovide shape information. The tokenizer vocabulary and merge rules are embedded as KV arrays. -
Aligned data: Tensor data is aligned to 32 bytes by default (configurable via the
general.alignmentkey), enabling efficient mmap-based loading where tensors can be accessed directly from the memory-mapped file without copying. -
Type system: KV values can be uint8, int8, uint16, int16, uint32, int32, float32, bool, string, arrays of any of these, uint64, int64, or float64. The
gguf_typeenum defines all 13 types. -
Version 3: The current version supports 64-bit tensor counts and offsets, enabling files larger than 4GB. Magic bytes are
"GGUF"(0x46554747 in little-endian).
Tip: The
gguf-pyPython library in the repository provides a convenient way to inspect GGUF files:python -m gguf.scripts.gguf_dump model.ggufprints all metadata and tensor shapes.
What's Next
We've now covered GGML from tensor operations to hardware backends to file formats. But the story of a decode call is incomplete. In the next article, we'll trace the complete inference pipeline: how llama_decode() manages batching and micro-batching, how the polymorphic memory system handles KV caches for transformers and state buffers for recurrent models, and how the system recovers from failures without corrupting state.