llama.cpp Architecture: A Map of the Codebase
Prerequisites
- ›Basic C/C++ knowledge (pointers, classes, virtual dispatch)
- ›General understanding of what Large Language Models are and how they generate text
llama.cpp Architecture: A Map of the Codebase
llama.cpp is one of the most consequential open-source projects in the LLM ecosystem. It lets you run large language models—quantized down to 2–4 bits—on consumer hardware, from a MacBook to a Raspberry Pi. But at 55,000+ lines of core library code supporting 120+ model architectures, diving into the source can feel like exploring an unfamiliar city without a map.
This article is your map. We'll establish the mental model you need to navigate the codebase efficiently: the two-library stack, the directory layout, the C API facade pattern, the three critical types that govern the entire inference lifecycle, and a step-by-step trace of what happens when you ask the library to generate a token.
The Two-Library Stack
llama.cpp is not a monolithic codebase. It's structured as two distinct libraries, plus a utilities layer on top:
graph TD
subgraph "User-facing tools"
CLI["tools/cli"]
SRV["tools/server"]
QUANT["tools/quantize"]
end
subgraph "Shared utilities"
COMMON["common/"]
end
subgraph "Core libraries"
LLAMA["libllama (src/)"]
GGML["GGML (ggml/)"]
end
CLI --> COMMON
SRV --> COMMON
QUANT --> COMMON
COMMON --> LLAMA
LLAMA --> GGML
GGML (ggml/) is a generic tensor computation library. It knows nothing about language models—it provides tensor types, a lazy computation graph, quantized data formats, and a pluggable backend system for CPUs, GPUs, and accelerators. Think of it as a minimal, inference-optimized alternative to PyTorch's tensor layer.
libllama (src/) is the LLM-specific library. It understands model architectures, tokenization, KV caches, and sampling strategies. It uses GGML to build and execute computation graphs for 120+ model architectures, from LLaMA to Mamba to RWKV.
common/ is a utilities layer consumed only by the command-line tools (server, CLI, quantize, etc.). It handles argument parsing, chat template application, and high-level sampling wrappers. It is not part of the public library API.
This separation matters. GGML is independently usable—whisper.cpp and other projects share it. And the clean boundary between libllama and common/ means that if you're embedding llama.cpp in your own application, you only link against libllama and ggml, ignoring common/ entirely.
Tip: The dependency direction is strictly one-way: tools → common → libllama → GGML. If you see an
#includethat violates this, it's a bug.
Directory Structure Tour
Here's a map of every top-level directory and its purpose:
| Directory | Purpose |
|---|---|
include/ |
Public C API headers (llama.h, llama-cpp.h) |
src/ |
Core libllama implementation (C++ internals) |
src/models/ |
One .cpp file per model architecture (~110 files) |
ggml/ |
GGML tensor library (include/, src/, backends/) |
common/ |
Shared utilities for CLI tools |
tools/ |
User-facing binaries (server, cli, quantize, bench, etc.) |
gguf-py/ |
Python library for reading/writing GGUF files |
tests/ |
Unit and integration tests |
convert_hf_to_gguf.py |
Python model conversion from HuggingFace format |
The src/CMakeLists.txt file serves as the definitive file index for libllama. Every source file in the library is listed there—from the 20+ core llama-*.cpp modules to all 110+ model architecture files under models/.
flowchart LR
subgraph "src/CMakeLists.txt lists all files"
CORE["Core modules\nllama-context.cpp\nllama-model.cpp\nllama-graph.cpp\n...20+ files"]
MODELS["Model architectures\nmodels/llama.cpp\nmodels/qwen2.cpp\nmodels/mamba.cpp\n...110+ files"]
end
CORE -.-> MODELS
The naming convention in src/ is methodical: every core module follows the pattern llama-{module}.h / llama-{module}.cpp. For example, llama-batch.h defines batch handling, llama-kv-cache.h defines the KV cache, and llama-sampler.h defines the sampling chain. This makes grepping for functionality straightforward.
The C API Facade Pattern
llama.cpp exposes a pure C API through include/llama.h. The header declares opaque struct types and free functions following a consistent llama_* naming convention:
struct llama_model; // opaque
struct llama_context; // opaque
struct llama_sampler; // opaque
The implementation lives in src/llama.cpp, which acts as a thin delegation layer. Each public llama_* function simply forwards to the corresponding method on the internal C++ class. For instance, llama_decode() delegates to llama_context::decode().
classDiagram
class llama_h["llama.h (C API)"] {
<<header>>
llama_model_load_from_file()
llama_init_from_model()
llama_decode()
llama_sampler_sample()
}
class llama_cpp["llama.cpp (Facade)"] {
<<delegation>>
Forwards to C++ classes
}
class llama_model["llama_model (C++)"] {
load_hparams()
load_tensors()
build_graph()
create_memory()
}
class llama_context_impl["llama_context (C++)"] {
decode()
encode()
process_ubatch()
}
llama_h --> llama_cpp : "declares"
llama_cpp --> llama_model : "delegates"
llama_cpp --> llama_context_impl : "delegates"
Why a C API? Three reasons. First, C has a stable ABI—you can upgrade libllama without recompiling consumers, and you can bind to it from any language (Python, Rust, Go, C#). Second, it enforces a clear boundary: all the C++ complexity (templates, virtual dispatch, RAII) stays internal. Third, it prevents header leakage—users only need llama.h and the GGML headers, not the 20+ internal llama-*.h files.
Tip: If you're looking at a public function like
llama_decode(), find its implementation insrc/llama.cppfirst—it will point you to the actual C++ class and method that does the work.
The Three Critical Types
Everything in llama.cpp revolves around three types, declared in include/llama.h:
llama_model represents a loaded model: its architecture metadata (hyperparameters), vocabulary, and weight tensors allocated in backend buffers. A model is immutable after loading and can be shared across multiple contexts. Its internal structure is defined in src/llama-model.h, storing the architecture enum, hyperparameters, per-layer tensor pointers, and the list of devices used for offloading.
llama_context represents an inference session: it owns the KV cache (or other memory types), compute buffers, the backend scheduler, and runtime parameters like batch size and thread count. You create a context from a model, and it's where all mutable state lives. Its definition is in src/llama-context.h.
llama_sampler represents a token selection pipeline. Samplers are composable—you chain together temperature scaling, top-k, top-p, repetition penalty, and other strategies. The chain is defined in src/llama-sampler.h.
classDiagram
class llama_model {
+arch: llm_arch
+hparams: llama_hparams
+vocab: llama_vocab
+layers: vector~llama_layer~
+tok_embd: ggml_tensor*
+output: ggml_tensor*
+build_graph()
+create_memory()
}
class llama_context {
+model: const llama_model&
+memory: llama_memory_i*
+sched: ggml_backend_sched_t
+decode()
+process_ubatch()
}
class llama_sampler_chain {
+samplers: vector~info~
+cur: vector~llama_token_data~
}
llama_model "1" <-- "*" llama_context : "references (immutable)"
llama_context "1" --> "1" llama_sampler_chain : "uses"
The ownership relationship is crucial: a model outlives all its contexts. Multiple contexts can share one model (useful for parallel inference with separate KV caches). The context borrows a const reference to the model—it never modifies weights.
Inference Lifecycle Walkthrough
Here's the complete sequence for generating a single token, from loading to output:
sequenceDiagram
participant App as Application
participant API as llama.h
participant Model as llama_model
participant Ctx as llama_context
participant GGML as GGML Backend
App->>API: llama_model_load_from_file()
API->>Model: load_hparams(), load_tensors()
Model-->>App: llama_model*
App->>API: llama_init_from_model()
API->>Ctx: construct(model, params)
Ctx->>Model: create_memory() → KV cache
Ctx-->>App: llama_context*
App->>API: llama_tokenize("Hello")
API-->>App: [token_ids]
App->>API: llama_decode(batch)
API->>Ctx: decode(batch)
Ctx->>Ctx: balloc->init(batch)
Ctx->>Ctx: memory->init_batch()
loop For each ubatch
Ctx->>Model: build_graph(params)
Model->>GGML: ggml_backend_sched_alloc_graph()
Ctx->>GGML: set_inputs() + graph_compute()
GGML-->>Ctx: logits
end
Ctx-->>App: logits ready
App->>API: llama_sampler_sample()
API-->>App: next_token
App->>API: llama_detokenize()
API-->>App: "world"
Step 1: Load the model. llama_model_load_from_file() reads a GGUF file, parses the architecture from metadata, loads hyperparameters, maps tensor names to the internal llama_layer structure, and allocates weight data into backend buffers (CPU, GPU, or both).
Step 2: Create a context. llama_init_from_model() constructs a llama_context. This calls create_memory() on the model to select the right memory implementation—KV cache for transformers, recurrent state for Mamba/RWKV, or a hybrid. It also reserves compute buffers using a worst-case graph estimate.
Step 3: Tokenize. llama_tokenize() converts text to token IDs using the vocabulary embedded in the model. The vocab type (BPE, SentencePiece, WordPiece, etc.) determines the algorithm.
Step 4: Decode. llama_decode() is where the real work happens. The batch is sanitized, split into micro-batches ("ubatches"), and each ubatch is processed through process_ubatch(): build the computation graph → allocate backend memory → set input tensors → execute → extract logits.
Step 5: Sample. llama_sampler_sample() takes the logits from the last decode call and runs them through the sampler chain to select a token.
Step 6: Detokenize. Convert the selected token back to text.
Navigation Guide
When you need to find something in the codebase, these strategies will save you time:
Start with src/CMakeLists.txt. It's a complete index of every file in the library. Searching this file is often faster than recursive grep.
Use the llama-*.h naming pattern. Want to understand the batch system? Open llama-batch.h. KV cache? llama-kv-cache.h. The pattern is consistent across all 20+ core modules.
Finding a model architecture. Every model has a file under src/models/. The filename matches the GGUF architecture name: LLaMA → models/llama.cpp, Qwen2 → models/qwen2.cpp, Mamba → models/mamba.cpp. If you don't know the filename, search src/llama-arch.cpp for the LLM_ARCH_NAMES map—it translates GGUF string identifiers to enum values.
Tracing a public API call. Start in src/llama.cpp, find the function, see which C++ class it delegates to, then follow the trail into the specific module file.
| What you're looking for | Where to look |
|---|---|
| Public API surface | include/llama.h |
| API delegation layer | src/llama.cpp |
| Model loading / architecture dispatch | src/llama-model.cpp |
| Context construction / decode loop | src/llama-context.cpp |
| Graph building toolkit | src/llama-graph.h |
| Specific model implementation | src/models/{name}.cpp |
| Architecture enum / GGUF key names | src/llama-arch.h, src/llama-arch.cpp |
| KV cache internals | src/llama-kv-cache.h, src/llama-kv-cache.cpp |
| GGML tensor operations | ggml/include/ggml.h |
| Backend system | ggml/src/ggml-backend-impl.h |
Tip: The
models/llama.cppfile is the reference implementation that most other model architectures follow. If you're trying to understand how any model builder works, read this file first.
What's Next
Now that you have a mental model of the codebase structure, the next article dives into the heart of llama.cpp's design: how model architectures are translated into GGML computation graphs. We'll walk through the llm_graph_context toolkit, trace the LLaMA graph builder line by line, and see how 120+ architectures—including non-transformer models like Mamba and RWKV—fit into one unified framework.