How llama.cpp Turns Model Weights into Computation
Prerequisites
- ›Article 1: architecture-overview-and-navigation
- ›Understanding of transformer forward pass (self-attention, FFN, layer normalization, KV caching, RoPE)
How llama.cpp Turns Model Weights into Computation
In Article 1, we saw that model.build_graph() sits at the center of every decode call. But what does "building a graph" actually look like for 120+ model architectures? The answer lies in one of llama.cpp's most elegant design patterns: a toolkit of reusable building blocks that model authors compose into forward passes, much like assembling LEGO from a shared set of bricks.
This article explores that system in depth. We'll examine the llm_graph_context base class and its toolkit methods, walk through a concrete model builder (LLaMA), understand the dispatch mechanism that routes architectures to builders, see how graph reuse avoids redundant work, and discover how non-transformer architectures like Mamba and RWKV fit into the same framework.
The Graph Context Toolkit
Every model builder in llama.cpp inherits from llm_graph_context, defined in src/llama-graph.h. This base class provides two things: a rich set of pre-initialized member variables (from hyperparameters, context params, and the current ubatch), and a toolkit of builder methods that handle the repetitive mechanics of constructing computation graph fragments.
The key builder methods are:
| Method | Purpose |
|---|---|
build_inp_embd() |
Create token embedding lookup input |
build_inp_pos() |
Create position input tensor |
build_norm() |
Apply LayerNorm or RMSNorm |
build_ffn() |
Build a feed-forward network (SiLU/GELU/ReLU with gating) |
build_moe_ffn() |
Build a Mixture-of-Experts FFN |
build_attn() |
Build attention with KV cache read/write (5+ overloads) |
build_lora_mm() |
Matrix multiply with optional LoRA and per-tensor scaling |
build_cvec() |
Apply control vectors for steering |
build_inp_out_ids() |
Output filtering (only compute logits for requested tokens) |
These methods are defined in src/llama-graph.h. Notice that build_attn() alone has five overloads—one for each attention style: no-cache (BERT), KV cache, K-only cache, interleaved sliding window (iSWA), and cross-attention.
classDiagram
class llm_graph_context {
+arch: llm_arch
+hparams: llama_hparams&
+cparams: llama_cparams&
+ubatch: llama_ubatch&
+n_embd, n_layer, n_head...
+ctx0: ggml_context*
+gf: ggml_cgraph*
+res: llm_graph_result*
+build_inp_embd()
+build_inp_pos()
+build_norm()
+build_ffn()
+build_moe_ffn()
+build_attn() ×5
+build_lora_mm()
+build_cvec()
}
class llm_build_llama {
Constructor builds full graph
}
class llm_build_qwen2 {
Constructor builds full graph
}
class llm_build_bert {
Constructor builds full graph
}
llm_graph_context <|-- llm_build_llama
llm_graph_context <|-- llm_build_qwen2
llm_graph_context <|-- llm_build_bert
The design philosophy is "convention over configuration." Model builders don't need to manually manage KV cache slot indices, attention mask construction, or LoRA application—the toolkit handles all of that internally. A model author focuses on the topology of the forward pass: which norms, which projections, which activation functions, in what order.
A Concrete Model: LLaMA Graph Builder
Let's trace the reference implementation in src/models/llama.cpp. The entire forward pass is built in the constructor of llm_build_llama:
flowchart TD
EMB["build_inp_embd(tok_embd)"] --> POS["build_inp_pos()"]
POS --> ATTN_INP["build_attn_inp_kv()"]
ATTN_INP --> LOOP_START["For each layer il = 0..n_layer"]
LOOP_START --> NORM1["build_norm(attn_norm, RMS)"]
NORM1 --> QKV["Q/K/V projections via build_lora_mm"]
QKV --> ROPE["ggml_rope_ext (Q and K)"]
ROPE --> ATTN["build_attn(inp_attn, wo, Q, K, V)"]
ATTN --> ADD1["residual add"]
ADD1 --> NORM2["build_norm(ffn_norm, RMS)"]
NORM2 --> FFN_CHECK{MoE layer?}
FFN_CHECK -->|No| FFN["build_ffn(SiLU, PAR)"]
FFN_CHECK -->|Yes| MOE["build_moe_ffn(...)"]
FFN --> ADD2["residual add + build_cvec"]
MOE --> ADD2
ADD2 --> NEXT["Next layer"]
NEXT --> FINAL_NORM["build_norm(output_norm, RMS)"]
FINAL_NORM --> LM_HEAD["build_lora_mm(output, cur)"]
LM_HEAD --> DONE["ggml_build_forward_expand(gf, cur)"]
The builder starts by creating the embedding input and position tensor (lines 13–16). Then it sets up the attention input, which constructs the KV cache indices and mask tensors. The template parameter <embed> lets the same code serve both generative and embedding modes—a clean use of C++ templates.
Inside the layer loop (lines 31–153), each iteration follows the standard transformer pattern:
- RMSNorm the input
- Q/K/V projections via
build_lora_mm()(handles LoRA transparently) - RoPE rotation on Q and K
- Attention through
build_attn(), which handles KV cache write, mask application, and flash attention - Residual connection
- FFN or MoE FFN depending on whether
ffn_gate_inpexists (lines 107–143) - Second residual connection and control vector application
After all layers, the output is normalized and projected to vocabulary size through the language model head (lines 156–171).
Tip: The
cb(cur, "attn_norm", il)calls throughout are not just debug labels—they trigger the graph callback that the backend scheduler uses for op placement decisions and offloading.
Architecture Dispatch
When llama_context::process_ubatch() calls model.build_graph(), it reaches a massive switch statement in src/llama-model.cpp:
ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
std::unique_ptr<llm_graph_context> llm;
switch (arch) {
case LLM_ARCH_LLAMA:
llm = std::make_unique<llm_build_llama<false>>(*this, params);
break;
case LLM_ARCH_QWEN2:
llm = std::make_unique<llm_build_qwen2>(*this, params);
break;
// ... 120+ cases
}
}
The arch field is an llm_arch enum value, set during model loading by reading the general.architecture key from the GGUF file. The mapping from GGUF string names to enum values lives in src/llama-arch.cpp as LLM_ARCH_NAMES:
static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_LLAMA, "llama" },
{ LLM_ARCH_QWEN2, "qwen2" },
{ LLM_ARCH_MAMBA, "mamba" },
// ...
};
The llm_arch enum currently defines 125 architecture identifiers. This is one of the most actively growing parts of the codebase—new model architectures are added regularly.
flowchart LR
GGUF["GGUF file\ngeneral.architecture = 'llama'"] --> PARSE["llm_arch_from_string()"]
PARSE --> ENUM["LLM_ARCH_LLAMA"]
ENUM --> SWITCH["build_graph() switch"]
SWITCH --> BUILDER["llm_build_llama<false>"]
BUILDER --> GRAPH["ggml_cgraph"]
Graph Reuse Optimization
Building a computation graph is not free—it involves allocating GGML tensors, wiring up operations, and running the backend scheduler's allocation pass. For autoregressive generation, consecutive decode calls often produce identical graph topologies (same batch size, same sequence structure, same model configuration). llama.cpp exploits this with a graph reuse optimization.
In process_ubatch(), the system checks if the previous graph can be reused:
if (!graph_reuse_disable && res->can_reuse(gparams)) {
n_reused++;
} else {
res->reset();
gf = model.build_graph(gparams);
ggml_backend_sched_alloc_graph(sched.get(), gf);
}
res->set_inputs(&ubatch);
graph_compute(res->get_gf(), ...);
The can_reuse() method on llm_graph_result checks whether the new llm_graph_params would produce the same graph topology as before. The reuse conditions are defined in allow_reuse(): the ubatch dimensions must match (n_tokens, n_seqs, n_seq_tokens), the same input mode (tokens vs embeddings), the same output count, and the same configuration flags.
When reuse succeeds, only set_inputs() is called to update the tensor data—the graph structure, backend allocations, and tensor memory all remain unchanged. Each llm_graph_input_i subclass also implements its own can_reuse() check to verify that input-specific state (like KV cache indices) hasn't changed in a way that would invalidate the graph.
Tip: Set the environment variable
LLAMA_GRAPH_INPUT_DEBUG=1to get debug output showing when graph reuse succeeds or fails, helping identify performance bottlenecks.
The Input System
Computation graphs need data—token IDs, positions, attention masks, KV cache indices. These are managed through the llm_graph_input_i hierarchy, defined in src/llama-graph.h:
class llm_graph_input_i {
public:
virtual void set_input(const llama_ubatch * ubatch) = 0;
virtual bool can_reuse(const llm_graph_params & params) { return false; }
};
Each concrete input class creates one or more GGML tensors and knows how to populate them from a ubatch. The major input types include:
llm_graph_input_embd— token IDs or raw embeddingsllm_graph_input_pos— position indices (supports M-RoPE with multiple position dimensions)llm_graph_input_attn_kv— KV cache slot indices and attention mask for standard transformersllm_graph_input_attn_kv_iswa— same but for interleaved sliding window attentionllm_graph_input_attn_no_cache— full attention mask for BERT-style modelsllm_graph_input_rs— recurrent state copy indices for Mamba/RWKVllm_graph_input_mem_hybrid— combined attention + recurrent inputs for hybrid models like Jamba
classDiagram
class llm_graph_input_i {
<<interface>>
+set_input(ubatch)*
+can_reuse(params)*
}
class llm_graph_input_embd {
tokens: ggml_tensor*
embd: ggml_tensor*
}
class llm_graph_input_attn_kv {
self_k_idxs: ggml_tensor*
self_v_idxs: ggml_tensor*
self_kq_mask: ggml_tensor*
}
class llm_graph_input_rs {
s_copy: ggml_tensor*
}
class llm_graph_input_mem_hybrid {
inp_attn: llm_graph_input_attn_kv
inp_rs: llm_graph_input_rs
}
llm_graph_input_i <|-- llm_graph_input_embd
llm_graph_input_i <|-- llm_graph_input_attn_kv
llm_graph_input_i <|-- llm_graph_input_rs
llm_graph_input_i <|-- llm_graph_input_mem_hybrid
This design decouples the graph topology from the input data lifecycle. A model builder calls build_attn_inp_kv() to create the attention input object, which both creates the placeholder tensors and registers itself to be filled later during set_inputs().
Non-Transformer Architectures
One of llama.cpp's architectural achievements is fitting non-transformer models into the same framework. The key is specialized base classes declared in src/models/models.h:
llm_build_mamba_base provides build_mamba_layer() and build_mamba2_layer() methods that construct the SSM (state-space model) computation: input projection, convolution, selective scan, and output projection. The Mamba builder looks remarkably similar to a transformer builder—loop over layers, normalize, apply the SSM block, add residual—except it uses build_rs_inp() instead of build_attn_inp_kv() and build_mamba_layer() instead of build_attn().
llm_build_rwkv7_base provides build_rwkv7_time_mix() and build_rwkv7_channel_mix() for the RWKV architecture's linear attention variant.
llm_build_delta_net_base provides both chunked and autoregressive implementations of the Delta Net linear attention mechanism.
Hybrid architectures like Jamba combine both patterns. They use llm_graph_input_mem_hybrid, which wraps both an attention KV input and a recurrent state input, and their per-layer loop checks whether each layer is an attention layer or a recurrent layer.
classDiagram
class llm_graph_context {
<<base>>
build_norm()
build_ffn()
build_attn()
}
class llm_build_mamba_base {
<<base>>
build_mamba_layer()
build_mamba2_layer()
}
class llm_build_rwkv7_base {
<<base>>
build_rwkv7_time_mix()
build_rwkv7_channel_mix()
}
class llm_build_delta_net_base {
<<base>>
build_delta_net()
}
class llm_build_mamba {
Constructor builds SSM graph
}
class llm_build_rwkv7 {
Constructor builds RWKV graph
}
class llm_build_jamba {
Constructor builds hybrid graph
}
llm_graph_context <|-- llm_build_mamba_base
llm_graph_context <|-- llm_build_rwkv7_base
llm_graph_context <|-- llm_build_delta_net_base
llm_build_mamba_base <|-- llm_build_mamba
llm_build_rwkv7_base <|-- llm_build_rwkv7
llm_graph_context <|-- llm_build_jamba
The memory system (covered in Article 4) mirrors this hierarchy—llama_memory_recurrent handles SSM state, llama_memory_hybrid combines KV cache and recurrent state, and the create_memory() factory selects the right one based on the architecture.
What's Next
We've seen how llama.cpp translates model architectures into GGML computation graphs using composable building blocks. But what exactly is GGML? How does a lazy evaluation tensor library work, and how does it execute the same graph across CPUs, CUDA GPUs, Metal, and Vulkan? The next article dives into GGML's core abstractions, backend vtable system, and the GGUF file format.