Contributing a New Model Architecture to llama.cpp
Prerequisites
- ›Articles 1-5
- ›Familiarity with at least one transformer architecture's forward pass
- ›Basic Python knowledge for the converter step
Contributing a New Model Architecture to llama.cpp
Over the course of this series, we've built a thorough understanding of llama.cpp's internals: the two-library stack, the graph context toolkit, GGML's backend system, the decode pipeline, and the application layer. Now it's time to put that knowledge to work. This article is a practical guide to adding a new model architecture to llama.cpp, covering every file you need to modify and in what order.
Adding a model architecture requires touching five to six files in a specific sequence. We'll use the existing LLaMA model as the reference implementation throughout, since it's the canonical example that most architectures follow.
The End-to-End Model Addition Path
Here's the complete path, from first touch to merged PR:
flowchart TD
S1["Step 1: Architecture Registration\nllama-arch.h + llama-arch.cpp"] --> S2["Step 2: Python Converter\nconvert_hf_to_gguf.py"]
S2 --> S3["Step 3: Tensor Loading\nllama-model.cpp\n(load_hparams + load_tensors)"]
S3 --> S4["Step 4: Graph Builder\nsrc/models/mymodel.cpp"]
S4 --> S5["Step 5: Registration\nllama-model.cpp\n(build_graph + create_memory)"]
S5 --> S6["Step 6: Build & Test\nCMakeLists.txt + validation"]
| Step | Files Modified | Purpose |
|---|---|---|
| 1 | src/llama-arch.h, src/llama-arch.cpp |
Register architecture identifier and tensor names |
| 2 | convert_hf_to_gguf.py |
Write Python converter from HuggingFace format |
| 3 | src/llama-model.cpp |
Add hyperparameter and tensor loading |
| 4 | src/models/{name}.cpp (new file) |
Implement the graph builder |
| 5 | src/llama-model.cpp |
Add dispatch entries for build_graph() and create_memory() |
| 6 | src/CMakeLists.txt |
Add new source file to build |
Step 1: Architecture Registration
Start by adding your architecture to the llm_arch enum in src/llama-arch.h:
enum llm_arch {
// ... existing entries ...
LLM_ARCH_MY_MODEL,
LLM_ARCH_UNKNOWN, // keep this last
};
Then add the string mapping in src/llama-arch.cpp:
static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
// ... existing entries ...
{ LLM_ARCH_MY_MODEL, "my-model" },
};
The string "my-model" must match the general.architecture value that your Python converter will write into the GGUF file.
You also need to register tensor name mappings in llama-arch.cpp. This is a large map (LLM_TENSOR_NAMES) that tells the loader how to find tensors by GGUF name. For example:
{ LLM_ARCH_MY_MODEL, {
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
// ... all weight tensors
}},
Tip: Study the tensor names used by the LLaMA architecture entry as a template. Most transformer models use the same set of tensors—the main variation is in naming and which optional tensors (biases, gate projections, MoE weights) are present.
Step 2: Python Converter
The converter convert_hf_to_gguf.py uses a class hierarchy where each model architecture has its own converter class. Add a new class:
@ModelBase.register("MyModelForCausalLM")
class MyModelModel(TextModel):
model_arch = gguf.MODEL_ARCH.MY_MODEL
def set_gguf_parameters(self):
super().set_gguf_parameters()
# Write architecture-specific hyperparameters
self.gguf_writer.add_block_count(self.hparams["num_hidden_layers"])
self.gguf_writer.add_embedding_length(self.hparams["hidden_size"])
self.gguf_writer.add_head_count(self.hparams["num_attention_heads"])
# ... etc
def modify_tensors(self, data_torch, name, bid):
# Map HuggingFace tensor names to GGUF names
# Return list of (new_name, tensor) pairs
return [(self.map_tensor_name(name), data_torch)]
The @ModelBase.register("MyModelForCausalLM") decorator registers the converter based on the architectures field in the HuggingFace config.json. The model_arch attribute must match the enum value you'll add to gguf-py/gguf/constants.py.
You'll also need to add the architecture to the gguf-py constants:
# In gguf-py/gguf/constants.py
class MODEL_ARCH(IntEnum):
# ...
MY_MODEL = auto()
And its tensor name mappings in the same file.
flowchart LR
HF["HuggingFace Checkpoint\nconfig.json + model.safetensors"] --> CONV["convert_hf_to_gguf.py"]
CONV --> ARCH["MyModelModel class"]
ARCH --> PARAMS["set_gguf_parameters()\nWrite hyperparameters"]
ARCH --> TENSORS["modify_tensors()\nRename + transform weights"]
PARAMS --> GGUF["Output: model.gguf"]
TENSORS --> GGUF
Step 3: Hyperparameter and Tensor Loading
Back in C++, you need to teach llama.cpp how to read the GGUF metadata and load the tensors. This happens in src/llama-model.cpp.
Loading hyperparameters — Find the load_hparams() method and add a case for your architecture in its switch statement. Here you read GGUF metadata keys and populate the hparams struct:
case LLM_ARCH_MY_MODEL:
ml.get_key(LLM_KV_BLOCK_COUNT, hparams.n_layer);
ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd);
ml.get_key(LLM_KV_ATTENTION_HEAD_COUNT, hparams.n_head());
// ... model-specific hyperparameters
break;
Loading tensors — Find the load_tensors() method and add a case that creates ggml_tensor pointers for each weight. The pattern mirrors llama_layer:
case LLM_ARCH_MY_MODEL:
model.tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD), {n_embd, n_vocab});
for (int i = 0; i < n_layer; ++i) {
auto & layer = model.layers[i];
layer.attn_norm = create_tensor(tn(LLM_TENSOR_ATTN_NORM, i), {n_embd});
layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, i), {n_embd, n_embd});
// ... all per-layer tensors
}
break;
Tip: The
create_tensor()calls don't allocate data—they register tensor shapes and names for the model loader to fill from the GGUF file. The actual data allocation and device assignment (CPU vs GPU) is handled by the loader based onn_gpu_layers.
Step 4: Writing the Graph Builder
Create a new file src/models/mymodel.cpp. This is where you implement the model's forward pass using the llm_graph_context toolkit methods we covered in Article 2.
Start from the LLaMA reference implementation src/models/llama.cpp and modify it to match your architecture's topology. The general pattern is:
#include "models.h"
struct llm_build_my_model : public llm_graph_context {
llm_build_my_model(const llama_model & model,
const llm_graph_params & params)
: llm_graph_context(params) {
ggml_tensor * cur;
ggml_tensor * inpL = build_inp_embd(model.tok_embd);
ggml_tensor * inp_pos = build_inp_pos();
auto * inp_attn = build_attn_inp_kv();
ggml_tensor * inp_out_ids = build_inp_out_ids();
for (int il = 0; il < n_layer; ++il) {
// 1. Pre-attention norm
cur = build_norm(inpL, model.layers[il].attn_norm,
NULL, LLM_NORM_RMS, il);
// 2. Q/K/V projections + RoPE
// ... (specific to your architecture)
// 3. Attention
cur = build_attn(inp_attn, model.layers[il].wo, NULL,
Qcur, Kcur, Vcur, NULL, NULL, NULL,
kq_scale, il);
// 4. Residual + FFN
cur = ggml_add(ctx0, cur, inpL);
// ... FFN via build_ffn() or build_moe_ffn()
inpL = cur;
}
// Final norm + LM head
cur = build_norm(inpL, model.output_norm, NULL,
LLM_NORM_RMS, -1);
res->t_embd = cur;
cur = build_lora_mm(model.output, cur);
res->t_logits = cur;
ggml_build_forward_expand(gf, cur);
}
};
You must also declare the builder in src/models/models.h:
struct llm_build_my_model : public llm_graph_context {
llm_build_my_model(const llama_model & model,
const llm_graph_params & params);
};
flowchart TD
subgraph "Toolkit methods you compose"
A["build_inp_embd()"]
B["build_inp_pos()"]
C["build_norm(RMS/LayerNorm)"]
D["build_lora_mm(Q/K/V projections)"]
E["ggml_rope_ext(RoPE)"]
F["build_attn(attention + KV cache)"]
G["build_ffn(SiLU/GELU/ReLU)"]
H["build_moe_ffn(MoE)"]
end
A --> C
B --> E
C --> D
D --> E
E --> F
F --> C
C --> G
C --> H
Step 5: Registration and Memory Selection
Two more entries in src/llama-model.cpp:
Add to build_graph() — In the switch statement at line 8288:
case LLM_ARCH_MY_MODEL:
llm = std::make_unique<llm_build_my_model>(*this, params);
break;
Add to create_memory() — If your model is a standard transformer, the default case in create_memory() already handles it—it will create a llama_kv_cache. Only add an explicit case if your model needs:
- No cache (BERT-style) → return
nullptr - Recurrent state (Mamba/RWKV) →
llama_memory_recurrent - Hybrid →
llama_memory_hybrid - Sliding window → handled automatically if
hparams.swa_type != NONE
Finally, add the new source file to src/CMakeLists.txt:
add_library(llama
# ... existing files ...
models/mymodel.cpp
)
Testing and Contribution Workflow
After building, validate your implementation:
1. Convert a model:
python convert_hf_to_gguf.py /path/to/hf-model --outfile model.gguf
2. Verify tensor shapes: Use gguf-py to dump the file and check that all tensors have correct shapes:
python -m gguf.scripts.gguf_dump model.gguf
3. Run inference:
./build/bin/llama-cli -m model.gguf -p "Hello, world!" -n 32
4. Compare outputs: Run the same prompt through the HuggingFace reference implementation and compare logits/outputs. Small differences (< 0.01) are expected due to floating-point ordering differences; large differences indicate a bug in tensor mapping or graph topology.
5. Run perplexity: For a quantitative validation, compute perplexity on a standard benchmark:
./build/bin/llama-perplexity -m model.gguf -f wiki.test.raw
Compare with the HuggingFace model's perplexity—they should be within 0.1–0.5 points for F16, more for quantized formats.
flowchart LR
CONVERT["Convert model\nHF → GGUF"] --> VERIFY["Verify tensors\ngguf_dump"]
VERIFY --> RUN["Run inference\nllama-cli"]
RUN --> COMPARE["Compare outputs\nvs. HF reference"]
COMPARE --> PPL["Perplexity test\nllama-perplexity"]
PPL --> PR["Submit PR"]
Contribution guidelines: The project has a CONTRIBUTING.md that covers coding standards, PR requirements, and the review process. Key points:
- PRs adding new model architectures should include a link to the model's paper or documentation
- The converter should handle tokenizer conversion correctly
- New models should pass basic inference testing before submission
- Follow the existing code style—the
.clang-formatand.clang-tidyconfigs enforce this
Tip: The fastest way to add a new model is to find the most similar existing architecture and copy its converter class, tensor loading block, and graph builder. For most decoder-only transformers, the LLaMA implementation covers 80% of the work—you'll mainly be adjusting normalization placement, activation functions, and attention variants.
Series Conclusion
Over six articles, we've traced the complete path of inference through llama.cpp—from the GGUF file format and GGML tensor operations, through the graph context toolkit and architecture dispatch, the KV cache and decode pipeline, up to the HTTP server and CLI application layer. We've seen how a relatively small set of well-designed abstractions (the graph context toolkit, the memory interface hierarchy, the backend vtable system) enables support for 120+ model architectures across a dozen hardware backends.
The llama.cpp codebase rewards careful reading. Its design prioritizes performance and portability over abstraction, which means the code is sometimes more verbose than you'd expect—but every design choice has a reason rooted in the constraints of efficient inference on diverse hardware. Understanding those choices is the key to contributing effectively.