Read OSS

Contributing a New Model Architecture to llama.cpp

Advanced

Prerequisites

  • Articles 1-5
  • Familiarity with at least one transformer architecture's forward pass
  • Basic Python knowledge for the converter step

Contributing a New Model Architecture to llama.cpp

Over the course of this series, we've built a thorough understanding of llama.cpp's internals: the two-library stack, the graph context toolkit, GGML's backend system, the decode pipeline, and the application layer. Now it's time to put that knowledge to work. This article is a practical guide to adding a new model architecture to llama.cpp, covering every file you need to modify and in what order.

Adding a model architecture requires touching five to six files in a specific sequence. We'll use the existing LLaMA model as the reference implementation throughout, since it's the canonical example that most architectures follow.

The End-to-End Model Addition Path

Here's the complete path, from first touch to merged PR:

flowchart TD
    S1["Step 1: Architecture Registration\nllama-arch.h + llama-arch.cpp"] --> S2["Step 2: Python Converter\nconvert_hf_to_gguf.py"]
    S2 --> S3["Step 3: Tensor Loading\nllama-model.cpp\n(load_hparams + load_tensors)"]
    S3 --> S4["Step 4: Graph Builder\nsrc/models/mymodel.cpp"]
    S4 --> S5["Step 5: Registration\nllama-model.cpp\n(build_graph + create_memory)"]
    S5 --> S6["Step 6: Build & Test\nCMakeLists.txt + validation"]
Step Files Modified Purpose
1 src/llama-arch.h, src/llama-arch.cpp Register architecture identifier and tensor names
2 convert_hf_to_gguf.py Write Python converter from HuggingFace format
3 src/llama-model.cpp Add hyperparameter and tensor loading
4 src/models/{name}.cpp (new file) Implement the graph builder
5 src/llama-model.cpp Add dispatch entries for build_graph() and create_memory()
6 src/CMakeLists.txt Add new source file to build

Step 1: Architecture Registration

Start by adding your architecture to the llm_arch enum in src/llama-arch.h:

enum llm_arch {
    // ... existing entries ...
    LLM_ARCH_MY_MODEL,
    LLM_ARCH_UNKNOWN, // keep this last
};

Then add the string mapping in src/llama-arch.cpp:

static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
    // ... existing entries ...
    { LLM_ARCH_MY_MODEL, "my-model" },
};

The string "my-model" must match the general.architecture value that your Python converter will write into the GGUF file.

You also need to register tensor name mappings in llama-arch.cpp. This is a large map (LLM_TENSOR_NAMES) that tells the loader how to find tensors by GGUF name. For example:

{ LLM_ARCH_MY_MODEL, {
    { LLM_TENSOR_TOKEN_EMBD,      "token_embd" },
    { LLM_TENSOR_OUTPUT_NORM,     "output_norm" },
    { LLM_TENSOR_OUTPUT,          "output" },
    { LLM_TENSOR_ATTN_NORM,      "blk.%d.attn_norm" },
    { LLM_TENSOR_ATTN_Q,         "blk.%d.attn_q" },
    // ... all weight tensors
}},

Tip: Study the tensor names used by the LLaMA architecture entry as a template. Most transformer models use the same set of tensors—the main variation is in naming and which optional tensors (biases, gate projections, MoE weights) are present.

Step 2: Python Converter

The converter convert_hf_to_gguf.py uses a class hierarchy where each model architecture has its own converter class. Add a new class:

@ModelBase.register("MyModelForCausalLM")
class MyModelModel(TextModel):
    model_arch = gguf.MODEL_ARCH.MY_MODEL
    
    def set_gguf_parameters(self):
        super().set_gguf_parameters()
        # Write architecture-specific hyperparameters
        self.gguf_writer.add_block_count(self.hparams["num_hidden_layers"])
        self.gguf_writer.add_embedding_length(self.hparams["hidden_size"])
        self.gguf_writer.add_head_count(self.hparams["num_attention_heads"])
        # ... etc
    
    def modify_tensors(self, data_torch, name, bid):
        # Map HuggingFace tensor names to GGUF names
        # Return list of (new_name, tensor) pairs
        return [(self.map_tensor_name(name), data_torch)]

The @ModelBase.register("MyModelForCausalLM") decorator registers the converter based on the architectures field in the HuggingFace config.json. The model_arch attribute must match the enum value you'll add to gguf-py/gguf/constants.py.

You'll also need to add the architecture to the gguf-py constants:

# In gguf-py/gguf/constants.py
class MODEL_ARCH(IntEnum):
    # ...
    MY_MODEL = auto()

And its tensor name mappings in the same file.

flowchart LR
    HF["HuggingFace Checkpoint\nconfig.json + model.safetensors"] --> CONV["convert_hf_to_gguf.py"]
    CONV --> ARCH["MyModelModel class"]
    ARCH --> PARAMS["set_gguf_parameters()\nWrite hyperparameters"]
    ARCH --> TENSORS["modify_tensors()\nRename + transform weights"]
    PARAMS --> GGUF["Output: model.gguf"]
    TENSORS --> GGUF

Step 3: Hyperparameter and Tensor Loading

Back in C++, you need to teach llama.cpp how to read the GGUF metadata and load the tensors. This happens in src/llama-model.cpp.

Loading hyperparameters — Find the load_hparams() method and add a case for your architecture in its switch statement. Here you read GGUF metadata keys and populate the hparams struct:

case LLM_ARCH_MY_MODEL:
    ml.get_key(LLM_KV_BLOCK_COUNT,       hparams.n_layer);
    ml.get_key(LLM_KV_EMBEDDING_LENGTH,  hparams.n_embd);
    ml.get_key(LLM_KV_ATTENTION_HEAD_COUNT, hparams.n_head());
    // ... model-specific hyperparameters
    break;

Loading tensors — Find the load_tensors() method and add a case that creates ggml_tensor pointers for each weight. The pattern mirrors llama_layer:

case LLM_ARCH_MY_MODEL:
    model.tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD), {n_embd, n_vocab});
    for (int i = 0; i < n_layer; ++i) {
        auto & layer = model.layers[i];
        layer.attn_norm = create_tensor(tn(LLM_TENSOR_ATTN_NORM, i), {n_embd});
        layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, i), {n_embd, n_embd});
        // ... all per-layer tensors
    }
    break;

Tip: The create_tensor() calls don't allocate data—they register tensor shapes and names for the model loader to fill from the GGUF file. The actual data allocation and device assignment (CPU vs GPU) is handled by the loader based on n_gpu_layers.

Step 4: Writing the Graph Builder

Create a new file src/models/mymodel.cpp. This is where you implement the model's forward pass using the llm_graph_context toolkit methods we covered in Article 2.

Start from the LLaMA reference implementation src/models/llama.cpp and modify it to match your architecture's topology. The general pattern is:

#include "models.h"

struct llm_build_my_model : public llm_graph_context {
    llm_build_my_model(const llama_model & model, 
                       const llm_graph_params & params) 
        : llm_graph_context(params) {
        
        ggml_tensor * cur;
        ggml_tensor * inpL = build_inp_embd(model.tok_embd);
        ggml_tensor * inp_pos = build_inp_pos();
        auto * inp_attn = build_attn_inp_kv();
        ggml_tensor * inp_out_ids = build_inp_out_ids();
        
        for (int il = 0; il < n_layer; ++il) {
            // 1. Pre-attention norm
            cur = build_norm(inpL, model.layers[il].attn_norm, 
                           NULL, LLM_NORM_RMS, il);
            
            // 2. Q/K/V projections + RoPE
            // ... (specific to your architecture)
            
            // 3. Attention
            cur = build_attn(inp_attn, model.layers[il].wo, NULL,
                           Qcur, Kcur, Vcur, NULL, NULL, NULL, 
                           kq_scale, il);
            
            // 4. Residual + FFN
            cur = ggml_add(ctx0, cur, inpL);
            // ... FFN via build_ffn() or build_moe_ffn()
            
            inpL = cur;
        }
        
        // Final norm + LM head
        cur = build_norm(inpL, model.output_norm, NULL, 
                        LLM_NORM_RMS, -1);
        res->t_embd = cur;
        cur = build_lora_mm(model.output, cur);
        res->t_logits = cur;
        
        ggml_build_forward_expand(gf, cur);
    }
};

You must also declare the builder in src/models/models.h:

struct llm_build_my_model : public llm_graph_context {
    llm_build_my_model(const llama_model & model, 
                       const llm_graph_params & params);
};
flowchart TD
    subgraph "Toolkit methods you compose"
        A["build_inp_embd()"]
        B["build_inp_pos()"]
        C["build_norm(RMS/LayerNorm)"]
        D["build_lora_mm(Q/K/V projections)"]
        E["ggml_rope_ext(RoPE)"]
        F["build_attn(attention + KV cache)"]
        G["build_ffn(SiLU/GELU/ReLU)"]
        H["build_moe_ffn(MoE)"]
    end
    
    A --> C
    B --> E
    C --> D
    D --> E
    E --> F
    F --> C
    C --> G
    C --> H

Step 5: Registration and Memory Selection

Two more entries in src/llama-model.cpp:

Add to build_graph() — In the switch statement at line 8288:

case LLM_ARCH_MY_MODEL:
    llm = std::make_unique<llm_build_my_model>(*this, params);
    break;

Add to create_memory() — If your model is a standard transformer, the default case in create_memory() already handles it—it will create a llama_kv_cache. Only add an explicit case if your model needs:

  • No cache (BERT-style) → return nullptr
  • Recurrent state (Mamba/RWKV) → llama_memory_recurrent
  • Hybrid → llama_memory_hybrid
  • Sliding window → handled automatically if hparams.swa_type != NONE

Finally, add the new source file to src/CMakeLists.txt:

add_library(llama
    # ... existing files ...
    models/mymodel.cpp
)

Testing and Contribution Workflow

After building, validate your implementation:

1. Convert a model:

python convert_hf_to_gguf.py /path/to/hf-model --outfile model.gguf

2. Verify tensor shapes: Use gguf-py to dump the file and check that all tensors have correct shapes:

python -m gguf.scripts.gguf_dump model.gguf

3. Run inference:

./build/bin/llama-cli -m model.gguf -p "Hello, world!" -n 32

4. Compare outputs: Run the same prompt through the HuggingFace reference implementation and compare logits/outputs. Small differences (< 0.01) are expected due to floating-point ordering differences; large differences indicate a bug in tensor mapping or graph topology.

5. Run perplexity: For a quantitative validation, compute perplexity on a standard benchmark:

./build/bin/llama-perplexity -m model.gguf -f wiki.test.raw

Compare with the HuggingFace model's perplexity—they should be within 0.1–0.5 points for F16, more for quantized formats.

flowchart LR
    CONVERT["Convert model\nHF → GGUF"] --> VERIFY["Verify tensors\ngguf_dump"]
    VERIFY --> RUN["Run inference\nllama-cli"]
    RUN --> COMPARE["Compare outputs\nvs. HF reference"]
    COMPARE --> PPL["Perplexity test\nllama-perplexity"]
    PPL --> PR["Submit PR"]

Contribution guidelines: The project has a CONTRIBUTING.md that covers coding standards, PR requirements, and the review process. Key points:

  • PRs adding new model architectures should include a link to the model's paper or documentation
  • The converter should handle tokenizer conversion correctly
  • New models should pass basic inference testing before submission
  • Follow the existing code style—the .clang-format and .clang-tidy configs enforce this

Tip: The fastest way to add a new model is to find the most similar existing architecture and copy its converter class, tensor loading block, and graph builder. For most decoder-only transformers, the LLaMA implementation covers 80% of the work—you'll mainly be adjusting normalization placement, activation functions, and attention variants.

Series Conclusion

Over six articles, we've traced the complete path of inference through llama.cpp—from the GGUF file format and GGML tensor operations, through the graph context toolkit and architecture dispatch, the KV cache and decode pipeline, up to the HTTP server and CLI application layer. We've seen how a relatively small set of well-designed abstractions (the graph context toolkit, the memory interface hierarchy, the backend vtable system) enables support for 120+ model architectures across a dozen hardware backends.

The llama.cpp codebase rewards careful reading. Its design prioritizes performance and portability over abstraction, which means the code is sometimes more verbose than you'd expect—but every design choice has a reason rooted in the constraints of efficient inference on diverse hardware. Understanding those choices is the key to contributing effectively.