ggml-org/llama.cpp
6 articles
Prerequisites
- ›Basic C/C++ knowledge (pointers, classes, virtual dispatch)
- ›General understanding of what Large Language Models are and how they generate text
llama.cpp Architecture: A Map of the Codebase
A comprehensive orientation to the llama.cpp codebase covering its two-library stack, directory structure, C API facade, core types, and inference lifecycle
How llama.cpp Turns Model Weights into Computation
Deep dive into llama.cpp's computation graph system: the graph context toolkit, model builders, architecture dispatch, and how non-transformer models fit in
GGML: The Tensor Engine Under llama.cpp
Deep dive into GGML's lazy evaluation model, backend vtable system, compile-time registration, backend scheduler, quantization types, and the GGUF file format
The Decode Loop: Batching, KV Cache, and Memory Management
Tracing the inference pipeline from llama_decode() through batch splitting, the core process_ubatch() loop, KV cache management, and error recovery
From HTTP Request to Token: The Server and CLI Tools
How llama.cpp's server handles concurrent requests through slots, its OpenAI-compatible API, and why the CLI reuses server internals
Contributing a New Model Architecture to llama.cpp
A practical end-to-end guide to adding a new model architecture, from GGUF conversion through graph building to testing