Read OSS

ggml-org/llama.cpp

6 articles

Prerequisites

01

llama.cpp Architecture: A Map of the Codebase

A comprehensive orientation to the llama.cpp codebase covering its two-library stack, directory structure, C API facade, core types, and inference lifecycle

02

How llama.cpp Turns Model Weights into Computation

Deep dive into llama.cpp's computation graph system: the graph context toolkit, model builders, architecture dispatch, and how non-transformer models fit in

03

GGML: The Tensor Engine Under llama.cpp

Deep dive into GGML's lazy evaluation model, backend vtable system, compile-time registration, backend scheduler, quantization types, and the GGUF file format

04

The Decode Loop: Batching, KV Cache, and Memory Management

Tracing the inference pipeline from llama_decode() through batch splitting, the core process_ubatch() loop, KV cache management, and error recovery

05

From HTTP Request to Token: The Server and CLI Tools

How llama.cpp's server handles concurrent requests through slots, its OpenAI-compatible API, and why the CLI reuses server internals

06

Contributing a New Model Architecture to llama.cpp

A practical end-to-end guide to adding a new model architecture, from GGUF conversion through graph building to testing