From HTTP Request to Token: The Server and CLI Tools
Prerequisites
- ›Articles 1-4
- ›Basic understanding of HTTP APIs and REST conventions
From HTTP Request to Token: The Server and Command-Line Tools
Everything we've covered so far—GGML tensors, computation graphs, KV caches, the decode loop—lives in libllama, a C/C++ library with no opinion about how you interact with it. The application layer in tools/ is where llama.cpp meets real users: an HTTP server with OpenAI-compatible endpoints, an interactive CLI, quantization tools, and more.
This article covers the server's slot-based concurrency model, the task queue architecture, the API surface, and one of llama.cpp's most interesting design decisions: the CLI doesn't implement its own inference loop—it reuses the server's entire infrastructure.
Server Architecture Overview
The server's architecture centers on three components, split across several header files in tools/server/:
server_context is the central orchestrator. Defined in tools/server/server-context.h, it uses the pimpl (pointer-to-implementation) pattern to hide its internals behind server_context_impl. It owns the model, context, slots, and the main event loop.
server_queue manages task submission and routing. Defined in tools/server/server-queue.h, it provides a thread-safe task queue with deferred tasks (queued when all slots are busy), priority support, and an optional sleeping mode that unloads the model during idle periods.
server_task represents a unit of work. Defined in tools/server/server-task.h, tasks carry a type (completion, embedding, cancel, etc.), input parameters, and can have child tasks for multi-step operations.
flowchart TD
HTTP["HTTP Request\n/v1/chat/completions"] --> ROUTES["Route Handler"]
ROUTES --> TASK["Create server_task"]
TASK --> QUEUE["server_queue.post()"]
QUEUE --> LOOP["Main Event Loop"]
LOOP --> SLOT["Assign to Slot"]
SLOT --> DECODE["llama_decode()"]
DECODE --> SAMPLE["Sample token"]
SAMPLE --> RESULT["server_task_result"]
RESULT --> SSE["SSE Stream / JSON Response"]
The main event loop runs in a single thread (the "main loop" thread), processing one task at a time from the queue. This is deliberate—it avoids the complexity of concurrent llama_decode() calls, which would require multiple contexts or careful locking. Concurrency comes from the slot system instead.
The Slot System
The server can handle multiple concurrent inference requests against a single loaded model through a "slot" abstraction. Each slot represents an independent inference stream with its own:
- Prompt state and generated tokens
- Sampling parameters and sampler chain
- Position in the KV cache
- Streaming state (for SSE responses)
When a new request arrives, the server assigns it to an idle slot. The main loop iterates through all active slots, advancing each one by calling llama_decode() with the slot's batch. This is a form of cooperative multitasking—all slots share the same model weights and llama_context, but they each get their own sequence ID in the KV cache.
sequenceDiagram
participant R1 as Request 1
participant R2 as Request 2
participant Q as server_queue
participant L as Main Loop
participant S1 as Slot 0
participant S2 as Slot 1
participant CTX as llama_context
R1->>Q: Completion task
R2->>Q: Completion task
Q->>L: Dequeue tasks
L->>S1: Assign Request 1
L->>S2: Assign Request 2
loop Update cycle
L->>S1: Prepare batch
L->>S2: Prepare batch
L->>CTX: llama_decode(combined batch)
CTX-->>S1: Logits for seq 0
CTX-->>S2: Logits for seq 1
S1->>R1: Stream token (SSE)
S2->>R2: Stream token (SSE)
end
The number of slots is configurable at startup via --parallel N. Each additional slot claims a share of the context window—with 4 slots and n_ctx=8192, each slot effectively has ~2048 positions (though the implementation is more nuanced with KV cache partitioning).
Tip: If you're running the server for a single user,
--parallel 1is optimal. Each additional slot reduces per-request context length and increases memory usage. For multiple concurrent users, set--parallelto the expected concurrency level and scale--ctx-sizeaccordingly.
API Surface and Route Registration
The server exposes three API families, set up in tools/server/server.cpp:
OpenAI-compatible endpoints:
POST /v1/chat/completions— chat completions with streamingPOST /v1/completions— text completionsPOST /v1/embeddings— embedding generationGET /v1/models— list available models
Native endpoints:
POST /completion— raw completion with full parameter controlPOST /tokenize— tokenize textPOST /detokenize— detokenize tokensGET /health— server health checkGET /props— server properties
Anthropic-compatible endpoint:
POST /v1/messages— Anthropic Messages API format
Each route handler follows the same pattern: parse the JSON request, create a server_task with the appropriate parameters, post it to the queue, and wait for results using a server_response_reader. For streaming, results are sent as Server-Sent Events (SSE).
The route handlers are wrapped in ex_wrapper(), which catches exceptions and converts them to appropriate HTTP error responses—400 for invalid arguments, 500 for internal errors.
CLI Reusing Server Internals
Perhaps the most surprising architectural decision in llama.cpp is visible in tools/cli/cli.cpp:
#include "server-context.h"
#include "server-task.h"
struct cli_context {
server_context ctx_server; // <-- the same server_context!
json messages = json::array();
// ...
};
The interactive CLI doesn't implement its own inference loop. Instead, cli_context wraps a server_context and interacts with it through the same task/queue interface that HTTP handlers use. When you type a message in the CLI, it creates a server_task, posts it to the server's queue, and reads results through a server_response_reader.
classDiagram
class server_context {
+load_model()
+start_loop()
+get_response_reader()
}
class server_http_context {
Routes: /v1/chat/completions
Routes: /v1/completions
Creates server_tasks
}
class cli_context {
+ctx_server: server_context
+messages: json
Creates server_tasks
}
server_context <-- server_http_context : "routes to"
server_context <-- cli_context : "wraps"
This design has several advantages:
-
Feature parity — Every feature available in the server (tool calling, multimodal input, grammar constraints, speculative decoding) is automatically available in the CLI.
-
Single code path — Bug fixes in the inference pipeline fix both the server and CLI simultaneously. No risk of the CLI diverging.
-
Simpler testing — The CLI effectively integration-tests the server's core logic without needing an HTTP stack.
The tradeoff is that the CLI carries the weight of server dependencies (task queue, slot management, etc.) even though it only uses one slot.
The Common Utilities Layer
The common/ directory provides shared infrastructure used by all tools:
Argument parsing — common/arg.h provides common_params_parse(), which handles the hundreds of CLI flags that tools accept (model path, context size, GPU layers, sampling parameters, etc.). All tools share the same common_params struct.
Sampling wrappers — common/sampling.h wraps libllama's low-level llama_sampler chain with a higher-level interface that handles configuration, reset, and parameter management.
Chat template engine — The common/jinja/ directory contains a Jinja2-compatible template engine for applying chat templates. When a model provides a chat template in its GGUF metadata (via the tokenizer.chat_template key), this engine renders user/assistant/system messages into the model's expected format.
flowchart TD
subgraph "common/ utilities"
ARG["arg.h / arg.cpp\nArgument parsing"]
SAMP["sampling.h\nSampling wrappers"]
CHAT["chat.h\nChat template application"]
JINJA["jinja/\nJinja2 template engine"]
LOG["log.h\nLogging utilities"]
end
subgraph "Tools"
CLI["tools/cli"]
SRV["tools/server"]
QNT["tools/quantize"]
BENCH["tools/bench"]
end
CLI --> ARG
CLI --> CHAT
SRV --> ARG
SRV --> SAMP
SRV --> JINJA
QNT --> ARG
BENCH --> ARG
Tool Ecosystem and Model Conversion
Beyond the server and CLI, llama.cpp includes several specialized tools:
tools/quantize/ — Converts models between quantization formats (e.g., F16 → Q4_K_M). It reads a GGUF file, requantizes each tensor according to the target format, and writes a new GGUF file. Supports importance matrix (--imatrix) for data-dependent quantization.
tools/bench/ — Performance benchmarking for prompt processing and token generation throughput.
tools/perplexity/ — Evaluates model quality by computing perplexity on a test corpus.
tools/imatrix/ — Computes importance matrices from calibration data, used to improve quantization quality.
tools/tts/ — Text-to-speech using speech synthesis models.
tools/mtmd/ — Multimodal tool for models that process images and audio alongside text.
Model conversion — The convert_hf_to_gguf.py script is the primary way to bring new models into llama.cpp. It reads HuggingFace model checkpoints (PyTorch .bin or SafeTensors .safetensors), maps tensor names to GGUF conventions, writes hyperparameters as GGUF metadata, and outputs a .gguf file. The gguf-py/ library provides the Python GGUF reader/writer that the converter uses.
Tip: When quantizing models, Q4_K_M is generally the best balance of quality and size for most use cases. Use
--imatrixwith a calibration dataset for the best results at aggressive quantization levels (Q2_K and below).
What's Next
We've now covered the full stack from HTTP request to token output. The final article in this series takes a practical turn: a step-by-step guide to contributing a new model architecture to llama.cpp, covering every file you need to touch from GGUF conversion to graph building to testing.