Kong as an AI Gateway: The LLM Driver Architecture
Prerequisites
- ›Article 1: Architecture and Nginx Integration
- ›Article 4: Plugin System and Iterator (plugin handler patterns)
- ›Basic understanding of LLM APIs (chat completions, streaming, tokens)
Kong as an AI Gateway: The LLM Driver Architecture
Kong's most recent major addition is its AI gateway capability — a subsystem for proxying, transforming, and observing requests to Large Language Model providers. Instead of building a separate AI proxy, Kong embedded LLM support directly into its plugin architecture, reusing the same phase pipeline, configuration system, and observability infrastructure we've explored throughout this series.
The design centers on a driver pattern: each LLM provider (OpenAI, Anthropic, Azure, AWS Bedrock, Google Gemini, Cohere, Hugging Face) is implemented as a driver module with a standard interface. A shared utilities module handles common HTTP transformations, and cloud-specific authentication is decoupled into adapter modules.
LLM Module Architecture and Format Detection
The LLM subsystem lives in kong/llm/, with the entry point at kong/llm/init.lua. The module's first responsibility is format detection — determining whether an incoming request is a chat completion or a text completion:
local function identify_request(request)
local formats = {}
if type(request.messages) == "table" and #request.messages > 0 then
table.insert(formats, "llm/v1/chat")
end
if type(request.prompt) == "string" then
table.insert(formats, "llm/v1/completions")
end
-- ...
end
Kong's canonical format is OpenAI-compatible: llm/v1/chat for message-array-based requests and llm/v1/completions for single-prompt requests. The is_compatible function at lines 67–82 checks whether a request matches the expected route type, with a special preserve mode that passes requests through without format validation.
The available driver modules span the major LLM providers:
| Driver | File | Provider |
|---|---|---|
openai |
kong/llm/drivers/openai.lua |
OpenAI API |
anthropic |
kong/llm/drivers/anthropic.lua |
Anthropic Claude |
azure |
kong/llm/drivers/azure.lua |
Azure OpenAI |
bedrock |
kong/llm/drivers/bedrock.lua |
AWS Bedrock |
gemini |
kong/llm/drivers/gemini.lua |
Google Gemini |
cohere |
kong/llm/drivers/cohere.lua |
Cohere |
huggingface |
kong/llm/drivers/huggingface.lua |
Hugging Face |
mistral |
kong/llm/drivers/mistral.lua |
Mistral AI |
llama2 |
kong/llm/drivers/llama2.lua |
Llama 2 (self-hosted) |
flowchart TD
A[Incoming Request] --> B{Format Detection}
B --> C["llm/v1/chat<br>(messages array)"]
B --> D["llm/v1/completions<br>(prompt string)"]
C --> E{Route Type Match?}
D --> E
E -->|Compatible| F[Select Driver]
E -->|Incompatible| G[400 Error]
F --> H[Transform to Provider Format]
H --> I[Send to LLM Provider]
I --> J[Transform Response to Kong Format]
The Driver Pattern: Provider Abstraction
Each driver module implements a standard interface with to_format and from_format transformer functions. The kong/llm/drivers/openai.lua driver is the simplest because Kong's canonical format is the OpenAI format:
local transformers_to = {
["llm/v1/chat"] = function(request_table, model_info, route_type)
request_table.model = model_info.name or request_table.model
request_table.stream = request_table.stream or false
request_table.top_k = nil -- unsupported by OpenAI
return request_table, "application/json", nil
end,
}
The kong/llm/drivers/anthropic.lua driver, by contrast, must translate between formats. For older Claude models, it converts Kong's message array into Anthropic's Human:/Assistant: prompt format:
local function kong_messages_to_claude_prompt(messages)
local buf = buffer.new()
for _, v in ipairs(messages) do
if v.role == "assistant" then
buf:put("Assistant: ")
elseif v.role == "user" then
buf:put("Human: ")
end
buf:put(v.content)
buf:put("\n\n")
end
buf:put("Assistant:")
return buf:get()
end
The shared driver utilities in kong/llm/drivers/shared.lua provide common functionality used by all drivers: HTTP client management, streaming content type detection, SSE parsing, and log entry key constants for observability. The module defines standard keys for tracking usage:
local log_entry_keys = {
USAGE_CONTAINER = "usage",
PROMPT_TOKENS = "prompt_tokens",
COMPLETION_TOKENS = "completion_tokens",
TOTAL_TOKENS = "total_tokens",
TIME_PER_TOKEN = "time_per_token",
COST = "cost",
}
classDiagram
class SharedDriver {
+_CONST: SSE_TERMINATOR, etc.
+_SUPPORTED_STREAMING_CONTENT_TYPES
+log_entry_keys
+HTTP utilities
}
class OpenAIDriver {
+to_format(request, model_info)
+from_format(response, model_info)
+DRIVER_NAME: "openai"
}
class AnthropicDriver {
+to_format(request, model_info)
+from_format(response, model_info)
+kong_messages_to_claude_prompt()
+DRIVER_NAME: "anthropic"
}
class BedrockDriver {
+to_format(request, model_info)
+from_format(response, model_info)
+DRIVER_NAME: "bedrock"
}
SharedDriver <|-- OpenAIDriver
SharedDriver <|-- AnthropicDriver
SharedDriver <|-- BedrockDriver
Cloud Adapters and Authentication
Authentication to cloud-hosted LLM services is decoupled from the driver logic. Cloud-specific adapters handle credential management without cluttering the format transformation code.
The adapter modules live in kong/llm/adapters/:
bedrock.lua— AWS SigV4 request signing usingresty.awsgemini.lua— Google Cloud service account authentication viaresty.gcp
The shared driver module at kong/llm/drivers/shared.lua initializes the cloud SDKs at load time:
local GCP = require("resty.gcp.request.credentials.accesstoken")
local aws_config = require "resty.aws.config"
local AWS = require("resty.aws")
local AWS_REGION = os.getenv("AWS_REGION") or os.getenv("AWS_DEFAULT_REGION")
The authentication schema in kong/llm/schemas/init.lua provides a flexible auth configuration that supports header-based auth (API keys), query parameter auth, and cloud-native authentication:
local auth_schema = {
type = "record",
fields = {
{ header_name = { type = "string", referenceable = true }},
{ header_value = { type = "string", encrypted = true, referenceable = true }},
{ param_name = { type = "string", referenceable = true }},
{ param_value = { type = "string", encrypted = true, referenceable = true }},
},
}
Note the encrypted = true and referenceable = true annotations. The encrypted flag marks fields for at-rest encryption in the database (an Enterprise feature). The referenceable flag means the value can be a Kong Vault reference like {vault://env/OPENAI_API_KEY} — integrating with Kong's secrets management system.
Tip: For cloud providers like AWS Bedrock, you don't need to set explicit API keys. The adapter uses the standard AWS SDK credential chain — environment variables, IAM roles, or instance profiles. Just set
AWS_REGIONand ensure your Kong instance has the appropriate IAM permissions.
The AI Plugin Family
The ai-proxy plugin at kong/plugins/ai-proxy/handler.lua is remarkably compact — just 19 lines. That's because it delegates to a filter-based architecture built on the ai_plugin_base module:
local ai_plugin_base = require("kong.llm.plugin.base")
local NAME = "ai-proxy"
local PRIORITY = 770
local AIPlugin = ai_plugin_base.define(NAME, PRIORITY)
local SHARED_FILTERS = {
"parse-request", "normalize-request", "enable-buffering",
"normalize-response-header", "parse-sse-chunk", "normalize-sse-chunk",
"parse-json-response", "normalize-json-response",
"serialize-analytics",
}
for _, filter in ipairs(SHARED_FILTERS) do
AIPlugin:enable(AIPlugin.register_filter(require("kong.llm.plugin.shared-filters." .. filter)))
end
return AIPlugin:as_kong_plugin()
The kong/llm/plugin/base.lua module provides a meta-plugin framework with its own "stages" system that maps onto Kong's phases:
local STAGES = {
SETUP = 0,
REQ_INTROSPECTION = 1,
REQ_TRANSFORMATION = 2,
REQ_POST_PROCESSING = 3,
RES_INTROSPECTION = 4,
RES_TRANSFORMATION = 5,
STREAMING = 6,
RES_PRE_PROCESSING = 7,
RES_POST_PROCESSING = 8,
}
Each shared filter registers for specific stages. The parse-request filter runs during REQ_INTROSPECTION to decode the incoming request body. The normalize-request filter runs during REQ_TRANSFORMATION to translate from Kong's canonical format to the provider's format. The serialize-analytics filter runs during RES_POST_PROCESSING to emit usage metrics.
This composable filter architecture allows different AI plugins (ai-proxy, ai-request-transformer, ai-response-transformer) to share common logic while implementing different high-level behaviors. The ai-proxy enables all the standard filters; the ai-request-transformer might enable only the request-side filters plus an LLM introspection filter.
sequenceDiagram
participant Client
participant AIProxy as ai-proxy (access)
participant ParseReq as parse-request filter
participant NormReq as normalize-request filter
participant LLM as LLM Provider
participant ParseRes as parse-json-response filter
participant NormRes as normalize-json-response filter
participant Analytics as serialize-analytics filter
Client->>AIProxy: POST /llm/v1/chat
AIProxy->>ParseReq: STAGE: REQ_INTROSPECTION
ParseReq->>ParseReq: Decode JSON body
ParseReq->>NormReq: STAGE: REQ_TRANSFORMATION
NormReq->>NormReq: Transform to provider format
NormReq->>LLM: Forward transformed request
LLM-->>ParseRes: Provider response
ParseRes->>ParseRes: Decode provider JSON
ParseRes->>NormRes: STAGE: RES_TRANSFORMATION
NormRes->>NormRes: Normalize to Kong format
NormRes->>Analytics: STAGE: RES_POST_PROCESSING
Analytics->>Analytics: Record token usage
Analytics-->>Client: Normalized response
The observability module at kong/llm/plugin/observability.lua integrates with Kong's existing metrics infrastructure. Token counts, latencies, and costs are tracked per-request and exposed through Kong's standard logging plugins — so you can use http-log, datadog, or prometheus to monitor AI gateway traffic without any additional configuration.
The context module at kong/llm/plugin/ctx.lua provides namespaced per-request state management. Each AI plugin gets its own context namespace, preventing conflicts when multiple AI plugins run on the same request (e.g., ai-prompt-guard followed by ai-proxy).
The Full AI Request Flow
Here's how a complete AI gateway request flows through Kong:
- Client sends
POST /ai/chatwith an OpenAI-compatible request body - Kong's router matches this to a Route configured with the
ai-proxyplugin - The ai-proxy's
accesshandler runs:parse-request: Decodes JSON, identifies format asllm/v1/chatnormalize-request: Selects the configured driver (e.g.,anthropic), transforms the requestenable-buffering: Enables response buffering for non-streaming requests
- Kong's balancer sends the transformed request to the LLM provider endpoint
- The ai-proxy's response handlers run:
normalize-response-header: Adjusts content-type headersparse-json-responseorparse-sse-chunk: Parses the provider's responsenormalize-json-responseornormalize-sse-chunk: Translates back to Kong formatserialize-analytics: Records token usage, latency, cost
- The normalized response is sent to the client
For streaming responses, the STREAMING stage filters run on every SSE chunk in the body_filter phase — this is why REPEATED_PHASES marks the streaming stage as repeatable.
Tip: The LLM schemas at
kong/llm/schemas/init.luadefine provider-specific options likebedrock_options_schema(AWS region override) andgemini_options_schema(Vertex AI project/location). These are optional — the adapter modules fall back to environment variables when schema options aren't set.
Series Conclusion
Over these seven articles, we've traced Kong from its Nginx foundation through initialization, request processing, plugin execution, schema-driven data management, distributed clustering, and AI gateway capabilities. The common thread throughout is Kong's commitment to a few powerful abstractions: phases for lifecycle management, schemas for data modeling, iterators for plugin execution, and drivers for provider abstraction.
Whether you're extending Kong with a custom plugin, debugging a production issue, or evaluating Kong for your API infrastructure, understanding these internals transforms Kong from a black box into a comprehensible system. The codebase is large — kong/init.lua alone is nearly 2,000 lines — but the patterns are consistent, the naming is clear, and the architecture rewards close reading.