Read OSS

Kong as an AI Gateway: The LLM Driver Architecture

Intermediate

Prerequisites

  • Article 1: Architecture and Nginx Integration
  • Article 4: Plugin System and Iterator (plugin handler patterns)
  • Basic understanding of LLM APIs (chat completions, streaming, tokens)

Kong as an AI Gateway: The LLM Driver Architecture

Kong's most recent major addition is its AI gateway capability — a subsystem for proxying, transforming, and observing requests to Large Language Model providers. Instead of building a separate AI proxy, Kong embedded LLM support directly into its plugin architecture, reusing the same phase pipeline, configuration system, and observability infrastructure we've explored throughout this series.

The design centers on a driver pattern: each LLM provider (OpenAI, Anthropic, Azure, AWS Bedrock, Google Gemini, Cohere, Hugging Face) is implemented as a driver module with a standard interface. A shared utilities module handles common HTTP transformations, and cloud-specific authentication is decoupled into adapter modules.

LLM Module Architecture and Format Detection

The LLM subsystem lives in kong/llm/, with the entry point at kong/llm/init.lua. The module's first responsibility is format detection — determining whether an incoming request is a chat completion or a text completion:

local function identify_request(request)
  local formats = {}
  if type(request.messages) == "table" and #request.messages > 0 then
    table.insert(formats, "llm/v1/chat")
  end
  if type(request.prompt) == "string" then
    table.insert(formats, "llm/v1/completions")
  end
  -- ...
end

Kong's canonical format is OpenAI-compatible: llm/v1/chat for message-array-based requests and llm/v1/completions for single-prompt requests. The is_compatible function at lines 67–82 checks whether a request matches the expected route type, with a special preserve mode that passes requests through without format validation.

The available driver modules span the major LLM providers:

Driver File Provider
openai kong/llm/drivers/openai.lua OpenAI API
anthropic kong/llm/drivers/anthropic.lua Anthropic Claude
azure kong/llm/drivers/azure.lua Azure OpenAI
bedrock kong/llm/drivers/bedrock.lua AWS Bedrock
gemini kong/llm/drivers/gemini.lua Google Gemini
cohere kong/llm/drivers/cohere.lua Cohere
huggingface kong/llm/drivers/huggingface.lua Hugging Face
mistral kong/llm/drivers/mistral.lua Mistral AI
llama2 kong/llm/drivers/llama2.lua Llama 2 (self-hosted)
flowchart TD
    A[Incoming Request] --> B{Format Detection}
    B --> C["llm/v1/chat<br>(messages array)"]
    B --> D["llm/v1/completions<br>(prompt string)"]
    C --> E{Route Type Match?}
    D --> E
    E -->|Compatible| F[Select Driver]
    E -->|Incompatible| G[400 Error]
    F --> H[Transform to Provider Format]
    H --> I[Send to LLM Provider]
    I --> J[Transform Response to Kong Format]

The Driver Pattern: Provider Abstraction

Each driver module implements a standard interface with to_format and from_format transformer functions. The kong/llm/drivers/openai.lua driver is the simplest because Kong's canonical format is the OpenAI format:

local transformers_to = {
  ["llm/v1/chat"] = function(request_table, model_info, route_type)
    request_table.model = model_info.name or request_table.model
    request_table.stream = request_table.stream or false
    request_table.top_k = nil  -- unsupported by OpenAI
    return request_table, "application/json", nil
  end,
}

The kong/llm/drivers/anthropic.lua driver, by contrast, must translate between formats. For older Claude models, it converts Kong's message array into Anthropic's Human:/Assistant: prompt format:

local function kong_messages_to_claude_prompt(messages)
  local buf = buffer.new()
  for _, v in ipairs(messages) do
    if v.role == "assistant" then
      buf:put("Assistant: ")
    elseif v.role == "user" then
      buf:put("Human: ")
    end
    buf:put(v.content)
    buf:put("\n\n")
  end
  buf:put("Assistant:")
  return buf:get()
end

The shared driver utilities in kong/llm/drivers/shared.lua provide common functionality used by all drivers: HTTP client management, streaming content type detection, SSE parsing, and log entry key constants for observability. The module defines standard keys for tracking usage:

local log_entry_keys = {
  USAGE_CONTAINER = "usage",
  PROMPT_TOKENS = "prompt_tokens",
  COMPLETION_TOKENS = "completion_tokens",
  TOTAL_TOKENS = "total_tokens",
  TIME_PER_TOKEN = "time_per_token",
  COST = "cost",
}
classDiagram
    class SharedDriver {
        +_CONST: SSE_TERMINATOR, etc.
        +_SUPPORTED_STREAMING_CONTENT_TYPES
        +log_entry_keys
        +HTTP utilities
    }
    class OpenAIDriver {
        +to_format(request, model_info)
        +from_format(response, model_info)
        +DRIVER_NAME: "openai"
    }
    class AnthropicDriver {
        +to_format(request, model_info)
        +from_format(response, model_info)
        +kong_messages_to_claude_prompt()
        +DRIVER_NAME: "anthropic"
    }
    class BedrockDriver {
        +to_format(request, model_info)
        +from_format(response, model_info)
        +DRIVER_NAME: "bedrock"
    }
    SharedDriver <|-- OpenAIDriver
    SharedDriver <|-- AnthropicDriver
    SharedDriver <|-- BedrockDriver

Cloud Adapters and Authentication

Authentication to cloud-hosted LLM services is decoupled from the driver logic. Cloud-specific adapters handle credential management without cluttering the format transformation code.

The adapter modules live in kong/llm/adapters/:

  • bedrock.lua — AWS SigV4 request signing using resty.aws
  • gemini.lua — Google Cloud service account authentication via resty.gcp

The shared driver module at kong/llm/drivers/shared.lua initializes the cloud SDKs at load time:

local GCP = require("resty.gcp.request.credentials.accesstoken")
local aws_config = require "resty.aws.config"
local AWS = require("resty.aws")
local AWS_REGION = os.getenv("AWS_REGION") or os.getenv("AWS_DEFAULT_REGION")

The authentication schema in kong/llm/schemas/init.lua provides a flexible auth configuration that supports header-based auth (API keys), query parameter auth, and cloud-native authentication:

local auth_schema = {
  type = "record",
  fields = {
    { header_name = { type = "string", referenceable = true }},
    { header_value = { type = "string", encrypted = true, referenceable = true }},
    { param_name = { type = "string", referenceable = true }},
    { param_value = { type = "string", encrypted = true, referenceable = true }},
  },
}

Note the encrypted = true and referenceable = true annotations. The encrypted flag marks fields for at-rest encryption in the database (an Enterprise feature). The referenceable flag means the value can be a Kong Vault reference like {vault://env/OPENAI_API_KEY} — integrating with Kong's secrets management system.

Tip: For cloud providers like AWS Bedrock, you don't need to set explicit API keys. The adapter uses the standard AWS SDK credential chain — environment variables, IAM roles, or instance profiles. Just set AWS_REGION and ensure your Kong instance has the appropriate IAM permissions.

The AI Plugin Family

The ai-proxy plugin at kong/plugins/ai-proxy/handler.lua is remarkably compact — just 19 lines. That's because it delegates to a filter-based architecture built on the ai_plugin_base module:

local ai_plugin_base = require("kong.llm.plugin.base")

local NAME = "ai-proxy"
local PRIORITY = 770

local AIPlugin = ai_plugin_base.define(NAME, PRIORITY)

local SHARED_FILTERS = {
  "parse-request", "normalize-request", "enable-buffering",
  "normalize-response-header", "parse-sse-chunk", "normalize-sse-chunk",
  "parse-json-response", "normalize-json-response",
  "serialize-analytics",
}

for _, filter in ipairs(SHARED_FILTERS) do
  AIPlugin:enable(AIPlugin.register_filter(require("kong.llm.plugin.shared-filters." .. filter)))
end

return AIPlugin:as_kong_plugin()

The kong/llm/plugin/base.lua module provides a meta-plugin framework with its own "stages" system that maps onto Kong's phases:

local STAGES = {
  SETUP = 0,
  REQ_INTROSPECTION = 1,
  REQ_TRANSFORMATION = 2,
  REQ_POST_PROCESSING = 3,
  RES_INTROSPECTION = 4,
  RES_TRANSFORMATION = 5,
  STREAMING = 6,
  RES_PRE_PROCESSING = 7,
  RES_POST_PROCESSING = 8,
}

Each shared filter registers for specific stages. The parse-request filter runs during REQ_INTROSPECTION to decode the incoming request body. The normalize-request filter runs during REQ_TRANSFORMATION to translate from Kong's canonical format to the provider's format. The serialize-analytics filter runs during RES_POST_PROCESSING to emit usage metrics.

This composable filter architecture allows different AI plugins (ai-proxy, ai-request-transformer, ai-response-transformer) to share common logic while implementing different high-level behaviors. The ai-proxy enables all the standard filters; the ai-request-transformer might enable only the request-side filters plus an LLM introspection filter.

sequenceDiagram
    participant Client
    participant AIProxy as ai-proxy (access)
    participant ParseReq as parse-request filter
    participant NormReq as normalize-request filter
    participant LLM as LLM Provider
    participant ParseRes as parse-json-response filter
    participant NormRes as normalize-json-response filter
    participant Analytics as serialize-analytics filter

    Client->>AIProxy: POST /llm/v1/chat
    AIProxy->>ParseReq: STAGE: REQ_INTROSPECTION
    ParseReq->>ParseReq: Decode JSON body
    ParseReq->>NormReq: STAGE: REQ_TRANSFORMATION
    NormReq->>NormReq: Transform to provider format
    NormReq->>LLM: Forward transformed request
    LLM-->>ParseRes: Provider response
    ParseRes->>ParseRes: Decode provider JSON
    ParseRes->>NormRes: STAGE: RES_TRANSFORMATION
    NormRes->>NormRes: Normalize to Kong format
    NormRes->>Analytics: STAGE: RES_POST_PROCESSING
    Analytics->>Analytics: Record token usage
    Analytics-->>Client: Normalized response

The observability module at kong/llm/plugin/observability.lua integrates with Kong's existing metrics infrastructure. Token counts, latencies, and costs are tracked per-request and exposed through Kong's standard logging plugins — so you can use http-log, datadog, or prometheus to monitor AI gateway traffic without any additional configuration.

The context module at kong/llm/plugin/ctx.lua provides namespaced per-request state management. Each AI plugin gets its own context namespace, preventing conflicts when multiple AI plugins run on the same request (e.g., ai-prompt-guard followed by ai-proxy).

The Full AI Request Flow

Here's how a complete AI gateway request flows through Kong:

  1. Client sends POST /ai/chat with an OpenAI-compatible request body
  2. Kong's router matches this to a Route configured with the ai-proxy plugin
  3. The ai-proxy's access handler runs:
    • parse-request: Decodes JSON, identifies format as llm/v1/chat
    • normalize-request: Selects the configured driver (e.g., anthropic), transforms the request
    • enable-buffering: Enables response buffering for non-streaming requests
  4. Kong's balancer sends the transformed request to the LLM provider endpoint
  5. The ai-proxy's response handlers run:
    • normalize-response-header: Adjusts content-type headers
    • parse-json-response or parse-sse-chunk: Parses the provider's response
    • normalize-json-response or normalize-sse-chunk: Translates back to Kong format
    • serialize-analytics: Records token usage, latency, cost
  6. The normalized response is sent to the client

For streaming responses, the STREAMING stage filters run on every SSE chunk in the body_filter phase — this is why REPEATED_PHASES marks the streaming stage as repeatable.

Tip: The LLM schemas at kong/llm/schemas/init.lua define provider-specific options like bedrock_options_schema (AWS region override) and gemini_options_schema (Vertex AI project/location). These are optional — the adapter modules fall back to environment variables when schema options aren't set.

Series Conclusion

Over these seven articles, we've traced Kong from its Nginx foundation through initialization, request processing, plugin execution, schema-driven data management, distributed clustering, and AI gateway capabilities. The common thread throughout is Kong's commitment to a few powerful abstractions: phases for lifecycle management, schemas for data modeling, iterators for plugin execution, and drivers for provider abstraction.

Whether you're extending Kong with a custom plugin, debugging a production issue, or evaluating Kong for your API infrastructure, understanding these internals transforms Kong from a black box into a comprehensible system. The codebase is large — kong/init.lua alone is nearly 2,000 lines — but the patterns are consistent, the naming is clear, and the architecture rewards close reading.