Pipeline、Tokenizer 与 Transformers 的扩展机制

在本系列中，我们从懒加载导入系统出发，依次剖析了模型解析、权重加载、文本生成与训练流程。本篇作为终章，聚焦于模型的上下两层：高层的 pipeline() API（让推理变成一行代码）、跨三种后端处理文本预处理的 tokenizer 层级体系、面向多模态模型的 ProcessorMixin，以及向库中贡献新模型所需的扩展接口。

pipeline() 工厂函数

pipeline() 是 Transformers 最简洁的 API。只需一次调用，它就能将任务字符串解析为对应的 Pipeline 子类，自动选择模型和 tokenizer，并返回一个可直接调用的对象：

generator = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")
output = generator("Once upon a time")

flowchart TD
    A["pipeline('text-generation',<br/>model='...')"] --> B["Resolve task → Pipeline class"]
    B --> C["PipelineRegistry lookup"]
    C --> D["TextGenerationPipeline"]
    D --> E["Auto-select model<br/>(AutoModelForCausalLM)"]
    E --> F["Auto-select tokenizer<br/>(AutoTokenizer)"]
    F --> G["Load model<br/>from_pretrained()"]
    G --> H["Return configured<br/>Pipeline instance"]

函数签名接受 task、model、tokenizer、config、processor、device、device_map、dtype 和 trust_remote_code 等参数。如果只提供 task，库会自动选用默认模型；如果提供了 model，则从模型的 config 中推断任务类型。

解析过程依赖一个 PipelineRegistry，它将任务字符串（如 "text-generation"、"text-classification"、"image-to-text"）映射到对应的 Pipeline 子类及其默认模型。任务别名同样受支持 —— "sentiment-analysis" 会被映射到 TextClassificationPipeline。

Pipeline 基类：preprocess → _forward → postprocess

所有 pipeline 都继承自 Pipeline，其核心是三步处理模式：

sequenceDiagram
    participant User
    participant Pipe as Pipeline.__call__
    participant Pre as preprocess()
    participant Fwd as _forward()
    participant Post as postprocess()

    User->>Pipe: pipe("Hello world")
    Pipe->>Pre: preprocess("Hello world")
    Pre-->>Pipe: model_inputs (dict of tensors)
    Pipe->>Fwd: _forward(model_inputs)
    Note over Fwd: model(**inputs)<br/>with torch.no_grad()
    Fwd-->>Pipe: model_outputs
    Pipe->>Post: postprocess(model_outputs)
    Post-->>User: [{"generated_text": "..."}]

Pipeline.__call__ 基类负责批处理、长输入分块，以及 no-grad 上下文管理。子类只需重写以下三个方法：

preprocess() — 将原始输入（字符串、图像、音频）转换为模型所需的张量
_forward() — 执行模型前向传播（默认实现为 self.model(**inputs)）
postprocess() — 将模型输出转换为人类可读的结果

类标志位控制预处理组件的加载行为：

class TextGenerationPipeline(Pipeline):
    _load_tokenizer = True
    _load_image_processor = False
    _load_feature_extractor = False
    _pipeline_calls_generate = True  # Uses model.generate() instead of model()

其中 _pipeline_calls_generate = True 至关重要 —— 文本生成 pipeline 会调用 model.generate()（第 5 篇），而非直接执行前向传播，从而继承所有解码、缓存和流式输出的能力。

提示： 在需要批量推理的生产环境中，可使用 pipe(texts, batch_size=16)。Pipeline 会在内部自动处理 padding、数据整合与批次迭代。如需流式输出，可将 TextStreamer 传入 pipeline 的 generation kwargs。

Tokenizer 层级体系

Transformers 的分词体系经历了多次架构演进。现行方案提供了统一的对外接口，底层由三种不同的后端实现：

classDiagram
    class PreTrainedTokenizerBase {
        +__call__(text, ...)
        +encode(text)
        +decode(ids)
        +from_pretrained()
        +save_pretrained()
        +padding_side: str
        +model_max_length: int
    }
    class PreTrainedTokenizerPython {
        «Pure Python BPE/WordPiece»
        +_tokenize(text) → list[str]
        +_convert_token_to_id(token) → int
    }
    class PreTrainedTokenizerTokenizers {
        «Rust tokenizers backend»
        +_tokenizer: tokenizers.Tokenizer
        +_batch_encode_plus()
    }
    class PreTrainedTokenizerSentencePiece {
        «SentencePiece backend»
        +sp_model: SentencePieceProcessor
    }
    PreTrainedTokenizerBase <|-- PreTrainedTokenizerPython
    PreTrainedTokenizerBase <|-- PreTrainedTokenizerTokenizers
    PreTrainedTokenizerBase <|-- PreTrainedTokenizerSentencePiece

PreTrainedTokenizerBase 定义了公共的 __call__ 接口，负责处理 padding、截断、attention mask 生成以及返回类型选择（pt / np / tf）。整个类约有 2000 行代码，覆盖了批量编码中所有的组合逻辑。

三种后端分别为：

tokenization_python.py — BPE、WordPiece 和 Unigram 的纯 Python 实现。速度最慢，但无任何外部依赖。
tokenization_utils_tokenizers.py — 封装 tokenizers Rust 库，速度提升 10–100 倍，是默认的"快速" tokenizer。
tokenization_utils_sentencepiece.py — 封装 Google 的 SentencePiece 库，被 LLaMA、T5 等众多模型采用。

如第 1 篇所述，模块别名系统将 tokenization_utils_fast 映射到 tokenization_utils_tokenizers，以保持向后兼容。AutoTokenizer 会自动选择可用的最快后端 —— 当安装了 tokenizers 库时，优先使用 Rust 后端。

提示： 始终使用 AutoTokenizer.from_pretrained(model_name)，而非直接使用模型专属的 tokenizer 类。Auto 类会自动处理后端选择，并确保与模型词汇表的兼容性。

多模态模型的 ProcessorMixin

LLaVA、Whisper、CLIP 等现代模型不仅接受文本输入。ProcessorMixin 提供了统一的 __call__ 接口，将 tokenizer 与一个或多个模态专属处理器组合在一起：

classDiagram
    class ProcessorMixin {
        +tokenizer: PreTrainedTokenizerBase
        +image_processor: BaseImageProcessor
        +feature_extractor: FeatureExtractionMixin
        +__call__(*args, **kwargs)
        +from_pretrained()
        +save_pretrained()
    }
    class LlavaProcessor {
        +tokenizer: LlamaTokenizer
        +image_processor: CLIPImageProcessor
        +__call__(text, images)
    }
    class WhisperProcessor {
        +tokenizer: WhisperTokenizer
        +feature_extractor: WhisperFeatureExtractor
        +__call__(audio, text)
    }
    ProcessorMixin <|-- LlavaProcessor
    ProcessorMixin <|-- WhisperProcessor

processor 的 __call__ 方法通常执行以下步骤：

使用对应的模态处理器处理图像或音频
对文本进行分词
将两者合并为一个具有正确形状的 BatchFeature 字典

ProcessorMixin 同样负责序列化 —— save_pretrained() 会同时保存 tokenizer 和 processor 的配置，from_pretrained() 则将它们一并加载回来。

扩展接口：添加新模型

综合本系列所涵盖的内容，以下是向 Transformers 贡献新模型的完整检查清单：

flowchart TD
    A["1. Create config<br/>with model_type"] --> B["2. Create model<br/>inheriting PreTrainedModel"]
    B --> C["3. Create tokenizer<br/>(or reuse existing)"]
    C --> D["4. Register in Auto mappings<br/>CONFIG_MAPPING_NAMES<br/>MODEL_FOR_*_MAPPING_NAMES"]
    D --> E["5. Add __all__ exports<br/>to each .py file"]
    E --> F["6. Add TYPE_CHECKING block<br/>in model __init__.py"]
    F --> G["7. Write tests<br/>and documentation"]

第一步：配置（Configuration）。 创建 configuration_<model>.py，定义一个设置了 model_type 的类，并将所有超参数声明为带默认值的类型化字段。使用 @strict 进行验证。如果模型支持并行，还需声明 base_model_tp_plan 和 base_model_pp_plan。

第二步：模型（Model）。 创建 modeling_<model>.py，继承 PreTrainedModel，设置类标志位（如 _supports_sdpa 等），并使用 ALL_ATTENTION_FUNCTIONS.get_interface() 实现注意力分发。Decoder 层使用 GradientCheckpointingLayer。任务头部可通过多重继承复用通用类（如 GenericForSequenceClassification 等）。

第三步：Auto 注册。 在 models/auto/ 目录下，将新模型添加到 CONFIG_MAPPING_NAMES 以及相关的 MODEL_FOR_*_MAPPING_NAMES 字典中。这是唯一需要手动注册的地方 —— 懒加载导入系统（第 1 篇）会通过 __all__ 导出自动发现其余所有内容。

第四步：Tokenizer。 创建新的 tokenizer，或复用已有的实现。大多数现代 LLM 会直接复用 LlamaTokenizer 或基于 SentencePiece 的 tokenizer。

trust_remote_code：动态模块加载

对于尚未合并进 Transformers 的模型，dynamic_module_utils.py 模块支持从 Hub 仓库加载任意 Python 代码：

model = AutoModelForCausalLM.from_pretrained(
    "org/new-model",
    trust_remote_code=True
)

这一机制会从 Hub 下载模型的 Python 文件，在沙箱模块命名空间中执行，并将相应的类注册到 Auto 系统中。config.json 中的 auto_map 字段告诉 Transformers 每个 Auto 类应使用哪个远程类：

{
    "model_type": "new_model",
    "auto_map": {
        "AutoConfig": "configuration_new_model.NewModelConfig",
        "AutoModelForCausalLM": "modeling_new_model.NewModelForCausalLM"
    }
}

CLI 脚手架工具

cli/add_new_model_like.py 脚本通过复制并改造已有模型，自动生成新模型所需的样板代码。它处理那些繁琐的重复工作：创建目录结构、重命名类、更新 Auto 映射，以及生成测试文件。

在配置中声明并行计划

如第 3 篇所述，LLaMA 直接在 config 中声明了张量并行计划：

base_model_tp_plan = {
    "layers.*.self_attn.q_proj": "colwise",
    "layers.*.self_attn.k_proj": "colwise",
    "layers.*.self_attn.v_proj": "colwise",
    "layers.*.self_attn.o_proj": "rowwise",
    "layers.*.mlp.gate_proj": "colwise",
    "layers.*.mlp.up_proj": "colwise",
    "layers.*.mlp.down_proj": "rowwise",
}

这意味着启用张量并行无需任何代码改动 —— 只需调用 model.tensor_parallel(device_mesh)，库就会根据该计划对每一层执行列向或行向切分。类似地，base_model_pp_plan 声明了哪些层应被放置在哪个流水线并行阶段，并显式指定了输入输出张量的规格。

完整架构总览

作为本系列的收尾，让我们看看所有模块是如何组合在一起的：

flowchart TD
    subgraph "User API"
        A["pipeline()"] 
        B["AutoModelForCausalLM"]
        C["Trainer"]
    end

    subgraph "Resolution Layer"
        D["_LazyModule"]
        E["Auto Mappings"]
        F["CONFIG_MAPPING"]
    end

    subgraph "Model Layer"
        G["PreTrainedModel"]
        H["AttentionInterface"]
        I["Generic Head Classes"]
    end

    subgraph "Weight Layer"
        J["from_pretrained()"]
        K["WeightConverter"]
        L["HfQuantizer"]
    end

    subgraph "Generation"
        M["GenerationMixin"]
        N["Cache System"]
        O["LogitsProcessors"]
    end

    subgraph "Training"
        P["Trainer Loop"]
        Q["Callbacks"]
        R["Distributed Backends"]
    end

    subgraph "Preprocessing"
        S["Tokenizers"]
        T["ProcessorMixin"]
    end

    A --> E
    B --> E
    E --> D
    D --> G
    G --> H
    G --> I
    J --> K
    J --> L
    G --> M
    M --> N
    M --> O
    C --> P
    P --> Q
    P --> R
    A --> S
    A --> T

图中的每个方框，都对应我们在本系列中详细分析过的代码。导入系统（第 1 篇）奠定了基础；Auto 类（第 2 篇）提供了分发机制；模型结构（第 3 篇）实现了计算逻辑；权重加载（第 4 篇）完成了初始化；文本生成（第 5 篇）支撑了推理；Trainer（第 6 篇）支持了训练；而本篇则覆盖了面向用户的 API 与扩展接口，将一切串联在了一起。

结语

Transformers 是一个通过一致性模式驾驭极高复杂度的库：懒加载无处不在、配置声明式化、pipeline 可组合，以及从高层 API 到底层实现的渐进式委托。阅读这份代码库本身就是一种收获 —— 几乎每一个设计决策背后，都有清晰的理由，根植于支持 450+ 架构、十余种后端和数百万用户的现实约束。

我们贯穿全系列所追踪的这些模式 —— 用于延迟导入的 _LazyModule、用于延迟类解析的 _LazyAutoMapping、用于可插拔后端的 AttentionInterface、用于两级分发的 GeneralInterface、用于代码复用的通用头部类，以及用于检查点兼容的 WeightConverter —— 这些思路都具有普遍的迁移价值。理解它们，不仅能让你成为更好的 Transformers 使用者，更能让你成为更好的系统级程序员。

Pipeline、Tokenizer 与 Transformers 的扩展机制

前置知识

Pipeline、Tokenizer 与 Transformers 的扩展机制

pipeline() 工厂函数

Pipeline 基类：preprocess → _forward → postprocess

Tokenizer 层级体系

多模态模型的 ProcessorMixin

扩展接口：添加新模型

trust_remote_code：动态模块加载

CLI 脚手架工具

在配置中声明并行计划

完整架构总览

结语