From Hub to GPU: The Weight Loading Pipeline
Prerequisites
- ›Article 3: Model anatomy and class hierarchy
- ›PyTorch state_dict, meta device, and memory management
- ›safetensors file format basics
- ›Awareness of quantization concepts (INT8, INT4, GPTQ, AWQ)
From Hub to GPU: The Weight Loading Pipeline
You've seen how Transformers resolves "meta-llama/Llama-2-7b-hf" to LlamaForCausalLM and what that class looks like inside. But how do 14 GB of weights actually end up on your GPU in the right format? The from_pretrained() method on PreTrainedModel is a 60+ parameter orchestrator that handles safetensors downloading, meta device initialization, checkpoint format conversion, quantization, and multi-GPU dispatch. This article traces the full pipeline.
The from_pretrained() Orchestration
The from_pretrained() method signature reveals the scope of what it handles:
def from_pretrained(
cls,
pretrained_model_name_or_path,
*model_args,
config=None,
cache_dir=None,
ignore_mismatched_sizes=False,
force_download=False,
local_files_only=False,
token=None,
revision="main",
use_safetensors=None,
weights_only=True,
**kwargs, # ~50 more via kwargs
):
The **kwargs hides a lot: device_map, torch_dtype, quantization_config, attn_implementation, low_cpu_mem_usage, and many more. Here's the high-level flow:
sequenceDiagram
participant User
participant FP as from_pretrained()
participant Hub as HuggingFace Hub
participant QA as Quantizer Auto-Select
participant DM as Device Map
participant Init as Meta Device Init
participant WL as Weight Loading
participant Post as Post-Load Hooks
User->>FP: from_pretrained("meta-llama/Llama-2-7b-hf")
FP->>Hub: Download config.json
Hub-->>FP: Config
FP->>FP: Resolve config class
FP->>QA: Select quantizer (if any)
QA-->>FP: HfQuantizer or None
FP->>DM: Compute device_map
DM-->>FP: {"model.layers.0": 0, ...}
FP->>Init: Initialize model on meta device
Init-->>FP: Empty model skeleton
FP->>WL: Load weights shard by shard
WL->>WL: Apply WeightConverter/WeightRenaming
WL->>WL: Quantize if needed
WL->>WL: Place on target device
FP->>Post: Post-weight-loading hooks
Post-->>User: Ready model
The method first resolves the config (via AutoConfig if needed), then enters the weight loading path. The LoadStateDictConfig dataclass bundles all the loading parameters into a single immutable object:
@dataclass(frozen=True)
class LoadStateDictConfig:
pretrained_model_name_or_path: str | None = None
use_safetensors: bool | None = None
ignore_mismatched_sizes: bool = False
device_map: dict | None = None
dtype: torch.dtype | None = None
hf_quantizer: HfQuantizer | None = None
weight_mapping: list[WeightConverter | WeightRenaming] | None = None
# ...
This config is passed through the entire weight loading pipeline, avoiding the anti-pattern of threading individual parameters through a dozen functions.
core_model_loading.py: The Weight Conversion Engine
The core_model_loading.py module handles checkpoint format differences. Different versions of the same model may store weights under different names or in different tensor layouts. The WeightConverter and WeightRenaming dataclasses express these transformations declaratively.
WeightRenaming handles simple key renames:
@dataclass(slots=True)
class WeightRenaming(WeightTransform):
# Maps source_patterns → target_patterns
# e.g., "model.layers.*.attn.qkv.weight" → "model.layers.*.self_attn.q_proj.weight"
WeightConverter handles actual tensor transformations via a pipeline of ConversionOps:
flowchart LR
A["Checkpoint tensor<br/>qkv_proj.weight"] --> B["Chunk(dim=0)"]
B --> C["q_proj.weight"]
B --> D["k_proj.weight"]
B --> E["v_proj.weight"]
The ConversionOps hierarchy includes Chunk (split tensors), Concatenate (merge tensors), and specialized ops for quantized formats. Each operation has a reverse_op property, enabling bidirectional conversion — important for both loading and saving.
The conversion system uses build_glob_alternation() to compile all weight patterns into a single regex, making pattern matching efficient even with hundreds of conversion rules.
Tip: Weight converters are defined per-model in the model's module (or in the config). If you're loading a checkpoint in a format different from the model's native format, check the model class for
_checkpoint_conversion_mappingorweight_mappingattributes.
Quantizer Auto-Selection
When loading with quantization_config (or when the model was pre-quantized), Transformers selects the appropriate quantizer from 20+ backends. The AUTO_QUANTIZER_MAPPING in quantizers/auto.py maps method names to quantizer classes:
AUTO_QUANTIZER_MAPPING = {
"awq": AwqQuantizer,
"bitsandbytes_4bit": Bnb4BitHfQuantizer,
"bitsandbytes_8bit": Bnb8BitHfQuantizer,
"gptq": GptqHfQuantizer,
"quanto": QuantoHfQuantizer,
# ... 20+ more
}
flowchart TD
A["quantization_config provided?"] -->|Yes| B["Extract quant_method"]
A -->|No| C["Check config.json<br/>for quantization_config"]
C -->|Found| B
C -->|Not found| D["No quantization"]
B --> E["AUTO_QUANTIZER_MAPPING[method]"]
E --> F["HfQuantizer subclass"]
F --> G["validate_environment()"]
G --> H["preprocess_model()"]
H --> I["Weight loading with<br/>quantization hooks"]
I --> J["postprocess_model()"]
The HfQuantizer base class defines lifecycle hooks that wrap the weight loading process:
validate_environment()— check that required packages are installedupdate_device_map()— some quantizers forcedevice_map="auto"_process_model_before_weight_loading()— replace modules with quantized variants on meta device_process_model_after_weight_loading()— finalize quantized weightsparam_needs_quantization()— per-parameter decision on what to quantize
The get_keys_to_not_convert() function automatically excludes the LM head and tied weights from quantization — keeping the output layer in full precision for numerical stability.
Device Maps and Multi-GPU Loading
For models too large to fit on a single GPU, Transformers integrates with Accelerate's device map system. The integrations/accelerate.py module provides the bridge.
When you pass device_map="auto", the library:
- Estimates model size from the config (without loading weights)
- Queries available GPU memory
- Runs
infer_auto_device_map()to assign each module to a device - Respects
_no_split_modules— decoder layers stay whole on one device
flowchart TD
A["device_map='auto'"] --> B["Estimate model size<br/>from config"]
B --> C["Query GPU memory"]
C --> D["infer_auto_device_map()"]
D --> E{"Model fits on<br/>single GPU?"}
E -->|Yes| F["All on GPU 0"]
E -->|No| G["Split across GPUs"]
G --> H{"Still too large?"}
H -->|Yes| I["Offload to CPU"]
I --> J{"CPU OOM?"}
J -->|Yes| K["Offload to disk"]
J -->|No| L["device_map ready"]
H -->|No| L
F --> L
The device map is a dict mapping parameter names to devices: {"model.embed_tokens": 0, "model.layers.0": 0, "model.layers.16": 1, "lm_head": "cpu"}. During weight loading, each tensor is placed directly on its target device, avoiding the need to ever hold the full model in CPU memory.
safetensors and Shard-by-Shard Loading
Modern checkpoints use the safetensors format, which supports memory-mapped access and is immune to pickle-based attacks. Weights are typically sharded across multiple files (e.g., model-00001-of-00003.safetensors).
The loading process reads the shard index (model.safetensors.index.json) to know which parameters live in which shard, then loads shards one at a time. This is critical for large models — rather than loading all 14 GB into CPU memory and then dispatching, each shard's tensors go directly to their target device:
sequenceDiagram
participant FP as from_pretrained
participant Idx as Shard Index
participant S1 as Shard 1
participant S2 as Shard 2
participant GPU0 as GPU 0
participant GPU1 as GPU 1
FP->>Idx: Read index.json
Idx-->>FP: param → shard mapping
FP->>S1: Load shard 1 (memory-mapped)
S1-->>FP: layers 0-15 weights
FP->>GPU0: Place layers 0-15
Note over FP: Shard 1 released from memory
FP->>S2: Load shard 2 (memory-mapped)
S2-->>FP: layers 16-31 weights
FP->>GPU1: Place layers 16-31
The use_safetensors parameter defaults to None, which means auto-detect: prefer safetensors if available, fall back to PyTorch .bin files. The safetensors_conversion.py module handles automatic conversion from .bin to safetensors format when saving.
Tip: Set
SAFETENSORS_FAST_GPU=1as an environment variable to enable direct GPU loading of safetensors files, bypassing CPU entirely. This requires thesafetensorslibrary version ≥ 0.4.0 and can significantly speed up loading on systems with fast NVMe storage.
The Loading Report
After all weights are loaded, from_pretrained() logs a detailed report of what happened: which keys were loaded, which were missing (expected for tied weights), which were unexpected, and which had size mismatches. The LoadStateDictInfo dataclass collects this information throughout the loading process, and the final report helps diagnose issues when fine-tuning checkpoints diverge from the base model structure.
The method ends with a call to model.eval() — a deliberate default that prevents accidental training behavior (like dropout) during inference. If you're fine-tuning, you'll need to call model.train() explicitly.
Directory Map
| File | Purpose |
|---|---|
src/transformers/modeling_utils.py |
from_pretrained(), LoadStateDictConfig |
src/transformers/core_model_loading.py |
WeightConverter, WeightRenaming, conversion ops |
src/transformers/quantizers/auto.py |
AUTO_QUANTIZER_MAPPING, quantizer selection |
src/transformers/quantizers/base.py |
HfQuantizer base class with lifecycle hooks |
src/transformers/integrations/accelerate.py |
Device map computation, Accelerate bridge |
src/transformers/safetensors_conversion.py |
Format conversion utilities |
Now that we have a fully loaded model, it's time to generate text. In the next article, we'll dive into the generate() method — a 1700-line orchestrator that manages decoding strategies, KV-caches, logits processing, speculative decoding, and token streaming.