从用户空间到内核：系统调用入口与 VFS 层

当用户空间程序调用 read() 时，一系列事件随之展开：跨越 CPU 特权级边界、穿越 Spectre 缓解机制、通过生成的跳转表分发，最终经由 VFS（内核最核心的抽象层）抵达特定文件系统的实现函数。本文将完整追踪这条路径，从 x86-64 的 SYSCALL 指令一直追溯到数据从磁盘读出的那一刻。

在第 3 篇中，我们了解了调度器如何决定哪个任务运行。现在，我们来看这个任务向内核发起请求后会发生什么。

x86-64 系统调用汇编入口

当用户空间执行 SYSCALL 指令时，CPU 会原子地完成以下操作：

将 RIP 保存到 RCX，将 RFLAGS 保存到 R11
从 LSTAR MSR（Model-Specific Register）加载 RIP
用 FMASK MSR 对 RFLAGS 进行掩码处理
切换到 ring 0（内核模式）

LSTAR MSR 指向 entry_SYSCALL_64：

arch/x86/entry/entry_64.S#L87-L170

SYM_CODE_START(entry_SYSCALL_64)
    UNWIND_HINT_ENTRY
    ENDBR

    swapgs
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

sequenceDiagram
    participant U as Userspace
    participant CPU as CPU Hardware
    participant ASM as entry_SYSCALL_64
    participant C as do_syscall_64()

    U->>CPU: SYSCALL instruction
    CPU->>CPU: Save RIP→RCX, RFLAGS→R11
    CPU->>CPU: Load LSTAR→RIP, ring 0
    CPU->>ASM: Jump to entry_SYSCALL_64
    ASM->>ASM: swapgs (load kernel GS base)
    ASM->>ASM: Switch to kernel stack
    ASM->>ASM: SWITCH_TO_KERNEL_CR3
    ASM->>ASM: Construct pt_regs on stack
    ASM->>ASM: IBRS_ENTER, UNTRAIN_RET
    ASM->>C: call do_syscall_64
    C-->>ASM: return (true=SYSRET, false=IRET)
    ASM->>U: SYSRET or IRET to userspace

第一条真正执行的指令是 swapgs，它将 CPU 的 GS base 寄存器在用户空间值和内核值之间进行交换，从而让内核能够访问每个 CPU 的私有数据。随后，当前的用户栈指针被保存，并替换为内核栈。

SWITCH_TO_KERNEL_CR3 是针对 Meltdown/KPTI 的缓解措施：它将页表切换为映射了内核内存的版本。启用 KPTI 后，用户空间运行时使用的页表完全不包含内核映射。

接下来，汇编代码通过将所有用户寄存器按照 C 代码所期望的精确布局依次压栈，在栈上构建出 struct pt_regs：

    pushq   $__USER_DS              /* pt_regs->ss */
    pushq   PER_CPU_VAR(...)        /* pt_regs->sp */
    pushq   %r11                    /* pt_regs->flags */
    pushq   $__USER_CS              /* pt_regs->cs */
    pushq   %rcx                    /* pt_regs->ip */
    pushq   %rax                    /* pt_regs->orig_ax */
    PUSH_AND_CLEAR_REGS rax=$-ENOSYS

注意 rax=$-ENOSYS——所有通用寄存器都被清零（深度防御，防止推测执行攻击），%rax 预设为 -ENOSYS，作为默认的"未实现"返回值。

之后是 Spectre 缓解措施：

    IBRS_ENTER
    UNTRAIN_RET
    CLEAR_BRANCH_HISTORY
    call    do_syscall_64

IBRS_ENTER 启用间接分支限制推测（Indirect Branch Restricted Speculation）。UNTRAIN_RET 缓解 Retbleed 漏洞。CLEAR_BRANCH_HISTORY 防范分支历史注入（Branch History Injection）攻击。这三行代码，凝结了多年来应对硬件漏洞的成果。

do_syscall_64() 与系统调用分发

C 层的分发函数清晰简洁，注释也很完备：

arch/x86/entry/syscall_64.c#L87-L141

__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
    add_random_kstack_offset();
    nr = syscall_enter_from_user_mode(regs, nr);

    instrumentation_begin();

    if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
        regs->ax = __x64_sys_ni_syscall(regs);
    }

    instrumentation_end();
    syscall_exit_to_user_mode(regs);
    ...

add_random_kstack_offset() 对内核栈的位置进行随机化，是又一项漏洞利用缓解手段。syscall_enter_from_user_mode() 则负责处理跟踪、seccomp 过滤器以及审计逻辑。

实际的分发通过一个生成的 switch 语句完成：

arch/x86/entry/syscall_64.c#L34-L41

#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
    switch (nr) {
    #include <asm/syscalls_64.h>
    default: return __x64_sys_ni_syscall(regs);
    }
}

flowchart TD
    A["do_syscall_64(regs, nr)"] --> B["add_random_kstack_offset()"]
    B --> C["syscall_enter_from_user_mode()<br/>(seccomp, audit, tracing)"]
    C --> D["do_syscall_x64(regs, nr)"]
    D --> E["x64_sys_call: switch(nr)"]
    E --> F["__x64_sys_read(regs)<br/>or any syscall handler"]
    F --> G["syscall_exit_to_user_mode()"]
    G --> H{"SYSRET safe?"}
    H -->|Yes| I["Fast: SYSRET to userspace"]
    H -->|No| J["Slow: IRET to userspace"]

#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs); 配合 #include <asm/syscalls_64.h> 的 X-macro 技巧，在编译期生成完整的 switch 语句。这一方式取代了旧的基于数组的分发机制，通过 array_index_nospec() 获得了更好的 Spectre 防护。

do_syscall_64 的返回值是一个布尔值：true 表示"使用 SYSRET"（快速路径），false 表示"使用 IRET"（较慢但更安全的路径）。SYSRET 安全的条件会被显式检查——RCX 必须等于 RIP，R11 必须等于 RFLAGS，且指令指针必须位于规范的用户地址空间内。Intel CPU 存在一个硬件 bug：当 RCX 不规范时执行 SYSRET 会在内核模式下触发故障，从而让用户获得对内核栈的控制权。

提示： include/linux/syscalls.h#L217-L230 中的 SYSCALL_DEFINE 宏是在 C 代码中声明系统调用的标准方式。SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) 会生成带有完整跟踪和审计元数据的实际函数。

VFS：内核的核心抽象

系统调用分发器会将控制权交给各子系统的具体实现。对于文件操作——read、write、open、close、mmap——这意味着进入虚拟文件系统交换层（VFS）。

VFS 是内核中 C 语言 vtable 模式最重要的应用。它提供了一套统一的系统调用处理函数，无论底层存储是 ext4、XFS、NFS、procfs 还是 FUSE 文件系统，行为完全一致。这套抽象通过三个核心操作结构体来实现。

flowchart TD
    subgraph "Userspace"
        app["Application: read(fd, buf, count)"]
    end
    subgraph "Syscall Layer"
        sys["sys_read()"]
    end
    subgraph "VFS Layer"
        vfs["vfs_read() → file->f_op->read_iter()"]
    end
    subgraph "Filesystem Implementations"
        ext4["ext4_file_read_iter()"]
        xfs["xfs_file_read_iter()"]
        nfs["nfs_file_read()"]
        proc["proc_reg_read_iter()"]
    end
    app --> sys --> vfs
    vfs --> ext4
    vfs --> xfs
    vfs --> nfs
    vfs --> proc

深入操作结构体

三个核心 VFS 操作结构体定义了通用 VFS 层与各文件系统实现之间的契约。

struct file_operations

include/linux/fs.h#L1926-L1970

struct file_operations {
    struct module *owner;
    fop_flags_t fop_flags;
    loff_t (*llseek)(struct file *, loff_t, int);
    ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
    __poll_t (*poll)(struct file *, struct poll_table_struct *);
    int (*mmap)(struct file *, struct vm_area_struct *);
    int (*open)(struct inode *, struct file *);
    int (*release)(struct inode *, struct file *);
    int (*fsync)(struct file *, loff_t, loff_t, int datasync);
    int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
    ...
} __randomize_layout;

系统中每个打开的文件都对应一个 struct file，其中包含指向该文件系统 file_operations 的指针。调用 read() 时，内核最终会调用 file->f_op->read_iter()。注意其中的 uring_cmd 函数指针——这是 io_uring 的集成点，我们将在第 5 篇中详细探讨。

classDiagram
    class file_operations {
        +llseek()
        +read()
        +write()
        +read_iter()
        +write_iter()
        +poll()
        +mmap()
        +open()
        +release()
        +fsync()
        +uring_cmd()
    }
    class inode_operations {
        +lookup()
        +create()
        +link()
        +unlink()
        +mkdir()
        +rmdir()
        +rename()
        +setattr()
        +getattr()
    }
    class super_operations {
        +alloc_inode()
        +destroy_inode()
        +dirty_inode()
        +write_inode()
        +drop_inode()
        +put_super()
        +sync_fs()
        +statfs()
    }
    file_operations <-- inode_operations : "per-inode"
    inode_operations <-- super_operations : "per-filesystem"

struct inode_operations

include/linux/fs.h#L2001-L2025

inode 操作负责处理命名空间层面的操作：在目录中查找文件、创建文件、创建目录、重命名，以及管理扩展属性。这些操作面向的是目录树结构，而非文件内容本身。

struct super_operations

include/linux/fs/super_types.h#L83-L112

超级块操作管理整个文件系统：分配 inode、回写脏 inode、同步文件系统，以及通过 statfs 报告磁盘空间。

每个文件系统都需要填充这三个结构体（有时还包括用于页缓存集成的 struct address_space_operations）。VFS 从不直接调用文件系统代码——始终通过这些函数指针间接调用。正是如此，Linux 才能用一套系统调用支撑 50 多种文件系统类型。

追踪 open() → read() → write() 穿越 VFS 的完整路径

让我们逐步追踪一次 read() 调用穿越 VFS 的过程：

用户空间调用 read(fd, buf, count)
系统调用入口（如上所述）将其分发到 sys_read()
sys_read() 调用 fdget_pos()，从文件描述符表中查找对应的 struct file
vfs_read() 检查权限，然后调用 file->f_op->read_iter()（现代路径）或 file->f_op->read()（兼容旧接口的路径）
文件系统的具体实现——以 ext4_file_read_iter() 为例——负责处理实际的 I/O
数据经由页缓存流转，可能触发向磁盘的块 I/O
结果沿着 VFS 路径向上传递，最终返回用户空间

sequenceDiagram
    participant U as Userspace
    participant S as sys_read()
    participant V as vfs_read()
    participant F as file->f_op->read_iter()
    participant PC as Page Cache
    participant BIO as Block I/O

    U->>S: read(fd, buf, count)
    S->>S: fdget_pos(fd) → struct file
    S->>V: vfs_read(file, buf, count, &pos)
    V->>V: Permission checks
    V->>F: f_op->read_iter(kiocb, iov_iter)
    F->>PC: Look up page cache
    alt Cache hit
        PC-->>F: Return cached data
    else Cache miss
        PC->>BIO: Submit block I/O
        BIO-->>PC: Data from disk
        PC-->>F: Return data
    end
    F-->>V: Bytes read
    V-->>S: Bytes read
    S-->>U: Return to userspace

open() 的路径尤为关键，因为正是在这一步，file_operations 指针被赋值。在 open() 过程中，VFS 调用 inode->i_op->lookup() 解析路径，从挂载点确定文件系统类型，分配 struct file，并将 file->f_op 设置为该文件系统的 file_operations。此后，对该文件描述符的每次 read()、write() 和 mmap() 调用，都会通过这张文件系统的 vtable 进行分发。

提示： 想了解某个具体文件系统的工作原理，最直接的方法是找到它的 file_operations 定义。以 ext4 为例：grep "struct file_operations" fs/ext4/*.c。通过这些函数指针，你可以清楚地看到每个操作由哪个函数负责处理。

下一步

至此，我们已经追踪了从用户空间到文件系统的完整路径。在下一篇文章中，我们将探讨 io_uring 如何绕过整个系统调用序列——通过在内核与用户空间之间共享内存来提交和完成 I/O 操作，完全不需要执行任何 SYSCALL 指令。

从用户空间到内核：系统调用入口与 VFS 层

前置知识

从用户空间到内核：系统调用入口与 VFS 层

x86-64 系统调用汇编入口

do_syscall_64() 与系统调用分发

VFS：内核的核心抽象

深入操作结构体

struct file_operations

struct inode_operations

struct super_operations

追踪 open() → read() → write() 穿越 VFS 的完整路径

下一步