Read OSS

From Userspace to Kernel: Syscall Entry and the VFS Layer

Advanced

Prerequisites

  • Article 1: architecture-and-directory-map
  • Article 3: process-scheduler-internals
  • Basic x86 assembly (registers, calling conventions)
  • Understanding of C function pointer structs

From Userspace to Kernel: Syscall Entry and the VFS Layer

When a userspace program calls read(), a chain of events crosses the CPU privilege boundary, navigates Spectre mitigations, dispatches through a generated switch table, and eventually reaches a filesystem-specific function via the VFS — the kernel's single most important abstraction layer. This article traces that complete path, from the x86-64 SYSCALL instruction to the point where bytes come off a disk.

We've seen the scheduler (Article 3) decide which task runs. Now we see what happens when that task asks the kernel to do something.

The x86-64 Syscall Assembly Entry

When userspace executes the SYSCALL instruction, the CPU atomically:

  1. Saves RIP to RCX and RFLAGS to R11
  2. Loads RIP from the LSTAR MSR (Model-Specific Register)
  3. Masks RFLAGS with the FMASK MSR
  4. Switches to ring 0 (kernel mode)

The LSTAR MSR points to entry_SYSCALL_64:

arch/x86/entry/entry_64.S#L87-L170

SYM_CODE_START(entry_SYSCALL_64)
    UNWIND_HINT_ENTRY
    ENDBR

    swapgs
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
sequenceDiagram
    participant U as Userspace
    participant CPU as CPU Hardware
    participant ASM as entry_SYSCALL_64
    participant C as do_syscall_64()

    U->>CPU: SYSCALL instruction
    CPU->>CPU: Save RIP→RCX, RFLAGS→R11
    CPU->>CPU: Load LSTAR→RIP, ring 0
    CPU->>ASM: Jump to entry_SYSCALL_64
    ASM->>ASM: swapgs (load kernel GS base)
    ASM->>ASM: Switch to kernel stack
    ASM->>ASM: SWITCH_TO_KERNEL_CR3
    ASM->>ASM: Construct pt_regs on stack
    ASM->>ASM: IBRS_ENTER, UNTRAIN_RET
    ASM->>C: call do_syscall_64
    C-->>ASM: return (true=SYSRET, false=IRET)
    ASM->>U: SYSRET or IRET to userspace

The first real instruction is swapgs, which exchanges the CPU's GS base register between userspace and kernel values. This gives the kernel access to per-CPU data. Next, the current user stack pointer is saved and replaced with the kernel stack.

SWITCH_TO_KERNEL_CR3 is a Meltdown/KPTI mitigation: it swaps to a page table that maps kernel memory. With KPTI active, userspace runs with a page table that has the kernel completely unmapped.

The assembly then constructs a struct pt_regs on the stack by pushing all user registers in the exact layout expected by C code:

    pushq   $__USER_DS              /* pt_regs->ss */
    pushq   PER_CPU_VAR(...)        /* pt_regs->sp */
    pushq   %r11                    /* pt_regs->flags */
    pushq   $__USER_CS              /* pt_regs->cs */
    pushq   %rcx                    /* pt_regs->ip */
    pushq   %rax                    /* pt_regs->orig_ax */
    PUSH_AND_CLEAR_REGS rax=$-ENOSYS

Notice rax=$-ENOSYS — all general-purpose registers are cleared (defense in depth against speculative execution), and %rax is preset to -ENOSYS as the default "not implemented" return.

Then come the Spectre mitigations:

    IBRS_ENTER
    UNTRAIN_RET
    CLEAR_BRANCH_HISTORY
    call    do_syscall_64

IBRS_ENTER enables Indirect Branch Restricted Speculation. UNTRAIN_RET mitigates Retbleed. CLEAR_BRANCH_HISTORY protects against Branch History Injection. These three lines represent years of hardware vulnerability response.

do_syscall_64() and Dispatch

The C dispatch function is clean and well-documented:

arch/x86/entry/syscall_64.c#L87-L141

__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
    add_random_kstack_offset();
    nr = syscall_enter_from_user_mode(regs, nr);

    instrumentation_begin();

    if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
        regs->ax = __x64_sys_ni_syscall(regs);
    }

    instrumentation_end();
    syscall_exit_to_user_mode(regs);
    ...

add_random_kstack_offset() randomizes the kernel stack position — another exploit mitigation. syscall_enter_from_user_mode() handles tracing, seccomp filters, and audit.

The actual dispatch uses a generated switch statement:

arch/x86/entry/syscall_64.c#L34-L41

#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
    switch (nr) {
    #include <asm/syscalls_64.h>
    default: return __x64_sys_ni_syscall(regs);
    }
}
flowchart TD
    A["do_syscall_64(regs, nr)"] --> B["add_random_kstack_offset()"]
    B --> C["syscall_enter_from_user_mode()<br/>(seccomp, audit, tracing)"]
    C --> D["do_syscall_x64(regs, nr)"]
    D --> E["x64_sys_call: switch(nr)"]
    E --> F["__x64_sys_read(regs)<br/>or any syscall handler"]
    F --> G["syscall_exit_to_user_mode()"]
    G --> H{"SYSRET safe?"}
    H -->|Yes| I["Fast: SYSRET to userspace"]
    H -->|No| J["Slow: IRET to userspace"]

The X-macro trick #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs); combined with #include <asm/syscalls_64.h> generates a complete switch statement at compile time. This replaced the older array-based dispatch for better Spectre safety via array_index_nospec().

The return value of do_syscall_64 is a boolean: true means "use SYSRET" (the fast path), false means "use IRET" (the slow but safe path). The conditions for SYSRET safety are checked explicitly — RCX must equal RIP, R11 must equal RFLAGS, and the instruction pointer must be in canonical user address space. Intel CPUs have a hardware bug where SYSRET with non-canonical RCX faults in kernel mode, giving the user control of the kernel stack.

Tip: The SYSCALL_DEFINE macros in include/linux/syscalls.h#L217-L230 are how syscalls are declared in C. SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) generates the actual function with proper metadata for tracing and auditing.

VFS: The Kernel's Central Abstraction

The system call dispatcher calls into subsystem-specific code. For file operations — read, write, open, close, mmap — that means the Virtual Filesystem Switch (VFS).

The VFS is the most important application of the C vtable pattern in the kernel. It provides a single set of syscall handlers that work identically regardless of whether the underlying storage is ext4, XFS, NFS, procfs, or a FUSE filesystem. The abstraction is achieved through three primary operations structures.

flowchart TD
    subgraph "Userspace"
        app["Application: read(fd, buf, count)"]
    end
    subgraph "Syscall Layer"
        sys["sys_read()"]
    end
    subgraph "VFS Layer"
        vfs["vfs_read() → file->f_op->read_iter()"]
    end
    subgraph "Filesystem Implementations"
        ext4["ext4_file_read_iter()"]
        xfs["xfs_file_read_iter()"]
        nfs["nfs_file_read()"]
        proc["proc_reg_read_iter()"]
    end
    app --> sys --> vfs
    vfs --> ext4
    vfs --> xfs
    vfs --> nfs
    vfs --> proc

Operations Structures Deep Dive

The three core VFS operations structures define the contract between the generic VFS layer and individual filesystem implementations.

struct file_operations

include/linux/fs.h#L1926-L1970

struct file_operations {
    struct module *owner;
    fop_flags_t fop_flags;
    loff_t (*llseek)(struct file *, loff_t, int);
    ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
    __poll_t (*poll)(struct file *, struct poll_table_struct *);
    int (*mmap)(struct file *, struct vm_area_struct *);
    int (*open)(struct inode *, struct file *);
    int (*release)(struct inode *, struct file *);
    int (*fsync)(struct file *, loff_t, loff_t, int datasync);
    int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
    ...
} __randomize_layout;

Every open file in the system has a struct file that points to the file_operations for its filesystem. When you call read(), the kernel ultimately calls file->f_op->read_iter(). Note the uring_cmd function pointer — this is the io_uring integration point we'll explore in Article 5.

classDiagram
    class file_operations {
        +llseek()
        +read()
        +write()
        +read_iter()
        +write_iter()
        +poll()
        +mmap()
        +open()
        +release()
        +fsync()
        +uring_cmd()
    }
    class inode_operations {
        +lookup()
        +create()
        +link()
        +unlink()
        +mkdir()
        +rmdir()
        +rename()
        +setattr()
        +getattr()
    }
    class super_operations {
        +alloc_inode()
        +destroy_inode()
        +dirty_inode()
        +write_inode()
        +drop_inode()
        +put_super()
        +sync_fs()
        +statfs()
    }
    file_operations <-- inode_operations : "per-inode"
    inode_operations <-- super_operations : "per-filesystem"

struct inode_operations

include/linux/fs.h#L2001-L2025

Inode operations handle namespace operations: looking up files in directories, creating files, making directories, renaming, and managing extended attributes. These operate on the directory tree structure rather than file content.

struct super_operations

include/linux/fs/super_types.h#L83-L112

Superblock operations manage the filesystem as a whole: allocating inodes, writing back dirty inodes, syncing the filesystem, and reporting disk space with statfs.

Each filesystem fills in these three structures (and sometimes struct address_space_operations for page cache integration). The VFS never calls filesystem code directly — it always goes through these function pointers. This is how Linux supports 50+ filesystem types with a single set of syscalls.

Tracing open() → read() → write() Through VFS

Let's trace a read() call through the VFS:

  1. Userspace calls read(fd, buf, count)
  2. Syscall entry (as described above) dispatches to sys_read()
  3. sys_read() calls fdget_pos() to look up the struct file from the file descriptor table
  4. vfs_read() checks permissions and calls file->f_op->read_iter() (the modern path) or file->f_op->read() (legacy path)
  5. The filesystem's implementation — say ext4_file_read_iter() — handles the actual I/O
  6. Data flows through the page cache, possibly triggering block I/O to disk
  7. The result propagates back up through the VFS to userspace
sequenceDiagram
    participant U as Userspace
    participant S as sys_read()
    participant V as vfs_read()
    participant F as file->f_op->read_iter()
    participant PC as Page Cache
    participant BIO as Block I/O

    U->>S: read(fd, buf, count)
    S->>S: fdget_pos(fd) → struct file
    S->>V: vfs_read(file, buf, count, &pos)
    V->>V: Permission checks
    V->>F: f_op->read_iter(kiocb, iov_iter)
    F->>PC: Look up page cache
    alt Cache hit
        PC-->>F: Return cached data
    else Cache miss
        PC->>BIO: Submit block I/O
        BIO-->>PC: Data from disk
        PC-->>F: Return data
    end
    F-->>V: Bytes read
    V-->>S: Bytes read
    S-->>U: Return to userspace

The open() path is particularly interesting because it's where the file_operations pointer is assigned. During open(), the VFS calls inode->i_op->lookup() to resolve the path, determines the filesystem type from the mount point, allocates a struct file, and sets file->f_op to the filesystem's file_operations. From that point on, every read(), write(), and mmap() on that file descriptor dispatches through the filesystem's vtable.

Tip: To understand how a specific filesystem works, find its file_operations definition. For ext4: grep "struct file_operations" fs/ext4/*.c. The function pointers tell you exactly which functions handle each operation.

What's Next

We've now traced the full path from userspace to a filesystem. In the next article, we'll see how io_uring bypasses this entire syscall sequence — sharing memory between kernel and userspace to submit and complete I/O operations without any SYSCALL instruction at all.