From Userspace to Kernel: Syscall Entry and the VFS Layer
Prerequisites
- ›Article 1: architecture-and-directory-map
- ›Article 3: process-scheduler-internals
- ›Basic x86 assembly (registers, calling conventions)
- ›Understanding of C function pointer structs
From Userspace to Kernel: Syscall Entry and the VFS Layer
When a userspace program calls read(), a chain of events crosses the CPU privilege boundary, navigates Spectre mitigations, dispatches through a generated switch table, and eventually reaches a filesystem-specific function via the VFS — the kernel's single most important abstraction layer. This article traces that complete path, from the x86-64 SYSCALL instruction to the point where bytes come off a disk.
We've seen the scheduler (Article 3) decide which task runs. Now we see what happens when that task asks the kernel to do something.
The x86-64 Syscall Assembly Entry
When userspace executes the SYSCALL instruction, the CPU atomically:
- Saves
RIPtoRCXandRFLAGStoR11 - Loads
RIPfrom theLSTARMSR (Model-Specific Register) - Masks
RFLAGSwith theFMASKMSR - Switches to ring 0 (kernel mode)
The LSTAR MSR points to entry_SYSCALL_64:
arch/x86/entry/entry_64.S#L87-L170
SYM_CODE_START(entry_SYSCALL_64)
UNWIND_HINT_ENTRY
ENDBR
swapgs
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
sequenceDiagram
participant U as Userspace
participant CPU as CPU Hardware
participant ASM as entry_SYSCALL_64
participant C as do_syscall_64()
U->>CPU: SYSCALL instruction
CPU->>CPU: Save RIP→RCX, RFLAGS→R11
CPU->>CPU: Load LSTAR→RIP, ring 0
CPU->>ASM: Jump to entry_SYSCALL_64
ASM->>ASM: swapgs (load kernel GS base)
ASM->>ASM: Switch to kernel stack
ASM->>ASM: SWITCH_TO_KERNEL_CR3
ASM->>ASM: Construct pt_regs on stack
ASM->>ASM: IBRS_ENTER, UNTRAIN_RET
ASM->>C: call do_syscall_64
C-->>ASM: return (true=SYSRET, false=IRET)
ASM->>U: SYSRET or IRET to userspace
The first real instruction is swapgs, which exchanges the CPU's GS base register between userspace and kernel values. This gives the kernel access to per-CPU data. Next, the current user stack pointer is saved and replaced with the kernel stack.
SWITCH_TO_KERNEL_CR3 is a Meltdown/KPTI mitigation: it swaps to a page table that maps kernel memory. With KPTI active, userspace runs with a page table that has the kernel completely unmapped.
The assembly then constructs a struct pt_regs on the stack by pushing all user registers in the exact layout expected by C code:
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(...) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
Notice rax=$-ENOSYS — all general-purpose registers are cleared (defense in depth against speculative execution), and %rax is preset to -ENOSYS as the default "not implemented" return.
Then come the Spectre mitigations:
IBRS_ENTER
UNTRAIN_RET
CLEAR_BRANCH_HISTORY
call do_syscall_64
IBRS_ENTER enables Indirect Branch Restricted Speculation. UNTRAIN_RET mitigates Retbleed. CLEAR_BRANCH_HISTORY protects against Branch History Injection. These three lines represent years of hardware vulnerability response.
do_syscall_64() and Dispatch
The C dispatch function is clean and well-documented:
arch/x86/entry/syscall_64.c#L87-L141
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
regs->ax = __x64_sys_ni_syscall(regs);
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
...
add_random_kstack_offset() randomizes the kernel stack position — another exploit mitigation. syscall_enter_from_user_mode() handles tracing, seccomp filters, and audit.
The actual dispatch uses a generated switch statement:
arch/x86/entry/syscall_64.c#L34-L41
#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
#include <asm/syscalls_64.h>
default: return __x64_sys_ni_syscall(regs);
}
}
flowchart TD
A["do_syscall_64(regs, nr)"] --> B["add_random_kstack_offset()"]
B --> C["syscall_enter_from_user_mode()<br/>(seccomp, audit, tracing)"]
C --> D["do_syscall_x64(regs, nr)"]
D --> E["x64_sys_call: switch(nr)"]
E --> F["__x64_sys_read(regs)<br/>or any syscall handler"]
F --> G["syscall_exit_to_user_mode()"]
G --> H{"SYSRET safe?"}
H -->|Yes| I["Fast: SYSRET to userspace"]
H -->|No| J["Slow: IRET to userspace"]
The X-macro trick #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs); combined with #include <asm/syscalls_64.h> generates a complete switch statement at compile time. This replaced the older array-based dispatch for better Spectre safety via array_index_nospec().
The return value of do_syscall_64 is a boolean: true means "use SYSRET" (the fast path), false means "use IRET" (the slow but safe path). The conditions for SYSRET safety are checked explicitly — RCX must equal RIP, R11 must equal RFLAGS, and the instruction pointer must be in canonical user address space. Intel CPUs have a hardware bug where SYSRET with non-canonical RCX faults in kernel mode, giving the user control of the kernel stack.
Tip: The
SYSCALL_DEFINEmacros ininclude/linux/syscalls.h#L217-L230are how syscalls are declared in C.SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)generates the actual function with proper metadata for tracing and auditing.
VFS: The Kernel's Central Abstraction
The system call dispatcher calls into subsystem-specific code. For file operations — read, write, open, close, mmap — that means the Virtual Filesystem Switch (VFS).
The VFS is the most important application of the C vtable pattern in the kernel. It provides a single set of syscall handlers that work identically regardless of whether the underlying storage is ext4, XFS, NFS, procfs, or a FUSE filesystem. The abstraction is achieved through three primary operations structures.
flowchart TD
subgraph "Userspace"
app["Application: read(fd, buf, count)"]
end
subgraph "Syscall Layer"
sys["sys_read()"]
end
subgraph "VFS Layer"
vfs["vfs_read() → file->f_op->read_iter()"]
end
subgraph "Filesystem Implementations"
ext4["ext4_file_read_iter()"]
xfs["xfs_file_read_iter()"]
nfs["nfs_file_read()"]
proc["proc_reg_read_iter()"]
end
app --> sys --> vfs
vfs --> ext4
vfs --> xfs
vfs --> nfs
vfs --> proc
Operations Structures Deep Dive
The three core VFS operations structures define the contract between the generic VFS layer and individual filesystem implementations.
struct file_operations
include/linux/fs.h#L1926-L1970
struct file_operations {
struct module *owner;
fop_flags_t fop_flags;
loff_t (*llseek)(struct file *, loff_t, int);
ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
__poll_t (*poll)(struct file *, struct poll_table_struct *);
int (*mmap)(struct file *, struct vm_area_struct *);
int (*open)(struct inode *, struct file *);
int (*release)(struct inode *, struct file *);
int (*fsync)(struct file *, loff_t, loff_t, int datasync);
int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
...
} __randomize_layout;
Every open file in the system has a struct file that points to the file_operations for its filesystem. When you call read(), the kernel ultimately calls file->f_op->read_iter(). Note the uring_cmd function pointer — this is the io_uring integration point we'll explore in Article 5.
classDiagram
class file_operations {
+llseek()
+read()
+write()
+read_iter()
+write_iter()
+poll()
+mmap()
+open()
+release()
+fsync()
+uring_cmd()
}
class inode_operations {
+lookup()
+create()
+link()
+unlink()
+mkdir()
+rmdir()
+rename()
+setattr()
+getattr()
}
class super_operations {
+alloc_inode()
+destroy_inode()
+dirty_inode()
+write_inode()
+drop_inode()
+put_super()
+sync_fs()
+statfs()
}
file_operations <-- inode_operations : "per-inode"
inode_operations <-- super_operations : "per-filesystem"
struct inode_operations
include/linux/fs.h#L2001-L2025
Inode operations handle namespace operations: looking up files in directories, creating files, making directories, renaming, and managing extended attributes. These operate on the directory tree structure rather than file content.
struct super_operations
include/linux/fs/super_types.h#L83-L112
Superblock operations manage the filesystem as a whole: allocating inodes, writing back dirty inodes, syncing the filesystem, and reporting disk space with statfs.
Each filesystem fills in these three structures (and sometimes struct address_space_operations for page cache integration). The VFS never calls filesystem code directly — it always goes through these function pointers. This is how Linux supports 50+ filesystem types with a single set of syscalls.
Tracing open() → read() → write() Through VFS
Let's trace a read() call through the VFS:
- Userspace calls
read(fd, buf, count) - Syscall entry (as described above) dispatches to
sys_read() sys_read()callsfdget_pos()to look up thestruct filefrom the file descriptor tablevfs_read()checks permissions and callsfile->f_op->read_iter()(the modern path) orfile->f_op->read()(legacy path)- The filesystem's implementation — say
ext4_file_read_iter()— handles the actual I/O - Data flows through the page cache, possibly triggering block I/O to disk
- The result propagates back up through the VFS to userspace
sequenceDiagram
participant U as Userspace
participant S as sys_read()
participant V as vfs_read()
participant F as file->f_op->read_iter()
participant PC as Page Cache
participant BIO as Block I/O
U->>S: read(fd, buf, count)
S->>S: fdget_pos(fd) → struct file
S->>V: vfs_read(file, buf, count, &pos)
V->>V: Permission checks
V->>F: f_op->read_iter(kiocb, iov_iter)
F->>PC: Look up page cache
alt Cache hit
PC-->>F: Return cached data
else Cache miss
PC->>BIO: Submit block I/O
BIO-->>PC: Data from disk
PC-->>F: Return data
end
F-->>V: Bytes read
V-->>S: Bytes read
S-->>U: Return to userspace
The open() path is particularly interesting because it's where the file_operations pointer is assigned. During open(), the VFS calls inode->i_op->lookup() to resolve the path, determines the filesystem type from the mount point, allocates a struct file, and sets file->f_op to the filesystem's file_operations. From that point on, every read(), write(), and mmap() on that file descriptor dispatches through the filesystem's vtable.
Tip: To understand how a specific filesystem works, find its
file_operationsdefinition. For ext4:grep "struct file_operations" fs/ext4/*.c. The function pointers tell you exactly which functions handle each operation.
What's Next
We've now traced the full path from userspace to a filesystem. In the next article, we'll see how io_uring bypasses this entire syscall sequence — sharing memory between kernel and userspace to submit and complete I/O operations without any SYSCALL instruction at all.