Read OSS

io_uring: Inside the Kernel's Async I/O Engine

Advanced

Prerequisites

  • Article 1: architecture-and-directory-map
  • Article 4: syscall-path-and-vfs
  • Understanding of memory ordering and atomic operations
  • Familiarity with async I/O concepts

io_uring: Inside the Kernel's Async I/O Engine

In Article 4, we traced the full syscall path: privilege transition, register save/restore, Spectre mitigations, dispatch, VFS traversal, and return. That sequence is fast — but when an application does millions of I/O operations per second, the overhead of entering and exiting the kernel for each one becomes the bottleneck. io_uring eliminates this overhead by sharing ring buffers between kernel and userspace, enabling I/O submission and completion without any system call in the hot path.

io_uring was important enough to earn its own top-level directory — one of the very few subsystems promoted out of fs/ into the root of the kernel source tree. As we saw in Article 1's directory map, it's conditionally compiled via obj-$(CONFIG_IO_URING) += io_uring/ in the top-level Kbuild.

Why io_uring Exists

Traditional Linux I/O has two unsatisfying options:

  1. Synchronous syscalls (read/write) — simple but blocking. Each call incurs the full syscall entry/exit path.
  2. aio (Linux AIO) — truly asynchronous but limited to direct I/O on files, with a cumbersome API and high per-operation overhead.

High-performance servers (databases, web servers, storage engines) need to handle hundreds of thousands of I/O operations per second. At that scale, the syscall overhead from Article 4 — swapgs, CR3 switch, pt_regs construction, Spectre mitigations, SYSRET validation — multiplied by every operation becomes significant.

io_uring solves this with a shared-memory design: userspace writes submission queue entries (SQEs) directly into kernel-visible memory, and the kernel writes completion queue entries (CQEs) back. In SQ poll mode, even the notification that new work is available happens without a syscall.

Ring Buffer Architecture

The core data structure is a pair of ring buffers shared between userspace and the kernel through mmap():

flowchart LR
    subgraph "Userspace Process"
        SQT["SQ Tail (written by app)"]
        CQH["CQ Head (written by app)"]
    end
    subgraph "Shared Memory (mmap'd)"
        subgraph "Submission Queue"
            SQE1["SQE 0"]
            SQE2["SQE 1"]
            SQE3["SQE ..."]
            SQE4["SQE N"]
        end
        subgraph "Completion Queue"
            CQE1["CQE 0"]
            CQE2["CQE 1"]
            CQE3["CQE ..."]
            CQE4["CQE N"]
        end
    end
    subgraph "Kernel"
        SQH["SQ Head (written by kernel)"]
        CQT["CQ Tail (written by kernel)"]
    end
    SQT -->|"smp_store_release"| SQE1
    SQH -->|"smp_load_acquire"| SQE1
    CQT -->|"smp_store_release"| CQE1
    CQH -->|"smp_load_acquire"| CQE1

The header comment in io_uring.c documents the memory ordering contract:

io_uring/io_uring.c#L1-L41

A note on the read/write ordering memory barriers that are matched between
the application and kernel side.

After the application reads the CQ ring tail, it must use an
appropriate smp_rmb() to pair with the smp_wmb() the kernel uses
before writing the tail (using smp_load_acquire to read the tail will
do). It also needs a smp_mb() before updating CQ head (ordering the
entry load(s) with the head store), pairing with an implicit barrier
through a control-dependency in io_get_cqe.

This is a lock-free single-producer/single-consumer protocol: the application produces SQEs and consumes CQEs, while the kernel consumes SQEs and produces CQEs. The only synchronization is memory barriers, which are far cheaper than locks or syscalls.

The main context structure that holds all this state is struct io_ring_ctx:

include/linux/io_uring_types.h#L271-L289

struct io_ring_ctx {
    /* const or read-mostly hot data */
    struct {
        unsigned int        flags;
        unsigned int        drain_next: 1;
        unsigned int        task_complete: 1;
        unsigned int        lockless_cq: 1;
        unsigned int        syscall_iopoll: 1;
        ...

Like struct rq from the scheduler (Article 3), the fields are carefully grouped — const or read-mostly hot data is separated from frequently written fields to minimize cache contention.

Tip: The UAPI header include/uapi/linux/io_uring.h defines the structures that userspace sees — struct io_uring_sqe and struct io_uring_cqe. This is the stable ABI boundary discussed in Article 1.

Operation Definition Pattern

io_uring supports dozens of operation types (read, write, send, recv, accept, connect, poll, timeout, etc.), each defined by a struct io_issue_def:

io_uring/opdef.h#L7-L44

struct io_issue_def {
    unsigned        needs_file : 1;
    unsigned        plug : 1;
    unsigned        ioprio : 1;
    unsigned        iopoll : 1;
    unsigned        buffer_select : 1;
    unsigned        hash_reg_file : 1;
    unsigned        unbound_nonreg_file : 1;
    unsigned        pollin : 1;
    unsigned        pollout : 1;
    ...
    unsigned short  async_size;

    int (*issue)(struct io_kiocb *, unsigned int);
    int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
};

This is another instance of the C vtable pattern, but more fine-grained than the VFS. Each operation declares its capabilities via bitfields (does it need a file? does it support iopoll? does it support buffer selection?) and provides two function pointers: prep to validate and prepare the SQE, and issue to execute the operation.

The operations are collected in a dispatch table indexed by opcode:

io_uring/opdef.c#L54-L60

const struct io_issue_def io_issue_defs[] = {
    [IORING_OP_NOP] = {
        .audit_skip     = 1,
        .iopoll         = 1,
        .prep           = io_nop_prep,
        .issue          = io_nop,
    },
    [IORING_OP_READV] = {
        .needs_file     = 1,
        .unbound_nonreg_file = 1,
        .pollin         = 1,
        .buffer_select  = 1,
        ...
        .prep           = io_prep_readv,
        .issue          = io_read,
    },
    ...

The organization is one-file-per-operation-family: rw.c for read/write, net.c for networking operations, poll.c for polling, timeout.c for timeouts, etc.

Source File Operations
io_uring/rw.c READV, WRITEV, READ_FIXED, WRITE_FIXED
io_uring/net.c SENDMSG, RECVMSG, SEND, RECV, ACCEPT, CONNECT
io_uring/poll.c POLL_ADD, POLL_REMOVE
io_uring/timeout.c TIMEOUT, TIMEOUT_REMOVE, LINK_TIMEOUT
io_uring/openclose.c OPENAT, CLOSE
io_uring/sqpoll.c SQ poll thread management
io_uring/io-wq.c Worker thread pool

Operation Lifecycle: prep → issue → completion

When the kernel processes a submission queue entry, it follows a clear lifecycle:

flowchart TD
    A["Userspace writes SQE<br/>to submission queue"] --> B["Kernel reads SQE<br/>(smp_load_acquire)"]
    B --> C["io_issue_defs[opcode].prep(req, sqe)<br/>Validate and prepare"]
    C --> D{"prep result?"}
    D -->|Success| E["io_issue_defs[opcode].issue(req, flags)<br/>Execute operation"]
    D -->|Error| G["Post CQE with error"]
    E --> F{"Result?"}
    F -->|Complete| H["Post CQE to completion queue<br/>(smp_store_release tail)"]
    F -->|Would block| I["Delegate to io-wq<br/>worker thread"]
    I --> H

The prep phase runs synchronously when the kernel drains the SQ. It validates the SQE fields, extracts parameters, and sets up the internal struct io_kiocb request. If preparation fails (bad file descriptor, invalid flags), an error CQE is posted immediately.

The issue phase attempts to complete the operation. For many operations — especially those that hit the page cache — this succeeds immediately. If the operation would block (e.g., reading data not in cache), it returns -EAGAIN and the request is handed off to the io-wq worker pool for asynchronous completion.

Completion writes a CQE with the result to the completion ring. The application sees it on the next CQ drain.

SQ Poll Mode and io-wq Worker Pool

SQ Poll Mode (IORING_SETUP_SQPOLL)

In standard mode, the application still needs one syscall — io_uring_enter() — to tell the kernel "there are new SQEs." SQ poll mode eliminates even this: a dedicated kernel thread continuously polls the submission queue for new entries.

The poll thread runs in io_sq_thread() in io_uring/sqpoll.c. It spins for a configurable period looking for new SQEs, and only goes to sleep (saving CPU) when the queue has been idle. When it sleeps, the application can see IORING_SQ_NEED_WAKEUP in the SQ flags and send a single wakeup syscall.

This achieves true zero-syscall I/O in the steady state: the application writes SQEs and reads CQEs from shared memory, while the kernel's poll thread handles submission. For high-throughput workloads (NVMe storage, high-speed networking), this eliminates the syscall overhead entirely.

flowchart LR
    subgraph "Standard Mode"
        A1["App writes SQE"] --> A2["App calls io_uring_enter()"]
        A2 --> A3["Kernel processes SQEs"]
    end
    subgraph "SQ Poll Mode"
        B1["App writes SQE"] --> B2["Kernel poll thread<br/>sees new SQE"]
        B2 --> B3["Kernel processes SQEs"]
    end

io-wq Worker Pool

Not all operations can complete without blocking. When issue() returns -EAGAIN, the request is queued to the io-wq worker pool implemented in io_uring/io-wq.c. This pool maintains:

  • Bounded workers — for work that doesn't need additional file access (capped to avoid resource exhaustion)
  • Unbounded workers — for work on non-regular files (sockets, pipes) that might need many concurrent threads

The io-wq subsystem manages thread creation, sleep/wake, and work stealing across these pools. It's essentially a specialized kernel thread pool tuned for io_uring's needs.

Tip: When benchmarking io_uring, watch for io-wq thread creation — if you see many worker threads, your operations are blocking. Switching to direct I/O (O_DIRECT) or ensuring data is in page cache can keep work on the fast inline path.

What's Next

We've now seen the two paths into the kernel: the traditional syscall path (Article 4) and the shared-memory io_uring path. In the final article, we'll explore the newest language addition to the kernel — Rust. We'll see how the kernel crate wraps these same C interfaces (VFS operations, driver registration, the initcall mechanism) in safe Rust abstractions, and walk through a real Rust GPU driver.