io_uring: Inside the Kernel's Async I/O Engine
Prerequisites
- ›Article 1: architecture-and-directory-map
- ›Article 4: syscall-path-and-vfs
- ›Understanding of memory ordering and atomic operations
- ›Familiarity with async I/O concepts
io_uring: Inside the Kernel's Async I/O Engine
In Article 4, we traced the full syscall path: privilege transition, register save/restore, Spectre mitigations, dispatch, VFS traversal, and return. That sequence is fast — but when an application does millions of I/O operations per second, the overhead of entering and exiting the kernel for each one becomes the bottleneck. io_uring eliminates this overhead by sharing ring buffers between kernel and userspace, enabling I/O submission and completion without any system call in the hot path.
io_uring was important enough to earn its own top-level directory — one of the very few subsystems promoted out of fs/ into the root of the kernel source tree. As we saw in Article 1's directory map, it's conditionally compiled via obj-$(CONFIG_IO_URING) += io_uring/ in the top-level Kbuild.
Why io_uring Exists
Traditional Linux I/O has two unsatisfying options:
- Synchronous syscalls (
read/write) — simple but blocking. Each call incurs the full syscall entry/exit path. - aio (Linux AIO) — truly asynchronous but limited to direct I/O on files, with a cumbersome API and high per-operation overhead.
High-performance servers (databases, web servers, storage engines) need to handle hundreds of thousands of I/O operations per second. At that scale, the syscall overhead from Article 4 — swapgs, CR3 switch, pt_regs construction, Spectre mitigations, SYSRET validation — multiplied by every operation becomes significant.
io_uring solves this with a shared-memory design: userspace writes submission queue entries (SQEs) directly into kernel-visible memory, and the kernel writes completion queue entries (CQEs) back. In SQ poll mode, even the notification that new work is available happens without a syscall.
Ring Buffer Architecture
The core data structure is a pair of ring buffers shared between userspace and the kernel through mmap():
flowchart LR
subgraph "Userspace Process"
SQT["SQ Tail (written by app)"]
CQH["CQ Head (written by app)"]
end
subgraph "Shared Memory (mmap'd)"
subgraph "Submission Queue"
SQE1["SQE 0"]
SQE2["SQE 1"]
SQE3["SQE ..."]
SQE4["SQE N"]
end
subgraph "Completion Queue"
CQE1["CQE 0"]
CQE2["CQE 1"]
CQE3["CQE ..."]
CQE4["CQE N"]
end
end
subgraph "Kernel"
SQH["SQ Head (written by kernel)"]
CQT["CQ Tail (written by kernel)"]
end
SQT -->|"smp_store_release"| SQE1
SQH -->|"smp_load_acquire"| SQE1
CQT -->|"smp_store_release"| CQE1
CQH -->|"smp_load_acquire"| CQE1
The header comment in io_uring.c documents the memory ordering contract:
A note on the read/write ordering memory barriers that are matched between
the application and kernel side.
After the application reads the CQ ring tail, it must use an
appropriate smp_rmb() to pair with the smp_wmb() the kernel uses
before writing the tail (using smp_load_acquire to read the tail will
do). It also needs a smp_mb() before updating CQ head (ordering the
entry load(s) with the head store), pairing with an implicit barrier
through a control-dependency in io_get_cqe.
This is a lock-free single-producer/single-consumer protocol: the application produces SQEs and consumes CQEs, while the kernel consumes SQEs and produces CQEs. The only synchronization is memory barriers, which are far cheaper than locks or syscalls.
The main context structure that holds all this state is struct io_ring_ctx:
include/linux/io_uring_types.h#L271-L289
struct io_ring_ctx {
/* const or read-mostly hot data */
struct {
unsigned int flags;
unsigned int drain_next: 1;
unsigned int task_complete: 1;
unsigned int lockless_cq: 1;
unsigned int syscall_iopoll: 1;
...
Like struct rq from the scheduler (Article 3), the fields are carefully grouped — const or read-mostly hot data is separated from frequently written fields to minimize cache contention.
Tip: The UAPI header
include/uapi/linux/io_uring.hdefines the structures that userspace sees —struct io_uring_sqeandstruct io_uring_cqe. This is the stable ABI boundary discussed in Article 1.
Operation Definition Pattern
io_uring supports dozens of operation types (read, write, send, recv, accept, connect, poll, timeout, etc.), each defined by a struct io_issue_def:
struct io_issue_def {
unsigned needs_file : 1;
unsigned plug : 1;
unsigned ioprio : 1;
unsigned iopoll : 1;
unsigned buffer_select : 1;
unsigned hash_reg_file : 1;
unsigned unbound_nonreg_file : 1;
unsigned pollin : 1;
unsigned pollout : 1;
...
unsigned short async_size;
int (*issue)(struct io_kiocb *, unsigned int);
int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
};
This is another instance of the C vtable pattern, but more fine-grained than the VFS. Each operation declares its capabilities via bitfields (does it need a file? does it support iopoll? does it support buffer selection?) and provides two function pointers: prep to validate and prepare the SQE, and issue to execute the operation.
The operations are collected in a dispatch table indexed by opcode:
const struct io_issue_def io_issue_defs[] = {
[IORING_OP_NOP] = {
.audit_skip = 1,
.iopoll = 1,
.prep = io_nop_prep,
.issue = io_nop,
},
[IORING_OP_READV] = {
.needs_file = 1,
.unbound_nonreg_file = 1,
.pollin = 1,
.buffer_select = 1,
...
.prep = io_prep_readv,
.issue = io_read,
},
...
The organization is one-file-per-operation-family: rw.c for read/write, net.c for networking operations, poll.c for polling, timeout.c for timeouts, etc.
| Source File | Operations |
|---|---|
io_uring/rw.c |
READV, WRITEV, READ_FIXED, WRITE_FIXED |
io_uring/net.c |
SENDMSG, RECVMSG, SEND, RECV, ACCEPT, CONNECT |
io_uring/poll.c |
POLL_ADD, POLL_REMOVE |
io_uring/timeout.c |
TIMEOUT, TIMEOUT_REMOVE, LINK_TIMEOUT |
io_uring/openclose.c |
OPENAT, CLOSE |
io_uring/sqpoll.c |
SQ poll thread management |
io_uring/io-wq.c |
Worker thread pool |
Operation Lifecycle: prep → issue → completion
When the kernel processes a submission queue entry, it follows a clear lifecycle:
flowchart TD
A["Userspace writes SQE<br/>to submission queue"] --> B["Kernel reads SQE<br/>(smp_load_acquire)"]
B --> C["io_issue_defs[opcode].prep(req, sqe)<br/>Validate and prepare"]
C --> D{"prep result?"}
D -->|Success| E["io_issue_defs[opcode].issue(req, flags)<br/>Execute operation"]
D -->|Error| G["Post CQE with error"]
E --> F{"Result?"}
F -->|Complete| H["Post CQE to completion queue<br/>(smp_store_release tail)"]
F -->|Would block| I["Delegate to io-wq<br/>worker thread"]
I --> H
The prep phase runs synchronously when the kernel drains the SQ. It validates the SQE fields, extracts parameters, and sets up the internal struct io_kiocb request. If preparation fails (bad file descriptor, invalid flags), an error CQE is posted immediately.
The issue phase attempts to complete the operation. For many operations — especially those that hit the page cache — this succeeds immediately. If the operation would block (e.g., reading data not in cache), it returns -EAGAIN and the request is handed off to the io-wq worker pool for asynchronous completion.
Completion writes a CQE with the result to the completion ring. The application sees it on the next CQ drain.
SQ Poll Mode and io-wq Worker Pool
SQ Poll Mode (IORING_SETUP_SQPOLL)
In standard mode, the application still needs one syscall — io_uring_enter() — to tell the kernel "there are new SQEs." SQ poll mode eliminates even this: a dedicated kernel thread continuously polls the submission queue for new entries.
The poll thread runs in io_sq_thread() in io_uring/sqpoll.c. It spins for a configurable period looking for new SQEs, and only goes to sleep (saving CPU) when the queue has been idle. When it sleeps, the application can see IORING_SQ_NEED_WAKEUP in the SQ flags and send a single wakeup syscall.
This achieves true zero-syscall I/O in the steady state: the application writes SQEs and reads CQEs from shared memory, while the kernel's poll thread handles submission. For high-throughput workloads (NVMe storage, high-speed networking), this eliminates the syscall overhead entirely.
flowchart LR
subgraph "Standard Mode"
A1["App writes SQE"] --> A2["App calls io_uring_enter()"]
A2 --> A3["Kernel processes SQEs"]
end
subgraph "SQ Poll Mode"
B1["App writes SQE"] --> B2["Kernel poll thread<br/>sees new SQE"]
B2 --> B3["Kernel processes SQEs"]
end
io-wq Worker Pool
Not all operations can complete without blocking. When issue() returns -EAGAIN, the request is queued to the io-wq worker pool implemented in io_uring/io-wq.c. This pool maintains:
- Bounded workers — for work that doesn't need additional file access (capped to avoid resource exhaustion)
- Unbounded workers — for work on non-regular files (sockets, pipes) that might need many concurrent threads
The io-wq subsystem manages thread creation, sleep/wake, and work stealing across these pools. It's essentially a specialized kernel thread pool tuned for io_uring's needs.
Tip: When benchmarking io_uring, watch for io-wq thread creation — if you see many worker threads, your operations are blocking. Switching to direct I/O (
O_DIRECT) or ensuring data is in page cache can keep work on the fast inline path.
What's Next
We've now seen the two paths into the kernel: the traditional syscall path (Article 4) and the shared-memory io_uring path. In the final article, we'll explore the newest language addition to the kernel — Rust. We'll see how the kernel crate wraps these same C interfaces (VFS operations, driver registration, the initcall mechanism) in safe Rust abstractions, and walk through a real Rust GPU driver.