Day 238: io_uring - Shared Rings for Modern Linux Async IO

The previous lesson explained why async I/O matters for workloads dominated by waiting. io_uring is one modern Linux answer to the next question: how do we submit and complete lots of I/O efficiently, without paying so much syscall and coordination overhead on every tiny step?

Today's "Aha!" Moment

In classic event-driven I/O, the application often does a dance like this:

ask the kernel which file descriptors are ready
wake up in userspace
issue read or write
maybe go back into the kernel again
repeat thousands of times

That model works, but it can spend a lot of time bouncing between user space and kernel space.

io_uring changes the shape of that interaction.

The aha is:

io_uring is not just "another async API"
it is a shared submission/completion model that reduces per-operation crossing cost between your process and the kernel

Instead of repeatedly asking the kernel one tiny question at a time, the program writes requests into a submission ring and later reads results from a completion ring.

That changes the cost profile:

fewer syscalls in the hot path
better batching
less per-request overhead
a more explicit completion model

So io_uring is best understood as a mechanism for making large amounts of async I/O cheaper to drive, not as a magic accelerator for every workload.

Why This Matters

Imagine a proxy server handling many client sockets while also reading and writing files for caching.

With a traditional readiness model, the server may:

wait in epoll
wake up
issue many recv and send calls
go back to sleep

That already scales better than one blocked thread per socket, but it still involves many trips across the user/kernel boundary and lots of small coordination steps.

io_uring tries to reduce that overhead by letting the process and the kernel communicate through shared rings:

one side submits work
the other side posts completions

This matters when:

I/O volume is high
operations are small and frequent
batching helps
per-operation overhead starts competing with useful work

It also matters because it gives us a more completion-oriented mental model:

not just "which fd is ready?"
but "which specific operations completed, with what result?"

That is why io_uring belongs after the async fundamentals lesson. It is not the introduction to event-driven concurrency. It is a more concrete Linux mechanism for pushing that style harder.

Learning Objectives

By the end of this session, you will be able to:

Explain why io_uring exists - Describe the overheads in older async styles that shared submission/completion rings are trying to reduce.
Trace the mechanism - Show how SQEs and CQEs move through the submission and completion rings.
Evaluate the trade-off - Recognize when io_uring is a good fit and when its kernel dependence, complexity, or workload shape make simpler models more appropriate.

Core Concepts Explained

Concept 1: `io_uring` Exists to Reduce Per-I/O Coordination Overhead

In older Linux async patterns, we often separate two phases:

discover readiness
then issue the actual operation

That means a lot of interaction can look like:

wait for readiness
enter userspace
issue read/write
return to kernel
wait again

This is already better than one blocked thread per connection, but the hot path can still be dominated by:

syscall overhead
repeated wakeups
fragmented submission of many tiny operations

io_uring exists because the kernel can often do better if the application describes work in bulk and consumes completions in bulk.

So the real target is not "make async possible." Async was already possible.

The target is:

make high-frequency async submission and completion cheaper

That is why io_uring is especially compelling in high-throughput servers, proxies, storage engines, and runtimes that drive many concurrent I/O operations.

Concept 2: The Core Mechanism Is a Submission Queue Ring and a Completion Queue Ring

The name tells the story:

there is a submission queue where user space places requests
there is a completion queue where the kernel places results

Conceptually:

userspace prepares SQEs  --->  kernel consumes them
kernel posts CQEs        --->  userspace consumes them

ASCII sketch:

userspace                            kernel
---------                            ------
fill SQE: read fd=7, buf=X  ---->
fill SQE: write fd=9, buf=Y ---->
submit tail update           ---->

                              process requests
                              complete read
                              complete write

<---- read CQE: res=128
<---- read CQE: res=64

The important idea is that the process and kernel are coordinating through shared memory-backed ring structures rather than treating each operation as a fully separate conversational round trip.

That opens the door to:

batching many requests together
amortizing syscall cost
making completions explicit and structured

And because completions refer to specific operations, the program can reason at the operation level rather than only at the file-descriptor readiness level.

Concept 3: `io_uring` Improves the Right Workloads, but It Also Exposes More Kernel-Specific Complexity

io_uring is powerful, but it is not automatically the best option.

Benefits often include:

lower per-I/O overhead
better batching
support for a wide range of operations
a clean completion-based model

But costs include:

Linux-specific design and APIs
more moving parts than simpler blocking code
kernel-version sensitivity for advanced features
the need to understand registration, polling modes, worker behavior, and buffer/file management if you want the real gains

This is the central trade-off:

you can buy performance and flexibility by moving closer to the kernel's async machinery
but you also accept more operational and implementation complexity

And just like the previous lesson, it still does not fix CPU-heavy handlers.

If your bottleneck is:

JSON parsing
compression
encryption
expensive business logic

then io_uring cannot save you from the fact that the program is CPU-bound.

It shines when the bottleneck really is high-rate I/O coordination.

Troubleshooting

Issue: "io_uring is just epoll with a new name."

Why it happens / is confusing: Both are associated with scalable Linux I/O.

Clarification / Fix: epoll is primarily a readiness notification mechanism. io_uring is a broader submission/completion system that aims to reduce per-operation overhead and support richer async workflows.

Issue: "If we adopt io_uring, every I/O workload gets faster."

Why it happens / is confusing: The API is often presented in performance-focused discussions.

Clarification / Fix: The win depends on workload shape. It helps most when many I/O operations are in flight and coordination overhead matters. It does not erase CPU bottlenecks or bad application structure.

Issue: "io_uring makes kernel details irrelevant."

Why it happens / is confusing: Higher-level libraries hide some of the surface area.

Clarification / Fix: The interface is still closely tied to Linux kernel capabilities and versions. If you want the real gains, you still need to understand how the ring, workers, registration, and completion behavior actually work.

Advanced Connections

Connection 1: `io_uring` <-> Async IO Fundamentals

The parallel: The previous lesson explained the why of event-driven waiting. io_uring is one concrete Linux mechanism that makes submission and completion of many async operations cheaper and more structured.

Connection 2: `io_uring` <-> Memory Models & Ordering

The parallel: Both rely on explicit reasoning about shared state and visibility. With io_uring, the shared rings themselves are coordination structures whose producer/consumer semantics must be respected carefully.

Resources

[DOC] io_uring_setup(2)
[DOC] io_uring_enter(2)
[DOC] io_uring_register(2)
[DOC] liburing Project
[DOC] io_uring zero copy Rx - Linux kernel docs

Key Insights

io_uring is about cheaper async I/O coordination - Its main goal is reducing per-operation submission/completion overhead, not merely making async possible.
The mechanism is explicit shared rings - User space submits SQEs, the kernel posts CQEs, and both sides coordinate through ring structures designed for batching and throughput.
It is powerful but not free - io_uring can improve the right Linux workloads substantially, but it also introduces kernel-specific complexity and does not solve CPU-bound work.

Knowledge Check

What is the main problem io_uring is trying to reduce?
- A) The existence of caches in the CPU
- B) The coordination and syscall overhead of driving large volumes of async I/O
- C) The need for file descriptors
What do SQEs and CQEs represent?
- A) Shared queue entries for submission and completion entries for results
- B) Two kinds of mutexes
- C) Compiler optimization passes
When is io_uring a poor fit by itself?
- A) When the main bottleneck is CPU-heavy work rather than high-rate I/O coordination
- B) When the program uses Linux
- C) When operations have completion results

Answers

1. B: io_uring is aimed at reducing the cost of submitting and completing large numbers of async I/O operations.

2. A: Submission queue entries describe work to do; completion queue entries describe the result of that work.

3. A: io_uring helps with I/O coordination overhead, not with CPU-heavy handlers or business logic.

← Back to Learning

io_uring - Shared Rings for Modern Linux Async IO