Linux I/O

Preface

Last time I introduced how computer data is stored. This time I want to dig into and summarize Linux I/O transfer. First, let’s quickly review a few concepts from last time.

Disk: the user’s persistent storage medium

PageCache - disk cache - kernel buffer: a cache of disk contents; its storage is RAM, so it’s faster

Virtual memory: memory exposed to applications; the basic unit is a page, mapped to physical memory by the MMU

Next, I’ll introduce a few more concepts to help you better understand the final part. At the end, I’ll cover Linux’s “traditional copy” and the more mainstream “advanced copy” (yeah, I made up that name… just to distinguish it from the traditional one; essentially it’s using zero-copy techniques).

User Mode and Kernel Mode

The core of an operating system is the kernel. It’s a program independent of normal applications: it can access protected memory regions and has permission to access low-level hardware devices. To prevent user processes from operating on the kernel and to ensure kernel security, the OS splits virtual memory into two parts: kernel space and user space.

Summary (I’ll mention this repeatedly below):

Kernel modules run in kernel space, and the corresponding process is in kernel mode;

User programs run in user space, and the corresponding process is in user mode.

Introducing DMA

The CPU is the brain of the computer. If the CPU had to watch over every little thing all the time, its efficiency would be terrible. Modern CPUs run at very high frequencies: computing is much faster than transferring data. Traditionally, if the CPU needs to compute on some data, it has to read that data into cache first—computation might take nanoseconds, while reading could take seconds. During reads, the CPU is basically waiting/blocked. If someone else could do that part, the CPU could do little or even no copying—just compute. That leads to DMA (Direct Memory Access). Today, most computers are equipped with a DMA controller. Intel integrates the DMA controller into the southbridge. The whole process requires no CPU involvement: data is moved/copied quickly directly by the DMA controller, saving CPU resources for other work.

Traditional Copy

Direct I/O in User Space

Direct user-space I/O means the application process (or a library function running in user mode) accesses hardware devices directly. Data bypasses the kernel during transfer (here “kernel” refers to the PageCache in kernel space). The kernel does nothing except necessary virtual memory configuration work. This has low context-switch overhead, and these applications usually maintain their own data cache in process space (user space).

The downside is obvious: because of the speed gap between CPU and disk I/O, it can waste a lot of resources. The solution is to pair it with asynchronous I/O. Note that this is usually suitable for very large file I/O (typically GB-level).

Using PageCache (Kernel Buffer)

I previously introduced the kernel buffer (i.e., PageCache). When reading small files, if you read a bit into the kernel buffer each time, and the next time you hit the cache, efficiency improves a lot.

This is also the most common I/O path for traditional copying. Let’s focus on analyzing the overhead of this read path (“switch” means context switch, “copy” means a copy operation):

The user process calls read() to initiate a system call to the kernel; context switches from user mode to kernel mode 【switch-1】
In kernel mode, the CPU uses the DMA controller to copy data from disk to the kernel buffer in kernel space 【copy-1】
The CPU copies data from the kernel buffer to the user buffer 【copy-2】
read() returns; context switches from kernel mode back to user mode 【switch-2】
In user mode, the user process calls write() to initiate a system call to the kernel; context switches from user mode to kernel mode 【switch-3】
The CPU copies data from the user buffer to the network buffer in kernel space (socket buffer) 【copy-3】
The CPU uses the DMA controller to copy data from the network buffer to the NIC for transmission 【copy-4】
write() returns; context switches from kernel mode back to user mode 【switch-4】

As you can see, with the traditional kernel-buffer-based copy, you go through four copies (two by CPU + two by DMA) and four context switches. In high-concurrency scenarios, frequent I/O calls and frequent context switches hurt performance a lot.

Advanced Copy - Zero Copy

The concept of zero-copy isn’t super strictly defined. Or rather, zero-copy doesn’t mean there are literally zero copies. But its core goal is very clear: improve I/O performance.

The mainstream ideas can roughly be grouped into three categories:

Reduce or even avoid copying between user space and kernel space: In some scenarios, the user process doesn’t need to access or process the data during transfer. Then the transfer between Linux Page Cache and the user process buffer can be avoided entirely—keeping copies fully within the kernel, or even avoiding kernel copies via more clever approaches. This category is usually implemented by adding new system calls, such as mmap() and sendfile() in Linux.
Direct I/O bypassing the kernel: Allow a user-mode process to bypass the kernel and transfer data directly with hardware. During transfer, the kernel only handles some management and auxiliary work. This is somewhat similar to the first approach in that it tries to avoid user/kernel data transfer, but the first approach completes the transfer in kernel mode, while this one bypasses the kernel and talks to hardware directly—similar effect, totally different principle.
Optimizing transfers between kernel buffers and user buffers: This focuses on optimizing CPU copies between the user process buffer and the OS page cache. It continues the traditional communication approach, but in a more flexible way.

Image from GitHub user Andy Pan (reference 3)

`mmap()` Replaces `read()`

The idea of mmap() is to map the user buffer to the kernel buffer, which reduces one CPU copy into the user buffer. Everything else stays the same, but you pay extra mapping overhead. The flow:

The user process calls mmap() to initiate a system call to the kernel; context switches from user mode to kernel mode 【switch-1】
In kernel mode, the CPU uses the DMA controller to copy data from disk to the kernel buffer in kernel space (at this point the user buffer has already been mapped) 【copy-1】
mmap() returns; context switches from kernel mode back to user mode 【switch-2】
In user mode, the user process calls write() to initiate a system call to the kernel; context switches from user mode to kernel mode 【switch-3】
The CPU copies data from the user buffer to the network buffer in kernel space (socket buffer) 【copy-3】
The CPU uses the DMA controller to copy data from the network buffer to the NIC for transmission 【copy-4】
write() returns; context switches from kernel mode back to user mode 【switch-4】

This has two benefits: first, it saves memory space, because that region in the user process is virtual and doesn’t actually occupy physical memory—it’s mapped to the kernel buffer where the file resides—so it can save about half the memory footprint. Second, it removes one CPU copy. Compared to traditional Linux I/O read/write, data no longer needs to be forwarded through the user process; the copy is completed directly in the kernel. So after using mmap(), the copy count becomes: 2 DMA copies + 1 CPU copy = 3 copy operations total, saving one CPU copy and half the memory. However, since mmap() is still a system call, user/kernel mode switching is still 4 times.

`sendfile()` Replaces `read()`

The principle of sendfile() is to use the sendfile() system call introduced in the Linux 2.1 kernel to replace read() and write(). sendfile completely hides I/O data from user space. It’s suitable when user space doesn’t need to process the data, allowing data to be transferred entirely within kernel space. This avoids copying between user space and kernel space, saving one copy and two context switches. The flow:

The user process calls sendfile() to initiate a system call to the kernel; context switches from user mode to kernel mode 【switch-1】
In kernel mode, the CPU uses the DMA controller to copy data from disk to the kernel buffer in kernel space 【copy-1】
Next, the CPU directly copies data from the kernel buffer to the network buffer (socket buffer) 【copy-2】
The CPU uses the DMA controller to copy data from the network buffer to the network 【copy-3】
sendfile() returns; context switches from kernel mode back to user mode 【switch-2】

As you can see, sendfile reduces one copy and two context switches. Unlike the mmap memory-mapping approach, sendfile makes the I/O data completely invisible to user space, so it’s limited to cases where user space doesn’t need to modify the data.

`sendfile` - Linux 2.4 Kernel Enhancement

From the flow, you can see data is copied from disk into the kernel buffer, then again into the network buffer. Is that extra transfer necessary? Linux 2.4 modified the sendfile system call and introduced a gather operation for DMA copying. It records the corresponding data description information (memory address, offset) from the kernel buffer in kernel space into the network buffer. Then DMA uses the memory address and offset to batch-copy data from the read kernel buffer to the NIC device, eliminating the remaining 1 CPU copy in kernel space. The flow:

The user process calls sendfile() to invoke the OS; context switches from user mode to kernel mode 【switch-1】
In kernel mode, the DMA controller reads data from disk into the read buffer 【copy-1】,
the CPU writes the file descriptor and file length directly into the socket buffer; this is not considered a copy
the DMA controller uses scatter and gather to copy data from the read buffer to the NIC 【copy-2】
sendfile() returns; context switches from kernel mode back to user mode 【switch2】

Copy Techniques Used by Some Middleware

RocketMQ chooses the mmap + write zero-copy approach, suitable for persistence and transfer of small chunks like business-level messages;

Kafka uses the sendfile zero-copy approach, suitable for persistence and transfer of large chunks with high throughput, like system log messages. But one thing worth noting: Kafka’s index files use mmap + write, while data files use sendfile.

References

https://www.bilibili.com/video/BV1cJ411K7HW
https://www.bilibili.com/video/BV16J411p7f1
https://github.com/panjf2000?tab=repositories

All articles in this blog, unless otherwise stated, are licensed under @Oreoft . Please indicate the source when reprinting!