Trend of eBPF for System Acceleration
Recently, I’ve taken some time to review recent research papers to explore emerging trends in eBPF and reflect on its evolving landscape. In this article, I focus on the growing trend of using eBPF for system acceleration. Specifically, I will discuss three key areas:
- Network performance optimization,
- System I/O acceleration, and
- Observability with minimal overhead.
Introduction
paper list
Toward eBPF-Accelerated Pub-Sub Systems
What:
present BPF-Broker, a broker with its message dissemination path fully implemented in eBPF. In designing BPFBroker, we decouple the control and data paths: topic registration and subscriber management occur in a user-space process, while message dissemination in response to publish requests is performed fully in-kernel in eBPF hooks. The user-space control logic stores the subscriber list for each topic in a map-of-maps accessed by eBPF programs at the traffic control (TC) layer. BPF-Broker intercepts incoming packets at the TC ingress hook. For publish requests in a single UDP packet, it uses an eBPF map-of-maps to retrieve the subscriber list for the corresponding topic. It then uses bpf_clone_redirect() to replicate and send out messages to subscribers directly to the NIC transmit queues, bypassing the traditional protocol processing, sockets, queues, and crossing into user space altogether. To further accelerate message dissemination, BPF-Broker identifies, in the XDP hook, when a publish message is for a topic with only one subscriber and processes it right there, avoiding sending the packet to the kernel stack at all.
Points:
- However, XDP lacks support for packet cloning, which is essential for topics with multiple subscribers. As such, BPF-Broker can only use it for processing publish requests when there is only one subscriber for the corresponding topic. Although TC introduces slightly more overhead than XDP, our measurements show that the additional per-packet latency is modest – on the order of 1𝜇s
Related works of eBPF to accelerate networked applications.
- BMC accelerates Memcached by serving UDP GET requests in XDP from in-kernel cache, while using the TC egress hook to monitor responses and maintain cache coherence.
- Electrode accelerates distributed protocols (i.e. MultiPaxos) by offloading performance-critical operations like broadcasting and quorum handling to XDP and TC egress.
- DINT targets distributed transaction processing by offloading key-value access, locking, and logging into the kernel via eBPF. It uses XDP for parsing and execution, and TC egress for response finalization and state synchronization.
- BOAD optimizes broadcast and aggregation using XDP for early processing and TC egress for coordinated replication.
- XAgg implements in-kernel gradient aggregation using XDP to improve distributed machine learning performance.
- BPF-Broker targets generic pub-sub workloads and performs in-kernel message replication. It opportunistically uses XDP to handle single-subscriber topics, avoiding kernel buffer allocation and reducing CPU overhead. XDP handles lightweight fast paths, while TC ingress manages cloning for multisubscriber delivery.
What other use cases exist for eBPF-based system acceleration?
A Memory Pool Allocator for eBPF Applications
What:
present Kerby, a memory pool allocator for eBPF. The idea is to pool memory space for eBPF applications and manage it dynamically. Kerby divides the pre-allocated memory pool into multiple fixed-size blocks, and these blocks are dynamically combined to represent variable-length data. For example, a 1024-B object uses eight 128-B blocks. The system implements an in-kernel memory management mechanism using eBPF maps. It consists of three core components:
- Memory Allocation Map – Translates each opaque object ID (e.g., key or sequence number) to one or more block indices in the memory pool. Depending on the identifier size and access pattern, this map can be a BPF hash map, array map, or per-CPU array map.
- Memory Pool – A preallocated memory region divided into fixed-size blocks, implemented as a BPF hash map. Each block index maps to a unique block, and the hash map’s internal collision handling guarantees consistent access to these blocks.
- Index Allocator – A single-entry BPF array map that maintains a monotonically increasing block index counter. New block indices are allocated atomically using __atomic_fetch_add, ensuring thread-safe, concurrent allocation. Together, these components provide a lightweight, lock-free memory management abstraction entirely within the eBPF map framework—allowing dynamic data storage without relying on traditional kernel memory allocation.
Points:
- BPF Arena allows memory sharing between BPF programs and userspace applications, with memory allocated on demand. However, it currently cannot allocate/deallocate memory at the XDP hook since it requires a sleepable context while the XDP runs in an atomic (i.e., non-sleepable) context. It may cause internal fragmentation as it allocates memory in pages (i.e., 4 KB). It also supports memory only up to 4 GB.
- Does Kerby allocate memory via BPF_HASH_MAP, and is pre-allocation required?
- The BPF_F_NO_PREALLOC flag in BPF (Berkeley Packet Filter) is used to disable the default pre-allocation of entries in hash maps. By default, BPF hash maps pre-allocate all entries to improve performance and simplify memory management, but this can lead to high memory consumption. Using BPF_F_NO_PREALLOC allows memory to be allocated dynamically—only when entries are actually inserted—resulting in more efficient memory usage, especially for sparse maps.
Related works of dynamic memory allocation in eBPF:
- The bpf_obj_new enables the creation of objects at run time but only for fixed-size objects defined at compile-time, making it hard to handle variable-length data.
- BPF Arena allows memory sharing between BPF programs and userspace applications, with memory allocated on demand. However, it currently cannot allocate/deallocate memory at the XDP hook since it requires a sleepable context while the XDP runs in an atomic (i.e., non-sleepable) context. It may cause internal fragmentation as it allocates memory in pages (i.e., 4 KB). It also supports memory only up to 4 GB.
- KFlex is a kernel extension framework that includes extension heaps for dynamic allocation, but its memory feature is hard to integrate with the mainline kernel due to different safety check principles.
Can we design a memory allocator for XDP to support packet clone and boardcast redirect?
ChainIO: Bridging Disk and Network Domains with eBPF
What:
introduce ChainIO, a unified syscall-chaining framework that bridges network stack and IO. The solution rewrites POSIX I/O calls into batched submissions, unifies memory management across domains through shared regions, coordinates cross-domain operations while preserving correctness, and adaptively optimizes for tail latency. ChainIO requires no application modifications, achieving significant performance gains by eliminating redundant context switches and memory copies. This design can extend to other services with mixed storage and network workloads under POSIX easily.
usecase: ClickHouse
Point:
- Cross-Domain ring bridge.
Is that possible to build a new control/data decoupled architecture for trubo kernel io/transport (like Kmesh to k8s)
eTran: Extensible Kernel Transport with eBPF
What:
this paper takes the kernel approach and tries to tackle the following challenging problem: how to make the kernel transport extensible to enable agile customization, while achieving kernel safety, strong protection, and high performance? This is challenging because 1) customizability might expose new attack surfaces in the kernel, creating safety challenges; and 2) strong protection requires putting the entire transport states inside the kernel (e.g., the kernel TCP stack), which is usually contrary to high performance due to high kernel overhead (e.g., user-kernel crossing), and makes customization hard due to fixed kernel implementation.
introduce eTran (extensible kernel Transport), a system for agilely customizing kernel transports. eTran achieves agile customizability and kernel safety by 1) leveraging existing eBPF infrastructure such as built-in data structures (eBPF maps), BPF timer, and XDP for fast packet IO, and 2) extending it with new eBPF hooks and maps to support complex transport functionalities while conforming to the strict eBPF verifier for safety.
Points:
- only for tranport.
bpfCP: Efficient and Extensible Process Checkpointing via eBPF
What:
propose bpfCP, a process checkpointing scheme based on eBPF. Through the eBPF program running in the kernel, we can naturally read the in-kernel state ofthe process without relying on the /proc file system or special system call interfaces. Since eBPF programs are supplied externally to the kernel and are compatible with multiple kernel versions through BPF CO-RE, this provides good extensibility. eBPF supports the use of ring buffers shared between user and kernel space, which can avoid extensive context switches and memory copies during checkpointing. We presented this work at the Linux Plumbers Conference 2024, but this paper provides a written record.
Points:
- main aim for checkpoint, hard to achieve restore because https://lwn.net/Articles/984313/
References
- eTran: Extensible Kernel Transport with eBPF - NSDI’25
- XDP2 by Tom Herbert: link
- Toward eBPF-Accelerated Pub-Sub Systems - eBPF’25
- A Memory Pool Allocator for eBPF Applications - eBPF’25
- Empowering machine-learning assisted kernel decisions with eBPFML - eBPF’25
- eInfer: Unlocking Fine-Grained Tracing for Distributed LLMInference with eBPF - eBPF’25
- bpfCP: Efficient and Extensible Process Checkpointing via eBPF - eBPF’25
- ChainIO: Bridging Disk and Network Domains with eBPF - eBPF’25
- DINT: Fast In-Kernel Distributed Transactions with eBPF - NSDI’24
- Electrode: Accelerating Distributed Protocols with eBPF - NSDI’23