BPF for Storage: An Exokernel-Inspired Approach
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s07-zhong.pdf
Last updated
Was this helpful?
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s07-zhong.pdf
Last updated
Was this helpful?
Trend: faster storage devices
HDD, SSD, Intel Optane Gen 1 / 2
Software overhead of I/O requests become more significant
Reduce kernel overhead
Kernel bypass: allow applications to directly talk with the device, without going through the kernel storage stack
Problem: lack of fine-grained isolation, wastes CPU cycles due to polling
Near-storage processing
Download application logic into the storage device, but it requires specialized hardware
Reduce kernel overhead
Goal: a standard OS-supported mechanism in Linux that can reduce the software overhead
Exokernel file system supports user-defined kernel extensions, which give the kernel user-defined understanding of file system metadata
By downloading application logic into the Linux kernel, we can potentially eliminate most of the software overhead
Question: how to run untrusted application logic with the Linux kernel?
Networking community has been using Linux eBPF for efficient packet processing, filtering, security, and tracing
eBPF: extended Berkeley Packet Filter (interchangeable with BPF)
BPF allows user to run untrusted function in the kernel safely using JIT compiler
BPF could be used to chain dependent I/Os, eliminating traversals of the kernel storage stack and transitions to/from user space
e.g., B tree index lookup
On-Disk B-Tree Lookup with reads
After the request is completed, an interrupt will be generated, and the interrupt handler of the NVMe driver will be invoked to handle the completion
Here, already have the data. Binary search --> not in the user space
Only generate 1 read() request and (d-1) BPF lookups that bypass the kernel, where d is the depth of the tree
Dispatch paths for an I/O Chain
NVMe driver is lack of file system and block layer semantics
Potential benefit of BPF
Breakdown average 512B read() latency using Intel Optane SSD gen 2
Simulate B-tree lookups with different levels
Max speedup: 1.25x
Dispatch from NVMe Driver: max speedup - 2.5x
Dispatch from different layers: syscall layer (13%), NVMe driver (49%)
Calling BPF as early in the storage I/O completion path as possible to maximize performance gain
Can BPF help io_uring?
io_uring is a new syscall that allows applications to submit batches of async I/O requests in a zero-copy way, and to collect batches of I/O completions
However, requests sent with io_uring still passes through all the kernel layers
Dispatch from NVMe Driver (batch size)
Maximal speedup: 2.7x
3-level: 1.3-1.9x
BPF is complementary to io_uring
Future work: design for storage BPF
Build a library that provides a higher level interface than BPF, and new BPF hooks in the NVMe driver completion path
BPF functions would be triggered in the NVMe driver interrupt handler on each I/O completion
These BPF functions could extract file offsets from blocks fetched from storage and immediately reissue an I/O to those offsets, or
Filter, project, and aggregate block data by building up buffers that are later return to the application
Challenges
Translation & Security: NVMe driver lacks file-system semantics such as extent translation and access control
I/O Granularity Mismatch: A single I/O request might requires multiple NVMe commands to finish
Caching: NVMe driver bypasses page cache, which requires the application to implement its own caching
Concurrency: synchronizing read and write might require locking, which is prohibitive in the interrupt handler
Fairness: NVMe driver does not provide fas or QoS guarantees
Summary
Linux kernel storage stack accounts for ~50% of the access latency for fast storage devices
We can bypass most of the software layers by pushing application logics into Linux kernel to issue a chain of dependent I/O requests
Linux eBPF allows user to download simple untrusted function into the kernel safely
By dispatching the next request in the NVMe driver, we can improve the IOPS of read() syscall by 2.5x, and improve the IOPS of io_uring by 2.7x
Challenges arise in terms of translation, security, caching, concurrency, and fairness