BPF for Storage: An Exokernel-Inspired Approach

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s07-zhong.pdf

Trend: faster storage devices
- HDD, SSD, Intel Optane Gen 1 / 2
- Software overhead of I/O requests become more significant
Reduce kernel overhead
- Kernel bypass: allow applications to directly talk with the device, without going through the kernel storage stack
  - Problem: lack of fine-grained isolation, wastes CPU cycles due to polling
- Near-storage processing
  - Download application logic into the storage device, but it requires specialized hardware
Reduce kernel overhead
- Goal: a standard OS-supported mechanism in Linux that can reduce the software overhead
- Exokernel file system supports user-defined kernel extensions, which give the kernel user-defined understanding of file system metadata
- By downloading application logic into the Linux kernel, we can potentially eliminate most of the software overhead
- Question: how to run untrusted application logic with the Linux kernel?
  - Networking community has been using Linux eBPF for efficient packet processing, filtering, security, and tracing
    eBPF: extended Berkeley Packet Filter (interchangeable with BPF)
  - BPF allows user to run untrusted function in the kernel safely using JIT compiler
  - BPF could be used to chain dependent I/Os, eliminating traversals of the kernel storage stack and transitions to/from user space
    e.g., B tree index lookup
On-Disk B-Tree Lookup with reads
- - After the request is completed, an interrupt will be generated, and the interrupt handler of the NVMe driver will be invoked to handle the completion
  - Here, already have the data. Binary search --> not in the user space
Only generate 1 read() request and (d-1) BPF lookups that bypass the kernel, where d is the depth of the tree
Dispatch paths for an I/O Chain
- NVMe driver is lack of file system and block layer semantics
Potential benefit of BPF
- Breakdown average 512B read() latency using Intel Optane SSD gen 2
Simulate B-tree lookups with different levels
- Max speedup: 1.25x
- Dispatch from NVMe Driver: max speedup - 2.5x
- Dispatch from different layers: syscall layer (13%), NVMe driver (49%)
  - Calling BPF as early in the storage I/O completion path as possible to maximize performance gain
Can BPF help io_uring?
- io_uring is a new syscall that allows applications to submit batches of async I/O requests in a zero-copy way, and to collect batches of I/O completions
- However, requests sent with io_uring still passes through all the kernel layers
- Dispatch from NVMe Driver (batch size)
  - Maximal speedup: 2.7x
  - 3-level: 1.3-1.9x
  - BPF is complementary to io_uring
Future work: design for storage BPF
- Build a library that provides a higher level interface than BPF, and new BPF hooks in the NVMe driver completion path
- BPF functions would be triggered in the NVMe driver interrupt handler on each I/O completion
  - These BPF functions could extract file offsets from blocks fetched from storage and immediately reissue an I/O to those offsets, or
  - Filter, project, and aggregate block data by building up buffers that are later return to the application
Challenges
- Translation & Security: NVMe driver lacks file-system semantics such as extent translation and access control
- I/O Granularity Mismatch: A single I/O request might requires multiple NVMe commands to finish
- Caching: NVMe driver bypasses page cache, which requires the application to implement its own caching
- Concurrency: synchronizing read and write might require locking, which is prohibitive in the interrupt handler
- Fairness: NVMe driver does not provide fas or QoS guarantees
Summary
- Linux kernel storage stack accounts for ~50% of the access latency for fast storage devices
- We can bypass most of the software layers by pushing application logics into Linux kernel to issue a chain of dependent I/O requests
- Linux eBPF allows user to download simple untrusted function into the kernel safely
- By dispatching the next request in the NVMe driver, we can improve the IOPS of read() syscall by 2.5x, and improve the IOPS of io_uring by 2.7x
- Challenges arise in terms of translation, security, caching, concurrency, and fairness

PreviouseBPF: rethinking the linux kernel NextHigh Velocity Kernel File Systems with Bento

Last updated 3 years ago

Was this helpful?