BPF for Storage: An Exokernel-Inspired Approach

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s07-zhong.pdf

  • Trend: faster storage devices

    • HDD, SSD, Intel Optane Gen 1 / 2

    • Software overhead of I/O requests become more significant

  • Reduce kernel overhead

    • Kernel bypass: allow applications to directly talk with the device, without going through the kernel storage stack

      • Problem: lack of fine-grained isolation, wastes CPU cycles due to polling

    • Near-storage processing

      • Download application logic into the storage device, but it requires specialized hardware

  • Reduce kernel overhead

    • Goal: a standard OS-supported mechanism in Linux that can reduce the software overhead

    • Exokernel file system supports user-defined kernel extensions, which give the kernel user-defined understanding of file system metadata

    • By downloading application logic into the Linux kernel, we can potentially eliminate most of the software overhead

    • Question: how to run untrusted application logic with the Linux kernel?

      • Networking community has been using Linux eBPF for efficient packet processing, filtering, security, and tracing

        • eBPF: extended Berkeley Packet Filter (interchangeable with BPF)

      • BPF allows user to run untrusted function in the kernel safely using JIT compiler

      • BPF could be used to chain dependent I/Os, eliminating traversals of the kernel storage stack and transitions to/from user space

        • e.g., B tree index lookup

  • On-Disk B-Tree Lookup with reads

      • After the request is completed, an interrupt will be generated, and the interrupt handler of the NVMe driver will be invoked to handle the completion

      • Here, already have the data. Binary search --> not in the user space

  • Only generate 1 read() request and (d-1) BPF lookups that bypass the kernel, where d is the depth of the tree

  • Dispatch paths for an I/O Chain

    • NVMe driver is lack of file system and block layer semantics

  • Potential benefit of BPF

    • Breakdown average 512B read() latency using Intel Optane SSD gen 2

  • Simulate B-tree lookups with different levels

    • Max speedup: 1.25x

    • Dispatch from NVMe Driver: max speedup - 2.5x

    • Dispatch from different layers: syscall layer (13%), NVMe driver (49%)

      • Calling BPF as early in the storage I/O completion path as possible to maximize performance gain

  • Can BPF help io_uring?

    • io_uring is a new syscall that allows applications to submit batches of async I/O requests in a zero-copy way, and to collect batches of I/O completions

    • However, requests sent with io_uring still passes through all the kernel layers

    • Dispatch from NVMe Driver (batch size)

      • Maximal speedup: 2.7x

      • 3-level: 1.3-1.9x

      • BPF is complementary to io_uring

  • Future work: design for storage BPF

    • Build a library that provides a higher level interface than BPF, and new BPF hooks in the NVMe driver completion path

    • BPF functions would be triggered in the NVMe driver interrupt handler on each I/O completion

      • These BPF functions could extract file offsets from blocks fetched from storage and immediately reissue an I/O to those offsets, or

      • Filter, project, and aggregate block data by building up buffers that are later return to the application

  • Challenges

    • Translation & Security: NVMe driver lacks file-system semantics such as extent translation and access control

    • I/O Granularity Mismatch: A single I/O request might requires multiple NVMe commands to finish

    • Caching: NVMe driver bypasses page cache, which requires the application to implement its own caching

    • Concurrency: synchronizing read and write might require locking, which is prohibitive in the interrupt handler

    • Fairness: NVMe driver does not provide fas or QoS guarantees

  • Summary

    • Linux kernel storage stack accounts for ~50% of the access latency for fast storage devices

    • We can bypass most of the software layers by pushing application logics into Linux kernel to issue a chain of dependent I/O requests

    • Linux eBPF allows user to download simple untrusted function into the kernel safely

    • By dispatching the next request in the NVMe driver, we can improve the IOPS of read() syscall by 2.5x, and improve the IOPS of io_uring by 2.7x

    • Challenges arise in terms of translation, security, caching, concurrency, and fairness

Last updated