# BPF for Storage: An Exokernel-Inspired Approach

* Trend: faster storage devices&#x20;
  * HDD, SSD, Intel Optane Gen 1 / 2&#x20;
  * Software overhead of I/O requests become more significant&#x20;
* Reduce kernel overhead&#x20;
  * Kernel bypass: allow applications to directly talk with the device, without going through the kernel storage stack
    * Problem: lack of fine-grained isolation, wastes CPU cycles due to polling&#x20;
    * ![](/files/QZXe6huSlHMr8oKaS3LX)
  * Near-storage processing&#x20;
    * Download application logic into the storage device, but it requires specialized hardware
    * ![](/files/OGxGcqlCIie5ux3VSgX3)
* &#x20;Reduce kernel overhead&#x20;
  * Goal: a standard **OS-supported** mechanism in Linux that can reduce the software overhead&#x20;
  * Exokernel file system supports user-defined kernel extensions, which give the kernel user-defined understanding of file system metadata&#x20;
  * By downloading application logic into the Linux kernel, we can potentially eliminate most of the software overhead&#x20;
  * Question: how to run untrusted application logic with the Linux kernel?&#x20;
    * Networking community has been using Linux eBPF for efficient packet processing, filtering, security, and tracing&#x20;
      * eBPF: extended Berkeley Packet Filter (interchangeable with BPF)&#x20;
    * BPF allows user to run untrusted function in the kernel safely using JIT compiler&#x20;
    * BPF could be used to chain dependent I/Os, eliminating traversals of the kernel storage stack and transitions to/from user space&#x20;
      * e.g., B tree index lookup&#x20;
* On-Disk B-Tree Lookup with reads&#x20;
  * ![](/files/X5PsUjITmo3Rox2ZqsRe)
  * ![](/files/ZsqQPl92LQ4mIjNcMPsj)
    * After the request is completed, an interrupt will be generated, and the interrupt handler of the NVMe driver will be invoked to handle the completion&#x20;
    * Here, already have the data. Binary search --> not in the user space&#x20;
      * ![](/files/XyQ6vuYEJuVhlDWDtjar)
* Only generate 1 read() request and (d-1) BPF lookups that bypass the kernel, where d is the depth of the tree&#x20;
* Dispatch paths for an I/O Chain&#x20;
  * ![](/files/ImSdERUptcVvNhYxeAD7)
  * NVMe driver is lack of file system and block layer semantics&#x20;
* Potential benefit of BPF
  * Breakdown average 512B read() latency using Intel Optane SSD gen 2&#x20;
  * ![](/files/tdsh2xPnimM5KFc14lYc)
* Simulate B-tree lookups with different levels&#x20;
  * Max speedup: 1.25x&#x20;
  * Dispatch from NVMe Driver: max speedup - 2.5x&#x20;
  * Dispatch from different layers: syscall layer (13%), NVMe driver (49%)&#x20;
    * Calling BPF as early in the storage I/O completion path as possible to maximize performance gain&#x20;
* Can BPF help io\_uring?
  * io\_uring is a new syscall that allows applications to submit batches of async I/O requests in a zero-copy way, and to collect batches of I/O completions&#x20;
  * However, requests sent with io\_uring still passes through all the kernel layers&#x20;
  * Dispatch from NVMe Driver (batch size)&#x20;
    * Maximal speedup: 2.7x&#x20;
    * 3-level: 1.3-1.9x&#x20;
    * BPF is complementary to io\_uring&#x20;
* Future work: design for storage BPF
  * Build a library that provides a higher level interface than BPF, and new BPF hooks in the NVMe driver completion path&#x20;
  * BPF functions would be triggered in the NVMe driver interrupt handler on each I/O completion&#x20;
    * These BPF functions could extract file offsets from blocks fetched from storage and immediately reissue an I/O to those offsets, or&#x20;
    * Filter, project, and aggregate block data by building up buffers that are later return to the application&#x20;
* Challenges&#x20;
  * Translation & Security: NVMe driver lacks file-system semantics such as extent translation and access control
  * I/O Granularity Mismatch: A single I/O request might requires multiple NVMe commands to finish
  * Caching: NVMe driver bypasses page cache, which requires the application to implement its own caching
  * Concurrency: synchronizing read and write might require locking, which is prohibitive in the interrupt handler&#x20;
  * Fairness: NVMe driver does not provide fas or QoS guarantees&#x20;
* Summary
  * Linux kernel storage stack accounts for \~50% of the access latency for fast storage devices&#x20;
  * We can bypass most of the software layers by pushing application logics into Linux kernel to issue a chain of dependent I/O requests
  * Linux eBPF allows user to download simple untrusted function into the kernel safely&#x20;
  * By dispatching the next request in the NVMe driver, we can improve the IOPS of read() syscall by 2.5x, and improve the IOPS of io\_uring by 2.7x
  * Challenges arise in terms of translation, security, caching, concurrency, and fairness&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/bpf-for-storage-an-exokernel-inspired-approach.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
