> For the complete documentation index, see [llms.txt](https://sliu583.gitbook.io/blog/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/rearchitecting-linux-storage-stack-for-s-latency-and-high-throughput.md).

# Rearchitecting Linux Storage Stack for µs Latency and High Throughput

### Presentation&#x20;

* Widespread belief: Linux cannot achieve micro-second scale latency & high throughput&#x20;
  * Adaption of high performance H/W, but stagnant single-core capacity&#x20;
    * T-app: throughput-bound app&#x20;
    * Static data path --> hard to utilize all cores&#x20;
  * Co-location of apps with different performance goals&#x20;
    * L-app: latency-sensitive app&#x20;
    * High latency due to HoL blockign&#x20;
* Performance of existing storage stack&#x20;
  * Applications accessing in-memory data in remote servers (single-core case)&#x20;
    * ![](/files/jA1xtzPvEB0IC7pAGrRu)
    * Low latency or high throughput, but not both&#x20;
* blk-switch summary
  * Linux can achieve micro-second scale latency while achieving near H/W capacity throughput!
    * Without changes in applications, kernel CPU scheduler, kernel TCP/IP stack, and network hardware
  * For example, blk-switch acheives&#x20;
    * Even with tens of applications (6 L-apps + 6 T-apps on 6 cores)
    * Complex interference at compute, storage, and network stacks (remove storage access over100 Gbps)&#x20;
* Key insight&#x20;
  * Observation: today's linux storage stack is conceptually similar to network switches&#x20;
  * ![](/files/qPqs8wcTR9bfL9IStHIQ)
  * blk-switch: switched linux storage stack architecture&#x20;
    * Enables decoupling request processing from application cores
    * Multi-egress queues, prioritization, and load balancing&#x20;

#### Architecture&#x20;

![](/files/6P8Pd1eZleg042sQGYy8)

1. Egress queue per-(core, app-class)

![](/files/xrnnQpovzOMQpkmChtgw)

2\. Flexible mapping from ingress to egress queues&#x20;

\--> decoupling request processing from application cores: "static --> flexible"&#x20;

Three techniques:

* Blk-switch prioritization&#x20;
  * Prioritize L-app request processing&#x20;
  * Multi-egress queues + prioritization: near optimal latency for L-apps&#x20;
* Blk-switch request steering for transient loads&#x20;
  * Challenge: prioritization of L-apps can lead to transient starvation of T-apps&#x20;
  * Steer requests to underutilized cores at per-request granularity&#x20;
    * Select target cores using known techniques&#x20;
    * Capture only T-app load&#x20;
  * Request steering allows blk-switch to maintain high throughput, even under transient loads&#x20;
* Blk-switch application steering for persistent loads&#x20;
  * Challenge: persistent loads lead to high system overheads&#x20;
  * Steer apps to cores with low average utlization&#x20;
    * Long-term time scales (e.g., every 10ms)
    * Both L-app and T-app load
  * High throughput for T-apps even under persistent loads
  * Even lower latency for L-apps due to fewer context switches&#x20;
* Evaluation&#x20;
  * Implemented entirely in the Linux kernel with minimal changes (LOC: \~928)
  * To stress test blk-switch
    * Complex interaction among the compute, storage, and network stack
    * Evaluate "remote storage access"&#x20;
  * To push the bottleneck to the storage stack processing&#x20;
    * Two 32-core servers connected directly over 100 Gbps&#x20;
  * To access data on remote servers&#x20;
    * Linux / blk-switch use i10
    * SPDK uses userspace NVMe-over-TCP&#x20;

#### Summary

* It is possible to achieve millisecond-scale latency and high throughput with LInux
* blk-switch insight: modern storage stack is conceptually similar to network switches
  * Decoupling request processing from application cores
  * Multi-egress queue architecture, prioritization, request steering, and application steering
* blk-switch achieves&#x20;
  * 10s of micro-second scale avg latency and < 190 micro-second tail latency with in-memory storage
  * Near-hardware capacity throughput&#x20;
