# Rearchitecting Linux Storage Stack for µs Latency and High Throughput

### Presentation&#x20;

* Widespread belief: Linux cannot achieve micro-second scale latency & high throughput&#x20;
  * Adaption of high performance H/W, but stagnant single-core capacity&#x20;
    * T-app: throughput-bound app&#x20;
    * Static data path --> hard to utilize all cores&#x20;
  * Co-location of apps with different performance goals&#x20;
    * L-app: latency-sensitive app&#x20;
    * High latency due to HoL blockign&#x20;
* Performance of existing storage stack&#x20;
  * Applications accessing in-memory data in remote servers (single-core case)&#x20;
    * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FYnAchsV6ExV0AL3AkJpc%2Fimage.png?alt=media\&token=5b765adf-c2a4-481d-a4bf-936166974cc5)
    * Low latency or high throughput, but not both&#x20;
* blk-switch summary
  * Linux can achieve micro-second scale latency while achieving near H/W capacity throughput!
    * Without changes in applications, kernel CPU scheduler, kernel TCP/IP stack, and network hardware
  * For example, blk-switch acheives&#x20;
    * Even with tens of applications (6 L-apps + 6 T-apps on 6 cores)
    * Complex interference at compute, storage, and network stacks (remove storage access over100 Gbps)&#x20;
* Key insight&#x20;
  * Observation: today's linux storage stack is conceptually similar to network switches&#x20;
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FT03wVja51Y6t3xr39etL%2Fimage.png?alt=media\&token=51531756-3e3b-4c42-8714-d261bd91ee60)
  * blk-switch: switched linux storage stack architecture&#x20;
    * Enables decoupling request processing from application cores
    * Multi-egress queues, prioritization, and load balancing&#x20;

#### Architecture&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FOwJchfDEnvkXS2HuPdb8%2Fimage.png?alt=media\&token=e9fcb5e3-0750-4f1c-a521-d31db8e0e734)

1. Egress queue per-(core, app-class)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FSVAOYJm34zNJNDacNmMb%2Fimage.png?alt=media\&token=de1b89b4-9dc3-4536-8a4b-49b8c728695b)

2\. Flexible mapping from ingress to egress queues&#x20;

\--> decoupling request processing from application cores: "static --> flexible"&#x20;

Three techniques:

* Blk-switch prioritization&#x20;
  * Prioritize L-app request processing&#x20;
  * Multi-egress queues + prioritization: near optimal latency for L-apps&#x20;
* Blk-switch request steering for transient loads&#x20;
  * Challenge: prioritization of L-apps can lead to transient starvation of T-apps&#x20;
  * Steer requests to underutilized cores at per-request granularity&#x20;
    * Select target cores using known techniques&#x20;
    * Capture only T-app load&#x20;
  * Request steering allows blk-switch to maintain high throughput, even under transient loads&#x20;
* Blk-switch application steering for persistent loads&#x20;
  * Challenge: persistent loads lead to high system overheads&#x20;
  * Steer apps to cores with low average utlization&#x20;
    * Long-term time scales (e.g., every 10ms)
    * Both L-app and T-app load
  * High throughput for T-apps even under persistent loads
  * Even lower latency for L-apps due to fewer context switches&#x20;
* Evaluation&#x20;
  * Implemented entirely in the Linux kernel with minimal changes (LOC: \~928)
  * To stress test blk-switch
    * Complex interaction among the compute, storage, and network stack
    * Evaluate "remote storage access"&#x20;
  * To push the bottleneck to the storage stack processing&#x20;
    * Two 32-core servers connected directly over 100 Gbps&#x20;
  * To access data on remote servers&#x20;
    * Linux / blk-switch use i10
    * SPDK uses userspace NVMe-over-TCP&#x20;

#### Summary

* It is possible to achieve millisecond-scale latency and high throughput with LInux
* blk-switch insight: modern storage stack is conceptually similar to network switches
  * Decoupling request processing from application cores
  * Multi-egress queue architecture, prioritization, request steering, and application steering
* blk-switch achieves&#x20;
  * 10s of micro-second scale avg latency and < 190 micro-second tail latency with in-memory storage
  * Near-hardware capacity throughput&#x20;
