Rearchitecting Linux Storage Stack for µs Latency and High Throughput
Last updated
Was this helpful?
Last updated
Was this helpful?
Widespread belief: Linux cannot achieve micro-second scale latency & high throughput
Adaption of high performance H/W, but stagnant single-core capacity
T-app: throughput-bound app
Static data path --> hard to utilize all cores
Co-location of apps with different performance goals
L-app: latency-sensitive app
High latency due to HoL blockign
Performance of existing storage stack
Applications accessing in-memory data in remote servers (single-core case)
Low latency or high throughput, but not both
blk-switch summary
Linux can achieve micro-second scale latency while achieving near H/W capacity throughput!
Without changes in applications, kernel CPU scheduler, kernel TCP/IP stack, and network hardware
For example, blk-switch acheives
Even with tens of applications (6 L-apps + 6 T-apps on 6 cores)
Complex interference at compute, storage, and network stacks (remove storage access over100 Gbps)
Key insight
Observation: today's linux storage stack is conceptually similar to network switches
blk-switch: switched linux storage stack architecture
Enables decoupling request processing from application cores
Multi-egress queues, prioritization, and load balancing
Egress queue per-(core, app-class)
2. Flexible mapping from ingress to egress queues
--> decoupling request processing from application cores: "static --> flexible"
Three techniques:
Blk-switch prioritization
Prioritize L-app request processing
Multi-egress queues + prioritization: near optimal latency for L-apps
Blk-switch request steering for transient loads
Challenge: prioritization of L-apps can lead to transient starvation of T-apps
Steer requests to underutilized cores at per-request granularity
Select target cores using known techniques
Capture only T-app load
Request steering allows blk-switch to maintain high throughput, even under transient loads
Blk-switch application steering for persistent loads
Challenge: persistent loads lead to high system overheads
Steer apps to cores with low average utlization
Long-term time scales (e.g., every 10ms)
Both L-app and T-app load
High throughput for T-apps even under persistent loads
Even lower latency for L-apps due to fewer context switches
Evaluation
Implemented entirely in the Linux kernel with minimal changes (LOC: ~928)
To stress test blk-switch
Complex interaction among the compute, storage, and network stack
Evaluate "remote storage access"
To push the bottleneck to the storage stack processing
Two 32-core servers connected directly over 100 Gbps
To access data on remote servers
Linux / blk-switch use i10
SPDK uses userspace NVMe-over-TCP
It is possible to achieve millisecond-scale latency and high throughput with LInux
blk-switch insight: modern storage stack is conceptually similar to network switches
Decoupling request processing from application cores
Multi-egress queue architecture, prioritization, request steering, and application steering
blk-switch achieves
10s of micro-second scale avg latency and < 190 micro-second tail latency with in-memory storage
Near-hardware capacity throughput