Rearchitecting Linux Storage Stack for µs Latency and High Throughput

Widespread belief: Linux cannot achieve micro-second scale latency & high throughput
- Adaption of high performance H/W, but stagnant single-core capacity
  - T-app: throughput-bound app
  - Static data path --> hard to utilize all cores
- Co-location of apps with different performance goals
  - L-app: latency-sensitive app
  - High latency due to HoL blockign
Performance of existing storage stack
- Applications accessing in-memory data in remote servers (single-core case)
  - Low latency or high throughput, but not both
blk-switch summary
- Linux can achieve micro-second scale latency while achieving near H/W capacity throughput!
  - Without changes in applications, kernel CPU scheduler, kernel TCP/IP stack, and network hardware
- For example, blk-switch acheives
  - Even with tens of applications (6 L-apps + 6 T-apps on 6 cores)
  - Complex interference at compute, storage, and network stacks (remove storage access over100 Gbps)
Key insight
- Observation: today's linux storage stack is conceptually similar to network switches
- blk-switch: switched linux storage stack architecture
  - Enables decoupling request processing from application cores
  - Multi-egress queues, prioritization, and load balancing

2. Flexible mapping from ingress to egress queues

--> decoupling request processing from application cores: "static --> flexible"

Three techniques:

Blk-switch prioritization
- Prioritize L-app request processing
- Multi-egress queues + prioritization: near optimal latency for L-apps
Blk-switch request steering for transient loads
- Challenge: prioritization of L-apps can lead to transient starvation of T-apps
- Steer requests to underutilized cores at per-request granularity
  - Select target cores using known techniques
Blk-switch application steering for persistent loads
- Challenge: persistent loads lead to high system overheads
- Steer apps to cores with low average utlization
Evaluation
- Implemented entirely in the Linux kernel with minimal changes (LOC: ~928)
- To stress test blk-switch

It is possible to achieve millisecond-scale latency and high throughput with LInux
blk-switch insight: modern storage stack is conceptually similar to network switches
- Decoupling request processing from application cores
- Multi-egress queue architecture, prioritization, request steering, and application steering
blk-switch achieves
- 10s of micro-second scale avg latency and < 190 micro-second tail latency with in-memory storage
- Near-hardware capacity throughput

Last updated 3 years ago

Was this helpful?