Rearchitecting Linux Storage Stack for µs Latency and High Throughput


  • Widespread belief: Linux cannot achieve micro-second scale latency & high throughput

    • Adaption of high performance H/W, but stagnant single-core capacity

      • T-app: throughput-bound app

      • Static data path --> hard to utilize all cores

    • Co-location of apps with different performance goals

      • L-app: latency-sensitive app

      • High latency due to HoL blockign

  • Performance of existing storage stack

    • Applications accessing in-memory data in remote servers (single-core case)

      • Low latency or high throughput, but not both

  • blk-switch summary

    • Linux can achieve micro-second scale latency while achieving near H/W capacity throughput!

      • Without changes in applications, kernel CPU scheduler, kernel TCP/IP stack, and network hardware

    • For example, blk-switch acheives

      • Even with tens of applications (6 L-apps + 6 T-apps on 6 cores)

      • Complex interference at compute, storage, and network stacks (remove storage access over100 Gbps)

  • Key insight

    • Observation: today's linux storage stack is conceptually similar to network switches

    • blk-switch: switched linux storage stack architecture

      • Enables decoupling request processing from application cores

      • Multi-egress queues, prioritization, and load balancing


  1. Egress queue per-(core, app-class)

2. Flexible mapping from ingress to egress queues

--> decoupling request processing from application cores: "static --> flexible"

Three techniques:

  • Blk-switch prioritization

    • Prioritize L-app request processing

    • Multi-egress queues + prioritization: near optimal latency for L-apps

  • Blk-switch request steering for transient loads

    • Challenge: prioritization of L-apps can lead to transient starvation of T-apps

    • Steer requests to underutilized cores at per-request granularity

      • Select target cores using known techniques

      • Capture only T-app load

    • Request steering allows blk-switch to maintain high throughput, even under transient loads

  • Blk-switch application steering for persistent loads

    • Challenge: persistent loads lead to high system overheads

    • Steer apps to cores with low average utlization

      • Long-term time scales (e.g., every 10ms)

      • Both L-app and T-app load

    • High throughput for T-apps even under persistent loads

    • Even lower latency for L-apps due to fewer context switches

  • Evaluation

    • Implemented entirely in the Linux kernel with minimal changes (LOC: ~928)

    • To stress test blk-switch

      • Complex interaction among the compute, storage, and network stack

      • Evaluate "remote storage access"

    • To push the bottleneck to the storage stack processing

      • Two 32-core servers connected directly over 100 Gbps

    • To access data on remote servers

      • Linux / blk-switch use i10

      • SPDK uses userspace NVMe-over-TCP


  • It is possible to achieve millisecond-scale latency and high throughput with LInux

  • blk-switch insight: modern storage stack is conceptually similar to network switches

    • Decoupling request processing from application cores

    • Multi-egress queue architecture, prioritization, request steering, and application steering

  • blk-switch achieves

    • 10s of micro-second scale avg latency and < 190 micro-second tail latency with in-memory storage

    • Near-hardware capacity throughput

Last updated