High Performance Data Center Operating Systems
https://homes.cs.washington.edu/~tom/talks/os.pdf
Last updated
Was this helpful?
https://homes.cs.washington.edu/~tom/talks/os.pdf
Last updated
Was this helpful?
Today's I/O devices are fast and getting faster
Can't we just use Linux?
Kernel mediation is too heavy weight
Arrakis (OSDI 14): separate the OS control and data plane
OS architecture that separates the control and data plane, for both networking and storage
How to skip the kernel?
Design goals
Streamline network and storage I/O
Eliminate OS mediation in the common case
Application-specific customization vs. kernel one size fits all
Keep OS functionality
Process (container) isolation and protection
Resource arbitration, enforceable resource limits
Global naming, sharing semantics
POSIX compatibility at the application level
Additional performance gains from rewriting the API
Strata (SOSP 17)
File system design for low latency persistence (NVM) and multi-tier storage (NVM, SSD, HDD)
Storage diversification
NVDIMM: byte-addressable (cache-line granularity IO), direct access with load/store instructions
SSD: large erasure block (hardware GC overhead), random writes cause 5-6x slowdown by GC
Let's build a fast server
Key value store, database, file server, mail server ...
Requirements
Small updates dominate
NVM is too fast, kernel is the bottleneck
Dataset scales up to many terabytes
To save the cost, need a way to use multiple device types: NVM, SSD, HDD
Using only NVM is too expensive
For low-cast capability with high performance, must leverage multiple device types
Block-level caching manages data in blocks, but NVM is byte-addressable!
Updates must be crash consistent
Applications struggle for crash consistency
Today's file systems: limited by old design assumptions
Kernel mediates every operation
NVM is too fast, kernel is the bottleneck
Tied to single type of device
For low-cost capacity with high performance, must leverage multiple device types
Aggressive caching in DRAM, only write to device when you must (fsync)
Struggles for crash consistency
Strata: a cross media file system
Performance: especially small, random IO
Fast user-level device access
Capacity: leverage NVM, SSD & HDD for low cost
Transparent data migration across different media
Efficiently handle device IO properties
Simplicity: intuitive crash consistency model
In-order, sync IO
No fsync() required
Main design principle
LibFS: log operations to NVM at user-level
Fast user-level access
In-order, sync IO
Kernel FS: digest and migrate data in kernel
Async digest
Transparent data migration
Shared file access
TCP as a Service / FlexNIC / Floem (ASPLOS 15, OSDI 18)
OS, NIC, and app library support for fast, agile, secure protocol processing
Let's build a fast server
Small RPCs dominate
Enforceable resource sharing
Agile protocol development
Cost-efficient hardware
RDMA: read/write to (limited) region of remote server memory, no CPU involvement on the remote node (fast if app can use programming model)
Limitations: what if you need remote application computation (RPC)? lossless model is performance-fragile
Smart NICs: NIC with array of low-end CPU cores
If compute on the NIC, maybe don't need to go CPU?
Applications in high speed trading
Step 1: Build a faster kernel TCP in software
Q: why RPC over Linux TCP is so slow?
OS: hardware interface, highly optimized code path, buffer descriptor queues (no interrupts in common case, maximize concurrency)
Following: OS transmit packet processing
TCP layer: move from socket buffer to IP queue
Lock socket, congestion / flow control limit, fill in TCP header, calculate checksum, copy data, arm re-transmission timeout
IP layer: firewall, routing, ARP, traffic shaping
Driver: move from IP queue to NIC queue
Allocate and free packet buffers
Now: multiple sync kernel transitions
Parameter checks and copies
Cache pollution, pipeline stalls
TCP Acceleration as a Service (TaS)
TCP as a user-level OS service
SRIO-V to dedicated cores
Scale number of cores up/down to match demand
Optimized data plane for common case operations
Application uses its own dedicated cores
Avoid polluting application level cache
To the application, per-socket tx/rx queues with doorbells
Streamline common-case data path
Remove unneeded computation from data path
Congestion control, timeouts oper RTT
Minimize per-flow TCP state
Linearized code
Enforce IP level access control on control plane at connection setup
Step 2: TaS data plane can be efficiently built in hardware
FlexNIC design principles
RPCs are the common case: kernel bypass to application logic
Enforceable per-flow resource sharing: data plane in hardware, policy in kernel
Agile protocol development: protocol agnostic, offload both kernel and app packet handling
Cost-efficient: minimal instruction set for packet processing
FlexTCP: H/W accelerated TCP
Fast path is simple enough for FlexNIC model
Applications directly access NIC for RX/TX
Software slow-path messages NIC state
Streamlines NIC processing