High Performance Data Center Operating Systems

https://homes.cs.washington.edu/~tom/talks/os.pdf

Today's I/O devices are fast and getting faster
Can't we just use Linux?
- Kernel mediation is too heavy weight
Arrakis (OSDI 14): separate the OS control and data plane
- OS architecture that separates the control and data plane, for both networking and storage
- How to skip the kernel?
- Design goals
  - Streamline network and storage I/O
    Eliminate OS mediation in the common case
    Application-specific customization vs. kernel one size fits all
  - Keep OS functionality
    Process (container) isolation and protection
    Resource arbitration, enforceable resource limits
    Global naming, sharing semantics
  - POSIX compatibility at the application level
    Additional performance gains from rewriting the API
Strata (SOSP 17)
- File system design for low latency persistence (NVM) and multi-tier storage (NVM, SSD, HDD)
- Storage diversification
  - NVDIMM: byte-addressable (cache-line granularity IO), direct access with load/store instructions
  - SSD: large erasure block (hardware GC overhead), random writes cause 5-6x slowdown by GC
- Let's build a fast server
  - Key value store, database, file server, mail server ...
  - Requirements
    Small updates dominate
    NVM is too fast, kernel is the bottleneck
    Dataset scales up to many terabytes
    To save the cost, need a way to use multiple device types: NVM, SSD, HDD
    Using only NVM is too expensive
    For low-cast capability with high performance, must leverage multiple device types
    Block-level caching manages data in blocks, but NVM is byte-addressable!
    Updates must be crash consistent
    Applications struggle for crash consistency
  - Today's file systems: limited by old design assumptions
    Kernel mediates every operation
    NVM is too fast, kernel is the bottleneck
    Tied to single type of device
    For low-cost capacity with high performance, must leverage multiple device types
    Aggressive caching in DRAM, only write to device when you must (fsync)
    Struggles for crash consistency
  - Strata: a cross media file system
    Performance: especially small, random IO
    Fast user-level device access
    Capacity: leverage NVM, SSD & HDD for low cost
    Transparent data migration across different media
    Efficiently handle device IO properties
    Simplicity: intuitive crash consistency model
    In-order, sync IO
    No fsync() required
  - Main design principle
    LibFS: log operations to NVM at user-level
    Fast user-level access
    In-order, sync IO
    Kernel FS: digest and migrate data in kernel
    Async digest
    Transparent data migration
    Shared file access
TCP as a Service / FlexNIC / Floem (ASPLOS 15, OSDI 18)
- OS, NIC, and app library support for fast, agile, secure protocol processing
- Let's build a fast server
  - Small RPCs dominate
  - Enforceable resource sharing
  - Agile protocol development
  - Cost-efficient hardware
- RDMA: read/write to (limited) region of remote server memory, no CPU involvement on the remote node (fast if app can use programming model)
  - Limitations: what if you need remote application computation (RPC)? lossless model is performance-fragile
- Smart NICs: NIC with array of low-end CPU cores
  - If compute on the NIC, maybe don't need to go CPU?
    Applications in high speed trading
  - Step 1: Build a faster kernel TCP in software
    Q: why RPC over Linux TCP is so slow?
    OS: hardware interface, highly optimized code path, buffer descriptor queues (no interrupts in common case, maximize concurrency)
    Following: OS transmit packet processing
    TCP layer: move from socket buffer to IP queue
    Lock socket, congestion / flow control limit, fill in TCP header, calculate checksum, copy data, arm re-transmission timeout
    IP layer: firewall, routing, ARP, traffic shaping
    Driver: move from IP queue to NIC queue
    Allocate and free packet buffers
    Now: multiple sync kernel transitions
    Parameter checks and copies
    Cache pollution, pipeline stalls
  - TCP Acceleration as a Service (TaS)
    TCP as a user-level OS service
    SRIO-V to dedicated cores
    Scale number of cores up/down to match demand
    Optimized data plane for common case operations
    Application uses its own dedicated cores
    Avoid polluting application level cache
    To the application, per-socket tx/rx queues with doorbells
  - Streamline common-case data path
    Remove unneeded computation from data path
    Congestion control, timeouts oper RTT
    Minimize per-flow TCP state
    Linearized code
    Enforce IP level access control on control plane at connection setup
  - Step 2: TaS data plane can be efficiently built in hardware
    FlexNIC design principles
    RPCs are the common case: kernel bypass to application logic
    Enforceable per-flow resource sharing: data plane in hardware, policy in kernel
    Agile protocol development: protocol agnostic, offload both kernel and app packet handling
    Cost-efficient: minimal instruction set for packet processing
    FlexTCP: H/W accelerated TCP
    Fast path is simple enough for FlexNIC model
    Applications directly access NIC for RX/TX
    Software slow-path messages NIC state
    Streamlines NIC processing

PreviousFloem: A programming system for NIC-accelerated network applications NextLeveraging Service Meshes as a New Network Layer

Last updated 3 years ago

Was this helpful?