High Performance Data Center Operating Systems

https://homes.cs.washington.edu/~tom/talks/os.pdf

  • Today's I/O devices are fast and getting faster

  • Can't we just use Linux?

    • Kernel mediation is too heavy weight

  • Arrakis (OSDI 14): separate the OS control and data plane

    • OS architecture that separates the control and data plane, for both networking and storage

    • How to skip the kernel?

    • Design goals

      • Streamline network and storage I/O

        • Eliminate OS mediation in the common case

        • Application-specific customization vs. kernel one size fits all

      • Keep OS functionality

        • Process (container) isolation and protection

        • Resource arbitration, enforceable resource limits

        • Global naming, sharing semantics

      • POSIX compatibility at the application level

        • Additional performance gains from rewriting the API

  • Strata (SOSP 17)

    • File system design for low latency persistence (NVM) and multi-tier storage (NVM, SSD, HDD)

    • Storage diversification

      • NVDIMM: byte-addressable (cache-line granularity IO), direct access with load/store instructions

      • SSD: large erasure block (hardware GC overhead), random writes cause 5-6x slowdown by GC

    • Let's build a fast server

      • Key value store, database, file server, mail server ...

      • Requirements

        • Small updates dominate

          • NVM is too fast, kernel is the bottleneck

        • Dataset scales up to many terabytes

          • To save the cost, need a way to use multiple device types: NVM, SSD, HDD

            • Using only NVM is too expensive

          • For low-cast capability with high performance, must leverage multiple device types

            • Block-level caching manages data in blocks, but NVM is byte-addressable!

        • Updates must be crash consistent

          • Applications struggle for crash consistency

      • Today's file systems: limited by old design assumptions

        • Kernel mediates every operation

          • NVM is too fast, kernel is the bottleneck

        • Tied to single type of device

          • For low-cost capacity with high performance, must leverage multiple device types

        • Aggressive caching in DRAM, only write to device when you must (fsync)

          • Struggles for crash consistency

      • Strata: a cross media file system

        • Performance: especially small, random IO

          • Fast user-level device access

          • Capacity: leverage NVM, SSD & HDD for low cost

            • Transparent data migration across different media

            • Efficiently handle device IO properties

          • Simplicity: intuitive crash consistency model

            • In-order, sync IO

            • No fsync() required

      • Main design principle

        • LibFS: log operations to NVM at user-level

          • Fast user-level access

          • In-order, sync IO

        • Kernel FS: digest and migrate data in kernel

          • Async digest

          • Transparent data migration

          • Shared file access

  • TCP as a Service / FlexNIC / Floem (ASPLOS 15, OSDI 18)

    • OS, NIC, and app library support for fast, agile, secure protocol processing

    • Let's build a fast server

      • Small RPCs dominate

      • Enforceable resource sharing

      • Agile protocol development

      • Cost-efficient hardware

    • RDMA: read/write to (limited) region of remote server memory, no CPU involvement on the remote node (fast if app can use programming model)

      • Limitations: what if you need remote application computation (RPC)? lossless model is performance-fragile

    • Smart NICs: NIC with array of low-end CPU cores

      • If compute on the NIC, maybe don't need to go CPU?

        • Applications in high speed trading

      • Step 1: Build a faster kernel TCP in software

        • Q: why RPC over Linux TCP is so slow?

          • OS: hardware interface, highly optimized code path, buffer descriptor queues (no interrupts in common case, maximize concurrency)

          • Following: OS transmit packet processing

          • TCP layer: move from socket buffer to IP queue

            • Lock socket, congestion / flow control limit, fill in TCP header, calculate checksum, copy data, arm re-transmission timeout

          • IP layer: firewall, routing, ARP, traffic shaping

          • Driver: move from IP queue to NIC queue

          • Allocate and free packet buffers

          • Now: multiple sync kernel transitions

            • Parameter checks and copies

            • Cache pollution, pipeline stalls

      • TCP Acceleration as a Service (TaS)

        • TCP as a user-level OS service

          • SRIO-V to dedicated cores

          • Scale number of cores up/down to match demand

          • Optimized data plane for common case operations

        • Application uses its own dedicated cores

          • Avoid polluting application level cache

        • To the application, per-socket tx/rx queues with doorbells

      • Streamline common-case data path

        • Remove unneeded computation from data path

          • Congestion control, timeouts oper RTT

        • Minimize per-flow TCP state

        • Linearized code

        • Enforce IP level access control on control plane at connection setup

      • Step 2: TaS data plane can be efficiently built in hardware

        • FlexNIC design principles

          • RPCs are the common case: kernel bypass to application logic

          • Enforceable per-flow resource sharing: data plane in hardware, policy in kernel

          • Agile protocol development: protocol agnostic, offload both kernel and app packet handling

          • Cost-efficient: minimal instruction set for packet processing

        • FlexTCP: H/W accelerated TCP

          • Fast path is simple enough for FlexNIC model

          • Applications directly access NIC for RX/TX

          • Software slow-path messages NIC state

          • Streamlines NIC processing

Last updated