# High Performance Data Center Operating Systems

* Today's I/O devices are fast and getting faster&#x20;
* Can't we just use Linux?&#x20;
  * Kernel mediation is too heavy weight&#x20;
* Arrakis (OSDI 14): separate the OS control and data plane&#x20;
  * OS architecture that separates the control and data plane, for both networking and storage&#x20;
  * How to skip the kernel?&#x20;
  * ![](/files/yLFbMsGpdpfq6nhYtffG)
  * Design goals&#x20;
    * Streamline network and storage I/O
      * Eliminate OS mediation in the common case&#x20;
      * Application-specific customization vs. kernel one size fits all&#x20;
    * Keep OS functionality&#x20;
      * Process (container) isolation and protection
      * Resource arbitration, enforceable resource limits
      * Global naming, sharing semantics&#x20;
    * POSIX compatibility at the application level
      * Additional performance gains from rewriting the API&#x20;
* Strata (SOSP 17)
  * File system design for low latency persistence (NVM) and multi-tier storage (NVM, SSD, HDD)
  * Storage diversification&#x20;
    * NVDIMM: byte-addressable (cache-line granularity IO), direct access with load/store instructions&#x20;
    * SSD: large erasure block (hardware GC overhead), random writes cause 5-6x slowdown by GC&#x20;
  * Let's build a fast server&#x20;
    * Key value store, database, file server, mail server ...
    * Requirements&#x20;
      * Small updates dominate&#x20;
        * NVM is too fast, kernel is the bottleneck&#x20;
      * Dataset scales up to many terabytes&#x20;
        * To save the cost, need a way to use multiple device types: NVM, SSD, HDD&#x20;
          * Using only NVM is too expensive&#x20;
        * For low-cast capability with high performance, must leverage multiple device types&#x20;
          * Block-level caching manages data in blocks, but NVM is byte-addressable!&#x20;
      * Updates must be crash consistent&#x20;
        * Applications struggle for crash consistency&#x20;
    * Today's file systems: limited by old design assumptions&#x20;
      * Kernel mediates every operation&#x20;
        * NVM is too fast, kernel is the bottleneck&#x20;
      * Tied to single type of device&#x20;
        * For low-cost capacity with high performance, must leverage multiple device types&#x20;
      * Aggressive caching in DRAM, only write to device when you must (fsync)
        * Struggles for crash consistency&#x20;
    * Strata: a cross media file system&#x20;
      * Performance: especially small, random IO
        * Fast user-level device access&#x20;
        * Capacity: leverage NVM, SSD & HDD for low cost&#x20;
          * Transparent data migration across different media
          * Efficiently handle device IO properties&#x20;
        * Simplicity: intuitive crash consistency model
          * In-order, sync IO&#x20;
          * No fsync() required&#x20;
    * Main design principle
      * LibFS: log operations to NVM at user-level
        * Fast user-level access
        * In-order, sync IO
      * Kernel FS: digest and migrate data in kernel
        * Async digest
        * Transparent data migration&#x20;
        * Shared file access&#x20;
* TCP as a Service / FlexNIC / Floem (ASPLOS 15, OSDI 18)
  * OS, NIC, and app library support for fast, agile, secure protocol processing&#x20;
  * Let's build a fast server&#x20;
    * Small RPCs dominate&#x20;
    * Enforceable resource sharing
    * Agile protocol development&#x20;
    * Cost-efficient hardware&#x20;
  * RDMA: read/write to (limited) region of remote server memory, no CPU involvement on the remote node (fast if app can use programming model)&#x20;
    * Limitations: what if you need remote application computation (RPC)? lossless model is performance-fragile&#x20;
  * Smart NICs: NIC with array of low-end CPU cores&#x20;
    * If compute on the NIC, maybe don't need to go CPU?
      * Applications in high speed trading&#x20;
    * **Step 1: Build a faster kernel TCP in software**&#x20;
      * Q: why RPC over Linux TCP is so slow?&#x20;
        * OS: hardware interface, highly optimized code path, buffer descriptor queues (no interrupts in common case, maximize concurrency)&#x20;
        * Following: OS transmit packet processing&#x20;
        * TCP layer: move from socket buffer to IP queue&#x20;
          * Lock socket, congestion / flow control limit, fill in TCP header, calculate checksum, copy data, arm re-transmission timeout
        * IP layer: firewall, routing, ARP, traffic shaping
        * Driver: move from IP queue to NIC queue
        * Allocate and free packet buffers&#x20;
        * Now: multiple sync kernel transitions
          * Parameter checks and copies&#x20;
          * Cache pollution, pipeline stalls&#x20;
    * TCP Acceleration as a Service (TaS)
      * TCP as a user-level OS service&#x20;
        * SRIO-V to dedicated cores
        * Scale number of cores up/down to match demand
        * Optimized data plane for common case operations&#x20;
      * Application uses its own dedicated cores&#x20;
        * Avoid polluting application level cache&#x20;
      * To the application, per-socket tx/rx queues with doorbells&#x20;
    * Streamline common-case data path
      * Remove unneeded computation from data path&#x20;
        * Congestion control, timeouts oper RTT
      * Minimize per-flow TCP state
      * Linearized code
      * Enforce IP level access control on control plane at connection setup&#x20;
    * **Step 2: TaS data plane can be efficiently built in hardware**&#x20;
      * FlexNIC design principles&#x20;
        * RPCs are the common case: kernel bypass to application logic&#x20;
        * Enforceable per-flow resource sharing: data plane in hardware, policy in kernel
        * Agile protocol development: protocol agnostic, offload both kernel and app packet handling
        * Cost-efficient: minimal instruction set for packet processing&#x20;
      * FlexTCP: H/W accelerated TCP
        * Fast path is simple enough for FlexNIC model
        * Applications directly access NIC for RX/TX
        * Software slow-path messages NIC state&#x20;
        * Streamlines NIC processing&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/high-performance-data-center-operating-systems.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
