LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation

Hardware Resource Disaggregation

  • Resource packing is difficult

  • More heterogeneous hardware: go into a server (not planned before hand)

  • Poor elasticity: add / remove / reconfigure

  • Fault tolerance

    • CPU memory controller: whole server is down

  • Why possible now?

    • Network is faster

    • More processing power at device: smartNIC, smartSSD, PIM

    • Network interface closer to device

Kernel Architectures for Resource Disaggregation

  • Can existing kernels fit?

  • Existing kernels don't fit

    • Remote resources

    • Distributed resource management (network-partition)

    • Fine-grained failure handling (each component can fail independently)

  • Key idea: when hardware is disaggregated, the OS should be also!

LegoOS: the first disaggregated OS

  • Outline

    • Abstraction

    • Design Principles

    • Implementations and Emulations

  • How should LegoOS appear to users?

    • as a set of virtual nodes (vNodes)

      • Similar semantics to virtual machines

      • Can run on multiple processor, memory, and storage components

  • 1.3x to 1.7x slowdown when disaggregating devices with LegoOS

    • To gain better resource packing, elasticity, and fault tolerance!

Paper: LegoOS

  • Key idea: disaggregated, network-attached hardware components can improve resource utilization, elasticity, heterogeneity, and failure handling in data centers, however no existing OSes or software systems can properly manage it

  • New OS model: splitkernel

    • Key: when hardware is disaggregated, the OS should be also

      • Breaks traditional OS functionalities into monitors, each monitor manages a hardware component, virtualizes and protects its physical resources

        • Lossely-coupled, communicate with each other

      • Run monitors at hardware components

      • Message passing across non-coherent components

      • Global resource management and failure handling

  • LegoOS:

    • appear to users as a set of distributed servers (vNodes)

    • separate OS functionalities: process monitor, memory monitor, storage monitor

    • Challenges

      • How to deliver good performance when application execution involves the access of network partitioned disaggregated HW and current network is slower than local buses?

      • How to locally manage individual HW components with limited HW resources?

      • How to manage distributed HW resources?

      • How to handle a component failure without affecting others?

      • What abstraction to expose to users? How to support existing datacenter applications?

    • Solution: hardware + software

      • separate process, memory, and storage functionalities

        • Moves all hardware memory functionalities to mComponents (e.g., page tables, TLBs), leaves only caches at the pComponent side --> each mComponent can choose its own memory allocation technique and v to p memory addr mapping

        • pComponent: virtual caches

        • Separating memory for performance and for capacity: utilize locality and leave a small amount of memory (e.g., 4GB) at each pComponent

          • ExCache

      • monitors run at HW components and fit device constraints

      • comparable performance to monolithic Linux servers

      • Efficient resource management and memory failure handling (space + performance)

      • Easy-to-use, backward compatible user interface

      • Support common Linux system call interfaces

Process, memory, storage management

  • Process monitor: runs in the kernel space of a pComponent and manages pComponent's CPU cores and ExCache

  • Memory monitor:

    • mComponents data: anonymous memory (i.e., heaps, stacks), memory-mapped files, and storage buffer caches

    • Manages both virtual and physical addr spaces, their allocation, deallocation, and memory address mappings; perform actual memory read and write

      • GMM: assign a home mComponent to each new process at its creation time

    • Two level approaches

      • home mComponent: coarse-grained, high-level virtual memory allocation decisions

      • other mComponent: fine-grained virtual memory allocation

    • Optimizations: delays physical memory allocation until write time

  • Storage management

    • Hierarchical file interface

    • Stateless storage server design, each I/O to the storage server contains all the information needed to fulfill this request

Global Resource Management

  • Two-level resource management mechanism

    • Three global resource managers: process, memory and storage --> coarse-grained global resource allocation and load balancing

    • At low level: each monito can employ its own policies and mechanisms to manage its local resources

Reliability and Failure Handling

  • Memory reliability (focus)

    • One primary mComponent, one secondary mComponent, and a backup file in sComponent for each vma

    • Maintains a small append-only log at the secondary mComponent and replicate the vma tree

    • Flushing backup to sComponent

Last updated