LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation

https://www.usenix.org/conference/osdi18/presentation/shan

Hardware Resource Disaggregation

Resource packing is difficult
More heterogeneous hardware: go into a server (not planned before hand)
Poor elasticity: add / remove / reconfigure
Fault tolerance
- CPU memory controller: whole server is down

Why possible now?
- Network is faster
- More processing power at device: smartNIC, smartSSD, PIM
- Network interface closer to device

Kernel Architectures for Resource Disaggregation

Can existing kernels fit?

Existing kernels don't fit
- Remote resources
- Distributed resource management (network-partition)
- Fine-grained failure handling (each component can fail independently)
Key idea: when hardware is disaggregated, the OS should be also!

LegoOS: the first disaggregated OS

Outline
- Abstraction
- Design Principles
- Implementations and Emulations
How should LegoOS appear to users?
- as a set of virtual nodes (vNodes)
  - Similar semantics to virtual machines
  - Can run on multiple processor, memory, and storage components

1.3x to 1.7x slowdown when disaggregating devices with LegoOS
- To gain better resource packing, elasticity, and fault tolerance!

Paper: LegoOS

Key idea: disaggregated, network-attached hardware components can improve resource utilization, elasticity, heterogeneity, and failure handling in data centers, however no existing OSes or software systems can properly manage it
New OS model: splitkernel
- Key: when hardware is disaggregated, the OS should be also
  - Breaks traditional OS functionalities into monitors, each monitor manages a hardware component, virtualizes and protects its physical resources
    Lossely-coupled, communicate with each other
  - Run monitors at hardware components
  - Message passing across non-coherent components
  - Global resource management and failure handling
LegoOS:
- appear to users as a set of distributed servers (vNodes)
- separate OS functionalities: process monitor, memory monitor, storage monitor
- Challenges
  - How to deliver good performance when application execution involves the access of network partitioned disaggregated HW and current network is slower than local buses?
  - How to locally manage individual HW components with limited HW resources?
  - How to manage distributed HW resources?
  - How to handle a component failure without affecting others?
  - What abstraction to expose to users? How to support existing datacenter applications?
- Solution: hardware + software
  - separate process, memory, and storage functionalities
    Moves all hardware memory functionalities to mComponents (e.g., page tables, TLBs), leaves only caches at the pComponent side --> each mComponent can choose its own memory allocation technique and v to p memory addr mapping
    pComponent: virtual caches
    Separating memory for performance and for capacity: utilize locality and leave a small amount of memory (e.g., 4GB) at each pComponent
    ExCache
  - monitors run at HW components and fit device constraints
  - comparable performance to monolithic Linux servers
  - Efficient resource management and memory failure handling (space + performance)
  - Easy-to-use, backward compatible user interface
  - Support common Linux system call interfaces

Process, memory, storage management

Process monitor: runs in the kernel space of a pComponent and manages pComponent's CPU cores and ExCache
Memory monitor:
- mComponents data: anonymous memory (i.e., heaps, stacks), memory-mapped files, and storage buffer caches
- Manages both virtual and physical addr spaces, their allocation, deallocation, and memory address mappings; perform actual memory read and write
  - GMM: assign a home mComponent to each new process at its creation time
- Two level approaches
  - home mComponent: coarse-grained, high-level virtual memory allocation decisions
  - other mComponent: fine-grained virtual memory allocation
- Optimizations: delays physical memory allocation until write time

Storage management
- Hierarchical file interface
- Stateless storage server design, each I/O to the storage server contains all the information needed to fulfill this request

Global Resource Management

Two-level resource management mechanism
- Three global resource managers: process, memory and storage --> coarse-grained global resource allocation and load balancing
- At low level: each monito can employ its own policies and mechanisms to manage its local resources

Reliability and Failure Handling

Memory reliability (focus)
- One primary mComponent, one secondary mComponent, and a backup file in sComponent for each vma
- Maintains a small append-only log at the secondary mComponent and replicate the vma tree
- Flushing backup to sComponent

PreviousUser-Defined Cloud NextBeyond Jain's Fairness Index: Setting the Bar For The Deployment of Congestion Control Algorithms

Last updated 4 years ago

Was this helpful?