LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation
https://www.usenix.org/conference/osdi18/presentation/shan
Last updated
Was this helpful?
https://www.usenix.org/conference/osdi18/presentation/shan
Last updated
Was this helpful?
Resource packing is difficult
More heterogeneous hardware: go into a server (not planned before hand)
Poor elasticity: add / remove / reconfigure
Fault tolerance
CPU memory controller: whole server is down
Why possible now?
Network is faster
More processing power at device: smartNIC, smartSSD, PIM
Network interface closer to device
Can existing kernels fit?
Existing kernels don't fit
Remote resources
Distributed resource management (network-partition)
Fine-grained failure handling (each component can fail independently)
Key idea: when hardware is disaggregated, the OS should be also!
Outline
Abstraction
Design Principles
Implementations and Emulations
How should LegoOS appear to users?
as a set of virtual nodes (vNodes)
Similar semantics to virtual machines
Can run on multiple processor, memory, and storage components
1.3x to 1.7x slowdown when disaggregating devices with LegoOS
To gain better resource packing, elasticity, and fault tolerance!
Key idea: disaggregated, network-attached hardware components can improve resource utilization, elasticity, heterogeneity, and failure handling in data centers, however no existing OSes or software systems can properly manage it
New OS model: splitkernel
Key: when hardware is disaggregated, the OS should be also
Breaks traditional OS functionalities into monitors, each monitor manages a hardware component, virtualizes and protects its physical resources
Lossely-coupled, communicate with each other
Run monitors at hardware components
Message passing across non-coherent components
Global resource management and failure handling
LegoOS:
appear to users as a set of distributed servers (vNodes)
separate OS functionalities: process monitor, memory monitor, storage monitor
Challenges
How to deliver good performance when application execution involves the access of network partitioned disaggregated HW and current network is slower than local buses?
How to locally manage individual HW components with limited HW resources?
How to manage distributed HW resources?
How to handle a component failure without affecting others?
What abstraction to expose to users? How to support existing datacenter applications?
Solution: hardware + software
separate process, memory, and storage functionalities
Moves all hardware memory functionalities to mComponents (e.g., page tables, TLBs), leaves only caches at the pComponent side --> each mComponent can choose its own memory allocation technique and v to p memory addr mapping
pComponent: virtual caches
Separating memory for performance and for capacity: utilize locality and leave a small amount of memory (e.g., 4GB) at each pComponent
ExCache
monitors run at HW components and fit device constraints
comparable performance to monolithic Linux servers
Efficient resource management and memory failure handling (space + performance)
Easy-to-use, backward compatible user interface
Support common Linux system call interfaces
Process monitor: runs in the kernel space of a pComponent and manages pComponent's CPU cores and ExCache
Memory monitor:
mComponents data: anonymous memory (i.e., heaps, stacks), memory-mapped files, and storage buffer caches
Manages both virtual and physical addr spaces, their allocation, deallocation, and memory address mappings; perform actual memory read and write
GMM: assign a home mComponent to each new process at its creation time
Two level approaches
home mComponent: coarse-grained, high-level virtual memory allocation decisions
other mComponent: fine-grained virtual memory allocation
Optimizations: delays physical memory allocation until write time
Storage management
Hierarchical file interface
Stateless storage server design, each I/O to the storage server contains all the information needed to fulfill this request
Two-level resource management mechanism
Three global resource managers: process, memory and storage --> coarse-grained global resource allocation and load balancing
At low level: each monito can employ its own policies and mechanisms to manage its local resources
Memory reliability (focus)
One primary mComponent, one secondary mComponent, and a backup file in sComponent for each vma
Maintains a small append-only log at the secondary mComponent and replicate the vma tree
Flushing backup to sComponent