# LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation

#### Hardware Resource Disaggregation&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F118Va9eFFUv9y6bCVqq6%2Fimage.png?alt=media\&token=b3f67226-f8b7-4d33-956d-d742b4c5c9d3)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FV5xvaqdOCu6K11zOp6md%2Fimage.png?alt=media\&token=2a87ad75-844f-4485-a5c7-029aa7eea683)

* Resource packing is difficult&#x20;
* More heterogeneous hardware: go into a server (not planned before hand)
* Poor elasticity: add / remove / reconfigure&#x20;
* Fault tolerance
  * CPU memory controller: whole server is down&#x20;

&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F2rc8YSsNCT1yuckIN7mr%2Fimage.png?alt=media\&token=e1b9164a-324f-45d3-bd08-d32b4a7f6b40)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FtL6MgnjPdQrgi1cdWyQE%2Fimage.png?alt=media\&token=39586924-269b-4c72-b527-e44d0c4da1f3)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FGL5CyG6FVLKdCsMjrOO0%2Fimage.png?alt=media\&token=4d84b0f4-7707-4276-8c5e-1930c9891ebd)

* Why possible now?
  * Network is faster
  * More processing power at device: smartNIC, smartSSD, PIM
  * Network interface closer to device&#x20;

#### Kernel Architectures for Resource Disaggregation&#x20;

* Can existing kernels fit?&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F6BPgY05OinT24UnK0PS2%2Fimage.png?alt=media\&token=d5701d88-0b49-4906-870d-04d13a9eecac)

* Existing kernels don't fit
  * Remote resources&#x20;
  * Distributed resource management (network-partition)
  * Fine-grained failure handling (each component can fail independently)&#x20;
* Key idea: when hardware is disaggregated, the OS should be also!&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FeAwUGITrgABlw6lvb8u0%2Fimage.png?alt=media\&token=cd336357-7749-477d-bd57-d8f6b4bc6a82)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FwUj3PW0ZOuQb0YAYlW7O%2Fimage.png?alt=media\&token=af6832ae-59d2-4e2c-bc30-8524e1159443)

#### LegoOS: the first disaggregated OS&#x20;

* Outline&#x20;
  * Abstraction&#x20;
  * Design Principles
  * Implementations and Emulations&#x20;
* How should LegoOS appear to users?
  * as a set of virtual nodes (vNodes)&#x20;
    * Similar semantics to virtual machines
    * Can run on multiple processor, memory, and storage components&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FP5PTEJIxtxW5iSB8NPjt%2Fimage.png?alt=media\&token=a2bf2ef8-dd64-4de7-aff3-ecf8d293b602)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FAu0jFmp1MEYhsv5Kzwlb%2Fimage.png?alt=media\&token=d1c0d592-9540-4680-badf-ecc1d0b7a35a)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fzro7VzmYZ1EVkIPN03GI%2Fimage.png?alt=media\&token=18b54dc4-4501-4871-baa5-7d6ec247355a)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FKi7EByQgcOazVfs1JtNQ%2Fimage.png?alt=media\&token=83adc9ca-add5-4012-9024-c438e7daab7f)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FYtOHVbXT8m1IVFADQlft%2Fimage.png?alt=media\&token=15165357-14f1-4c06-8d4a-83db47fe8b0c)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FHuGY97J7PdRGUJGBAIJr%2Fimage.png?alt=media\&token=80d01ba6-dcbc-4c04-a402-8fb181fbec44)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Ftd7c8PJWWkM3M2rvDSN5%2Fimage.png?alt=media\&token=53bbe451-c62d-4589-b78c-af418d74ebab)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FPVRB2MzIfYSVQl9moQtv%2Fimage.png?alt=media\&token=f09d3d3e-c71c-40f2-bed5-ba9513cd8c9a)

* 1.3x to 1.7x slowdown when disaggregating devices with LegoOS
  * To gain better resource packing, elasticity, and fault tolerance!&#x20;

### Paper: LegoOS&#x20;

* Key idea: disaggregated, network-attached hardware components can improve resource utilization, elasticity, heterogeneity, and failure handling in data centers, however no existing OSes or software systems can properly manage it&#x20;
* New OS model: splitkernel&#x20;
  * Key: when hardware is disaggregated, the OS should be also&#x20;
    * Breaks traditional OS functionalities into monitors, each monitor manages a hardware component, virtualizes and protects its physical resources&#x20;
      * Lossely-coupled, communicate with each other
    * Run monitors at hardware components&#x20;
    * Message passing across non-coherent components
    * Global resource management and failure handling &#x20;
* LegoOS:&#x20;
  * appear to users as a set of distributed servers (vNodes)
  * separate OS functionalities: process monitor, memory monitor, storage monitor&#x20;
  * Challenges&#x20;
    * How to deliver good performance when application execution involves the access of network partitioned disaggregated HW and current network is slower than local buses?
    * How to locally manage individual HW components with limited HW resources?
    * How to manage distributed HW resources?
    * How to handle a component failure without affecting others?
    * What abstraction to expose to users? How to support existing datacenter applications?&#x20;
  * Solution: hardware + software
    * separate process, memory, and storage functionalities
      * Moves all hardware memory functionalities to mComponents (e.g., page tables, TLBs), leaves only caches at the pComponent side --> each mComponent can choose its own memory allocation technique and v to p memory addr mapping&#x20;
      * pComponent: virtual caches&#x20;
      * Separating memory for performance and for capacity: utilize locality and leave a small amount of memory (e.g., 4GB) at each pComponent&#x20;
        * ExCache &#x20;
    * monitors run at HW components and fit device constraints&#x20;
    * comparable performance to monolithic Linux servers
    * Efficient resource management and memory failure handling (space + performance)
    * Easy-to-use, backward compatible user interface&#x20;
    * Support common Linux system call interfaces&#x20;

![Splitkernel architecture ](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fy1mt4COcfS55HXhdjmlH%2Fimage.png?alt=media\&token=da754cc0-07c4-4b58-b5e8-5048896383a5)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F6kQpT04ImcoZ4G3lmIYL%2Fimage.png?alt=media\&token=a13d94a9-6acc-4cb2-9568-6ad0e03ce470)

#### Process, memory, storage management&#x20;

* Process monitor: runs in the kernel space of a pComponent and manages pComponent's CPU cores and ExCache&#x20;
* Memory monitor:&#x20;
  * mComponents data: anonymous memory (i.e., heaps, stacks), memory-mapped files, and storage buffer caches&#x20;
  * Manages both virtual and physical addr spaces, their allocation, deallocation, and memory address mappings; perform actual memory read and write&#x20;
    * GMM: assign a home mComponent to each new process at its creation time&#x20;
  * Two level approaches&#x20;
    * home mComponent: coarse-grained, high-level virtual memory allocation decisions&#x20;
    * other mComponent: fine-grained virtual memory allocation&#x20;
  * Optimizations: delays physical memory allocation until write time&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fxtszq45MXFHW1ArUnzh8%2Fimage.png?alt=media\&token=7b2c3dbe-8cbf-40cd-b54e-171ef4fc2d1b)

* Storage management&#x20;
  * Hierarchical file interface&#x20;
  * Stateless storage server design, each I/O to the storage server contains all the information needed to fulfill this request&#x20;

#### Global Resource Management&#x20;

* Two-level resource management mechanism&#x20;
  * Three global resource managers: process, memory and storage --> coarse-grained global resource allocation and load balancing&#x20;
  * At low level: each monito can employ its own policies and mechanisms to manage its local resources&#x20;

#### Reliability and Failure Handling&#x20;

* Memory reliability (focus)&#x20;
  * One primary mComponent, one secondary mComponent, and a backup file in sComponent for each vma&#x20;
  * Maintains a small append-only log at the secondary mComponent and replicate the vma tree&#x20;
  * Flushing backup to sComponent&#x20;
