# The Demikernel and the future of kernal-bypass systems

* I/O devices are getting faster, but cpus are not&#x20;
  * The CPU is increasingly a bottleneck in datacenters&#x20;
* OS kernels consume a big percentage of CPU cycles, OS kernels can no longer keep up with datacenter applications or I/O&#x20;
* Solution: kernel-bypass I/O devices&#x20;
  * ![](/files/5ziJvyjgjdhKRQ784wWa)
  * ![](/files/kqBjUNZm7zUXbADX1gBn)
  * Kernel-bypass gives applications direct access to I/O devices, bypassing the OS kernel on every I/O&#x20;
* Pros and Cons of kernel-bypass&#x20;
  * Pros: widely-available (GCE, AWS, and Azure all support kernel-bypass, most I/O devices support it), effective (i.e. 128x improvement using RDMA)
  * Cons: hard-to-use (porting an application is complex and expensive), limited (only used by specialized applications today, like scientific computing, but not at scale in data-center today)&#x20;
* Outline
  * Introduction
  * Background
  * Demikernel overview, API, liboses, evaluation&#x20;

### Intro&#x20;

* Kernel-bypass works similarly to hardware virtualization&#x20;
* I/O device provides H/W support that the VM can directly issue I/O
  * IOMMU translation from guest to actual physical devices&#x20;
  * Technologies: SR-IOV&#x20;
* Kernel-bypass&#x20;
  * I/O device bypass the OS kernel
  * Application can directly issue I/O
  * IOMMU translate from virtual application level (user-level) addresses to machine addresses&#x20;
  * Widely deploy: DPDK, RDMA

![](/files/kimaCxLznwmGegZD23dE)

* Unlike H/W virtualization, the OS kernel does more than multiplex hardware resources&#x20;

![](/files/FVSP8SfaoLaopuP9JTMC)

* Take a look at networking: widely used kernel bypass technologies&#x20;

![](/files/Bc3VNedrs8MrJBtJxtwL)

* Above: architecture of modern system
* Kernel bypass: move those features to I/O device like address translation, device multiplexing and things like that, but I/O is not capable of supporting everything&#x20;
  * No high-level abstractions, no TCP, no socket, and lots of other things&#x20;
  * ![](/files/ke4SqrxXgsqhYdGKqghq)
  * Gap?&#x20;
    * One option: own custom messaging layer (networking stack)
      * If every application needs to build their own
    * Re-use OS networking stack and move it up to user space?&#x20;
      * Not fast enough for kernel-bypass devices, traditional OS is built to work with ms-level NICs&#x20;
    * mTCP
      * Explicitly built for kernel bypass&#x20;
      * Implement TCP in user space, offer socket and interface same as POSIX&#x20;
      * But only oriented with throughput, even slower than going through the Linux kernel&#x20;
* Different devices implement different interfaces and OS services based on hardware capabilities&#x20;
  * ![](/files/8UsxakWm56PrPIC2v7n5)
  * RDMA NIC: capable, provide a lot of stuff
    * But depend heavily on network to do congestion control, hard to control
  * DPDK: virtual nics and nothing else
    * Spend a lot of CPU cycles to run a full networking stack in user-level
  * Programmable devices&#x20;
    * Interface?&#x20;
  * **There is no standard kernel-bypass API or architecture**&#x20;

### **Demi-kernel**&#x20;

* Demi-kernel project&#x20;
  * What is it? A new kernel-bypass OS \[architecture]&#x20;
  * Design goals
    * Standardized kernel-bypass API&#x20;
    * Flexible architecture for heterogenous devices (OS features, but still provide a uniform experience for application programmer)&#x20;
    * Single microsecond i/o processing: i/o is very fast, os cannot add any more overhead&#x20;

![](/files/vqBjRP8bs1NEGHAce5Qe)

![](/files/g3ajZBAMo6Stv9lRtlVP)

* Demikernel supplies a different libos for each device&#x20;
  * ![](/files/eUujApIyRBXcLTKUdis4)
  * Single unified interface and architecture&#x20;
  * New hardware --> build new demi-kernel libOS to support them&#x20;

#### API&#x20;

* Key features&#x20;
  * I/O queue API: with scatter-gather arrays to minimize latency
    * Queues replace UNIX pipes and scokets&#x20;
    * In-memory queue similar to Go channels&#x20;
    * Each push and pop should be a complete I/O, so the Demikerel libOS can immediately issue the I/O if possible&#x20;
  * qtoken and wait: to block on I/O operations for finer-grained scheduling
    * Push and pop are async and return a qtoken for blocking on I/O computation
    * Wait blocks on one or more I/O operations and returns the result&#x20;
  * Native zero-copy from the application heap with use-after-free protection&#x20;
    * Critical for latency&#x20;
    * Pushed SGA buffers are libOS-owned until qtoken returns; however, the app can free the buffers at any time
    * LibOS allocates buffers for incoming I/O, transferring ownership of the SGA buffers on pop. The app is responsible for freeing the buffers&#x20;
* Design principles&#x20;
  * Shared execution contexts for minimizing latency
    * Demikernel libOSes perform OS tasks (e.g., networking processing) on shared application threads&#x20;
    * Minimizes latency compared to threads (mTCP, NSDI '14) or processes (SNAP, SOSP '19)&#x20;
    * Requires cooperation: application must regularly entire the libOS (e.g., by calling wait or performing I/O)&#x20;
  * Co-routines for lightweight multiplexing of OS tasks
    * Light-weight scheduling abstractions&#x20;
    * Multiplex OS tasks (e.g., packet processing, sending acks, allocating receive buffers) with application execution&#x20;
    * Implemented with built-in C++&#x20;
    * Cooperatively scheduled by a Demikernel scheduler that separates runnable and blocked co-routines&#x20;
  * Integrated memory allocator for transparent memory registration and use-after-free protection&#x20;

#### Challenges&#x20;

* Kernel-bypass scheduling&#x20;
  * When to do OS work vs running the application
  * How to prioritize OS work based on the kernel-bypass device
  * How to scale to hundreds of co-routines&#x20;
  * How to make fine-grained scheduling decisions in a few nanoseconds&#x20;

#### Summary

* Demikernel is a low-latency kernel-bypass OS
* Demikernel provides a portable API and flexible architecture for heterogenous kernel-bypass devices&#x20;
* There are still many interesting open problems in kernel-bypass OS design&#x20;

#### Questions&#x20;

* How can demikernel interacts with resources such as sockets without transitioning into the underlying OS?&#x20;
  * RDMA, DPDK: libraries, user-level direct access to the hardware NIC. Hardware devices allow us to safely access the network device&#x20;
* Feasible to implement POSIX API on top of demi kernel's qAPI? Do you think one should?&#x20;
  * Not hard to do POSIX API. Make design decisions: when to push, when to issue the I/O&#x20;
  * Great way to slowly move applications onto kernel-bypass&#x20;
* Congestion control?
  * No, working on it. Thought about doing (MIT) --> extensible CC module in rust, porting that?
* Example of the # of running for round-trip, Redis (applications). Other applications that does evaluations on?
  * Ported version of Redis, sub-module&#x20;
  * Not run Redis on the most recent version&#x20;
* Kernel bypass wouldn't work with containers?
  * Need user-level driver to talk with the device&#x20;
  * Match the versions&#x20;
  * But same OS problem, now visible to your applications&#x20;
* How do you see these primitives extending to the network, in distributed context. Two demi-kernels?&#x20;
  * LibOS: customized&#x20;
  * Framing for TCP, UDP: receive them as a packet&#x20;
  * TCP: don't know when start and end
* &#x20;What happens as the payloads increase in size?&#x20;
  * Latency goes up a little bit? Throughput goes up&#x20;
  * Nothing unexpected&#x20;
  * RDMA: up to 1G, segmentation offloaded onto the NIC&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/the-demikernel-and-the-future-of-kernal-bypass-systems.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
