> For the complete documentation index, see [llms.txt](https://sliu583.gitbook.io/blog/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/the-demikernel-and-the-future-of-kernal-bypass-systems.md).

# The Demikernel and the future of kernal-bypass systems

* I/O devices are getting faster, but cpus are not&#x20;
  * The CPU is increasingly a bottleneck in datacenters&#x20;
* OS kernels consume a big percentage of CPU cycles, OS kernels can no longer keep up with datacenter applications or I/O&#x20;
* Solution: kernel-bypass I/O devices&#x20;
  * ![](/files/5ziJvyjgjdhKRQ784wWa)
  * ![](/files/kqBjUNZm7zUXbADX1gBn)
  * Kernel-bypass gives applications direct access to I/O devices, bypassing the OS kernel on every I/O&#x20;
* Pros and Cons of kernel-bypass&#x20;
  * Pros: widely-available (GCE, AWS, and Azure all support kernel-bypass, most I/O devices support it), effective (i.e. 128x improvement using RDMA)
  * Cons: hard-to-use (porting an application is complex and expensive), limited (only used by specialized applications today, like scientific computing, but not at scale in data-center today)&#x20;
* Outline
  * Introduction
  * Background
  * Demikernel overview, API, liboses, evaluation&#x20;

### Intro&#x20;

* Kernel-bypass works similarly to hardware virtualization&#x20;
* I/O device provides H/W support that the VM can directly issue I/O
  * IOMMU translation from guest to actual physical devices&#x20;
  * Technologies: SR-IOV&#x20;
* Kernel-bypass&#x20;
  * I/O device bypass the OS kernel
  * Application can directly issue I/O
  * IOMMU translate from virtual application level (user-level) addresses to machine addresses&#x20;
  * Widely deploy: DPDK, RDMA

![](/files/kimaCxLznwmGegZD23dE)

* Unlike H/W virtualization, the OS kernel does more than multiplex hardware resources&#x20;

![](/files/FVSP8SfaoLaopuP9JTMC)

* Take a look at networking: widely used kernel bypass technologies&#x20;

![](/files/Bc3VNedrs8MrJBtJxtwL)

* Above: architecture of modern system
* Kernel bypass: move those features to I/O device like address translation, device multiplexing and things like that, but I/O is not capable of supporting everything&#x20;
  * No high-level abstractions, no TCP, no socket, and lots of other things&#x20;
  * ![](/files/ke4SqrxXgsqhYdGKqghq)
  * Gap?&#x20;
    * One option: own custom messaging layer (networking stack)
      * If every application needs to build their own
    * Re-use OS networking stack and move it up to user space?&#x20;
      * Not fast enough for kernel-bypass devices, traditional OS is built to work with ms-level NICs&#x20;
    * mTCP
      * Explicitly built for kernel bypass&#x20;
      * Implement TCP in user space, offer socket and interface same as POSIX&#x20;
      * But only oriented with throughput, even slower than going through the Linux kernel&#x20;
* Different devices implement different interfaces and OS services based on hardware capabilities&#x20;
  * ![](/files/8UsxakWm56PrPIC2v7n5)
  * RDMA NIC: capable, provide a lot of stuff
    * But depend heavily on network to do congestion control, hard to control
  * DPDK: virtual nics and nothing else
    * Spend a lot of CPU cycles to run a full networking stack in user-level
  * Programmable devices&#x20;
    * Interface?&#x20;
  * **There is no standard kernel-bypass API or architecture**&#x20;

### **Demi-kernel**&#x20;

* Demi-kernel project&#x20;
  * What is it? A new kernel-bypass OS \[architecture]&#x20;
  * Design goals
    * Standardized kernel-bypass API&#x20;
    * Flexible architecture for heterogenous devices (OS features, but still provide a uniform experience for application programmer)&#x20;
    * Single microsecond i/o processing: i/o is very fast, os cannot add any more overhead&#x20;

![](/files/vqBjRP8bs1NEGHAce5Qe)

![](/files/g3ajZBAMo6Stv9lRtlVP)

* Demikernel supplies a different libos for each device&#x20;
  * ![](/files/eUujApIyRBXcLTKUdis4)
  * Single unified interface and architecture&#x20;
  * New hardware --> build new demi-kernel libOS to support them&#x20;

#### API&#x20;

* Key features&#x20;
  * I/O queue API: with scatter-gather arrays to minimize latency
    * Queues replace UNIX pipes and scokets&#x20;
    * In-memory queue similar to Go channels&#x20;
    * Each push and pop should be a complete I/O, so the Demikerel libOS can immediately issue the I/O if possible&#x20;
  * qtoken and wait: to block on I/O operations for finer-grained scheduling
    * Push and pop are async and return a qtoken for blocking on I/O computation
    * Wait blocks on one or more I/O operations and returns the result&#x20;
  * Native zero-copy from the application heap with use-after-free protection&#x20;
    * Critical for latency&#x20;
    * Pushed SGA buffers are libOS-owned until qtoken returns; however, the app can free the buffers at any time
    * LibOS allocates buffers for incoming I/O, transferring ownership of the SGA buffers on pop. The app is responsible for freeing the buffers&#x20;
* Design principles&#x20;
  * Shared execution contexts for minimizing latency
    * Demikernel libOSes perform OS tasks (e.g., networking processing) on shared application threads&#x20;
    * Minimizes latency compared to threads (mTCP, NSDI '14) or processes (SNAP, SOSP '19)&#x20;
    * Requires cooperation: application must regularly entire the libOS (e.g., by calling wait or performing I/O)&#x20;
  * Co-routines for lightweight multiplexing of OS tasks
    * Light-weight scheduling abstractions&#x20;
    * Multiplex OS tasks (e.g., packet processing, sending acks, allocating receive buffers) with application execution&#x20;
    * Implemented with built-in C++&#x20;
    * Cooperatively scheduled by a Demikernel scheduler that separates runnable and blocked co-routines&#x20;
  * Integrated memory allocator for transparent memory registration and use-after-free protection&#x20;

#### Challenges&#x20;

* Kernel-bypass scheduling&#x20;
  * When to do OS work vs running the application
  * How to prioritize OS work based on the kernel-bypass device
  * How to scale to hundreds of co-routines&#x20;
  * How to make fine-grained scheduling decisions in a few nanoseconds&#x20;

#### Summary

* Demikernel is a low-latency kernel-bypass OS
* Demikernel provides a portable API and flexible architecture for heterogenous kernel-bypass devices&#x20;
* There are still many interesting open problems in kernel-bypass OS design&#x20;

#### Questions&#x20;

* How can demikernel interacts with resources such as sockets without transitioning into the underlying OS?&#x20;
  * RDMA, DPDK: libraries, user-level direct access to the hardware NIC. Hardware devices allow us to safely access the network device&#x20;
* Feasible to implement POSIX API on top of demi kernel's qAPI? Do you think one should?&#x20;
  * Not hard to do POSIX API. Make design decisions: when to push, when to issue the I/O&#x20;
  * Great way to slowly move applications onto kernel-bypass&#x20;
* Congestion control?
  * No, working on it. Thought about doing (MIT) --> extensible CC module in rust, porting that?
* Example of the # of running for round-trip, Redis (applications). Other applications that does evaluations on?
  * Ported version of Redis, sub-module&#x20;
  * Not run Redis on the most recent version&#x20;
* Kernel bypass wouldn't work with containers?
  * Need user-level driver to talk with the device&#x20;
  * Match the versions&#x20;
  * But same OS problem, now visible to your applications&#x20;
* How do you see these primitives extending to the network, in distributed context. Two demi-kernels?&#x20;
  * LibOS: customized&#x20;
  * Framing for TCP, UDP: receive them as a packet&#x20;
  * TCP: don't know when start and end
* &#x20;What happens as the payloads increase in size?&#x20;
  * Latency goes up a little bit? Throughput goes up&#x20;
  * Nothing unexpected&#x20;
  * RDMA: up to 1G, segmentation offloaded onto the NIC&#x20;
