The Demikernel and the future of kernal-bypass systems

https://www.youtube.com/watch?v=4LFL0_12cK4

  • I/O devices are getting faster, but cpus are not

    • The CPU is increasingly a bottleneck in datacenters

  • OS kernels consume a big percentage of CPU cycles, OS kernels can no longer keep up with datacenter applications or I/O

  • Solution: kernel-bypass I/O devices

    • Kernel-bypass gives applications direct access to I/O devices, bypassing the OS kernel on every I/O

  • Pros and Cons of kernel-bypass

    • Pros: widely-available (GCE, AWS, and Azure all support kernel-bypass, most I/O devices support it), effective (i.e. 128x improvement using RDMA)

    • Cons: hard-to-use (porting an application is complex and expensive), limited (only used by specialized applications today, like scientific computing, but not at scale in data-center today)

  • Outline

    • Introduction

    • Background

    • Demikernel overview, API, liboses, evaluation

Intro

  • Kernel-bypass works similarly to hardware virtualization

  • I/O device provides H/W support that the VM can directly issue I/O

    • IOMMU translation from guest to actual physical devices

    • Technologies: SR-IOV

  • Kernel-bypass

    • I/O device bypass the OS kernel

    • Application can directly issue I/O

    • IOMMU translate from virtual application level (user-level) addresses to machine addresses

    • Widely deploy: DPDK, RDMA

  • Unlike H/W virtualization, the OS kernel does more than multiplex hardware resources

  • Take a look at networking: widely used kernel bypass technologies

  • Above: architecture of modern system

  • Kernel bypass: move those features to I/O device like address translation, device multiplexing and things like that, but I/O is not capable of supporting everything

    • No high-level abstractions, no TCP, no socket, and lots of other things

    • Gap?

      • One option: own custom messaging layer (networking stack)

        • If every application needs to build their own

      • Re-use OS networking stack and move it up to user space?

        • Not fast enough for kernel-bypass devices, traditional OS is built to work with ms-level NICs

      • mTCP

        • Explicitly built for kernel bypass

        • Implement TCP in user space, offer socket and interface same as POSIX

        • But only oriented with throughput, even slower than going through the Linux kernel

  • Different devices implement different interfaces and OS services based on hardware capabilities

    • RDMA NIC: capable, provide a lot of stuff

      • But depend heavily on network to do congestion control, hard to control

    • DPDK: virtual nics and nothing else

      • Spend a lot of CPU cycles to run a full networking stack in user-level

    • Programmable devices

      • Interface?

    • There is no standard kernel-bypass API or architecture

Demi-kernel

  • Demi-kernel project

    • What is it? A new kernel-bypass OS [architecture]

    • Design goals

      • Standardized kernel-bypass API

      • Flexible architecture for heterogenous devices (OS features, but still provide a uniform experience for application programmer)

      • Single microsecond i/o processing: i/o is very fast, os cannot add any more overhead

  • Demikernel supplies a different libos for each device

    • Single unified interface and architecture

    • New hardware --> build new demi-kernel libOS to support them

API

  • Key features

    • I/O queue API: with scatter-gather arrays to minimize latency

      • Queues replace UNIX pipes and scokets

      • In-memory queue similar to Go channels

      • Each push and pop should be a complete I/O, so the Demikerel libOS can immediately issue the I/O if possible

    • qtoken and wait: to block on I/O operations for finer-grained scheduling

      • Push and pop are async and return a qtoken for blocking on I/O computation

      • Wait blocks on one or more I/O operations and returns the result

    • Native zero-copy from the application heap with use-after-free protection

      • Critical for latency

      • Pushed SGA buffers are libOS-owned until qtoken returns; however, the app can free the buffers at any time

      • LibOS allocates buffers for incoming I/O, transferring ownership of the SGA buffers on pop. The app is responsible for freeing the buffers

  • Design principles

    • Shared execution contexts for minimizing latency

      • Demikernel libOSes perform OS tasks (e.g., networking processing) on shared application threads

      • Minimizes latency compared to threads (mTCP, NSDI '14) or processes (SNAP, SOSP '19)

      • Requires cooperation: application must regularly entire the libOS (e.g., by calling wait or performing I/O)

    • Co-routines for lightweight multiplexing of OS tasks

      • Light-weight scheduling abstractions

      • Multiplex OS tasks (e.g., packet processing, sending acks, allocating receive buffers) with application execution

      • Implemented with built-in C++

      • Cooperatively scheduled by a Demikernel scheduler that separates runnable and blocked co-routines

    • Integrated memory allocator for transparent memory registration and use-after-free protection

Challenges

  • Kernel-bypass scheduling

    • When to do OS work vs running the application

    • How to prioritize OS work based on the kernel-bypass device

    • How to scale to hundreds of co-routines

    • How to make fine-grained scheduling decisions in a few nanoseconds

Summary

  • Demikernel is a low-latency kernel-bypass OS

  • Demikernel provides a portable API and flexible architecture for heterogenous kernel-bypass devices

  • There are still many interesting open problems in kernel-bypass OS design

Questions

  • How can demikernel interacts with resources such as sockets without transitioning into the underlying OS?

    • RDMA, DPDK: libraries, user-level direct access to the hardware NIC. Hardware devices allow us to safely access the network device

  • Feasible to implement POSIX API on top of demi kernel's qAPI? Do you think one should?

    • Not hard to do POSIX API. Make design decisions: when to push, when to issue the I/O

    • Great way to slowly move applications onto kernel-bypass

  • Congestion control?

    • No, working on it. Thought about doing (MIT) --> extensible CC module in rust, porting that?

  • Example of the # of running for round-trip, Redis (applications). Other applications that does evaluations on?

    • Ported version of Redis, sub-module

    • Not run Redis on the most recent version

  • Kernel bypass wouldn't work with containers?

    • Need user-level driver to talk with the device

    • Match the versions

    • But same OS problem, now visible to your applications

  • How do you see these primitives extending to the network, in distributed context. Two demi-kernels?

    • LibOS: customized

    • Framing for TCP, UDP: receive them as a packet

    • TCP: don't know when start and end

  • What happens as the payloads increase in size?

    • Latency goes up a little bit? Throughput goes up

    • Nothing unexpected

    • RDMA: up to 1G, segmentation offloaded onto the NIC

Last updated