The Demikernel and the future of kernal-bypass systems

https://www.youtube.com/watch?v=4LFL0_12cK4

I/O devices are getting faster, but cpus are not
- The CPU is increasingly a bottleneck in datacenters
OS kernels consume a big percentage of CPU cycles, OS kernels can no longer keep up with datacenter applications or I/O
Solution: kernel-bypass I/O devices
- Kernel-bypass gives applications direct access to I/O devices, bypassing the OS kernel on every I/O
Pros and Cons of kernel-bypass
- Pros: widely-available (GCE, AWS, and Azure all support kernel-bypass, most I/O devices support it), effective (i.e. 128x improvement using RDMA)
- Cons: hard-to-use (porting an application is complex and expensive), limited (only used by specialized applications today, like scientific computing, but not at scale in data-center today)
Outline
- Introduction
- Background
- Demikernel overview, API, liboses, evaluation

Intro

Kernel-bypass works similarly to hardware virtualization
I/O device provides H/W support that the VM can directly issue I/O
- IOMMU translation from guest to actual physical devices
- Technologies: SR-IOV
Kernel-bypass
- I/O device bypass the OS kernel
- Application can directly issue I/O
- IOMMU translate from virtual application level (user-level) addresses to machine addresses
- Widely deploy: DPDK, RDMA

Unlike H/W virtualization, the OS kernel does more than multiplex hardware resources

Take a look at networking: widely used kernel bypass technologies

Above: architecture of modern system
Kernel bypass: move those features to I/O device like address translation, device multiplexing and things like that, but I/O is not capable of supporting everything
- No high-level abstractions, no TCP, no socket, and lots of other things
- Gap?
  - One option: own custom messaging layer (networking stack)
    If every application needs to build their own
  - Re-use OS networking stack and move it up to user space?
    Not fast enough for kernel-bypass devices, traditional OS is built to work with ms-level NICs
  - mTCP
    Explicitly built for kernel bypass
    Implement TCP in user space, offer socket and interface same as POSIX
    But only oriented with throughput, even slower than going through the Linux kernel
Different devices implement different interfaces and OS services based on hardware capabilities
- RDMA NIC: capable, provide a lot of stuff
  - But depend heavily on network to do congestion control, hard to control
- DPDK: virtual nics and nothing else
  - Spend a lot of CPU cycles to run a full networking stack in user-level
- Programmable devices
  - Interface?
- There is no standard kernel-bypass API or architecture

Demi-kernel

Demi-kernel project
- What is it? A new kernel-bypass OS [architecture]
- Design goals
  - Standardized kernel-bypass API
  - Flexible architecture for heterogenous devices (OS features, but still provide a uniform experience for application programmer)
  - Single microsecond i/o processing: i/o is very fast, os cannot add any more overhead

Demikernel supplies a different libos for each device
- Single unified interface and architecture
- New hardware --> build new demi-kernel libOS to support them

API

Key features
- I/O queue API: with scatter-gather arrays to minimize latency
  - Queues replace UNIX pipes and scokets
  - In-memory queue similar to Go channels
  - Each push and pop should be a complete I/O, so the Demikerel libOS can immediately issue the I/O if possible
- qtoken and wait: to block on I/O operations for finer-grained scheduling
  - Push and pop are async and return a qtoken for blocking on I/O computation
  - Wait blocks on one or more I/O operations and returns the result
- Native zero-copy from the application heap with use-after-free protection
  - Critical for latency
  - Pushed SGA buffers are libOS-owned until qtoken returns; however, the app can free the buffers at any time
  - LibOS allocates buffers for incoming I/O, transferring ownership of the SGA buffers on pop. The app is responsible for freeing the buffers
Design principles
- Shared execution contexts for minimizing latency
  - Demikernel libOSes perform OS tasks (e.g., networking processing) on shared application threads
  - Minimizes latency compared to threads (mTCP, NSDI '14) or processes (SNAP, SOSP '19)
  - Requires cooperation: application must regularly entire the libOS (e.g., by calling wait or performing I/O)
- Co-routines for lightweight multiplexing of OS tasks
  - Light-weight scheduling abstractions
  - Multiplex OS tasks (e.g., packet processing, sending acks, allocating receive buffers) with application execution
  - Implemented with built-in C++
  - Cooperatively scheduled by a Demikernel scheduler that separates runnable and blocked co-routines
- Integrated memory allocator for transparent memory registration and use-after-free protection

Challenges

Kernel-bypass scheduling
- When to do OS work vs running the application
- How to prioritize OS work based on the kernel-bypass device
- How to scale to hundreds of co-routines
- How to make fine-grained scheduling decisions in a few nanoseconds

Summary

Demikernel is a low-latency kernel-bypass OS
Demikernel provides a portable API and flexible architecture for heterogenous kernel-bypass devices
There are still many interesting open problems in kernel-bypass OS design

Questions

How can demikernel interacts with resources such as sockets without transitioning into the underlying OS?
- RDMA, DPDK: libraries, user-level direct access to the hardware NIC. Hardware devices allow us to safely access the network device
Feasible to implement POSIX API on top of demi kernel's qAPI? Do you think one should?
- Not hard to do POSIX API. Make design decisions: when to push, when to issue the I/O
- Great way to slowly move applications onto kernel-bypass
Congestion control?
- No, working on it. Thought about doing (MIT) --> extensible CC module in rust, porting that?
Example of the # of running for round-trip, Redis (applications). Other applications that does evaluations on?
- Ported version of Redis, sub-module
- Not run Redis on the most recent version
Kernel bypass wouldn't work with containers?
- Need user-level driver to talk with the device
- Match the versions
- But same OS problem, now visible to your applications
How do you see these primitives extending to the network, in distributed context. Two demi-kernels?
- LibOS: customized
- Framing for TCP, UDP: receive them as a packet
- TCP: don't know when start and end
What happens as the payloads increase in size?
- Latency goes up a little bit? Throughput goes up
- Nothing unexpected
- RDMA: up to 1G, segmentation offloaded onto the NIC

PreviousA Vision for Runtime Programmable Networks NextFloem: A programming system for NIC-accelerated network applications

Last updated 3 years ago

Was this helpful?