The Demikernel and the future of kernal-bypass systems
https://www.youtube.com/watch?v=4LFL0_12cK4
Last updated
https://www.youtube.com/watch?v=4LFL0_12cK4
Last updated
I/O devices are getting faster, but cpus are not
The CPU is increasingly a bottleneck in datacenters
OS kernels consume a big percentage of CPU cycles, OS kernels can no longer keep up with datacenter applications or I/O
Solution: kernel-bypass I/O devices
Kernel-bypass gives applications direct access to I/O devices, bypassing the OS kernel on every I/O
Pros and Cons of kernel-bypass
Pros: widely-available (GCE, AWS, and Azure all support kernel-bypass, most I/O devices support it), effective (i.e. 128x improvement using RDMA)
Cons: hard-to-use (porting an application is complex and expensive), limited (only used by specialized applications today, like scientific computing, but not at scale in data-center today)
Outline
Introduction
Background
Demikernel overview, API, liboses, evaluation
Kernel-bypass works similarly to hardware virtualization
I/O device provides H/W support that the VM can directly issue I/O
IOMMU translation from guest to actual physical devices
Technologies: SR-IOV
Kernel-bypass
I/O device bypass the OS kernel
Application can directly issue I/O
IOMMU translate from virtual application level (user-level) addresses to machine addresses
Widely deploy: DPDK, RDMA
Unlike H/W virtualization, the OS kernel does more than multiplex hardware resources
Take a look at networking: widely used kernel bypass technologies
Above: architecture of modern system
Kernel bypass: move those features to I/O device like address translation, device multiplexing and things like that, but I/O is not capable of supporting everything
No high-level abstractions, no TCP, no socket, and lots of other things
Gap?
One option: own custom messaging layer (networking stack)
If every application needs to build their own
Re-use OS networking stack and move it up to user space?
Not fast enough for kernel-bypass devices, traditional OS is built to work with ms-level NICs
mTCP
Explicitly built for kernel bypass
Implement TCP in user space, offer socket and interface same as POSIX
But only oriented with throughput, even slower than going through the Linux kernel
Different devices implement different interfaces and OS services based on hardware capabilities
RDMA NIC: capable, provide a lot of stuff
But depend heavily on network to do congestion control, hard to control
DPDK: virtual nics and nothing else
Spend a lot of CPU cycles to run a full networking stack in user-level
Programmable devices
Interface?
There is no standard kernel-bypass API or architecture
Demi-kernel project
What is it? A new kernel-bypass OS [architecture]
Design goals
Standardized kernel-bypass API
Flexible architecture for heterogenous devices (OS features, but still provide a uniform experience for application programmer)
Single microsecond i/o processing: i/o is very fast, os cannot add any more overhead
Demikernel supplies a different libos for each device
Single unified interface and architecture
New hardware --> build new demi-kernel libOS to support them
Key features
I/O queue API: with scatter-gather arrays to minimize latency
Queues replace UNIX pipes and scokets
In-memory queue similar to Go channels
Each push and pop should be a complete I/O, so the Demikerel libOS can immediately issue the I/O if possible
qtoken and wait: to block on I/O operations for finer-grained scheduling
Push and pop are async and return a qtoken for blocking on I/O computation
Wait blocks on one or more I/O operations and returns the result
Native zero-copy from the application heap with use-after-free protection
Critical for latency
Pushed SGA buffers are libOS-owned until qtoken returns; however, the app can free the buffers at any time
LibOS allocates buffers for incoming I/O, transferring ownership of the SGA buffers on pop. The app is responsible for freeing the buffers
Design principles
Shared execution contexts for minimizing latency
Demikernel libOSes perform OS tasks (e.g., networking processing) on shared application threads
Minimizes latency compared to threads (mTCP, NSDI '14) or processes (SNAP, SOSP '19)
Requires cooperation: application must regularly entire the libOS (e.g., by calling wait or performing I/O)
Co-routines for lightweight multiplexing of OS tasks
Light-weight scheduling abstractions
Multiplex OS tasks (e.g., packet processing, sending acks, allocating receive buffers) with application execution
Implemented with built-in C++
Cooperatively scheduled by a Demikernel scheduler that separates runnable and blocked co-routines
Integrated memory allocator for transparent memory registration and use-after-free protection
Kernel-bypass scheduling
When to do OS work vs running the application
How to prioritize OS work based on the kernel-bypass device
How to scale to hundreds of co-routines
How to make fine-grained scheduling decisions in a few nanoseconds
Demikernel is a low-latency kernel-bypass OS
Demikernel provides a portable API and flexible architecture for heterogenous kernel-bypass devices
There are still many interesting open problems in kernel-bypass OS design
How can demikernel interacts with resources such as sockets without transitioning into the underlying OS?
RDMA, DPDK: libraries, user-level direct access to the hardware NIC. Hardware devices allow us to safely access the network device
Feasible to implement POSIX API on top of demi kernel's qAPI? Do you think one should?
Not hard to do POSIX API. Make design decisions: when to push, when to issue the I/O
Great way to slowly move applications onto kernel-bypass
Congestion control?
No, working on it. Thought about doing (MIT) --> extensible CC module in rust, porting that?
Example of the # of running for round-trip, Redis (applications). Other applications that does evaluations on?
Ported version of Redis, sub-module
Not run Redis on the most recent version
Kernel bypass wouldn't work with containers?
Need user-level driver to talk with the device
Match the versions
But same OS problem, now visible to your applications
How do you see these primitives extending to the network, in distributed context. Two demi-kernels?
LibOS: customized
Framing for TCP, UDP: receive them as a packet
TCP: don't know when start and end
What happens as the payloads increase in size?
Latency goes up a little bit? Throughput goes up
Nothing unexpected
RDMA: up to 1G, segmentation offloaded onto the NIC