Floem: A programming system for NIC-accelerated network applications
https://www.usenix.org/conference/osdi18/presentation/phothilimthana
Last updated
Was this helpful?
https://www.usenix.org/conference/osdi18/presentation/phothilimthana
Last updated
Was this helpful?
Challenge of accelerating server applications by off-loading applications to a programmable network card
Solution: language compiler runtime that makes it easier for developers to explore different NIC offloading designs
Recently: CPU can no longer keep up with the network
Offload the computation to programmable network card that sits between the CPU and the network
Traditionally: cannot be programmed
Now: NIC
Wimpy multi-core processor
Cavium LiquidIO
Netronome Agilio
Mellanox BlueField
Field-programmable gate array (FPGA)
Microsoft catapult
NetFPGA
Reconfigurable match table (RMT)
NIC offload
Offload computation
Fast path processing: filtering, classifying, caching, etc.
Transformation: encryption / decryption, compression etc
Steering
Congestion control
Offloading computation to a NIC requires a large amount of effort
Why? have to deal with distributed heterogenous system
CPU and NIC have o cache coherence
NIC can access CPU memory via PCIe
Hard to manage and optimize for
Heterogeneity in the system
NIC: slower cores, less power; faster cores higher power, instruction, memory hierarchy
Space of offload designs
Example: key-value store
CPU core, each handle one distinct set
NIC to steer the packet to the right CPU core: key-based steering
Can increase throughput
Using NIC as cache
3x power efficiency
Require: enough memory on NIC
No one-size-fit-all offload. Non-trivial to predict which offload is best
System, and performance objectives
Challenge: packet marshaling (define what fields to send, copy those fields)
Tedious & error-prone, hinder exploration of different offload designs
Challenge: communication strategies
No steering, key-based steering, separate GET & SET
Exploring different offload designs requires huge amount of efforts!
Floem
DSL makes it easy to explore alternative offloads
Compiler minimizes communication and generates efficient code
Runtime manages data transfer over PCIe
Language overview
Data-flow programming model
Extend to support: heterogeneity, parallelism
Contributions
Goal: explore offload designs
Inferred data transfer
Logical-to-physical queue mapping
Caching construct
Goal: Integration with existing app
Interface to external programs
Compiler 7 Runtime
Example
Element class: allow users to embed C code to define a computation of that element
One thread processes one packet at a time
Data parallelism
Annotate the segment with multiple cores
Pipeline parallelism
Multiple segments, insert queues
Offload: device
First: inferred data transfer
Solution: infer fields to send
Per-packet state: a packet and its metadata can be accessed anywhere in the program
Compiler infers which fields of packet and metadata to send
Second: logical-to-physical queue mapping
Observation: different communication strategies can be expressed by mapping logical queues to physical queues
Degrees of resource sharing
Dynamic packet steering
Packet ordering
Solution: queue construct with explicit logical-to-physical queue mapping
I.e. Queue(channels=2, instances=3)
Logical queues, physical queues
Example
No steering: queue construct with one physical instance
Key-based steering: multiple physical queue instances, specify which CPU cores get the data from which queue, and specify how to steer by assigning queue id to the per-packet state
Third: caching construct
Difficult to implement a complete cache protocol
Maintain consistency of data on NIC and CPU
High performance
Runtime & Communication
Runtime: responsible for managing the communication over the PCIe
Hard thing: manage data synchronization between the NIC and CPU
For performance, need a lot of optimizations in order to achieve high-throughput data transfer
I/O batching
Overlapping DMA operations with useful computation
Decouple the queue logic (in queue library) from the data sync and the DMA optimizations
Less compact and more modular
Can swap up queue library but reuse the queue sync layer
Evaluation
Help the programmers explore different offload designs?
Server: with smart NIC, without smart NIC
Case study: key-value store
Code relevant to communication
Floem program: 15 lines --> C program: 240 lines
Add cache in Floem (write-through cache --> write-back cache)
Other: distributed real-time data analytics
First offload: worse than CPU-only
Second offload: 96% improvement with 23 lines of code
Take-away: high-level programming abstractions
Control implementation strategies
Avoid low-level details
Result: minimal changes to explore different designs
Questions
Storage limitation of the NIC
Cavium: pretty big memory (4GB memory), fitting nicely with it
Might not be the right strategy
Allow you to try different offloading technologies
Multiple communication? multiple times back and forth
Can we have multiple communications?
Language allows you to use how many queues as you want
Whether that is the best strategy to use?
Best to minimize the communication
Congestion control beyond single-packet
TCP stack offload
One thing: implement the entire stack in the NIC, but use a lot of resources, and only left with small number of resources to offload other applications
Right thing: offload some parts (fast-path processing), but handle the flow state carefully (share between NIC and CPU)