Floem: A programming system for NIC-accelerated network applications

https://www.usenix.org/conference/osdi18/presentation/phothilimthana

Challenge of accelerating server applications by off-loading applications to a programmable network card
Solution: language compiler runtime that makes it easier for developers to explore different NIC offloading designs
Recently: CPU can no longer keep up with the network
- Offload the computation to programmable network card that sits between the CPU and the network
- Traditionally: cannot be programmed
- Now: NIC
  - Wimpy multi-core processor
    Cavium LiquidIO
    Netronome Agilio
    Mellanox BlueField
  - Field-programmable gate array (FPGA)
    Microsoft catapult
    NetFPGA
  - Reconfigurable match table (RMT)
- NIC offload
  - Offload computation
    Fast path processing: filtering, classifying, caching, etc.
    Transformation: encryption / decryption, compression etc
    Steering
    Congestion control
- Offloading computation to a NIC requires a large amount of effort
  - Why? have to deal with distributed heterogenous system
    CPU and NIC have o cache coherence
    NIC can access CPU memory via PCIe
    Hard to manage and optimize for
    Heterogeneity in the system
    NIC: slower cores, less power; faster cores higher power, instruction, memory hierarchy
  - Space of offload designs
    Example: key-value store
    CPU core, each handle one distinct set
    NIC to steer the packet to the right CPU core: key-based steering
    Can increase throughput
    Using NIC as cache
    3x power efficiency
    Require: enough memory on NIC
  - No one-size-fit-all offload. Non-trivial to predict which offload is best
    System, and performance objectives
    Challenge: packet marshaling (define what fields to send, copy those fields)
    Tedious & error-prone, hinder exploration of different offload designs
    Challenge: communication strategies
    No steering, key-based steering, separate GET & SET
  - Exploring different offload designs requires huge amount of efforts!
- Floem
  - DSL makes it easy to explore alternative offloads
  - Compiler minimizes communication and generates efficient code
  - Runtime manages data transfer over PCIe
- Language overview
  - Data-flow programming model
    Extend to support: heterogeneity, parallelism
    Contributions
    Goal: explore offload designs
    Inferred data transfer
    Logical-to-physical queue mapping
    Caching construct
    Goal: Integration with existing app
    Interface to external programs
  - Compiler 7 Runtime
  - Example
    Element class: allow users to embed C code to define a computation of that element
    One thread processes one packet at a time
    Data parallelism
    Annotate the segment with multiple cores
    Pipeline parallelism
    Multiple segments, insert queues
    Offload: device
- First: inferred data transfer
  - Solution: infer fields to send
  - Per-packet state: a packet and its metadata can be accessed anywhere in the program
  - Compiler infers which fields of packet and metadata to send
- Second: logical-to-physical queue mapping
  - Observation: different communication strategies can be expressed by mapping logical queues to physical queues
    Degrees of resource sharing
    Dynamic packet steering
    Packet ordering
  - Solution: queue construct with explicit logical-to-physical queue mapping
    I.e. Queue(channels=2, instances=3)
    Logical queues, physical queues
  - Example
    No steering: queue construct with one physical instance
    Key-based steering: multiple physical queue instances, specify which CPU cores get the data from which queue, and specify how to steer by assigning queue id to the per-packet state
- Third: caching construct
  - Difficult to implement a complete cache protocol
    Maintain consistency of data on NIC and CPU
    High performance
Runtime & Communication
- Runtime: responsible for managing the communication over the PCIe
- Hard thing: manage data synchronization between the NIC and CPU
  - For performance, need a lot of optimizations in order to achieve high-throughput data transfer
    I/O batching
    Overlapping DMA operations with useful computation
- Decouple the queue logic (in queue library) from the data sync and the DMA optimizations
  - Less compact and more modular
  - Can swap up queue library but reuse the queue sync layer
Evaluation
- Help the programmers explore different offload designs?
- Server: with smart NIC, without smart NIC
- Case study: key-value store
- Code relevant to communication
  - Floem program: 15 lines --> C program: 240 lines
  - Add cache in Floem (write-through cache --> write-back cache)
- Other: distributed real-time data analytics
  - First offload: worse than CPU-only
  - Second offload: 96% improvement with 23 lines of code
- Take-away: high-level programming abstractions
  - Control implementation strategies
  - Avoid low-level details
- Result: minimal changes to explore different designs
Questions
- Storage limitation of the NIC
  - Cavium: pretty big memory (4GB memory), fitting nicely with it
  - Might not be the right strategy
  - Allow you to try different offloading technologies
- Multiple communication? multiple times back and forth
  - Can we have multiple communications?
    Language allows you to use how many queues as you want
  - Whether that is the best strategy to use?
    Best to minimize the communication
- Congestion control beyond single-packet
  - TCP stack offload
  - One thing: implement the entire stack in the NIC, but use a lot of resources, and only left with small number of resources to offload other applications
  - Right thing: offload some parts (fast-path processing), but handle the flow state carefully (share between NIC and CPU)

PreviousThe Demikernel and the future of kernal-bypass systems NextHigh Performance Data Center Operating Systems

Last updated 3 years ago

Was this helpful?