Floem: A programming system for NIC-accelerated network applications

https://www.usenix.org/conference/osdi18/presentation/phothilimthana

  • Challenge of accelerating server applications by off-loading applications to a programmable network card

  • Solution: language compiler runtime that makes it easier for developers to explore different NIC offloading designs

  • Recently: CPU can no longer keep up with the network

    • Offload the computation to programmable network card that sits between the CPU and the network

    • Traditionally: cannot be programmed

    • Now: NIC

      • Wimpy multi-core processor

        • Cavium LiquidIO

        • Netronome Agilio

        • Mellanox BlueField

      • Field-programmable gate array (FPGA)

        • Microsoft catapult

        • NetFPGA

      • Reconfigurable match table (RMT)

    • NIC offload

      • Offload computation

        • Fast path processing: filtering, classifying, caching, etc.

        • Transformation: encryption / decryption, compression etc

        • Steering

        • Congestion control

    • Offloading computation to a NIC requires a large amount of effort

      • Why? have to deal with distributed heterogenous system

        • CPU and NIC have o cache coherence

        • NIC can access CPU memory via PCIe

          • Hard to manage and optimize for

        • Heterogeneity in the system

          • NIC: slower cores, less power; faster cores higher power, instruction, memory hierarchy

      • Space of offload designs

        • Example: key-value store

        • CPU core, each handle one distinct set

          • NIC to steer the packet to the right CPU core: key-based steering

            • Can increase throughput

          • Using NIC as cache

            • 3x power efficiency

            • Require: enough memory on NIC

      • No one-size-fit-all offload. Non-trivial to predict which offload is best

        • System, and performance objectives

        • Challenge: packet marshaling (define what fields to send, copy those fields)

          • Tedious & error-prone, hinder exploration of different offload designs

        • Challenge: communication strategies

          • No steering, key-based steering, separate GET & SET

      • Exploring different offload designs requires huge amount of efforts!

    • Floem

      • DSL makes it easy to explore alternative offloads

      • Compiler minimizes communication and generates efficient code

      • Runtime manages data transfer over PCIe

    • Language overview

      • Data-flow programming model

        • Extend to support: heterogeneity, parallelism

        • Contributions

          • Goal: explore offload designs

            • Inferred data transfer

            • Logical-to-physical queue mapping

            • Caching construct

          • Goal: Integration with existing app

            • Interface to external programs

      • Compiler 7 Runtime

      • Example

          • Element class: allow users to embed C code to define a computation of that element

          • One thread processes one packet at a time

        • Data parallelism

          • Annotate the segment with multiple cores

        • Pipeline parallelism

          • Multiple segments, insert queues

        • Offload: device

    • First: inferred data transfer

      • Solution: infer fields to send

      • Per-packet state: a packet and its metadata can be accessed anywhere in the program

      • Compiler infers which fields of packet and metadata to send

    • Second: logical-to-physical queue mapping

      • Observation: different communication strategies can be expressed by mapping logical queues to physical queues

        • Degrees of resource sharing

        • Dynamic packet steering

        • Packet ordering

      • Solution: queue construct with explicit logical-to-physical queue mapping

        • I.e. Queue(channels=2, instances=3)

          • Logical queues, physical queues

      • Example

        • No steering: queue construct with one physical instance

        • Key-based steering: multiple physical queue instances, specify which CPU cores get the data from which queue, and specify how to steer by assigning queue id to the per-packet state

    • Third: caching construct

      • Difficult to implement a complete cache protocol

        • Maintain consistency of data on NIC and CPU

        • High performance

  • Runtime & Communication

    • Runtime: responsible for managing the communication over the PCIe

    • Hard thing: manage data synchronization between the NIC and CPU

      • For performance, need a lot of optimizations in order to achieve high-throughput data transfer

        • I/O batching

        • Overlapping DMA operations with useful computation

    • Decouple the queue logic (in queue library) from the data sync and the DMA optimizations

      • Less compact and more modular

      • Can swap up queue library but reuse the queue sync layer

  • Evaluation

    • Help the programmers explore different offload designs?

    • Server: with smart NIC, without smart NIC

    • Case study: key-value store

    • Code relevant to communication

      • Floem program: 15 lines --> C program: 240 lines

      • Add cache in Floem (write-through cache --> write-back cache)

    • Other: distributed real-time data analytics

      • First offload: worse than CPU-only

      • Second offload: 96% improvement with 23 lines of code

    • Take-away: high-level programming abstractions

      • Control implementation strategies

      • Avoid low-level details

    • Result: minimal changes to explore different designs

  • Questions

    • Storage limitation of the NIC

      • Cavium: pretty big memory (4GB memory), fitting nicely with it

      • Might not be the right strategy

      • Allow you to try different offloading technologies

    • Multiple communication? multiple times back and forth

      • Can we have multiple communications?

        • Language allows you to use how many queues as you want

      • Whether that is the best strategy to use?

        • Best to minimize the communication

    • Congestion control beyond single-packet

      • TCP stack offload

      • One thing: implement the entire stack in the NIC, but use a lot of resources, and only left with small number of resources to offload other applications

      • Right thing: offload some parts (fast-path processing), but handle the flow state carefully (share between NIC and CPU)

Last updated