# Floem: A programming system for NIC-accelerated network applications

* Challenge of accelerating server applications by off-loading applications to a programmable network card&#x20;
* Solution: language compiler runtime that makes it easier for developers to explore different NIC offloading designs&#x20;
* Recently: CPU can no longer keep up with the network&#x20;
  * Offload the computation to programmable network card that sits between the CPU and the network&#x20;
  * Traditionally: cannot be programmed&#x20;
  * Now: NIC
    * Wimpy multi-core processor&#x20;
      * Cavium LiquidIO
      * Netronome Agilio
      * Mellanox BlueField&#x20;
    * Field-programmable gate array (FPGA)
      * Microsoft catapult&#x20;
      * NetFPGA
    * Reconfigurable match table (RMT)&#x20;
  * NIC offload&#x20;
    * Offload computation
      * &#x20;Fast path processing: filtering, classifying, caching, etc.
      * Transformation: encryption / decryption, compression etc
      * Steering&#x20;
      * Congestion control&#x20;
  * Offloading computation to a NIC requires a large amount of effort&#x20;
    * Why? have to deal with distributed heterogenous system&#x20;
      * CPU and NIC have o cache coherence
      * NIC can access CPU memory via PCIe
        * Hard to manage and optimize for &#x20;
      * Heterogeneity in the system&#x20;
        * NIC: slower cores, less power; faster cores higher power, instruction, memory hierarchy&#x20;
    * Space of offload designs&#x20;
      * Example: key-value store&#x20;
      * CPU core, each handle one distinct set&#x20;
        * NIC to steer the packet to the right CPU core: **key-based steering**&#x20;
          * Can increase throughput&#x20;
        * **Using NIC as cache**&#x20;
          * 3x power efficiency
          * Require: enough memory on NIC&#x20;
    * No one-size-fit-all offload. Non-trivial to predict which offload is best&#x20;
      * System, and performance objectives&#x20;
      * Challenge: packet marshaling (define what fields to send, copy those fields)&#x20;
        * Tedious & error-prone, hinder exploration of different offload designs&#x20;
      * Challenge: communication strategies&#x20;
        * No steering, key-based steering, separate GET & SET
    * &#x20;Exploring different offload designs requires huge amount of efforts!&#x20;
  * Floem
    * **DSL** makes it easy to explore alternative offloads
    * **Compiler** minimizes communication and generates efficient code
    * **Runtime** manages data transfer over PCIe&#x20;
  * Language overview
    * Data-flow programming model&#x20;
      * Extend to support: heterogeneity, parallelism&#x20;
      * Contributions&#x20;
        * **Goal: explore offload designs**&#x20;
          * **Inferred data transfer**
          * **Logical-to-physical queue mapping**
          * **Caching construct**&#x20;
        * Goal: Integration with existing app&#x20;
          * Interface to external programs
    * &#x20;Compiler 7 Runtime&#x20;
      * ![](/files/UAG8l5ioWJwaVBL6g4iZ)
    * Example&#x20;
      * ![](/files/U2XgrIrURdWBA310xFzK)
        * Element class: allow users to embed C code to define a computation of that element&#x20;
      * ![](/files/MHWnPVkaCpn5meBTvgsD)
        * One thread processes one packet at a time&#x20;
      * Data parallelism&#x20;
        * Annotate the segment with multiple cores&#x20;
      * Pipeline parallelism&#x20;
        * Multiple segments, insert queues&#x20;
      * Offload: device&#x20;
  * First: inferred data transfer&#x20;
    * Solution: infer fields to send
    * **Per-packet** state: a packet and its metadata can be accessed anywhere in the program
    * **Compiler** infers which fields of packet and metadata to send &#x20;
  * Second: logical-to-physical queue mapping&#x20;
    * Observation: different communication strategies can be expressed by mapping logical queues to physical queues&#x20;
      * Degrees of resource sharing
      * Dynamic packet steering
      * Packet ordering&#x20;
    * Solution: **queue construct** with explicit logical-to-physical queue mapping
      * I.e. Queue(channels=2, instances=3)&#x20;
        * Logical queues, physical queues&#x20;
    * Example
      * No steering: queue construct with one physical instance&#x20;
      * Key-based steering: multiple physical queue instances, specify which CPU cores get the data from which queue, and specify how to steer by assigning queue id to the per-packet state&#x20;
  * Third: caching construct&#x20;
    * Difficult to implement a complete cache protocol&#x20;
      * Maintain consistency of data on NIC and CPU
      * High performance&#x20;
    * ![](/files/AvtoRU1FOz8z8jOPiwYV)
    * ![](/files/Mafu0zYvqkybNRV4XGcz)
* Runtime & Communication&#x20;
  * Runtime: responsible for managing the communication over the PCIe&#x20;
  * Hard thing: manage data synchronization between the NIC and CPU&#x20;
    * For performance, need a lot of optimizations in order to achieve high-throughput data transfer&#x20;
      * I/O batching
      * Overlapping DMA operations with useful computation&#x20;
  * Decouple the queue logic (in queue library) from the data sync and the DMA optimizations&#x20;
    * Less compact and more modular&#x20;
    * Can swap up queue library but reuse the queue sync layer&#x20;
* Evaluation&#x20;
  * Help the programmers explore different offload designs?&#x20;
  * Server: with smart NIC, without smart NIC&#x20;
  * Case study: key-value store&#x20;
  * Code relevant to communication&#x20;
    * Floem program: 15 lines --> C program: 240 lines&#x20;
    * Add cache in Floem (write-through cache --> write-back cache)&#x20;
  * Other: distributed real-time data analytics&#x20;
    * First offload: worse than CPU-only
    * Second offload: 96% improvement with 23 lines of code&#x20;
  * Take-away: high-level programming abstractions&#x20;
    * Control implementation strategies&#x20;
    * Avoid low-level details
  * Result: minimal changes to explore different designs&#x20;
* Questions&#x20;
  * Storage limitation of the NIC&#x20;
    * Cavium: pretty big memory (4GB memory), fitting nicely with it&#x20;
    * Might not be the right strategy
    * Allow you to try different offloading technologies&#x20;
  * Multiple communication? multiple times back and forth&#x20;
    * Can we have multiple communications?&#x20;
      * Language allows you to use how many queues as you want&#x20;
    * Whether that is the best strategy to use?
      * Best to minimize the communication&#x20;
  * Congestion control beyond single-packet&#x20;
    * TCP stack offload&#x20;
    * One thing: implement the entire stack in the NIC, but use a lot of resources, and only left with small number of resources to offload other applications&#x20;
    * Right thing: offload some parts (fast-path processing), but handle the flow state carefully (share between NIC and CPU)&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/floem-a-programming-system-for-nic-accelerated-network-applications.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
