# IncBricks: Toward In-Network Computation with an In-Network Cache

### Abstract&#x20;

* Emergence of programmable network devices + increasing data traffic of data centers --> in-network computation&#x20;
* Offload compute operations to intermediate network devices&#x20;
  * Serve network request with low latency
  * Reduce datacenter traffic + reduce congestion&#x20;
  * Save energy&#x20;
* Challenge:&#x20;
  * No general compute capabilities
  * Commodity datacenter networks are complex&#x20;
* Key: in-network caching fabric with basic computing primitives&#x20;

### Intro&#x20;

* Goal: reduce traffic, lower communication latency, reduce communication overheads&#x20;
* SDN&#x20;
  * programmable switches (application-specific header parsing, customized match-action rules, light-weight programmable forwarding plane)
  * network accelerators: low-power multicore processors and fast traffic managers&#x20;
* INC: offload a set of compute operations from end-servers onto programmable network devices (switches, network accelerators)&#x20;
* Challenges&#x20;
  * Limited compute power and little storage for DC computation&#x20;
  * Keeping computation and state coherent across networking elements is complex&#x20;
  * INC requires simple and general computing abstraction to be integrated with application logic&#x20;
* Propose: in-network caching fabric with basic computing primitives based on programmable network devices&#x20;
  * IncBox: hybrid switch/network accelerator architecture, offload application-level operations&#x20;
  * IncCache: in-network cache for KV store&#x20;

### System Architecture&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FE2RvIZ16BuOHiMjs6Cwi%2Fimage.png?alt=media\&token=cd44226b-67be-4677-875e-2c582328b0a6)

* Hierarchical topology&#x20;
  * ToR: 10 Gbps&#x20;
  * aggregation: 10-40 Gbps&#x20;
  * core switches: 100 Gbps&#x20;
* Multiple paths in the core of the network by adding redundant switches&#x20;
* Traditional Ethernet switches&#x20;
  * Packet: forward based on forwarding database (FDB)&#x20;
    * **Data plane**: process network packets at line rate
      * Ingress / Egress controller: match transmitted and received packets between their wire-level representation and a unified, structured internal format&#x20;
      * Packet memory: buffer in-flight packets across all ingress ports&#x20;
      * Switching module: makes packet forwarding decisions based on the forwarding database&#x20;
    * **Control plane**: configure forwarding policies&#x20;
      * low-power processor for adding and removing forwarding rules&#x20;

Programmable switch and network accelerator

* Programmable switches: reconfigurability in forwarding plane&#x20;
  * Programmable parser, match memory, action engine&#x20;
    * Packet formats customizable&#x20;
    * Simple operations based on headers of incoming packets&#x20;
* Network accelerators&#x20;
  * Traffic manager: fast DMA between TX/RX ports and internal memory
  * Packet scheduler: maintaining incoming packet order and distribute packets to cores
  * Low-power multicore processor: payload modifications&#x20;
  * Con: only a few interface ports, limiting processing bandwidth&#x20;

Combine two hardware devices&#x20;

* IncBox: hardware unit of a network accelerator co-located with Ethernet switch&#x20;
  * Packet (INC), switch forward to network accelerator for computation&#x20;
* IncCache: distributed, coherent KV store with computing capabilities --> packet parsing, hashtable lookup, command execution, packet encapsulation&#x20;

### IncBox&#x20;

#### Design Decisions&#x20;

* Support three things&#x20;
  * F1: Parse in-transit network packets and extract some fields for the IncBrick logic&#x20;
  * F2: Modify both header and payload and forward the packet based on the hash of the key&#x20;
  * F3: Cache key / value data and potentially execute basic operations on ached value&#x20;
  * Should provide: P1 high throughput and P2 low latency&#x20;
* Programmable switches:&#x20;
  * can only support simple operations (read, write, add, subtract, shift on counters)&#x20;
  * size of the packet buffer is on the order of few tens of MB, most for storing incoming packet traffic and little space for caching&#x20;
  * Can meet F1 and F2, but hard to satisfy F3 and P1, P2 in terms of payload-related operations&#x20;
* Network accelerators to satisfy rest of the requirements&#x20;
  * Traffic manager can serve packet data faster than kernel bypass techniques&#x20;
    * Kernel bypass: eliminates the overheads of in-kernel network stacks by moving protocol processing to user space&#x20;
      * E.x. dedicate NIC to application, or continue to manage NIC by allowing applications to map NIC queues to their address space&#x20;
  * Multi-core processors can saturate 40-100 Gbps bandwidth easily&#x20;
  * Support multi-GB of memory, which can be used for caching&#x20;

#### Design

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FXss5iwe49QinaXJ6vE0S%2Fimage.png?alt=media\&token=c4f5e37b-3432-4976-8edf-2c0f5b37949e)

* Switch:&#x20;
  * Packet checking to filter in-network caching packets based on the application header&#x20;
    * Match: forward to network accelerator&#x20;
    * O/W: processed in the original processing pipeline&#x20;
  * Hit checks: whether the network accelerator has cached the key or not&#x20;
  * Packet steering: forwards the packet to a specific port based on the hash value of the key
* &#x20;Network accelerator:
  * Application-layer computations and run the IncCache system&#x20;
  * Extract KV paris and the command from the packet payload&#x20;
  * Conducts memory-related operations&#x20;
    * Write&#x20;
    * Read&#x20;
      * Cache look-up: miss, stops and forwards; hits: execute
    * After execution, rebuilds the packet and sends it back to the switch&#x20;

### IncCache&#x20;

* Able to&#x20;
  * Cache data on both IncBox units and end-servers
  * Keep the cache coherent using a directory-based cache coherence protocol
  * Handle scenarios related to multipath routings and failures&#x20;
  * Provide basic compute primitives&#x20;
* Packet format: ID, magic field, command, hash, application payload&#x20;
* Hash table based data cache&#x20;
  * On both network accelerators and endhost servers&#x20;
    * network accelerator: fixed size lock-free&#x20;
    * endhost servers: extensible hash table, lock-free&#x20;
    * Cache coherence protocol: keep data consistent without incurring high overhead&#x20;
      * Hierarchical directory-based cache coherence protocol&#x20;
        * Take advantage of the structured network topology by using a hierarchical distributed directory mechanism&#x20;
        * Decouple system interface and program interface to provide flexible programmability&#x20;
        * Support sequential consistency for high performance SET/GET/DEL requests&#x20;
