IncBricks: Toward In-Network Computation with an In-Network Cache

Abstract

  • Emergence of programmable network devices + increasing data traffic of data centers --> in-network computation

  • Offload compute operations to intermediate network devices

    • Serve network request with low latency

    • Reduce datacenter traffic + reduce congestion

    • Save energy

  • Challenge:

    • No general compute capabilities

    • Commodity datacenter networks are complex

  • Key: in-network caching fabric with basic computing primitives

Intro

  • Goal: reduce traffic, lower communication latency, reduce communication overheads

  • SDN

    • programmable switches (application-specific header parsing, customized match-action rules, light-weight programmable forwarding plane)

    • network accelerators: low-power multicore processors and fast traffic managers

  • INC: offload a set of compute operations from end-servers onto programmable network devices (switches, network accelerators)

  • Challenges

    • Limited compute power and little storage for DC computation

    • Keeping computation and state coherent across networking elements is complex

    • INC requires simple and general computing abstraction to be integrated with application logic

  • Propose: in-network caching fabric with basic computing primitives based on programmable network devices

    • IncBox: hybrid switch/network accelerator architecture, offload application-level operations

    • IncCache: in-network cache for KV store

System Architecture

  • Hierarchical topology

    • ToR: 10 Gbps

    • aggregation: 10-40 Gbps

    • core switches: 100 Gbps

  • Multiple paths in the core of the network by adding redundant switches

  • Traditional Ethernet switches

    • Packet: forward based on forwarding database (FDB)

      • Data plane: process network packets at line rate

        • Ingress / Egress controller: match transmitted and received packets between their wire-level representation and a unified, structured internal format

        • Packet memory: buffer in-flight packets across all ingress ports

        • Switching module: makes packet forwarding decisions based on the forwarding database

      • Control plane: configure forwarding policies

        • low-power processor for adding and removing forwarding rules

Programmable switch and network accelerator

  • Programmable switches: reconfigurability in forwarding plane

    • Programmable parser, match memory, action engine

      • Packet formats customizable

      • Simple operations based on headers of incoming packets

  • Network accelerators

    • Traffic manager: fast DMA between TX/RX ports and internal memory

    • Packet scheduler: maintaining incoming packet order and distribute packets to cores

    • Low-power multicore processor: payload modifications

    • Con: only a few interface ports, limiting processing bandwidth

Combine two hardware devices

  • IncBox: hardware unit of a network accelerator co-located with Ethernet switch

    • Packet (INC), switch forward to network accelerator for computation

  • IncCache: distributed, coherent KV store with computing capabilities --> packet parsing, hashtable lookup, command execution, packet encapsulation

IncBox

Design Decisions

  • Support three things

    • F1: Parse in-transit network packets and extract some fields for the IncBrick logic

    • F2: Modify both header and payload and forward the packet based on the hash of the key

    • F3: Cache key / value data and potentially execute basic operations on ached value

    • Should provide: P1 high throughput and P2 low latency

  • Programmable switches:

    • can only support simple operations (read, write, add, subtract, shift on counters)

    • size of the packet buffer is on the order of few tens of MB, most for storing incoming packet traffic and little space for caching

    • Can meet F1 and F2, but hard to satisfy F3 and P1, P2 in terms of payload-related operations

  • Network accelerators to satisfy rest of the requirements

    • Traffic manager can serve packet data faster than kernel bypass techniques

      • Kernel bypass: eliminates the overheads of in-kernel network stacks by moving protocol processing to user space

        • E.x. dedicate NIC to application, or continue to manage NIC by allowing applications to map NIC queues to their address space

    • Multi-core processors can saturate 40-100 Gbps bandwidth easily

    • Support multi-GB of memory, which can be used for caching

Design

  • Switch:

    • Packet checking to filter in-network caching packets based on the application header

      • Match: forward to network accelerator

      • O/W: processed in the original processing pipeline

    • Hit checks: whether the network accelerator has cached the key or not

    • Packet steering: forwards the packet to a specific port based on the hash value of the key

  • Network accelerator:

    • Application-layer computations and run the IncCache system

    • Extract KV paris and the command from the packet payload

    • Conducts memory-related operations

      • Write

      • Read

        • Cache look-up: miss, stops and forwards; hits: execute

      • After execution, rebuilds the packet and sends it back to the switch

IncCache

  • Able to

    • Cache data on both IncBox units and end-servers

    • Keep the cache coherent using a directory-based cache coherence protocol

    • Handle scenarios related to multipath routings and failures

    • Provide basic compute primitives

  • Packet format: ID, magic field, command, hash, application payload

  • Hash table based data cache

    • On both network accelerators and endhost servers

      • network accelerator: fixed size lock-free

      • endhost servers: extensible hash table, lock-free

      • Cache coherence protocol: keep data consistent without incurring high overhead

        • Hierarchical directory-based cache coherence protocol

          • Take advantage of the structured network topology by using a hierarchical distributed directory mechanism

          • Decouple system interface and program interface to provide flexible programmability

          • Support sequential consistency for high performance SET/GET/DEL requests

Last updated