ONIX: A Distributed Control Platform for Large-scale Production Networks

https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Koponen.pdf

Contribution

  • Present a network-wide control platform

    • Control platform handles state distribution and coordinates the state among various platform servers (SDN setup)

    • Provides a unified framework for understanding the scaling and performance properties of the system

    • Provides a higher level API

    • Handles difficulty of implementing distribution mechanisms and deploying them on switches

  • Essence of SDN: single control platform to implement a range of control functions (e.g. routing, TE, AC, VM migration) over a spectrum of granularities (e.g. flows to large traffic agg) in variety of contexts (e.g. enterprises, DCs, WANs)

    • basic primitives for state distribution should be implemented once in the control platform rather than separately for individual control tasks

    • should use well-known and general-purpose techniques from the distributed systems literature rather than the more specialized algorithms found in routing protocols and other network control mechanisms

  • Challenges: generality, scalability, reliability, simplicity, control plane performance (adequate, not optimal)

  • Key contribution

    • General API

    • Flexible distribution primitives (e.g. DHT storage, group memberships)

Design

Components

  • Physical infrastructure: switches, routers, other network elements

    • Only run software that is required to support this interface and achieve basic connectivity

  • Connectivity infrastructure: between physical networking gear and Onix (the "control traffic") transits the connectivity infrastructure

    • Need to support bi-directional communication between Onix instances and switches

  • Onix: distributed system running on a cluster of physical servers, each of which might run multiple Onix instances

    • 1) giving the control logic programmatic access to the network

    • 2) provide the requisite resilience for production deployments

  • Control logic: implement on top of Onix's API

    • Determine desired network behavior

API: useful and general

  • Consists of a data model that represents the network infrastructure, with each network element corresponding to one or more data objects

  • Control logic functionality

    • Read the current state of the object

    • Alter the network state by operating on these objects

    • Register for notifications of state changes to these objects

    • Customize the data model

    • Have control over placement and consistency of each component of the network state

  • Network Information Base (NIB)

    • Graph of entities within a network topology

    • Control applications are implemented by reading and writing to the NIB

    • Onix: provides scalability and resilience by replicating and distributing the NIB between multiple running instances (with app-specific logic)

    • Details

      • network entities: each of which hold a set of KV pairs, identified by a flat, 128-bit global identifier

      • typed entities: can contain a predefined set of attributes and methods to perform operations over these attributes

      • No fine-grained or distributed locking mechanisms, just a mechanism to request and release exclusive access to the NIB data structure of the local instances

        • Must use external mechanism

      • All ops are async; but also provide a sync primitive

Scaling and reliability

  • Need to scale (i.e. as the # of elements of the network increases)

    • What's the normal scale?

  • NIB distribution framework

Scaling

  • Three strategies

    • Allow control applications to partition the workload so that adding instances reduces work without merely replicating it

      • Each controller can use

        • the NIB’s representation of the complete physical topology (from the replicated database)

        • coupled with link utilization data (from the DHT)

        • to configure the forwarding state necessary to ensure paths meeting the deployment’s requirements throughout the network

    • Allow for aggregation in which the network managed by a cluster of Onix nodes appears as a single node in a separate cluster's NIB

      • Federated, hierarchical structuring of Onix clusters

    • Provide applications with control over the consistency and durability of the network state

      • Two data store

        • State applications that favor durability and stronger consistency: replicated transactional database

        • Volatile state that is more tolerant of inconsistency: memory-based on-hop DHT

Notion

  • In-band v.s Out-of-band

    • Control traffic shares the same forwarding elements as the data traffic in the network

    • V.s. separate physical network is used to handle the control traffic

Last updated