# ONIX: A Distributed Control Platform for Large-scale Production Networks

### Contribution

* Present a network-wide control platform&#x20;
  * Control platform handles state distribution and coordinates the state among various platform servers (SDN setup)&#x20;
  * Provides a unified framework for understanding the scaling and performance properties of the system
  * Provides a higher level API&#x20;
  * Handles difficulty of implementing distribution mechanisms and deploying them on switches&#x20;
* Essence of SDN: **single control platform** to implement **a range of control functions** (e.g. routing, TE, AC, VM migration) over **a spectrum of granularities** (e.g. flows to large traffic agg) in **variety of contexts** (e.g. enterprises, DCs, WANs)&#x20;
  * basic primitives for state distribution should be implemented once in the control platform rather than separately for individual control tasks
  * should use well-known and general-purpose techniques from the distributed systems literature rather than the more specialized algorithms found in routing protocols and other network control mechanisms&#x20;
* **Challenges**: generality, scalability, reliability, simplicity, control plane performance (adequate, not optimal)&#x20;
* **Key contribution**&#x20;
  * General API&#x20;
  * Flexible distribution primitives (e.g. DHT storage, group memberships)&#x20;

### Design

#### Components&#x20;

* Physical infrastructure: switches, routers, other network elements&#x20;
  * Only run software that is required to support this interface and achieve basic connectivity&#x20;
* Connectivity infrastructure: between physical networking gear and Onix (the "control traffic") transits the connectivity infrastructure&#x20;
  * Need to support bi-directional communication between Onix instances and switches&#x20;
* Onix: distributed system running on a cluster of physical servers, each of which might run multiple Onix instances&#x20;
  * 1\) giving the control logic programmatic access to the network
  * 2\) provide the requisite resilience for production deployments&#x20;
* Control logic: implement on top of Onix's API&#x20;
  * Determine desired network behavior&#x20;

#### API: useful and general&#x20;

* Consists of a data model that represents the network infrastructure, with each network element corresponding to one or more **data objects** &#x20;
* **Control logic functionality**&#x20;
  * Read the current state of the object&#x20;
  * Alter the network state by operating on these objects&#x20;
  * Register for notifications of state changes to these objects&#x20;
  * Customize the data model&#x20;
  * Have control over placement and consistency of each component of the network state&#x20;
* **Network Information Base (NIB)**
  * Graph of entities within a network topology&#x20;
  * Control applications are implemented by reading and writing to the NIB&#x20;
  * Onix: provides scalability and resilience by replicating and distributing the NIB between multiple running instances (with app-specific logic)&#x20;
  * Details
    * network entities: each of which hold a set of KV pairs, identified by a flat, 128-bit global identifier&#x20;
    * typed entities: can contain a predefined set of attributes and methods to perform operations over these attributes&#x20;
    * No fine-grained or distributed locking mechanisms, just a mechanism to request and release exclusive access to the NIB data structure of the local instances&#x20;
      * Must use external mechanism&#x20;
    * All ops are async; but also provide a sync primitive&#x20;
    * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F1uOaSFJyQvCm8CujY0IU%2Fimage.png?alt=media\&token=c33ae98c-632b-4024-ad49-82c338b16b7d)

### Scaling and reliability&#x20;

* Need to scale (i.e. as the # of elements of the network increases)&#x20;
  * What's the normal scale?&#x20;
* NIB distribution framework&#x20;

#### Scaling&#x20;

* Three strategies&#x20;
  * Allow control applications to **partition** the workload so that adding instances reduces work without merely replicating it&#x20;
    * Each controller can use&#x20;
      * the NIB’s representation of the complete physical topology (from the replicated database)
      * coupled with link utilization data (from the DHT)
      * to configure the forwarding state necessary to ensure paths meeting the deployment’s requirements throughout the network&#x20;
  * Allow for **aggregation** in which the network managed by a cluster of Onix nodes appears as a single node in a separate cluster's NIB&#x20;
    * Federated, hierarchical structuring of Onix clusters&#x20;
  * Provide applications with control over the **consistency and durability** of the network state&#x20;
    * Two data store&#x20;

      * State applications that favor durability and stronger consistency: replicated transactional database
      * Volatile state that is more tolerant of inconsistency: memory-based on-hop DHT

      &#x20;

### Notion

* In-band v.s Out-of-band&#x20;
  * Control traffic shares the same forwarding elements as the data traffic in the network&#x20;
  * V.s. separate physical network is used to handle the control traffic&#x20;
