# B4: Experience with a Globally-Deployed Software Defined WAN

### Problem

* WAN links are typically provisioned at 30-40% avg utilization&#x20;
  * WAN links are expensive, packet loss is typically thought unacceptable&#x20;
  * High-end, specialized equipment that place a premium on high availability&#x20;
  * Treat all bits the same&#x20;
* Google WAN
  * Control over everything (apps, servers, LANs, edge)&#x20;
  * Bandwidth-intensitve app performs large-scale data copies from one site to another
  * Anticipate no more than a few dozen data center deployment, making control of bandwidth feasible&#x20;
* Design centers around&#x20;
  * Accepting failures as inevitable and common events, effects are exposed to the app
    * ??
  * Switch hardware that exports a simple interface to program forwarding table entries under central control&#x20;
* Use cases: routing protocols and centralized traffic engineering&#x20;

### Background

* Two types of WANs
  * User-facing network: peers / exchange traffic with the other Internet domains
    * Requirement: support a wide range of protocols, physical topology will be more dense, in content delivery must support highest level of availability&#x20;
  * B4: connectivity between data centers&#x20;
    * Workload: user data copies for availability, remote storage access for computation over inherently distributed data sources, large-scale data push synchronization state across multiple DCs&#x20;
      * Ordered in increasing volume, decreasing latency sensitivity, and decreasing overall priority&#x20;
    * Design
      * Elastic bandwidth demand&#x20;
      * Moderate number of sites&#x20;
      * End application control&#x20;
      * Cost sensitivity&#x20;

<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F3nWI7mioV6NjnIOrBjLF%2Fimage.png?alt=media&#x26;token=08df6a47-57ec-42b0-90ad-16f2868ee073" alt=""><figcaption></figcaption></figure>

### Architecture&#x20;

<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FplouaxsNL6Hc1eC6sKIY%2Fimage.png?alt=media&#x26;token=ce8ac5be-2afd-4c3a-9ec8-6d79f1cda29e" alt=""><figcaption></figcaption></figure>

* Switch hardware: forward traffic
* Site controller layer: NCS hosting both OpenFlow controllers (OFC) and Network Control Applications (NCA)&#x20;
* Globaly layers: logically centralized applications (e.g. SDN gateway, TE servers)&#x20;

#### Switch design&#x20;

* Build their own hardware&#x20;
* Insight: don't need deep buffers, very large forwarding tables, hardware support for availability \[with cost and complexity]
* Motivation: careful endpoint managements, few set of DCs, <mark style="color:red;">switch failures typically result in software rather than hardware failure</mark>, no existing platform could support an SDN deployment

#### Network control&#x20;

* Functionality runs on NCS in the site controller layer collocated with the switch hardware
* Paxos: handles leader selection for all control functionality&#x20;
  * At each site, perform application-layer failure detection&#x20;
  * When a majority of the Paxos servers detect a failure, they elect a new leader among the remaining set of available servers

#### Routing&#x20;

* Routing application proxy (RAP)&#x20;

  * RAP translates from RIB entries forming a network-level view of global connectivity to the low-level hardware tables used by the OpenFlow data plane
    * RAP translates each RIB entry into two OpenFlow tables, a Flow table which maps prefixes to entries into a ECMP Group table.
  *

  ```
  <figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FeInU6f8pbTLw4xNxFAKU%2Fimage.png?alt=media&#x26;token=29f7c6cc-311d-433f-9427-f54a88e28360" alt=""><figcaption></figcaption></figure>
  ```

### Traffic Engineering&#x20;

<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fou6NiC8BaK3GPARWagY0%2Fimage.png?alt=media&#x26;token=ff482b70-fac2-4a14-9c93-d08e30f16808" alt=""><figcaption></figcaption></figure>

* Goal: share bandwidth among competing applications possibly using multiple paths&#x20;
* Objective function: deliver **max-min fair allocation** to applications
  * maximizes utilization as long as further gain in utilization is not achieved by penalizing fair share of applications
* Notion
  * Flow Group (FG): TE cannot operate on granularity of individual applications; aggreage application to a Flow Group defined as <mark style="background-color:yellow;">{src site, dest site, QoS}</mark>&#x20;
* **Bandwidth functions**&#x20;
  * specifies the bandwidth allocation to an application given the flow’s relative priority on an arbitrary, dimensionless scale, which we call its *fair share*&#x20;
  * decides from administrator-specified static weights&#x20;
    * q: flow detection? what about dynamic&#x20;
  * Bandwidth functions are configured, measured and provided to TE via *Bandwidth Enforcer*
    * an FG’s bandwidth function is a piecewise linear additive composition of per-application bandwidth functions
      * Each FG multiplexes multiple application demands from one site to another&#x20;
    * Max-min objective of TE is on this per-FG fair share dimension&#x20;
  * Bandwidth enforcer also aggregates bandwidth functions across multiple applications&#x20;
* **Optimization algorithm:** achieve similar fairness of LP optimal and at least 99% of the bandwidth utilization with 25x faster performance relative to LP&#x20;
  * (1) Tunnel Group Generation
    * Allocate bw to FGs using bandwidth functions to prioritize at bottleneck edges&#x20;
  * (2) Group Quantization&#x20;
    * Split ratios in each TG to match granualrity supported by the switch HW tables&#x20;
* Example&#x20;
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fxba6jdvCZlmItA40Rb7p%2Fimage.png?alt=media\&token=24448ec1-60f8-4f0f-8081-4cde42f52191)

### TE protocol, OpenFlow, how it's implemented

* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FstXR1F6esggsEtnZLWHd%2Fimage.png?alt=media\&token=e7cec6fd-0362-4fa1-ac12-5b566d5dabba)
*

```
<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FXLrHj5HPGd1qFXphPv7Q%2Fimage.png?alt=media&#x26;token=6e3b61fc-9d0e-4b92-bb94-432067597071" alt=""><figcaption></figcaption></figure>
```

* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FsFxSEg2SaCLudhdTTLXT%2Fimage.png?alt=media\&token=4e40fe72-9537-488d-8529-250f3f9bbaf6)
* Some additional discussions on dependencies and failures&#x20;
  * Dependencies among ops
  * Synchronizing TED between TE and OFC: compute difference, Session ID&#x20;
  * Ordering issues: sequence ID&#x20;
  * TE op failures: (Dirty/Clean) bit for each TED entry&#x20;

### Deployments and Evals

* Deployment Take&#x20;
  * i) topology aggregation significantly reduces path churn and system load
  * ii) even with topology aggregation, edge removals happen multiple times a day
  * iii) WAN links are susceptible to frequent port flaps and benefit from dynamic centralized management
* TE Ops Performance: monthly distribution of ops issued, failure rate, latency distribution for two main TE operations (Tunnel addition and Tunnel Group mutation)&#x20;
* Impact of Failures&#x20;
  * A single link failure
  * An encap switch failure and separately the failure of its neighboring transit router (much longer convergence time)&#x20;
  * An OFC failover
  * A TE server failover&#x20;
  * Disabling/enabling TE&#x20;
* TE Algorithm Evaluation
  * 14% throughput increase, main benefits come during periods of failure or high demand&#x20;
* Link utilization and hashing&#x20;
  * Most WANs: 30-40% utilization&#x20;
  * But B4: \~100%

#### Lessons learned from an outage

* Planned maintenance operation --> one of the new switches was inadvertently manually configured with the same ID as an existing switch --> link flaps, switches declare interfaces down, breaking BGP adjacencies with remote cites&#x20;
* Lessons&#x20;
  * Scalability and latency of the packet IO path between OFC and OFA is critical&#x20;
  * OFA should be async and multi-threaded&#x20;
  * Need additional performance profiling and reporting&#x20;
  * With TE, they "fail open" --> it is not possible to distinguish between physical failures and the associated data plane&#x20;
    * But the compromise as: hardware is more reliable than control software&#x20;
    * Require application-level signals of broken connectivity to disambiguate between WAN hardware and software failures&#x20;
  * TE server must be adaptive to failed / unresponsive OFCs when modifying TGs that depend on creating new Tunnels&#x20;
  * Most failures involve the inevitable human error that occurs in managing large, complex system&#x20;
  * Critical to measure system performance at its breaking point with published envelopes regarding system scale&#x20;

#### Some takes&#x20;

* Is the paper / problem / insight efficient?&#x20;
  * Efficiency argument&#x20;
    * Make a general observation about the traffic elasticity&#x20;
    * "Shift" in mindset: networking at efficiency (this is done at compute, etc..)&#x20;
    * Test of time award of Sigcomm&#x20;
    * AT\&T had large WANs but they were never aware of this --> be able to exploit this is important&#x20;
  * Use cases
    * Enterprise internally in DC (only has 5% utilization)&#x20;
    * Cloud tenants&#x20;
      * Cloud / hot potateo routing, tiers of BW paid&#x20;
  * Traffic Engineering (TE)&#x20;
    * Improve utilization&#x20;
    * "Scalability" for more # of DCs is a question&#x20;
      * AT\&T has done the centralized controller, but twick routing&#x20;
* Industry paper?&#x20;
  * Some student answers
    * 1\) FS paper feels like other papers can built upon and extend&#x20;
    * 2\) Take insights from industry paper&#x20;
    * 3\) "Access" to things are own by the industries&#x20;
    * 4\) No comparisons to other solutions&#x20;
  * Sylvia's take:&#x20;
    * When reviewing the paper @ SIGCOMM, we need a good problem statement, insight, and eval insights
    * In industries, it's often not realistic to compare it with some other solutions&#x20;
    * Industries are important participants; sigcomm reviewers said we need more industry paper
    * "practical", "at-scale", "vendor-support"&#x20;
