B4: Experience with a Globally-Deployed Software Defined WAN

https://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf

Problem

WAN links are typically provisioned at 30-40% avg utilization
- WAN links are expensive, packet loss is typically thought unacceptable
- High-end, specialized equipment that place a premium on high availability
- Treat all bits the same
Google WAN
- Control over everything (apps, servers, LANs, edge)
- Bandwidth-intensitve app performs large-scale data copies from one site to another
- Anticipate no more than a few dozen data center deployment, making control of bandwidth feasible
Design centers around
- Accepting failures as inevitable and common events, effects are exposed to the app
  - ??
- Switch hardware that exports a simple interface to program forwarding table entries under central control
Use cases: routing protocols and centralized traffic engineering

Background

Two types of WANs
- User-facing network: peers / exchange traffic with the other Internet domains
  - Requirement: support a wide range of protocols, physical topology will be more dense, in content delivery must support highest level of availability
- B4: connectivity between data centers
  - Workload: user data copies for availability, remote storage access for computation over inherently distributed data sources, large-scale data push synchronization state across multiple DCs
    Ordered in increasing volume, decreasing latency sensitivity, and decreasing overall priority
  - Design
    Elastic bandwidth demand
    Moderate number of sites
    End application control
    Cost sensitivity

Architecture

Switch hardware: forward traffic
Site controller layer: NCS hosting both OpenFlow controllers (OFC) and Network Control Applications (NCA)
Globaly layers: logically centralized applications (e.g. SDN gateway, TE servers)

Switch design

Build their own hardware
Insight: don't need deep buffers, very large forwarding tables, hardware support for availability [with cost and complexity]
Motivation: careful endpoint managements, few set of DCs, switch failures typically result in software rather than hardware failure, no existing platform could support an SDN deployment

Network control

Functionality runs on NCS in the site controller layer collocated with the switch hardware
Paxos: handles leader selection for all control functionality
- At each site, perform application-layer failure detection
- When a majority of the Paxos servers detect a failure, they elect a new leader among the remaining set of available servers

Routing

Routing application proxy (RAP)
- RAP translates from RIB entries forming a network-level view of global connectivity to the low-level hardware tables used by the OpenFlow data plane
  - RAP translates each RIB entry into two OpenFlow tables, a Flow table which maps prefixes to entries into a ECMP Group table.

Traffic Engineering

Goal: share bandwidth among competing applications possibly using multiple paths
Objective function: deliver max-min fair allocation to applications
- maximizes utilization as long as further gain in utilization is not achieved by penalizing fair share of applications
Notion
- Flow Group (FG): TE cannot operate on granularity of individual applications; aggreage application to a Flow Group defined as {src site, dest site, QoS}
Bandwidth functions
- specifies the bandwidth allocation to an application given the flow’s relative priority on an arbitrary, dimensionless scale, which we call its fair share
- decides from administrator-specified static weights
  - q: flow detection? what about dynamic
- Bandwidth functions are configured, measured and provided to TE via Bandwidth Enforcer
  - an FG’s bandwidth function is a piecewise linear additive composition of per-application bandwidth functions
    Each FG multiplexes multiple application demands from one site to another
  - Max-min objective of TE is on this per-FG fair share dimension
- Bandwidth enforcer also aggregates bandwidth functions across multiple applications
Optimization algorithm: achieve similar fairness of LP optimal and at least 99% of the bandwidth utilization with 25x faster performance relative to LP
- (1) Tunnel Group Generation
  - Allocate bw to FGs using bandwidth functions to prioritize at bottleneck edges
- (2) Group Quantization
  - Split ratios in each TG to match granualrity supported by the switch HW tables
Example

TE protocol, OpenFlow, how it's implemented

Some additional discussions on dependencies and failures
- Dependencies among ops
- Synchronizing TED between TE and OFC: compute difference, Session ID
- Ordering issues: sequence ID
- TE op failures: (Dirty/Clean) bit for each TED entry

Deployments and Evals

Deployment Take
- i) topology aggregation significantly reduces path churn and system load
- ii) even with topology aggregation, edge removals happen multiple times a day
- iii) WAN links are susceptible to frequent port flaps and benefit from dynamic centralized management
TE Ops Performance: monthly distribution of ops issued, failure rate, latency distribution for two main TE operations (Tunnel addition and Tunnel Group mutation)
Impact of Failures
- A single link failure
- An encap switch failure and separately the failure of its neighboring transit router (much longer convergence time)
- An OFC failover
- A TE server failover
- Disabling/enabling TE
TE Algorithm Evaluation
- 14% throughput increase, main benefits come during periods of failure or high demand
Link utilization and hashing
- Most WANs: 30-40% utilization
- But B4: ~100%

Lessons learned from an outage

Planned maintenance operation --> one of the new switches was inadvertently manually configured with the same ID as an existing switch --> link flaps, switches declare interfaces down, breaking BGP adjacencies with remote cites
Lessons
- Scalability and latency of the packet IO path between OFC and OFA is critical
- OFA should be async and multi-threaded
- Need additional performance profiling and reporting
- With TE, they "fail open" --> it is not possible to distinguish between physical failures and the associated data plane
  - But the compromise as: hardware is more reliable than control software
  - Require application-level signals of broken connectivity to disambiguate between WAN hardware and software failures
- TE server must be adaptive to failed / unresponsive OFCs when modifying TGs that depend on creating new Tunnels
- Most failures involve the inevitable human error that occurs in managing large, complex system
- Critical to measure system performance at its breaking point with published envelopes regarding system scale

Some takes

Is the paper / problem / insight efficient?
- Efficiency argument
  - Make a general observation about the traffic elasticity
  - "Shift" in mindset: networking at efficiency (this is done at compute, etc..)
  - Test of time award of Sigcomm
  - AT&T had large WANs but they were never aware of this --> be able to exploit this is important
- Use cases
  - Enterprise internally in DC (only has 5% utilization)
  - Cloud tenants
    Cloud / hot potateo routing, tiers of BW paid
- Traffic Engineering (TE)
  - Improve utilization
  - "Scalability" for more # of DCs is a question
    AT&T has done the centralized controller, but twick routing
Industry paper?
- Some student answers
  - 1) FS paper feels like other papers can built upon and extend
  - 2) Take insights from industry paper
  - 3) "Access" to things are own by the industries
  - 4) No comparisons to other solutions
- Sylvia's take:
  - When reviewing the paper @ SIGCOMM, we need a good problem statement, insight, and eval insights
  - In industries, it's often not realistic to compare it with some other solutions
  - Industries are important participants; sigcomm reviewers said we need more industry paper
  - "practical", "at-scale", "vendor-support"

PreviousONIX: A Distributed Control Platform for Large-scale Production Networks NextHow SDN will shape networking

Last updated 2 years ago

Was this helpful?