B4: Experience with a Globally-Deployed Software Defined WAN
https://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf
Problem
WAN links are typically provisioned at 30-40% avg utilization
WAN links are expensive, packet loss is typically thought unacceptable
High-end, specialized equipment that place a premium on high availability
Treat all bits the same
Google WAN
Control over everything (apps, servers, LANs, edge)
Bandwidth-intensitve app performs large-scale data copies from one site to another
Anticipate no more than a few dozen data center deployment, making control of bandwidth feasible
Design centers around
Accepting failures as inevitable and common events, effects are exposed to the app
??
Switch hardware that exports a simple interface to program forwarding table entries under central control
Use cases: routing protocols and centralized traffic engineering
Background
Two types of WANs
User-facing network: peers / exchange traffic with the other Internet domains
Requirement: support a wide range of protocols, physical topology will be more dense, in content delivery must support highest level of availability
B4: connectivity between data centers
Workload: user data copies for availability, remote storage access for computation over inherently distributed data sources, large-scale data push synchronization state across multiple DCs
Ordered in increasing volume, decreasing latency sensitivity, and decreasing overall priority
Design
Elastic bandwidth demand
Moderate number of sites
End application control
Cost sensitivity
Architecture
Switch hardware: forward traffic
Site controller layer: NCS hosting both OpenFlow controllers (OFC) and Network Control Applications (NCA)
Globaly layers: logically centralized applications (e.g. SDN gateway, TE servers)
Switch design
Build their own hardware
Insight: don't need deep buffers, very large forwarding tables, hardware support for availability [with cost and complexity]
Motivation: careful endpoint managements, few set of DCs, switch failures typically result in software rather than hardware failure, no existing platform could support an SDN deployment
Network control
Functionality runs on NCS in the site controller layer collocated with the switch hardware
Paxos: handles leader selection for all control functionality
At each site, perform application-layer failure detection
When a majority of the Paxos servers detect a failure, they elect a new leader among the remaining set of available servers
Routing
Routing application proxy (RAP)
RAP translates from RIB entries forming a network-level view of global connectivity to the low-level hardware tables used by the OpenFlow data plane
RAP translates each RIB entry into two OpenFlow tables, a Flow table which maps prefixes to entries into a ECMP Group table.
Traffic Engineering
Goal: share bandwidth among competing applications possibly using multiple paths
Objective function: deliver max-min fair allocation to applications
maximizes utilization as long as further gain in utilization is not achieved by penalizing fair share of applications
Notion
Flow Group (FG): TE cannot operate on granularity of individual applications; aggreage application to a Flow Group defined as {src site, dest site, QoS}
Bandwidth functions
specifies the bandwidth allocation to an application given the flow’s relative priority on an arbitrary, dimensionless scale, which we call its fair share
decides from administrator-specified static weights
q: flow detection? what about dynamic
Bandwidth functions are configured, measured and provided to TE via Bandwidth Enforcer
an FG’s bandwidth function is a piecewise linear additive composition of per-application bandwidth functions
Each FG multiplexes multiple application demands from one site to another
Max-min objective of TE is on this per-FG fair share dimension
Bandwidth enforcer also aggregates bandwidth functions across multiple applications
Optimization algorithm: achieve similar fairness of LP optimal and at least 99% of the bandwidth utilization with 25x faster performance relative to LP
(1) Tunnel Group Generation
Allocate bw to FGs using bandwidth functions to prioritize at bottleneck edges
(2) Group Quantization
Split ratios in each TG to match granualrity supported by the switch HW tables
Example
TE protocol, OpenFlow, how it's implemented
Some additional discussions on dependencies and failures
Dependencies among ops
Synchronizing TED between TE and OFC: compute difference, Session ID
Ordering issues: sequence ID
TE op failures: (Dirty/Clean) bit for each TED entry
Deployments and Evals
Deployment Take
i) topology aggregation significantly reduces path churn and system load
ii) even with topology aggregation, edge removals happen multiple times a day
iii) WAN links are susceptible to frequent port flaps and benefit from dynamic centralized management
TE Ops Performance: monthly distribution of ops issued, failure rate, latency distribution for two main TE operations (Tunnel addition and Tunnel Group mutation)
Impact of Failures
A single link failure
An encap switch failure and separately the failure of its neighboring transit router (much longer convergence time)
An OFC failover
A TE server failover
Disabling/enabling TE
TE Algorithm Evaluation
14% throughput increase, main benefits come during periods of failure or high demand
Link utilization and hashing
Most WANs: 30-40% utilization
But B4: ~100%
Lessons learned from an outage
Planned maintenance operation --> one of the new switches was inadvertently manually configured with the same ID as an existing switch --> link flaps, switches declare interfaces down, breaking BGP adjacencies with remote cites
Lessons
Scalability and latency of the packet IO path between OFC and OFA is critical
OFA should be async and multi-threaded
Need additional performance profiling and reporting
With TE, they "fail open" --> it is not possible to distinguish between physical failures and the associated data plane
But the compromise as: hardware is more reliable than control software
Require application-level signals of broken connectivity to disambiguate between WAN hardware and software failures
TE server must be adaptive to failed / unresponsive OFCs when modifying TGs that depend on creating new Tunnels
Most failures involve the inevitable human error that occurs in managing large, complex system
Critical to measure system performance at its breaking point with published envelopes regarding system scale
Some takes
Is the paper / problem / insight efficient?
Efficiency argument
Make a general observation about the traffic elasticity
"Shift" in mindset: networking at efficiency (this is done at compute, etc..)
Test of time award of Sigcomm
AT&T had large WANs but they were never aware of this --> be able to exploit this is important
Use cases
Enterprise internally in DC (only has 5% utilization)
Cloud tenants
Cloud / hot potateo routing, tiers of BW paid
Traffic Engineering (TE)
Improve utilization
"Scalability" for more # of DCs is a question
AT&T has done the centralized controller, but twick routing
Industry paper?
Some student answers
1) FS paper feels like other papers can built upon and extend
2) Take insights from industry paper
3) "Access" to things are own by the industries
4) No comparisons to other solutions
Sylvia's take:
When reviewing the paper @ SIGCOMM, we need a good problem statement, insight, and eval insights
In industries, it's often not realistic to compare it with some other solutions
Industries are important participants; sigcomm reviewers said we need more industry paper
"practical", "at-scale", "vendor-support"
Last updated