B4: Experience with a Globally-Deployed Software Defined WAN

https://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf

Problem

  • WAN links are typically provisioned at 30-40% avg utilization

    • WAN links are expensive, packet loss is typically thought unacceptable

    • High-end, specialized equipment that place a premium on high availability

    • Treat all bits the same

  • Google WAN

    • Control over everything (apps, servers, LANs, edge)

    • Bandwidth-intensitve app performs large-scale data copies from one site to another

    • Anticipate no more than a few dozen data center deployment, making control of bandwidth feasible

  • Design centers around

    • Accepting failures as inevitable and common events, effects are exposed to the app

      • ??

    • Switch hardware that exports a simple interface to program forwarding table entries under central control

  • Use cases: routing protocols and centralized traffic engineering

Background

  • Two types of WANs

    • User-facing network: peers / exchange traffic with the other Internet domains

      • Requirement: support a wide range of protocols, physical topology will be more dense, in content delivery must support highest level of availability

    • B4: connectivity between data centers

      • Workload: user data copies for availability, remote storage access for computation over inherently distributed data sources, large-scale data push synchronization state across multiple DCs

        • Ordered in increasing volume, decreasing latency sensitivity, and decreasing overall priority

      • Design

        • Elastic bandwidth demand

        • Moderate number of sites

        • End application control

        • Cost sensitivity

Architecture

  • Switch hardware: forward traffic

  • Site controller layer: NCS hosting both OpenFlow controllers (OFC) and Network Control Applications (NCA)

  • Globaly layers: logically centralized applications (e.g. SDN gateway, TE servers)

Switch design

  • Build their own hardware

  • Insight: don't need deep buffers, very large forwarding tables, hardware support for availability [with cost and complexity]

  • Motivation: careful endpoint managements, few set of DCs, switch failures typically result in software rather than hardware failure, no existing platform could support an SDN deployment

Network control

  • Functionality runs on NCS in the site controller layer collocated with the switch hardware

  • Paxos: handles leader selection for all control functionality

    • At each site, perform application-layer failure detection

    • When a majority of the Paxos servers detect a failure, they elect a new leader among the remaining set of available servers

Routing

  • Routing application proxy (RAP)

    • RAP translates from RIB entries forming a network-level view of global connectivity to the low-level hardware tables used by the OpenFlow data plane

      • RAP translates each RIB entry into two OpenFlow tables, a Flow table which maps prefixes to entries into a ECMP Group table.

Traffic Engineering

  • Goal: share bandwidth among competing applications possibly using multiple paths

  • Objective function: deliver max-min fair allocation to applications

    • maximizes utilization as long as further gain in utilization is not achieved by penalizing fair share of applications

  • Notion

    • Flow Group (FG): TE cannot operate on granularity of individual applications; aggreage application to a Flow Group defined as {src site, dest site, QoS}

  • Bandwidth functions

    • specifies the bandwidth allocation to an application given the flow’s relative priority on an arbitrary, dimensionless scale, which we call its fair share

    • decides from administrator-specified static weights

      • q: flow detection? what about dynamic

    • Bandwidth functions are configured, measured and provided to TE via Bandwidth Enforcer

      • an FG’s bandwidth function is a piecewise linear additive composition of per-application bandwidth functions

        • Each FG multiplexes multiple application demands from one site to another

      • Max-min objective of TE is on this per-FG fair share dimension

    • Bandwidth enforcer also aggregates bandwidth functions across multiple applications

  • Optimization algorithm: achieve similar fairness of LP optimal and at least 99% of the bandwidth utilization with 25x faster performance relative to LP

    • (1) Tunnel Group Generation

      • Allocate bw to FGs using bandwidth functions to prioritize at bottleneck edges

    • (2) Group Quantization

      • Split ratios in each TG to match granualrity supported by the switch HW tables

  • Example

TE protocol, OpenFlow, how it's implemented

  • Some additional discussions on dependencies and failures

    • Dependencies among ops

    • Synchronizing TED between TE and OFC: compute difference, Session ID

    • Ordering issues: sequence ID

    • TE op failures: (Dirty/Clean) bit for each TED entry

Deployments and Evals

  • Deployment Take

    • i) topology aggregation significantly reduces path churn and system load

    • ii) even with topology aggregation, edge removals happen multiple times a day

    • iii) WAN links are susceptible to frequent port flaps and benefit from dynamic centralized management

  • TE Ops Performance: monthly distribution of ops issued, failure rate, latency distribution for two main TE operations (Tunnel addition and Tunnel Group mutation)

  • Impact of Failures

    • A single link failure

    • An encap switch failure and separately the failure of its neighboring transit router (much longer convergence time)

    • An OFC failover

    • A TE server failover

    • Disabling/enabling TE

  • TE Algorithm Evaluation

    • 14% throughput increase, main benefits come during periods of failure or high demand

  • Link utilization and hashing

    • Most WANs: 30-40% utilization

    • But B4: ~100%

Lessons learned from an outage

  • Planned maintenance operation --> one of the new switches was inadvertently manually configured with the same ID as an existing switch --> link flaps, switches declare interfaces down, breaking BGP adjacencies with remote cites

  • Lessons

    • Scalability and latency of the packet IO path between OFC and OFA is critical

    • OFA should be async and multi-threaded

    • Need additional performance profiling and reporting

    • With TE, they "fail open" --> it is not possible to distinguish between physical failures and the associated data plane

      • But the compromise as: hardware is more reliable than control software

      • Require application-level signals of broken connectivity to disambiguate between WAN hardware and software failures

    • TE server must be adaptive to failed / unresponsive OFCs when modifying TGs that depend on creating new Tunnels

    • Most failures involve the inevitable human error that occurs in managing large, complex system

    • Critical to measure system performance at its breaking point with published envelopes regarding system scale

Some takes

  • Is the paper / problem / insight efficient?

    • Efficiency argument

      • Make a general observation about the traffic elasticity

      • "Shift" in mindset: networking at efficiency (this is done at compute, etc..)

      • Test of time award of Sigcomm

      • AT&T had large WANs but they were never aware of this --> be able to exploit this is important

    • Use cases

      • Enterprise internally in DC (only has 5% utilization)

      • Cloud tenants

        • Cloud / hot potateo routing, tiers of BW paid

    • Traffic Engineering (TE)

      • Improve utilization

      • "Scalability" for more # of DCs is a question

        • AT&T has done the centralized controller, but twick routing

  • Industry paper?

    • Some student answers

      • 1) FS paper feels like other papers can built upon and extend

      • 2) Take insights from industry paper

      • 3) "Access" to things are own by the industries

      • 4) No comparisons to other solutions

    • Sylvia's take:

      • When reviewing the paper @ SIGCOMM, we need a good problem statement, insight, and eval insights

      • In industries, it's often not realistic to compare it with some other solutions

      • Industries are important participants; sigcomm reviewers said we need more industry paper

      • "practical", "at-scale", "vendor-support"

Last updated