Fat tree

A Scalable, Commodity Data Center Network Architecture

Some notion

  • In computer networking, if the network is bisected into two partitions, the bisection bandwidth of a network topology is the bandwidth available between the two partitions.

  • Bisection bandwidth gives the true bandwidth available in the entire system.

  • Bisection bandwidth accounts for the bottleneck bandwidth of the entire network. Therefore bisection bandwidth represents bandwidth characteristics of the network better than any other metric.

  • Other explain

    • A bisection of a network is a partition into two equally-sized sets of nodes. The sum of the capacities of links between the two partitions is called the bandwidth of the bisection.

    • The bisection bandwidth of a network is the minimum such bandwidth along all possible bisections.

    • Oversubscription: the ratio of the worst-case achievable aggregate bandwidth among the end hosts to the total bisection bandwidth of a particular communication topology

      • 1:1: all hosts may potentially communicate with arbitrary other hosts at full bandwidth of their network interface (e.g., 1 Gbps for commodity Ethernet designs)

      • Oversubscription is typically introduced to lower the total cost

  • Network

    • Network with larger bisection width would have a better capacity of carrying the traffic

    • Fully-connected network

    • Linear: bisection width = 1 link

    • 2D mesh

      • Bisection bandwidth: product of the bisection width with the link bandwidth (if equal bw across links)

      • If link with unequal speed, find for the links that cut the network in two and have the minimum total bandwidth

Problem

The principal bottleneck of large-scale data center networks is the inter-node communication bandwidth. Existing approaches for building communication fabrics to address this challenge are either expensive, non-scalable, or incompatible with TCP/IP applications. For example, one of the most commonly used data center topology is building on top of hierarchies of switches, with expensive, non-commodity switches placed at the top of the hierarchy. The port density of the high-end switches limits scaling the cluster size and incurring high cost at the same time.

Discussion

Having full bisection bandwidth in a datacenter network is important?

  • Bisection bandwidth in datacenter network means the smallest aggregate capacity of the links crossing the worst-case cut among all the cuts that divide the topology graph into two halves, as a measure of its capacity.

  • Full bisection bandwidth means that all hosts may potentially communicate with arbitrary other hosts at full bandwidth of their network interface.

  • Theoretically, a network that can achieve full bisection bandwidth is extremely efficient, performant (i.e. there is no wasted bandwidth in the network), and robust to failures.

  • However in practice, building such network involves trade-off between performance and cost. Delivering full bisection bandwidth between arbitrary hosts is very expensive and there is a non-linear cost increases with cluster size. As the paper said, oversubscription is often introduced to lower the total cost.

  • Datacenter network topologies should be designed to deliver large bisection bandwidth, as it is very important for distributed applications to achieve desirable performance. But given the cost constraints and the huge scale of cluster, it's not practical to build a network interconnect that deliver full bisection bandwidth. Achieving this can also be difficult because of the need to prevent packet reordering for TCP flows. Additional bandwidth also needs to be used by operators for network management.

Main insight

The goal of this paper is to design a data center communication architecture that leverage cheap commodity switches to deliver scalable bandwidth for large-scale clusters.

They propose a special instance of a Clos topology called fat-tree. For a k-ary fat-tree, there are k pods, each containing two layers of k/2 switches; each k-port switch in the lower layer is directly connected to k/2 hosts; each of the remaining k/2 ports is connected to k/2 of the k ports in the aggregation layer. There are (k/2)^2 k-port core switches, each has one port connected to each of k ports. The ith port of core switch is connected to pod i. This architecture is greatly scalable (i.e. fat-tree built with k-port switches supports k^3/4 hosts) with low-cost (i.e. all switching elements are cheap commodity switches and identical).

However there are two principal issues of deployments: 1) IP/Ethernet networks build a single routing path between each src and dst, which can lead to bottleneck up and down the fat-tree 2) fat-tree topologies can impose significant wiring complexity in large networks.

For 1), the paper proposes a special IP addressing scheme to assign IP address to host in the cluster and introduces two-level route lookups to assist multi-path routing (e.g. distributing traffic) in the fat-tree. In addition to the two-level routing technique, the paper presents two optional dynamic routing optimizations for traffic diffusion:

  • a) flow classification: pod switches recognize packets of the same flow and forward them on the same outgoing port, and periodically reassign a minimal number of flow output ports

  • b) flow scheduling: central scheduler with global knowledge schedules large flows to minimize overlap with one another

For 2), the paper presents packaging and placement techniques to ameliorate this overhead.

Solution

Generally the solution is a good one:

1) Scalability and performance: In theory the proposed mechanism can deliver scalable interconnection bandwidth

2) Cost-efficiency: leverage the same economies of scale to make cheap off-the-shelf Ethernet commodity switches the basis for large-scale datacenter networks

3) Backward compatibility: the solution requires no modification to end hosts, and is fully TCP/IP compatible

Still,

1) deploying such architecture in practice requires some amount of modifications and redesigns of the routers.

2) Some optimization techniques proposed by this paper (i.e. especially in the flow classification and scheduling section) seem to be complicated and expensive to deploy.

3) The evaluation section is good, but the authors only showed results on a 4-port fat-tree. This doesn't demonstrate the system scalability that well... (the authors said that they focused on designs up to k = 48)

4) Though there is a failure broadcast protocol presented, there are no evaluations on fault tolerance at scale

5) It'd be interesting to see some performance evaluations of distributed applications (e.g. MapReduce) running on top of this arch

Comments / thoughts

  • Is this proposal deployed in DC now? Why or why not?

  • What are the types of switches commonly used in DC now and their capacity / throughput?

  • What are the typical number of hosts targeted by DC today? The paper targets around 25,000, I think the number should be much larger today?

  • Do servers or hosts take any roles of packet forwarding now in the DC? In this setup they seem to be only recipients

  • The paper states that it does not consider cabling costs in their cost calculation, why is that?

  • "Routing large flows plays a significant role in determining the achievable bisection bandwidth of a network and therefore merits special handling" --> why is it true? is it still true today (where short flows dominate DC workloads)?

Enjoy?

Yes, it is a fun read! I particularly enjoy reading the background session and get to know the architectural solutions back then

Additional note

Last updated