NetHint: While-Box Networking for Multi-Tenant Data Centers


Key insights

  • Cloud provider exposes network to its tenant as a black box: the cloud tenants have little visibility into their expected network performance and the underlying network characteristics. However as data intensive applications are moving to the cloud, they can and want toadapt traffic based on these characteristics to improve performance.

  • There is a mismatch between black-box nature of existing network abstractions and the ability and incentives of these applications to adapt their schedules.

  • The paper thus proposes a "white-box" approach that allow a cloud tenant and cloud provider to interact to resolve the mismatch and enhance performance jointly. The provider provides a hint (i.e. indirect indication of bandwidth allocation to a cloud tenant) that is secure and useful, then the tenant applications adapt their transfer schedules based on hints.


  • Valid motivations that identify the mismatch on application's incentives to achieve performance and its unawareness of the underlying characteristics.

  • Use cases illustrations: the paper talks about 4-5 scenarios where we are able to utilize the hints in a simple way to improve performance. This gives readers a sense on how useful these hints are.


  • The baseline comparisons are too simple; compared with an approach with large overheads (i.e. user-probing), compared with an approach with zero knowledge (i.e. no info), but there are a lot of cloud services that provide some sorts of QoS guarantees and specialized frameworks for running big data applications. How does NetHint compared to those, in terms of pricing and performance?

  • 100ms update period is not clearly motivated; the motivation figure doesn't show the fluctuations in traffic in this fine-granularity of time, it only shows per 4-hour period. If the fluctuation is low, user-probing is fine.

  • Do the hints necessarily hide the security concerns? In scenarios where there are only one other co-located flows, tenants are still able to infer the other tenant's information.

  • The information collection method doesn't work for all traffic, what about RDMA? Also are all the host machines allow NetHint to collect traffic like this?

  • Some applications need the information to adapt, but most of the workloads in cloud (i.e. web search etc.) might not need NetHint, collecting information on those machines would be extra overheads.

  • Under this mechanism, tenants still need to deal with some sort of complexity compared to cloud-provider service. Also I doubt cloud provider would want to expose hints that can potentially still lead to security concerns.

  • Some scalability challenges might still remain; how does it scale with # of tenants for example, with a single-thread answering agent located in each rack? how frequent on the query arrival?

  • Ideally in evaluations I'd also like to see some sorts of mixed of workloads when they jointly act on the information that they've seen


  • I'm also curious about the duration on each job completion time. If a job is very long and the fluctuation is somewhat stable, maybe user-probing is fine.


  • Data-intensive applications are moving to the cloud

    • Cloud providers have put tremendous effort into improving the applications

  • Today's Cloud offers a "black-box" abstraction

    • Simple

    • Tenants have minimum knowledge about the network performance

      • No link-layer topology

      • Unaware of instantaneous available bandwidth

  • Data-intensive application can adapt traffic

    • Broadcast [RL, ensemble]

  • Different overlays to build the multicast

  • It also have the incentive to adapt traffic

  • Mismatch

    • Black-box networking abstraction does not provide network characteristics

      • Q: why is that? what are the pros of not exposing network characteristics to the application ends?

      • User Probing

        • Tenants do traffic probing to profile the network performance

        • Con: costly (every app probes for itself), slow (delay the start)

          • Q: why not periodically?

    • Data-intensive applications have both the incentive and ability to adapt their transfer schedule based on network characteristics

      • Q: is it for all sorts of applications?

  • Strawmen White-box Solution

    • Cloud provider exposes some useful information to tenants

    • Q: what is the API exposed to the end users?

    • Q: alternative to provide isolations?

    • Con

      • Security concerns: might leak information of other tenants

      • Communication patterns can change frequently, and re-calculation is expensive


  • For NetHint to work, the service first collect characteristics, then application queries the hints --> adapt --> new transfer schedule can further change the network characteristics

  • Key questions

    • What hints to provide?

      • Hierarchical virtual topology T for a cloud tenant [tree structure]

        • Oversubscription at the rack layer

        • Can reflect the over-subscription and locality

        • Q: need to understand this

      • Network utilization on each link

        • Can reflect the total bandwidth

          • All flows statistics? [security leaking]

          • Residual bandwidth B on link l [not accurate, don't know how the flow will share the network with other flows]

          • Above + number of competing flows sharing the same link

    • How to provide hints with low cost?

      • Collects network metrics periodically

      • In each period, collect once for all tenants

      • Hierarchical all-gather; all-to-all only among racks

      • We set the information update period to 100ms

      • Overhead of NetHint monitoring plane

        • Q: is the emulation accurate

    • How should applications adapt their traffic?

      • Collective communication: data-parallel deep learning, RL, serving ensemble models

      • Task placements: data-analytics frameworks, task-based distributed systems

      • Some other questions to answer

    • Evaluations

      • Q: very small-scale testing, also what is the duration that I should do the probing

      • What causes the variances across different applications

Last updated