NetHint: While-Box Networking for Multi-Tenant Data Centers
Questions
Key insights
Cloud provider exposes network to its tenant as a black box: the cloud tenants have little visibility into their expected network performance and the underlying network characteristics. However as data intensive applications are moving to the cloud, they can and want toadapt traffic based on these characteristics to improve performance.
There is a mismatch between black-box nature of existing network abstractions and the ability and incentives of these applications to adapt their schedules.
The paper thus proposes a "white-box" approach that allow a cloud tenant and cloud provider to interact to resolve the mismatch and enhance performance jointly. The provider provides a hint (i.e. indirect indication of bandwidth allocation to a cloud tenant) that is secure and useful, then the tenant applications adapt their transfer schedules based on hints.
Strengths
Valid motivations that identify the mismatch on application's incentives to achieve performance and its unawareness of the underlying characteristics.
Use cases illustrations: the paper talks about 4-5 scenarios where we are able to utilize the hints in a simple way to improve performance. This gives readers a sense on how useful these hints are.
Weakness
The baseline comparisons are too simple; compared with an approach with large overheads (i.e. user-probing), compared with an approach with zero knowledge (i.e. no info), but there are a lot of cloud services that provide some sorts of QoS guarantees and specialized frameworks for running big data applications. How does NetHint compared to those, in terms of pricing and performance?
100ms update period is not clearly motivated; the motivation figure doesn't show the fluctuations in traffic in this fine-granularity of time, it only shows per 4-hour period. If the fluctuation is low, user-probing is fine.
Do the hints necessarily hide the security concerns? In scenarios where there are only one other co-located flows, tenants are still able to infer the other tenant's information.
The information collection method doesn't work for all traffic, what about RDMA? Also are all the host machines allow NetHint to collect traffic like this?
Some applications need the information to adapt, but most of the workloads in cloud (i.e. web search etc.) might not need NetHint, collecting information on those machines would be extra overheads.
Under this mechanism, tenants still need to deal with some sort of complexity compared to cloud-provider service. Also I doubt cloud provider would want to expose hints that can potentially still lead to security concerns.
Some scalability challenges might still remain; how does it scale with # of tenants for example, with a single-thread answering agent located in each rack? how frequent on the query arrival?
Ideally in evaluations I'd also like to see some sorts of mixed of workloads when they jointly act on the information that they've seen
Comments
I'm also curious about the duration on each job completion time. If a job is very long and the fluctuation is somewhat stable, maybe user-probing is fine.
Presentation
Data-intensive applications are moving to the cloud
Cloud providers have put tremendous effort into improving the applications
Today's Cloud offers a "black-box" abstraction
Simple
Tenants have minimum knowledge about the network performance
No link-layer topology
Unaware of instantaneous available bandwidth
Data-intensive application can adapt traffic
Broadcast [RL, ensemble]
Different overlays to build the multicast
It also have the incentive to adapt traffic
Mismatch
Black-box networking abstraction does not provide network characteristics
Q: why is that? what are the pros of not exposing network characteristics to the application ends?
User Probing
Tenants do traffic probing to profile the network performance
Con: costly (every app probes for itself), slow (delay the start)
Q: why not periodically?
Data-intensive applications have both the incentive and ability to adapt their transfer schedule based on network characteristics
Q: is it for all sorts of applications?
Strawmen White-box Solution
Cloud provider exposes some useful information to tenants
Q: what is the API exposed to the end users?
Q: alternative to provide isolations?
Con
Security concerns: might leak information of other tenants
Communication patterns can change frequently, and re-calculation is expensive
NetHint
For NetHint to work, the service first collect characteristics, then application queries the hints --> adapt --> new transfer schedule can further change the network characteristics
Key questions
What hints to provide?
Hierarchical virtual topology T for a cloud tenant [tree structure]
Oversubscription at the rack layer
Can reflect the over-subscription and locality
Q: need to understand this
Network utilization on each link
Can reflect the total bandwidth
All flows statistics? [security leaking]
Residual bandwidth B on link l [not accurate, don't know how the flow will share the network with other flows]
Above + number of competing flows sharing the same link
How to provide hints with low cost?
Collects network metrics periodically
In each period, collect once for all tenants
Hierarchical all-gather; all-to-all only among racks
We set the information update period to 100ms
Overhead of NetHint monitoring plane
Q: is the emulation accurate
How should applications adapt their traffic?
Collective communication: data-parallel deep learning, RL, serving ensemble models
Task placements: data-analytics frameworks, task-based distributed systems
Some other questions to answer
Evaluations
Q: very small-scale testing, also what is the duration that I should do the probing
What causes the variances across different applications
Last updated
Was this helpful?