Fast Distributed Deep Learning over RDMA
Problem
RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network. Using RPC for tensor data transfer does not provide efficient advantage on programmability or efficiency, and it typically involves memory copy to and from RPC-managed communication buffers, while RDMA enables zero-copy cross-machine tensor transfer.
Insight
The main insight is that the one-sided memory R/W semantic of RDMA allows a zero-copy communication across servers as long as the remote address is known. For DL computation, the data-flow graph analysis can help arrange the in-memory placement of tensors and provide info to the underlying communication layer.
The paper therefore proposes a simple memory-copy ("device"-like) interface along with a combination of static analysis and dynamic tracing to enable cross-stack optimizations for general DL network training.
Key Strength
Identify an important problem in abstraction layer in using emerging technologies (RDMA) to improve the performance of distributed ML / DL
The architecture and transfer graphs illustrations are really nice
The proposed solution is neat; intermediate buffer copies of an RPC framework and serialization overhead can be avoided through the new abstraction
Key weakness
The eval graphs are black-and-white D:
Does this technique scale and / or generalize to even larger and more diverse DL models, comm patterns and architectures?
Question
What's the difference between pytorch / tensorflow RPC implementation optimized for RDMA?
It'll be really good to see a bar plot of communication and computation time for each of the benchmarked workload; it helps to understand the eval figures better too
It'll also be good to show explicit number for the average tensor size for each of the workloads
Can some of the abstraction idea used to solve similar problems for other distributed workloads? Do other workloads share similar problems over RDMA?
Last updated
Was this helpful?