Fast Distributed Deep Learning over RDMA


  • RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network. Using RPC for tensor data transfer does not provide efficient advantage on programmability or efficiency, and it typically involves memory copy to and from RPC-managed communication buffers, while RDMA enables zero-copy cross-machine tensor transfer.


  • The main insight is that the one-sided memory R/W semantic of RDMA allows a zero-copy communication across servers as long as the remote address is known. For DL computation, the data-flow graph analysis can help arrange the in-memory placement of tensors and provide info to the underlying communication layer.

  • The paper therefore proposes a simple memory-copy ("device"-like) interface along with a combination of static analysis and dynamic tracing to enable cross-stack optimizations for general DL network training.

Key Strength

  • Identify an important problem in abstraction layer in using emerging technologies (RDMA) to improve the performance of distributed ML / DL

  • The architecture and transfer graphs illustrations are really nice

  • The proposed solution is neat; intermediate buffer copies of an RPC framework and serialization overhead can be avoided through the new abstraction

Key weakness

  • The eval graphs are black-and-white D:

  • Does this technique scale and / or generalize to even larger and more diverse DL models, comm patterns and architectures?


  • What's the difference between pytorch / tensorflow RPC implementation optimized for RDMA?

  • It'll be really good to see a bar plot of communication and computation time for each of the benchmarked workload; it helps to understand the eval figures better too

  • It'll also be good to show explicit number for the average tensor size for each of the workloads

  • Can some of the abstraction idea used to solve similar problems for other distributed workloads? Do other workloads share similar problems over RDMA?

Last updated