# Fast Distributed Deep Learning over RDMA

### Problem

* RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network. Using RPC for tensor data transfer does not provide efficient advantage on programmability or efficiency, and it typically involves memory copy to and from RPC-managed communication buffers, while RDMA enables zero-copy cross-machine tensor transfer.&#x20;

### Insight&#x20;

* The main insight is that the one-sided memory R/W semantic of RDMA allows a zero-copy communication across servers as long as the remote address is known. For DL computation, the data-flow graph analysis can help arrange the in-memory placement of tensors and provide info to the underlying communication layer.&#x20;
* The paper therefore proposes a simple memory-copy ("device"-like) interface along with a combination of static analysis and dynamic tracing to enable cross-stack optimizations for general DL network training.&#x20;

### Key Strength&#x20;

* Identify an important problem in abstraction layer in using emerging technologies (RDMA) to improve the performance of distributed ML / DL&#x20;
* The architecture and transfer graphs illustrations are really nice&#x20;
* The proposed solution is neat; intermediate buffer copies of an RPC framework and serialization overhead can be avoided through the new abstraction&#x20;

### Key weakness&#x20;

* The eval graphs are black-and-white D:&#x20;
* Does this technique scale and / or generalize to even larger and more diverse DL models, comm patterns and architectures?&#x20;

### Question

* What's the difference between pytorch / tensorflow RPC implementation optimized for RDMA?
* It'll be really good to see a bar plot of communication and computation time for each of the benchmarked workload; it helps to understand the eval figures better too&#x20;
* It'll also be good to show explicit number for the average tensor size for each of the workloads&#x20;
* Can some of the abstraction idea used to solve similar problems for other distributed workloads? Do other workloads share similar problems over RDMA?&#x20;
