# Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication

## Presentation&#x20;

* Large-scale graph processing&#x20;
  * 10^10 pages, 10^12 tokens: page rank&#x20;
  * 10^9 nodes, 10^12 edges: social network analysis&#x20;
* Powerful storage & computation technologies&#x20;
* Goal:&#x20;
  * Large memory + fast secondary storages&#x20;
  * CPU + GPUs&#x20;
    * CPU: sequential&#x20;
    * GPU: SIMD mode processing&#x20;
  * How to efficiently integrate heterogeneity under a unified abstraction&#x20;

![](/files/-McNDKY70BbQo1o78K3w)

![](/files/-McNDVdb-oLg-3z7RTxO)

![](/files/-McNE-dTOihateXQ0WBr)

* Non-distributed platform&#x20;

![](/files/-McNEFgbfOg7tKMUHxjL)

![](/files/-McNEW0Q44sf1V6A8OF_)

* Most time-consuming: gather phase&#x20;

![](/files/-McNEe2adb0D8lvnTiXh)

![](/files/-McNEit7zu0NuE0gUT71)

![](/files/-McNEx6b_LiRy6-cGNjN)

![](/files/-McNF-OVvvfttQFsZDWN)

![](/files/-McNF1PuNq7zx8l_-wQH)

![](/files/-McNF4VRPaE_aNROLAkd)

![](/files/-McNF6N5Vn88sqdHUv_G)

![](/files/-McNFALS4vsVFYOwY5iH)

![](/files/-McNFC09e7o_gO_M80jX)

![](/files/-McNFDloPlSzd5UZM2Yi)

![](/files/-McNFFeq1Du4-NLbCKXk)

## Paper&#x20;

### Abstract&#x20;

* Garaph: GPU-accelerated graph processing system on a single machine with secondary storage as memory extension&#x20;
* Contributions&#x20;
  * **Vertex replication degree customization scheme**
    * maximize GPU utilization given vertices' degrees and space constraints&#x20;
  * **Balanced edge-based partition** and a **hybrid of notify-pull and pull computation models**
    * ensure work balance over CPU threads
    * optimized for fast graph processing on CPU&#x20;
  * **Dynamic workload assignment** schemes&#x20;
    * Takes into account of the characteristics of processing elements and graph algorithms&#x20;

### Intro&#x20;

* Distributed graph systems: need fast network and effective partitioning to minimize communication&#x20;
* Alternative: non-distributed&#x20;
  * Benefit: users need not to be skilled at managing & tuning&#x20;
  * Problem: pressure on **memory** and **computing** power. But it's  affordable.&#x20;
    * RAM is large&#x20;
    * Advances of secondary storage: high access bandwidth close to memory&#x20;
    * GPU: massive parallelism to offer high-performance graph processing&#x20;
* Setting: GPU-accelerated, secondary-storage based graph processing&#x20;
* Challenge:&#x20;
  * highly skewed degree distribution of natural graphs&#x20;
    * Small fraction of vertices adjacent to large fraction of edges --> heavy write contention among GPU threads due to atomic updates of the same vertices&#x20;
    * Work imbalance of CPU threads&#x20;
  * heterogeneous parallelism of CPU & GPU&#x20;
    * CPU: sequential processing&#x20;
    * GPU: bulk parallel processing&#x20;
* Propose: Garaph&#x20;
  * GPU: vertex replication degree customization&#x20;
  * CPU: balanced edge-based partition&#x20;
  * Heterogeneity of computation units&#x20;
    * Pull computation model: matches the SIMD processing model of GPU
    * Hybrid of notify-pull and pull computation model: optimizes for fast sequential processing on CPU&#x20;
  * Dynamic workload assignment&#x20;

### System overview&#x20;

#### 2.1 Graph Representation&#x20;

For organizing incoming and outgoing edges:&#x20;

* Compressed Sparse Column (CSC)
* Compressed Sparse Row (CSR)&#x20;
* Shard:&#x20;
  * split vertices V into disjoint sets of vertices and each set is represented by a shard that stores all incoming edges whose destination is in that set.&#x20;
  * Edges in a shard are listed based on increasing order of their indexes of destination vertices.&#x20;
  * Allow each shard to be fit into the shared memory for high bandwidth&#x20;
  * Maximum offset is 12K, can use 16-bit integer to represent the index of destination vertices&#x20;
* Transfer shards from host memory to GPU memory in batch&#x20;
  * Call this as a **page**&#x20;
* Leverages multi-stream feature of GPUs for the overlap of memory copy and kernel execution&#x20;
* Two vertex-centric computation models&#x20;
  * Pull model&#x20;
    * Every vertex updates its state by pulling the new states of neighboring vertices through incoming edges&#x20;
  * Notify-pull
    * Only active vertices notify their outgoing neighbors to update, who in turn perform local computation by pulling states of their incoming neighbors&#x20;
    * More effective in case of few active vertices&#x20;

#### System Architecture&#x20;

* **Dispatcher**&#x20;
  * loading graph from secondary storages, distributing the computation over CPU and GPU
  * Partitions each graph page into equal-size data blocks, which are uniformly distributed over multiple secondary storages with a hash function&#x20;
  * Steps&#x20;
    * Load data blocks from secondary storage to host memory
    * Construct pages&#x20;
    * After one page is constructed, dispatch to either CPU or GPU&#x20;
* **GPU/CPU computation kernel**&#x20;
  * GPU
    * Process the shards of page in a parallel manner&#x20;
    * Only the pull model is enabled on GPU side&#x20;
      * Notify-pull can lead to high frequency of non-coalesced memory accesses because of poor locality and warp divergence caused by distinguishing active/inactive vertices&#x20;
  * CPU
    * Enables both pull and notify-pull&#x20;
    * Each thread processes one edge set (divide edges of a page into sets of equal size)&#x20;
  * Either of the two kernels has processed on page, there will be a synchronization between GPU and CPU&#x20;
  * Execution can be done both synchronously and asynchronously&#x20;
    * Iter: complete process over all the pages for one time
* **Fault Tolerance**&#x20;
  * Write vertex data to secondary storages periodically&#x20;

#### Programming API&#x20;

* Modified Gather-Apply-Scatter (GAS) abstraction used in the PowerGraph&#x20;
  * Modify scatter function to activate function which sets value if the vertex satisfies the active condition&#x20;
* Atomic (user-provided sum function) + non-atomic operations for GPU and CPU respectively&#x20;

### GPU-Based Graph Processing&#x20;

![](/files/-McNUp9U15n3hw_P8NY4)

* Global Memory
  * Up to 24GB in size.&#x20;
    * Size of the vertices is usually 4 bytes. Can store up to 6B (or 12B) vertices in global memory. &#x20;
  * Global Vertices: allows quick access to values of vertices&#x20;
  * Each shard in a page is processed by one GPU block in three phases: initialization, gather, and apply
    * **Initialization**:&#x20;
      * LocalVertices to store accumulate value of each vertex in a shard.&#x20;
      * Consecutive threads of a block initialize this array with default vertex values defined by users&#x20;
    * **Gather** &#x20;
      * Threads of one GPU block process edges of an individual shard. For each edge, one thread fetches vertex & edge data from global memory and increase accumulate value&#x20;
      * To have coalesced global memory accesses: consecutive threads of the block read consecutive edges' data in global memory&#x20;
    * **Apply**&#x20;
      * Each thread of block updates vertex value in shared memory&#x20;
      * Async: commit new vertex data to GlobalVertices array, which are immediately visible to the subsequent computation&#x20;
      * Sync: values are written to temporary array in global memory, which would be visible in the next iteration
  * When page has been processed, new vertex values are synchronized between GPU global memory and host memory&#x20;
    * Async: transmits updated values of GlobalVertices in the GPU global memory to array storing the most updated values of vertices in the host memory&#x20;
    * Sync: stored in temporary space of GPU global memory are transmitted to temporary array in host memory, commit after the iteration ends&#x20;
    * Can be overlapped with processing&#x20;

#### Replication-Based Gather&#x20;

* Problem: gather phase with write contention (multi-threads simultaneously modifying the same shared memory address) --> **position conflict**&#x20;
  * Frequent for natural graphs (power-law degree distribution)&#x20;
* Strategy: **replication**&#x20;
  * which consists of placing R adjoining copies of the partial accumulated value in the shared memory to spread these accesses over more shared memory addresses.
  * Then these R partial accumulated values are aggregated to calculate the final accumulated value au for a vertex u.&#x20;
    * Two-way merge
  * R: replication factor&#x20;
* Replication factor customization&#x20;
  * Too large? GPU underutilization since fewer vertices can be fit in the shared memory&#x20;
  * Maximizes the expected performance under given conflict degree and space constraints&#x20;

### CPU-Based Graph Processing&#x20;

![](/files/-McNgfZTVT6n7BIdNcvM)

* Main points&#x20;
  * How it works&#x20;
  * Balanced edge-based partition to exploit parallelism&#x20;
  * Dual-mode processing model: switches between pull/notify-pull modes according to the density of active vertices in the page&#x20;
* Existing approach: assign each thread of a subset of vertices&#x20;
  * Computation imbalance&#x20;
  * Random  memory access of edge data if adjacent vertices are assigned to different threads&#x20;
* **Edge-centric partition**&#x20;
  * Enhance sequential access of edge data and improves work balance over threads&#x20;
    * Why mention it in CPU-based??&#x20;
      * Why is it not an issue on the GPU side?&#x20;
        * Cuda blocks taking 32 threads&#x20;
        * Done with the block, then push another block&#x20;
* CPU engine
  * GlobalVertices in host memory for quick access to values of vertices&#x20;
  * LocalVertices to store accumulate values of destination vertices in the corresponding partition&#x20;
  * Each page: initialization, gather, apply&#x20;
    * If a page is processed at GPU side, system also synchronizes new vertex values between GPU memory and host memory&#x20;
  * Processing is done&#x20;
    * Graph state converges (i.e. no active vertices)&#x20;
      * Active vertices: vertices with significant state change (use a bitmap to indicate)&#x20;
    * Or: a given number of iterations complete
* Each page&#x20;
  * **Initialization**
    * Edges of the page is divided into partitions, and each thread processes one partition.&#x20;
    * Number of replicas is at most n\_t - 1&#x20;
    * Initialize LocalVertices with vertices' default values&#x20;
  * **Gather**&#x20;
    * Each partition --> one thread&#x20;
    * Edges are processed in a sequential order&#x20;
    * For each edge, CPU thread performs gather and updates the accumulate value in LocalVertices with sum function&#x20;
    * After each thread, aggregation phase aggregates values of vertices replicated at the partition boundaries&#x20;
  * **Apply**&#x20;
    * After gather phase of each page is finished, every thread updates vertices' values in the LocalVertices array
    * For each vertex in the partition, corresponding thread calls activate() to examine if the vertex is active or not and updates the bitmap&#x20;
  * **Sync**
    * after the GPU has processed a page, it sends the corresponding vertex values to the host memory
    * Then, system calls Activate() of these updated vertices&#x20;
    * Async:&#x20;
      * enables the updates received from the GPU immediately visible through writing them into the GlobalVertices array of host memory
      * Then, sends back new vertices updated on CPU side to the GPU and overwrites the corresponding part of the GlobalVertices array in the GPU global memory&#x20;
    * Sync
      * stores updated values in a temporary array, and commits these new values at the end of each iteration
      * the CPU transmits the new GlobalVertices array to that in the GPU memory at the end of each iteration

#### Dual-Mode Processing Engine&#x20;

* **pull mode**
  * more beneficial when most vertices are activated (dense active vertex set), which avoids the extra costs of modification&#x20;
* **notify/pull mode**: a vertex needs to be updated only when one of its source vertices is active in the previous iteration&#x20;
  * more efficient when few vertices are active in the last iteration (sparse active vertex set)
* At a given time during the graph processing, the active vertex set **may be dense or sparse**&#x20;
  * E.x. starts from sparse, then becoming denser as more vertices being activated, and sparse again when algorithm approaches convergence&#x20;
* Problem of combining two modes where only part of graph can be loaded into the host memory&#x20;
  * System entails I/O cost due to sequential and random accesses of outgoing/incoming edges on secondary storage for pull & notify-pull modes respectively (use a different formula to consider the rate of speeds between sequential read and random read of secondary storage)&#x20;

### Dispatcher&#x20;

* Adaptive scheduling mechanism to exploit the overlap of two engines&#x20;

#### CPU-GPU Scheduling

* if T(CPU) \<T(GPU), the system adopts CPU kernel only due to sparse active vertex set. Otherwise, Garaph adopts both GPU and CPU kernels to reduce the overall time of the processing
* At the beginning of each iteration, the scheduler calculates the following ratio of T(CPU) to T(GPU)

![](/files/-McNnI4SDSdU1O7rI-e5)

* T(pull) / T(GPU) is initialized by the speed ratio of CPU and GPU hardwares, and is updated once both kernels have begun to process pages&#x20;
* Alpha < 1: only CPU kernel is used for graph processing in this iteration as most vertices are very inactive (e.g. very small f)&#x20;
  * f is the fraction of |V\_A| / |V|&#x20;
* Otherwise: process on both CPU and GPU kernel&#x20;
  * Reactively assigns a page to a kernel once the kernel becomes free&#x20;

#### GPU Multi-Stream Scheduling&#x20;

* To trigger the graph processing on the GPU side: two threads running on the host&#x20;
  * **Transmission thread**: continuously transmits each page from the host memory to GPU’s global memory
  * **Computation thread**: launches a new GPU kernel to process the page that has already been transmitted.
* NVIDIA's Hyper-Q feature
  * pipelining CPU-GPU memory copy and kernel execution&#x20;
  * so that the processing tasks of pages can be dispatched onto multiple streams and handled concurrently

### Evaluation&#x20;

## Group Discussion&#x20;

* warp: unit of 32 threads on most gpu
  * warp divergence: thread takes a difference step, they diverge&#x20;
* vectorized instruction: increment all elements in the array by one,&#x20;
  * Check the value of the array (>5, <5), don't get 32x speed up


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/shivarams-group/group-papers/garaph-efficient-gpu-accelerated-graphprocessing-on-a-single-machine-with-balanced-replication.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
