Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication

https://www.usenix.org/system/files/conference/atc17/atc17-ma.pdf

Presentation

Large-scale graph processing
- 10^10 pages, 10^12 tokens: page rank
- 10^9 nodes, 10^12 edges: social network analysis
Powerful storage & computation technologies
Goal:
- Large memory + fast secondary storages
- CPU + GPUs
  - CPU: sequential
  - GPU: SIMD mode processing
- How to efficiently integrate heterogeneity under a unified abstraction

Non-distributed platform

Most time-consuming: gather phase

Paper

Abstract

Garaph: GPU-accelerated graph processing system on a single machine with secondary storage as memory extension
Contributions
- Vertex replication degree customization scheme
  - maximize GPU utilization given vertices' degrees and space constraints
- Balanced edge-based partition and a hybrid of notify-pull and pull computation models
  - ensure work balance over CPU threads
  - optimized for fast graph processing on CPU
- Dynamic workload assignment schemes
  - Takes into account of the characteristics of processing elements and graph algorithms

Intro

Distributed graph systems: need fast network and effective partitioning to minimize communication
Alternative: non-distributed
- Benefit: users need not to be skilled at managing & tuning
- Problem: pressure on memory and computing power. But it's affordable.
  - RAM is large
  - Advances of secondary storage: high access bandwidth close to memory
  - GPU: massive parallelism to offer high-performance graph processing
Setting: GPU-accelerated, secondary-storage based graph processing
Challenge:
- highly skewed degree distribution of natural graphs
  - Small fraction of vertices adjacent to large fraction of edges --> heavy write contention among GPU threads due to atomic updates of the same vertices
  - Work imbalance of CPU threads
- heterogeneous parallelism of CPU & GPU
  - CPU: sequential processing
  - GPU: bulk parallel processing
Propose: Garaph
- GPU: vertex replication degree customization
- CPU: balanced edge-based partition
- Heterogeneity of computation units
  - Pull computation model: matches the SIMD processing model of GPU
  - Hybrid of notify-pull and pull computation model: optimizes for fast sequential processing on CPU
- Dynamic workload assignment

System overview

2.1 Graph Representation

For organizing incoming and outgoing edges:

Compressed Sparse Column (CSC)
Compressed Sparse Row (CSR)
Shard:
- split vertices V into disjoint sets of vertices and each set is represented by a shard that stores all incoming edges whose destination is in that set.
- Edges in a shard are listed based on increasing order of their indexes of destination vertices.
- Allow each shard to be fit into the shared memory for high bandwidth
- Maximum offset is 12K, can use 16-bit integer to represent the index of destination vertices
Transfer shards from host memory to GPU memory in batch
- Call this as a page
Leverages multi-stream feature of GPUs for the overlap of memory copy and kernel execution
Two vertex-centric computation models
- Pull model
  - Every vertex updates its state by pulling the new states of neighboring vertices through incoming edges
- Notify-pull
  - Only active vertices notify their outgoing neighbors to update, who in turn perform local computation by pulling states of their incoming neighbors
  - More effective in case of few active vertices

System Architecture

Dispatcher
- loading graph from secondary storages, distributing the computation over CPU and GPU
- Partitions each graph page into equal-size data blocks, which are uniformly distributed over multiple secondary storages with a hash function
- Steps
  - Load data blocks from secondary storage to host memory
  - Construct pages
  - After one page is constructed, dispatch to either CPU or GPU
GPU/CPU computation kernel
- GPU
  - Process the shards of page in a parallel manner
  - Only the pull model is enabled on GPU side
    Notify-pull can lead to high frequency of non-coalesced memory accesses because of poor locality and warp divergence caused by distinguishing active/inactive vertices
- CPU
  - Enables both pull and notify-pull
  - Each thread processes one edge set (divide edges of a page into sets of equal size)
- Either of the two kernels has processed on page, there will be a synchronization between GPU and CPU
- Execution can be done both synchronously and asynchronously
  - Iter: complete process over all the pages for one time
Fault Tolerance
- Write vertex data to secondary storages periodically

Programming API

Modified Gather-Apply-Scatter (GAS) abstraction used in the PowerGraph
- Modify scatter function to activate function which sets value if the vertex satisfies the active condition
Atomic (user-provided sum function) + non-atomic operations for GPU and CPU respectively

GPU-Based Graph Processing

Global Memory
- Up to 24GB in size.
  - Size of the vertices is usually 4 bytes. Can store up to 6B (or 12B) vertices in global memory.
- Global Vertices: allows quick access to values of vertices
- Each shard in a page is processed by one GPU block in three phases: initialization, gather, and apply
  - Initialization:
    LocalVertices to store accumulate value of each vertex in a shard.
    Consecutive threads of a block initialize this array with default vertex values defined by users
  - Gather
    Threads of one GPU block process edges of an individual shard. For each edge, one thread fetches vertex & edge data from global memory and increase accumulate value
    To have coalesced global memory accesses: consecutive threads of the block read consecutive edges' data in global memory
  - Apply
    Each thread of block updates vertex value in shared memory
    Async: commit new vertex data to GlobalVertices array, which are immediately visible to the subsequent computation
    Sync: values are written to temporary array in global memory, which would be visible in the next iteration
- When page has been processed, new vertex values are synchronized between GPU global memory and host memory
  - Async: transmits updated values of GlobalVertices in the GPU global memory to array storing the most updated values of vertices in the host memory
  - Sync: stored in temporary space of GPU global memory are transmitted to temporary array in host memory, commit after the iteration ends
  - Can be overlapped with processing

Replication-Based Gather

Problem: gather phase with write contention (multi-threads simultaneously modifying the same shared memory address) --> position conflict
- Frequent for natural graphs (power-law degree distribution)
Strategy: replication
- which consists of placing R adjoining copies of the partial accumulated value in the shared memory to spread these accesses over more shared memory addresses.
- Then these R partial accumulated values are aggregated to calculate the final accumulated value au for a vertex u.
  - Two-way merge
- R: replication factor
Replication factor customization
- Too large? GPU underutilization since fewer vertices can be fit in the shared memory
- Maximizes the expected performance under given conflict degree and space constraints

CPU-Based Graph Processing

Main points
- How it works
- Balanced edge-based partition to exploit parallelism
- Dual-mode processing model: switches between pull/notify-pull modes according to the density of active vertices in the page
Existing approach: assign each thread of a subset of vertices
- Computation imbalance
- Random memory access of edge data if adjacent vertices are assigned to different threads
Edge-centric partition
- Enhance sequential access of edge data and improves work balance over threads
  - Why mention it in CPU-based??
    Why is it not an issue on the GPU side?
    Cuda blocks taking 32 threads
    Done with the block, then push another block
CPU engine
- GlobalVertices in host memory for quick access to values of vertices
- LocalVertices to store accumulate values of destination vertices in the corresponding partition
- Each page: initialization, gather, apply
  - If a page is processed at GPU side, system also synchronizes new vertex values between GPU memory and host memory
- Processing is done
  - Graph state converges (i.e. no active vertices)
    Active vertices: vertices with significant state change (use a bitmap to indicate)
  - Or: a given number of iterations complete
Each page
- Initialization
  - Edges of the page is divided into partitions, and each thread processes one partition.
  - Number of replicas is at most n_t - 1
  - Initialize LocalVertices with vertices' default values
- Gather
  - Each partition --> one thread
  - Edges are processed in a sequential order
  - For each edge, CPU thread performs gather and updates the accumulate value in LocalVertices with sum function
  - After each thread, aggregation phase aggregates values of vertices replicated at the partition boundaries
- Apply
  - After gather phase of each page is finished, every thread updates vertices' values in the LocalVertices array
  - For each vertex in the partition, corresponding thread calls activate() to examine if the vertex is active or not and updates the bitmap
- Sync
  - after the GPU has processed a page, it sends the corresponding vertex values to the host memory
  - Then, system calls Activate() of these updated vertices
  - Async:
    enables the updates received from the GPU immediately visible through writing them into the GlobalVertices array of host memory
    Then, sends back new vertices updated on CPU side to the GPU and overwrites the corresponding part of the GlobalVertices array in the GPU global memory
  - Sync
    stores updated values in a temporary array, and commits these new values at the end of each iteration
    the CPU transmits the new GlobalVertices array to that in the GPU memory at the end of each iteration

Dual-Mode Processing Engine

pull mode
- more beneficial when most vertices are activated (dense active vertex set), which avoids the extra costs of modification
notify/pull mode: a vertex needs to be updated only when one of its source vertices is active in the previous iteration
- more efficient when few vertices are active in the last iteration (sparse active vertex set)
At a given time during the graph processing, the active vertex set may be dense or sparse
- E.x. starts from sparse, then becoming denser as more vertices being activated, and sparse again when algorithm approaches convergence
Problem of combining two modes where only part of graph can be loaded into the host memory
- System entails I/O cost due to sequential and random accesses of outgoing/incoming edges on secondary storage for pull & notify-pull modes respectively (use a different formula to consider the rate of speeds between sequential read and random read of secondary storage)

Dispatcher

Adaptive scheduling mechanism to exploit the overlap of two engines

CPU-GPU Scheduling

if T(CPU) <T(GPU), the system adopts CPU kernel only due to sparse active vertex set. Otherwise, Garaph adopts both GPU and CPU kernels to reduce the overall time of the processing
At the beginning of each iteration, the scheduler calculates the following ratio of T(CPU) to T(GPU)

T(pull) / T(GPU) is initialized by the speed ratio of CPU and GPU hardwares, and is updated once both kernels have begun to process pages
Alpha < 1: only CPU kernel is used for graph processing in this iteration as most vertices are very inactive (e.g. very small f)
- f is the fraction of |V_A| / |V|
Otherwise: process on both CPU and GPU kernel
- Reactively assigns a page to a kernel once the kernel becomes free

GPU Multi-Stream Scheduling

To trigger the graph processing on the GPU side: two threads running on the host
- Transmission thread: continuously transmits each page from the host memory to GPU’s global memory
- Computation thread: launches a new GPU kernel to process the page that has already been transmitted.
NVIDIA's Hyper-Q feature
- pipelining CPU-GPU memory copy and kernel execution
- so that the processing tasks of pages can be dispatched onto multiple streams and handled concurrently

Evaluation

Group Discussion

warp: unit of 32 threads on most gpu
- warp divergence: thread takes a difference step, they diverge
vectorized instruction: increment all elements in the array by one,
- Check the value of the array (>5, <5), don't get 32x speed up