Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication
https://www.usenix.org/system/files/conference/atc17/atc17-ma.pdf
Last updated
https://www.usenix.org/system/files/conference/atc17/atc17-ma.pdf
Last updated
Large-scale graph processing
10^10 pages, 10^12 tokens: page rank
10^9 nodes, 10^12 edges: social network analysis
Powerful storage & computation technologies
Goal:
Large memory + fast secondary storages
CPU + GPUs
CPU: sequential
GPU: SIMD mode processing
How to efficiently integrate heterogeneity under a unified abstraction
Non-distributed platform
Most time-consuming: gather phase
Garaph: GPU-accelerated graph processing system on a single machine with secondary storage as memory extension
Contributions
Vertex replication degree customization scheme
maximize GPU utilization given vertices' degrees and space constraints
Balanced edge-based partition and a hybrid of notify-pull and pull computation models
ensure work balance over CPU threads
optimized for fast graph processing on CPU
Dynamic workload assignment schemes
Takes into account of the characteristics of processing elements and graph algorithms
Distributed graph systems: need fast network and effective partitioning to minimize communication
Alternative: non-distributed
Benefit: users need not to be skilled at managing & tuning
Problem: pressure on memory and computing power. But it's affordable.
RAM is large
Advances of secondary storage: high access bandwidth close to memory
GPU: massive parallelism to offer high-performance graph processing
Setting: GPU-accelerated, secondary-storage based graph processing
Challenge:
highly skewed degree distribution of natural graphs
Small fraction of vertices adjacent to large fraction of edges --> heavy write contention among GPU threads due to atomic updates of the same vertices
Work imbalance of CPU threads
heterogeneous parallelism of CPU & GPU
CPU: sequential processing
GPU: bulk parallel processing
Propose: Garaph
GPU: vertex replication degree customization
CPU: balanced edge-based partition
Heterogeneity of computation units
Pull computation model: matches the SIMD processing model of GPU
Hybrid of notify-pull and pull computation model: optimizes for fast sequential processing on CPU
Dynamic workload assignment
For organizing incoming and outgoing edges:
Compressed Sparse Column (CSC)
Compressed Sparse Row (CSR)
Shard:
split vertices V into disjoint sets of vertices and each set is represented by a shard that stores all incoming edges whose destination is in that set.
Edges in a shard are listed based on increasing order of their indexes of destination vertices.
Allow each shard to be fit into the shared memory for high bandwidth
Maximum offset is 12K, can use 16-bit integer to represent the index of destination vertices
Transfer shards from host memory to GPU memory in batch
Call this as a page
Leverages multi-stream feature of GPUs for the overlap of memory copy and kernel execution
Two vertex-centric computation models
Pull model
Every vertex updates its state by pulling the new states of neighboring vertices through incoming edges
Notify-pull
Only active vertices notify their outgoing neighbors to update, who in turn perform local computation by pulling states of their incoming neighbors
More effective in case of few active vertices
Dispatcher
loading graph from secondary storages, distributing the computation over CPU and GPU
Partitions each graph page into equal-size data blocks, which are uniformly distributed over multiple secondary storages with a hash function
Steps
Load data blocks from secondary storage to host memory
Construct pages
After one page is constructed, dispatch to either CPU or GPU
GPU/CPU computation kernel
GPU
Process the shards of page in a parallel manner
Only the pull model is enabled on GPU side
Notify-pull can lead to high frequency of non-coalesced memory accesses because of poor locality and warp divergence caused by distinguishing active/inactive vertices
CPU
Enables both pull and notify-pull
Each thread processes one edge set (divide edges of a page into sets of equal size)
Either of the two kernels has processed on page, there will be a synchronization between GPU and CPU
Execution can be done both synchronously and asynchronously
Iter: complete process over all the pages for one time
Fault Tolerance
Write vertex data to secondary storages periodically
Modified Gather-Apply-Scatter (GAS) abstraction used in the PowerGraph
Modify scatter function to activate function which sets value if the vertex satisfies the active condition
Atomic (user-provided sum function) + non-atomic operations for GPU and CPU respectively
Global Memory
Up to 24GB in size.
Size of the vertices is usually 4 bytes. Can store up to 6B (or 12B) vertices in global memory.
Global Vertices: allows quick access to values of vertices
Each shard in a page is processed by one GPU block in three phases: initialization, gather, and apply
Initialization:
LocalVertices to store accumulate value of each vertex in a shard.
Consecutive threads of a block initialize this array with default vertex values defined by users
Gather
Threads of one GPU block process edges of an individual shard. For each edge, one thread fetches vertex & edge data from global memory and increase accumulate value
To have coalesced global memory accesses: consecutive threads of the block read consecutive edges' data in global memory
Apply
Each thread of block updates vertex value in shared memory
Async: commit new vertex data to GlobalVertices array, which are immediately visible to the subsequent computation
Sync: values are written to temporary array in global memory, which would be visible in the next iteration
When page has been processed, new vertex values are synchronized between GPU global memory and host memory
Async: transmits updated values of GlobalVertices in the GPU global memory to array storing the most updated values of vertices in the host memory
Sync: stored in temporary space of GPU global memory are transmitted to temporary array in host memory, commit after the iteration ends
Can be overlapped with processing
Problem: gather phase with write contention (multi-threads simultaneously modifying the same shared memory address) --> position conflict
Frequent for natural graphs (power-law degree distribution)
Strategy: replication
which consists of placing R adjoining copies of the partial accumulated value in the shared memory to spread these accesses over more shared memory addresses.
Then these R partial accumulated values are aggregated to calculate the final accumulated value au for a vertex u.
Two-way merge
R: replication factor
Replication factor customization
Too large? GPU underutilization since fewer vertices can be fit in the shared memory
Maximizes the expected performance under given conflict degree and space constraints
Main points
How it works
Balanced edge-based partition to exploit parallelism
Dual-mode processing model: switches between pull/notify-pull modes according to the density of active vertices in the page
Existing approach: assign each thread of a subset of vertices
Computation imbalance
Random memory access of edge data if adjacent vertices are assigned to different threads
Edge-centric partition
Enhance sequential access of edge data and improves work balance over threads
Why mention it in CPU-based??
Why is it not an issue on the GPU side?
Cuda blocks taking 32 threads
Done with the block, then push another block
CPU engine
GlobalVertices in host memory for quick access to values of vertices
LocalVertices to store accumulate values of destination vertices in the corresponding partition
Each page: initialization, gather, apply
If a page is processed at GPU side, system also synchronizes new vertex values between GPU memory and host memory
Processing is done
Graph state converges (i.e. no active vertices)
Active vertices: vertices with significant state change (use a bitmap to indicate)
Or: a given number of iterations complete
Each page
Initialization
Edges of the page is divided into partitions, and each thread processes one partition.
Number of replicas is at most n_t - 1
Initialize LocalVertices with vertices' default values
Gather
Each partition --> one thread
Edges are processed in a sequential order
For each edge, CPU thread performs gather and updates the accumulate value in LocalVertices with sum function
After each thread, aggregation phase aggregates values of vertices replicated at the partition boundaries
Apply
After gather phase of each page is finished, every thread updates vertices' values in the LocalVertices array
For each vertex in the partition, corresponding thread calls activate() to examine if the vertex is active or not and updates the bitmap
Sync
after the GPU has processed a page, it sends the corresponding vertex values to the host memory
Then, system calls Activate() of these updated vertices
Async:
enables the updates received from the GPU immediately visible through writing them into the GlobalVertices array of host memory
Then, sends back new vertices updated on CPU side to the GPU and overwrites the corresponding part of the GlobalVertices array in the GPU global memory
Sync
stores updated values in a temporary array, and commits these new values at the end of each iteration
the CPU transmits the new GlobalVertices array to that in the GPU memory at the end of each iteration
pull mode
more beneficial when most vertices are activated (dense active vertex set), which avoids the extra costs of modification
notify/pull mode: a vertex needs to be updated only when one of its source vertices is active in the previous iteration
more efficient when few vertices are active in the last iteration (sparse active vertex set)
At a given time during the graph processing, the active vertex set may be dense or sparse
E.x. starts from sparse, then becoming denser as more vertices being activated, and sparse again when algorithm approaches convergence
Problem of combining two modes where only part of graph can be loaded into the host memory
System entails I/O cost due to sequential and random accesses of outgoing/incoming edges on secondary storage for pull & notify-pull modes respectively (use a different formula to consider the rate of speeds between sequential read and random read of secondary storage)
Adaptive scheduling mechanism to exploit the overlap of two engines
if T(CPU) <T(GPU), the system adopts CPU kernel only due to sparse active vertex set. Otherwise, Garaph adopts both GPU and CPU kernels to reduce the overall time of the processing
At the beginning of each iteration, the scheduler calculates the following ratio of T(CPU) to T(GPU)
T(pull) / T(GPU) is initialized by the speed ratio of CPU and GPU hardwares, and is updated once both kernels have begun to process pages
Alpha < 1: only CPU kernel is used for graph processing in this iteration as most vertices are very inactive (e.g. very small f)
f is the fraction of |V_A| / |V|
Otherwise: process on both CPU and GPU kernel
Reactively assigns a page to a kernel once the kernel becomes free
To trigger the graph processing on the GPU side: two threads running on the host
Transmission thread: continuously transmits each page from the host memory to GPU’s global memory
Computation thread: launches a new GPU kernel to process the page that has already been transmitted.
NVIDIA's Hyper-Q feature
pipelining CPU-GPU memory copy and kernel execution
so that the processing tasks of pages can be dispatched onto multiple streams and handled concurrently
warp: unit of 32 threads on most gpu
warp divergence: thread takes a difference step, they diverge
vectorized instruction: increment all elements in the array by one,
Check the value of the array (>5, <5), don't get 32x speed up