Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication

https://www.usenix.org/system/files/conference/atc17/atc17-ma.pdf

Presentation

  • Large-scale graph processing

    • 10^10 pages, 10^12 tokens: page rank

    • 10^9 nodes, 10^12 edges: social network analysis

  • Powerful storage & computation technologies

  • Goal:

    • Large memory + fast secondary storages

    • CPU + GPUs

      • CPU: sequential

      • GPU: SIMD mode processing

    • How to efficiently integrate heterogeneity under a unified abstraction

  • Non-distributed platform

  • Most time-consuming: gather phase

Paper

Abstract

  • Garaph: GPU-accelerated graph processing system on a single machine with secondary storage as memory extension

  • Contributions

    • Vertex replication degree customization scheme

      • maximize GPU utilization given vertices' degrees and space constraints

    • Balanced edge-based partition and a hybrid of notify-pull and pull computation models

      • ensure work balance over CPU threads

      • optimized for fast graph processing on CPU

    • Dynamic workload assignment schemes

      • Takes into account of the characteristics of processing elements and graph algorithms

Intro

  • Distributed graph systems: need fast network and effective partitioning to minimize communication

  • Alternative: non-distributed

    • Benefit: users need not to be skilled at managing & tuning

    • Problem: pressure on memory and computing power. But it's affordable.

      • RAM is large

      • Advances of secondary storage: high access bandwidth close to memory

      • GPU: massive parallelism to offer high-performance graph processing

  • Setting: GPU-accelerated, secondary-storage based graph processing

  • Challenge:

    • highly skewed degree distribution of natural graphs

      • Small fraction of vertices adjacent to large fraction of edges --> heavy write contention among GPU threads due to atomic updates of the same vertices

      • Work imbalance of CPU threads

    • heterogeneous parallelism of CPU & GPU

      • CPU: sequential processing

      • GPU: bulk parallel processing

  • Propose: Garaph

    • GPU: vertex replication degree customization

    • CPU: balanced edge-based partition

    • Heterogeneity of computation units

      • Pull computation model: matches the SIMD processing model of GPU

      • Hybrid of notify-pull and pull computation model: optimizes for fast sequential processing on CPU

    • Dynamic workload assignment

System overview

2.1 Graph Representation

For organizing incoming and outgoing edges:

  • Compressed Sparse Column (CSC)

  • Compressed Sparse Row (CSR)

  • Shard:

    • split vertices V into disjoint sets of vertices and each set is represented by a shard that stores all incoming edges whose destination is in that set.

    • Edges in a shard are listed based on increasing order of their indexes of destination vertices.

    • Allow each shard to be fit into the shared memory for high bandwidth

    • Maximum offset is 12K, can use 16-bit integer to represent the index of destination vertices

  • Transfer shards from host memory to GPU memory in batch

    • Call this as a page

  • Leverages multi-stream feature of GPUs for the overlap of memory copy and kernel execution

  • Two vertex-centric computation models

    • Pull model

      • Every vertex updates its state by pulling the new states of neighboring vertices through incoming edges

    • Notify-pull

      • Only active vertices notify their outgoing neighbors to update, who in turn perform local computation by pulling states of their incoming neighbors

      • More effective in case of few active vertices

System Architecture

  • Dispatcher

    • loading graph from secondary storages, distributing the computation over CPU and GPU

    • Partitions each graph page into equal-size data blocks, which are uniformly distributed over multiple secondary storages with a hash function

    • Steps

      • Load data blocks from secondary storage to host memory

      • Construct pages

      • After one page is constructed, dispatch to either CPU or GPU

  • GPU/CPU computation kernel

    • GPU

      • Process the shards of page in a parallel manner

      • Only the pull model is enabled on GPU side

        • Notify-pull can lead to high frequency of non-coalesced memory accesses because of poor locality and warp divergence caused by distinguishing active/inactive vertices

    • CPU

      • Enables both pull and notify-pull

      • Each thread processes one edge set (divide edges of a page into sets of equal size)

    • Either of the two kernels has processed on page, there will be a synchronization between GPU and CPU

    • Execution can be done both synchronously and asynchronously

      • Iter: complete process over all the pages for one time

  • Fault Tolerance

    • Write vertex data to secondary storages periodically

Programming API

  • Modified Gather-Apply-Scatter (GAS) abstraction used in the PowerGraph

    • Modify scatter function to activate function which sets value if the vertex satisfies the active condition

  • Atomic (user-provided sum function) + non-atomic operations for GPU and CPU respectively

GPU-Based Graph Processing

  • Global Memory

    • Up to 24GB in size.

      • Size of the vertices is usually 4 bytes. Can store up to 6B (or 12B) vertices in global memory.

    • Global Vertices: allows quick access to values of vertices

    • Each shard in a page is processed by one GPU block in three phases: initialization, gather, and apply

      • Initialization:

        • LocalVertices to store accumulate value of each vertex in a shard.

        • Consecutive threads of a block initialize this array with default vertex values defined by users

      • Gather

        • Threads of one GPU block process edges of an individual shard. For each edge, one thread fetches vertex & edge data from global memory and increase accumulate value

        • To have coalesced global memory accesses: consecutive threads of the block read consecutive edges' data in global memory

      • Apply

        • Each thread of block updates vertex value in shared memory

        • Async: commit new vertex data to GlobalVertices array, which are immediately visible to the subsequent computation

        • Sync: values are written to temporary array in global memory, which would be visible in the next iteration

    • When page has been processed, new vertex values are synchronized between GPU global memory and host memory

      • Async: transmits updated values of GlobalVertices in the GPU global memory to array storing the most updated values of vertices in the host memory

      • Sync: stored in temporary space of GPU global memory are transmitted to temporary array in host memory, commit after the iteration ends

      • Can be overlapped with processing

Replication-Based Gather

  • Problem: gather phase with write contention (multi-threads simultaneously modifying the same shared memory address) --> position conflict

    • Frequent for natural graphs (power-law degree distribution)

  • Strategy: replication

    • which consists of placing R adjoining copies of the partial accumulated value in the shared memory to spread these accesses over more shared memory addresses.

    • Then these R partial accumulated values are aggregated to calculate the final accumulated value au for a vertex u.

      • Two-way merge

    • R: replication factor

  • Replication factor customization

    • Too large? GPU underutilization since fewer vertices can be fit in the shared memory

    • Maximizes the expected performance under given conflict degree and space constraints

CPU-Based Graph Processing

  • Main points

    • How it works

    • Balanced edge-based partition to exploit parallelism

    • Dual-mode processing model: switches between pull/notify-pull modes according to the density of active vertices in the page

  • Existing approach: assign each thread of a subset of vertices

    • Computation imbalance

    • Random memory access of edge data if adjacent vertices are assigned to different threads

  • Edge-centric partition

    • Enhance sequential access of edge data and improves work balance over threads

      • Why mention it in CPU-based??

        • Why is it not an issue on the GPU side?

          • Cuda blocks taking 32 threads

          • Done with the block, then push another block

  • CPU engine

    • GlobalVertices in host memory for quick access to values of vertices

    • LocalVertices to store accumulate values of destination vertices in the corresponding partition

    • Each page: initialization, gather, apply

      • If a page is processed at GPU side, system also synchronizes new vertex values between GPU memory and host memory

    • Processing is done

      • Graph state converges (i.e. no active vertices)

        • Active vertices: vertices with significant state change (use a bitmap to indicate)

      • Or: a given number of iterations complete

  • Each page

    • Initialization

      • Edges of the page is divided into partitions, and each thread processes one partition.

      • Number of replicas is at most n_t - 1

      • Initialize LocalVertices with vertices' default values

    • Gather

      • Each partition --> one thread

      • Edges are processed in a sequential order

      • For each edge, CPU thread performs gather and updates the accumulate value in LocalVertices with sum function

      • After each thread, aggregation phase aggregates values of vertices replicated at the partition boundaries

    • Apply

      • After gather phase of each page is finished, every thread updates vertices' values in the LocalVertices array

      • For each vertex in the partition, corresponding thread calls activate() to examine if the vertex is active or not and updates the bitmap

    • Sync

      • after the GPU has processed a page, it sends the corresponding vertex values to the host memory

      • Then, system calls Activate() of these updated vertices

      • Async:

        • enables the updates received from the GPU immediately visible through writing them into the GlobalVertices array of host memory

        • Then, sends back new vertices updated on CPU side to the GPU and overwrites the corresponding part of the GlobalVertices array in the GPU global memory

      • Sync

        • stores updated values in a temporary array, and commits these new values at the end of each iteration

        • the CPU transmits the new GlobalVertices array to that in the GPU memory at the end of each iteration

Dual-Mode Processing Engine

  • pull mode

    • more beneficial when most vertices are activated (dense active vertex set), which avoids the extra costs of modification

  • notify/pull mode: a vertex needs to be updated only when one of its source vertices is active in the previous iteration

    • more efficient when few vertices are active in the last iteration (sparse active vertex set)

  • At a given time during the graph processing, the active vertex set may be dense or sparse

    • E.x. starts from sparse, then becoming denser as more vertices being activated, and sparse again when algorithm approaches convergence

  • Problem of combining two modes where only part of graph can be loaded into the host memory

    • System entails I/O cost due to sequential and random accesses of outgoing/incoming edges on secondary storage for pull & notify-pull modes respectively (use a different formula to consider the rate of speeds between sequential read and random read of secondary storage)

Dispatcher

  • Adaptive scheduling mechanism to exploit the overlap of two engines

CPU-GPU Scheduling

  • if T(CPU) <T(GPU), the system adopts CPU kernel only due to sparse active vertex set. Otherwise, Garaph adopts both GPU and CPU kernels to reduce the overall time of the processing

  • At the beginning of each iteration, the scheduler calculates the following ratio of T(CPU) to T(GPU)

  • T(pull) / T(GPU) is initialized by the speed ratio of CPU and GPU hardwares, and is updated once both kernels have begun to process pages

  • Alpha < 1: only CPU kernel is used for graph processing in this iteration as most vertices are very inactive (e.g. very small f)

    • f is the fraction of |V_A| / |V|

  • Otherwise: process on both CPU and GPU kernel

    • Reactively assigns a page to a kernel once the kernel becomes free

GPU Multi-Stream Scheduling

  • To trigger the graph processing on the GPU side: two threads running on the host

    • Transmission thread: continuously transmits each page from the host memory to GPU’s global memory

    • Computation thread: launches a new GPU kernel to process the page that has already been transmitted.

  • NVIDIA's Hyper-Q feature

    • pipelining CPU-GPU memory copy and kernel execution

    • so that the processing tasks of pages can be dispatched onto multiple streams and handled concurrently

Evaluation

Group Discussion

  • warp: unit of 32 threads on most gpu

    • warp divergence: thread takes a difference step, they diverge

  • vectorized instruction: increment all elements in the array by one,

    • Check the value of the array (>5, <5), don't get 32x speed up

Last updated