Accelerating Graph Sampling for Graph Machine Learning using GPUs
https://dl.acm.org/doi/10.1145/3447786.3456244
Requirement for GPU performance
thread: fundamental unit of computation in a GPU
thread block: threads are statically grouped into thread blocks and assigned a unique id within a block
streaming multiprocessors (SMs): each of which executes one or more thread blocks
Types of memory
shared memory: each SM's private memory, which is only available to the thread blocks assigned to that SM
global memory: the GPU has global memory, which is accessible to all SMs
Accessed latency of global memory >> shared memory
To run a thread block, an SM schedules a subset of threads from the thread block, known as warp
Warp typically consists of 32 threads with consecutive thread IDs
GPU employs: Single Instruction Multiple Threads (SIMT) execution model
All threads in a warp runs the same instruction in lock-step
Consequence
Two threads cannot execute two sides of the branch concurrently
Warp divergence: when the threads in a warp encounter a branch, the subset of threads that do not take the branch must wait for other threads to complete the branch
Goal: minimize warp divergence
Another goal: balance resource usage across thread blocks
the GPU can provide high-bandwidth access to global memory by coalescing several memory accesses from the same warp
only possible when concurrent memory accesses from threads in the same warp access consecutive memory segments.
Presentation
Last updated
Was this helpful?