Accelerating Graph Sampling for Graph Machine Learning using GPUs

https://dl.acm.org/doi/10.1145/3447786.3456244

thread: fundamental unit of computation in a GPU
thread block: threads are statically grouped into thread blocks and assigned a unique id within a block
streaming multiprocessors (SMs): each of which executes one or more thread blocks
Types of memory
- shared memory: each SM's private memory, which is only available to the thread blocks assigned to that SM
- global memory: the GPU has global memory, which is accessible to all SMs
- Accessed latency of global memory >> shared memory
To run a thread block, an SM schedules a subset of threads from the thread block, known as warp
- Warp typically consists of 32 threads with consecutive thread IDs
- GPU employs: Single Instruction Multiple Threads (SIMT) execution model
  - All threads in a warp runs the same instruction in lock-step
  - Consequence
    Two threads cannot execute two sides of the branch concurrently
    Warp divergence: when the threads in a warp encounter a branch, the subset of threads that do not take the branch must wait for other threads to complete the branch
  - Goal: minimize warp divergence
Another goal: balance resource usage across thread blocks
the GPU can provide high-bandwidth access to global memory by coalescing several memory accesses from the same warp
- only possible when concurrent memory accesses from threads in the same warp access consecutive memory segments.

Last updated 4 years ago

Was this helpful?