Accelerating Graph Sampling for Graph Machine Learning using GPUs

https://dl.acm.org/doi/10.1145/3447786.3456244

Requirement for GPU performance

  • thread: fundamental unit of computation in a GPU

  • thread block: threads are statically grouped into thread blocks and assigned a unique id within a block

  • streaming multiprocessors (SMs): each of which executes one or more thread blocks

  • Types of memory

    • shared memory: each SM's private memory, which is only available to the thread blocks assigned to that SM

    • global memory: the GPU has global memory, which is accessible to all SMs

    • Accessed latency of global memory >> shared memory

  • To run a thread block, an SM schedules a subset of threads from the thread block, known as warp

    • Warp typically consists of 32 threads with consecutive thread IDs

    • GPU employs: Single Instruction Multiple Threads (SIMT) execution model

      • All threads in a warp runs the same instruction in lock-step

      • Consequence

        • Two threads cannot execute two sides of the branch concurrently

        • Warp divergence: when the threads in a warp encounter a branch, the subset of threads that do not take the branch must wait for other threads to complete the branch

      • Goal: minimize warp divergence

  • Another goal: balance resource usage across thread blocks

  • the GPU can provide high-bandwidth access to global memory by coalescing several memory accesses from the same warp

    • only possible when concurrent memory accesses from threads in the same warp access consecutive memory segments.

Presentation

Last updated