🐣
Reading List
  • Starting point
  • Reference list
  • PhD application guidelines
  • Big Data System
    • Index
      • Architecture
        • Storage
          • Sun's Network File System (NFS)
      • Execution Engine, Resource Negotiator, Schedulers
        • Execution Engines
        • Resource Negotiator
        • Schedulers
      • Machine Learning
      • SQL Framework
      • Stream Processing
      • Graph Processing
      • Potpourri: Hardware, Serverless and Approximation
  • Operating System
    • Index
      • OSTEP
        • Virtualization
          • CPU Abstraction: the Process
          • Interlude: Process API
          • Mechanism: Limited Direct Execution
        • Intro
  • Networking
    • Index
      • CS 294 (Distributed System)
        • Week 1 - Global State and Clocks
          • Distributed Snapshots: Determining Global States of Distributed Systems
          • Time, Clocks, and the Ordering of Events in a Distributed System
        • Weak 5 - Weak Consistency
          • Dynamo: Amazon's Highly Available Key-value Store
          • Replicating Data Consistency Explained Through Baseball
          • Managing update conflicts in Bayou, a weakly connected replicated storage system
      • CS 268 (Adv Network)
        • Intro
        • Internet Architecture
          • Towards an Active Network Architecture
          • The Design Philosophy of the DARPA Internet Protocols
        • Beyond best-effort/Unicast
          • Core Based Trees (CBT)
          • Multicast Routing in Internetworks and Extended LANs
        • Congestion Control
        • SDN
          • ONIX: A Distributed Control Platform for Large-scale Production Networks
          • B4: Experience with a Globally-Deployed Software Defined WAN
          • How SDN will shape networking
          • The Future of Networking, and the Past of Protocols
        • Datacenter Networking
          • Fat tree
          • Jellyfish
        • BGP
          • The Case for Separating Routing from Routers
        • Programmable Network
          • NetCache
          • RMT
        • Datacenter Congestion Control
          • Swift
          • pFabric
        • WAN CC
          • Starvation (Sigcomm 22)
        • P2P
          • Design and Evaluation of IPFS: A Storage Layer for the Decentralized Web
          • The Impact of DHT Routing Geometry on Resilience and Proximity
        • Net SW
          • mTCP
          • The Click modular router
        • NFV
          • Performance Interfaces for Network Functions
          • Making Middleboxes Someone Else's Problem: Network Processing as a Cloud Service
        • Ethics
          • On the morals of network research and beyond
          • The collateral damage of internet censorship by DNS injection
          • Encore: Lightweight Measurement of Web Censorship with Cross-Origin Requests
        • Low Latency
          • Aquila: A unified, low-latency fabric for datacenter networks
          • cISP: A Speed-of-Light Internet Service Provider
        • Disaggregation
          • Network Requirements for Resource Disaggregation
        • Tenant Networking
          • Invisinets
          • NetHint: While-Box Networking for Multi-Tenant Data Centers
        • Verification
          • A General Approach to Network Configuration Verification
          • Header Space Analysis: Static Checking for Networks
        • ML
          • SwitchML
          • Fast Distributed Deep Learning over RDMA
      • Computer Networking: A Top-Down Approach
        • Chapter 1. Computer Network and the Internet
          • 1.1 What Is the Internet?
          • 1.2 The Network Edge
          • 1.3 The Network Core
        • Stanford CS144
          • Chapter 1
            • 1.1 A Day in the Life of an Application
            • 1.2 The 4-Layer Internet Model
            • 1.3 The IP Service Model
            • 1.4 A Day in the Life of a Packet
            • 1.6 Layering Principle
            • 1.7 Encapsulation Principle
            • 1.8 Memory layout and Endianness
            • 1.9 IPv4 Addresses
            • 1.10 Longest Prefix Match
            • 1.11 Address Resolution Protocol (ARP)
            • 1.12 The Internet and IP Recap
      • Reading list
        • Elastic hyperparameter tuning on the cloud
        • Rethinking Networking Abstractions for Cloud Tenants
        • Democratizing Cellular Access with AnyCell
        • Dagger: Efficient and Fast RPCs in Cloud Microservices in Near-Memory Reconfigurable NICs
        • Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
        • Faster and Cheaper Serverless Computing on Harvested Resources
        • Network-accelerated Distributed Machine Learning for Multi-Tenant Settings
        • User-Defined Cloud
        • LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation
        • Beyond Jain's Fairness Index: Setting the Bar For The Deployment of Congestion Control Algorithms
        • IncBricks: Toward In-Network Computation with an In-Network Cache
  • Persistence
    • Index
      • Hardware
        • Enhancing Lifetime and Security of PCM-Based Main Memory with Start-Gap Wear Leveling
        • An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
  • Database
    • Index
  • Group
    • WISR Group
      • Group
        • Offloading distributed applications onto smartNICs using iPipe
        • Semeru: A memory-disaggregated managed runtime
      • Cache
        • Index
          • TACK: Improving Wireless Transport Performance by Taming Acknowledgements
          • LHD: Improving Cache Hit Rate by Maximizing Hit Density
          • AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network
          • Clustered Bandits
          • Important Sampling
          • Contexual Bandits and Reinforcement Learning
          • Reinforcement Learning for Caching with Space-Time Popularity Dynamics
          • Hyperbolic Caching: Flexible Caching for Web Applications
          • Learning Cache Replacement with CACHEUS
          • Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN
      • Hyperparam Exploration
        • Bayesian optimization in cloud machine learning engine
    • Shivaram's Group
      • Tools
      • Group papers
        • PushdownDB: Accelerating a DBMS using S3 Computation
        • Declarative Machine Learning Systems
        • P3: Distributed Deep Graph Learning at Scale
        • Accelerating Graph Sampling for Graph Machine Learning using GPUs
        • Unicorn: A System for Searching the Social Graph
        • Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless
        • Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication
        • MOSAIC: Processing a Trillion-Edge Graph on a Single Machine
        • Fluid: Resource-aware Hyperparameter Tuning Engine
        • Lists
          • Wavelet: Efficient DNN Training with Tick-Tock Scheduling
          • GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
          • ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training
          • ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
          • KungFu: Making Training inDistributed Machine Learning Adaptive
        • Disk ANN
      • Queries Processing
        • Building An Elastic Query Engine on Disaggregated Storage
        • GRIP: Multi-Store Capacity-Optimized High-Performance NN Search
        • Milvus: A Purpose-Built Vector Data Management System
        • Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings
        • Billion-scale Approximate Nearest Neighbor Search
        • DiskANN: Fast accurate billion-point nearest neighbor search on a single node
        • KGvec2go - Knowledge Graph Embeddings as a Service
    • Seminar & Talk
      • Berkeley System Seminar
        • RR: Engineering Record and Replay for Deployability
        • Immortal Threads: Multithreaded Event-driven Intermittent Computing on Ultra-Low-Power Microcontroll
      • Berkeley DB Seminar
        • TAOBench: An End-to-End Benchmark for Social Network Workloads
      • PS2
      • Sky Seminar Series
        • Spring 23
          • Next-Generation Optical Networks for Emerging ML Workloads
      • Reading List
        • Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks
        • Rearchitecting Linux Storage Stack for µs Latency and High Throughput
        • eBPF: rethinking the linux kernel
        • BPF for Storage: An Exokernel-Inspired Approach
        • High Velocity Kernel File Systems with Bento
        • Incremental Path Towards a Safe OS Kernel
        • Toward Reconfigurable Kernel Datapaths with Learned Optimizations
        • A Vision for Runtime Programmable Networks
        • The Demikernel and the future of kernal-bypass systems
        • Floem: A programming system for NIC-accelerated network applications
        • High Performance Data Center Operating Systems
        • Leveraging Service Meshes as a New Network Layer
        • Automatically Discovering Machine Learning Optimizations
        • Beyond Data and Model Parallelism for Deep Neural Networks
        • IOS: Inter-Operator Scheduler for CNN Acceleration
        • Building An Elastic Query Engine on Disaggregated Storage
        • Sundial: Fault-tolerant Clock Synchronization for Datacenters
        • MIND: In-Network Memory Management for Disaggregated Data Centers
        • Understanding host network stack overheads
        • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
        • Redesigning Storage Systems for Future Workloads Hardware and Performance Requirements
        • Are Machine Learning Cloud APIs Used Correctly?
        • Fault-tolerant and transactional stateful serverless workflows
      • Reading Groups
        • Network reading group
          • Recap
          • ML & Networking
            • Video Streaming
              • Overview
              • Reducto: On-Camera Filtering for Resource Efficient Real-Time Video Analytics
              • Learning in situ: a randomized experiment in video streaming
              • SENSEI: Aligning Video Streaming Quality with Dynamic User Sensitivity
              • Neural Adaptive Video Streaming with Pensieve
              • Server-Driven Video Streaming for Deep Learning Inference
            • Congestion Control
              • ABC: A Simple Explicit Congestion Controller for Wireless Networks
              • TCP Congestion Control: A Systems Approach
                • Chapter 1: Introduction
              • A Deep Reinforcement Learning Perspective on Internet Congestion Control
              • Pantheon: the training ground for Internet congestion-control research
            • Other
              • On the Use of ML for Blackbox System Performance Prediction
              • Marauder: Synergized Caching and Prefetching for Low-Risk Mobile App Acceleration
              • Horcrux: Automatic JavaScript Parallelism for Resource-Efficient Web Computation
              • Snicket: Query-Driven Distributed Tracing
            • Workshop
          • Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities
        • DB reading group
          • CliqueMap: Productionizing an RMA-Based Distributed Caching System
          • Hash maps overview
          • Dark Silicon and the End of Multicore Scaling
        • WISR
          • pFabric: Minimal Near-Optimal Datacenter Transport
          • Scaling Distributed Machine Learning within-Network Aggregation
          • WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
          • Data center TCP (DCTCP)
      • Wisconsin Seminar
        • Enabling Hyperscale Web Services
        • The Lottery Ticket Hypothesis
        • External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
      • Stanford MLSys Seminar
        • Episode 17
        • Episode 18
  • Cloud Computing
    • Index
      • Cloud Reading Group
        • Owl: Scale and Flexibility in Distribution of Hot Contents
        • RubberBand: cloud-based hyperparameter tuning
  • Distributed System
    • Distributed Systems Lecture Series
      • 1.1 Introduction
  • Conference
    • Index
      • Stanford Graph Learning Workshop
        • Overview of Graph Representation Learning
      • NSDI 2022
      • OSDI 21
        • Graph Embeddings and Neural Networks
        • Data Management
        • Storage
        • Preview
        • Optimizations and Scheduling for ML
          • Oort: Efficient Federated Learning via Guided Participant Selection
          • PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
      • HotOS 21
        • FlexOS: Making OS Isolation Flexible
      • NSDI 21
        • Distributed System
          • Fault-Tolerant Replication with Pull-Based Consensus in MongoDB
          • Ownership: A Distributed Futures System for Fine-Grained Tasks
          • Caerus: NIMBLE Task Scheduling for Serverless Analytics
          • Ship Computer or Data? Why not both?
          • EPaxos Revisited
          • MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds
        • TEGRA: Efficient Ad-Hoc Analytics on Evolving Graphs
        • GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language
      • CIDR 21
        • Cerebro: A Layered Data Platform for Scalable Deep Learning
        • Magpie: Python at Speed and Scale using Cloud Backends
        • Lightweight Inspection of Data Preprocessingin Native Machine Learning Pipelines
        • Lakehouse: A New Generation of Open Platforms that UnifyData Warehousing and Advanced Analytics
      • MLSys 21
        • Chips and Compilers Symposium
        • Support sparse computations in ML
      • SOSP 21
        • SmartNic
          • LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism
          • Xenic: SmartNIC-accelerated distributed transacitions
        • Graphs
          • Mycelium: Large-Scale Distributed Graph Queries with Differential Privacy
          • dSpace: Composable Abstractions for Smart Spaces
        • Consistency
          • Efficient and Scalable Thread-Safety Violation Detection
          • Understanding and Detecting Software Upgrade Failures in Distributed Systems
        • NVM
          • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
        • Learning
          • Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge
          • Faster and Cheaper Serverless Computing on Harvested Resources
  • Random
    • Reading List
      • Random Thoughts
      • Hesse
      • Anxiety
  • Grad School
    • Index
      • Resources for undergraduate students
Powered by GitBook
On this page
  • Paper
  • Abstract
  • Introduction
  • Background
  • Design Overview
  • Tasks and pipelining
  • Bounded Asynchrony
  • Lambda Management
  • Evaluation

Was this helpful?

  1. Group
  2. Shivaram's Group
  3. Group papers

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless

https://arxiv.org/pdf/2105.11118.pdf

Paper

Abstract

  • Two major GNN training obstacles

    • it replies on high-end servers with many GPUs which are expensive to purchase and maintain

    • limited memory on GPUs cannot scale to today's billion-edge graphs

  • Dorylus: distributed system for training GNNs which takes advantage of serverless computing to increase scalability at a low cost

  • Key: computation separation

    • Construct a deep, bounded-asynchronous pipeline where graph and tensor parallel tasks can fully overlap

    • Effectively hiding the network latency incurred by Lambdas

Introduction

  • Model in GNN family: GCN, GRN, GAT

  • Wide range of applications

  • GPUs

    • Benefits: highly-parallel computations

    • Drawbacks:

      • expensive

        • Even in public cloud (show the pricing)

      • limited memory --> hinder scalability

  • Existing approaches to solve the two drawbacks

    • CPU

      • Looser memory restrictions, and operate at lower costs

      • But: far inferior efficiency (speed)

    • Graph sampling techniques

      • Improve scalability by considering less graph data

      • Limitations

        • must be done repeatedly per epoch, incurring time overheads

        • reduce accuracy of trained GNNs

        • no guarantee on convergence

  • Affordable, Scalable, and Accurate GNN Training

    • High efficiency: close to GPUs

    • High accuracy: e.g higher than sampling

    • Low cost for training billion-edge graphs

  • Turn to: Serverless computing paradigm

    • Large number of parallel threads at an extremely low price

    • Challenges

      • Limited compute resources (e.g. 2 weak vCPUs)

      • Restricted network resources (e.g., 200 Mbps between Lambda servers and standard EC2 servers)

  • Challenges

    • how to make computation fit into Lambda's weak compute profile?

      • Not all operations in GNN training need Lambda's parallelism

      • Divide training pipeline into a set of fine-grained tasks, based on the type of data they process

        • graph-parallel path: CPU instances

        • tensor-parallel path: Lambdas

          • Lightweight linear algebra operation on a data chunk of a small size

          • cheaper than regular CPU instances

    • how to minimize the negative impact of Lambda's network latency?

      • Now, Lambdas can spend one-third of their time on communication

      • To not let communication to be the bottleneck: bounded pipeline asynchronous computation (BPAC)

        • Make full use of pipelining where different fine-grained tasks overlap with each other

          • Graph-parallel & tensor-parallel tasks run simultaneously

        • Also, to reduce the wait time between tasks, incorporates asynchrony into the pipeline

          • Challenge: two computation paths, where exactly should asynchrony be introduced?

          • Use it where staleness can be tolerated

            • parameter updates (tensor-parallel path)

            • data gathering from neighbor vertices (graph-parallel path)

          • To prevent asynchrony from slowing down the convergence, bounds the degree of asynchrony at each location using

            • weight stashing at parameter updates

            • bounded staleness at data gathering

Background

  • GNN

    • Input: graph-structured data, each vertex is associated with a feature vector

    • Output: a feature vector for each individual vertex or the whole graph

    • Usage: output feature vectors can be used for graph or vertex classification

    • Goal: learn the patterns and relationships among the data, rather than relying solely on the features of a single data point

    • Combine

      • Graph propagation (Gather and Scatter)

      • NN computations

  • Graph-parallel interface

    • E.x. deep graph library (DGL), which unifies a variety of GNN models with a common GAS-like interface

Example: GCN

Forward Pass

  • A: adjacency matrix of the input graph, A_bar = A + I_N: adjacency matrix with self-loops constructing by adding A with the identity matrix

  • D_bar: diagonal matrix such that

  • Normalized adjacency matrix

  • W_L: layer-specific trainable weight matrix

  • Non-linear activation function, such as ReLU

  • H_L: activations matrix of the L-th layer

  • H_0 = X is the input feature matrix for all vertices

Mapping R1 to vertex-centric computation model:

  • Gather (GA)

  • Apply Vertex (AV)

  • Scatter (SC)

  • Apply Edge (AE)

  • One can see layer L's activations matrix H_L as a group of activation vectors, each associated with a vertex

  • Goal: compute a new activations vector for each vertex based on the vertex's previous activations vector (which, initially is its feature vector) and the information received from its in-neighbors

  • Different (!) from traditional graph processing

    • The computation of new activations matrix H_(L+1) is based on computationally intensive NN operations rather than a numerical function

  • Step

    • GA stage retrieves a (activation) vector from each in-edge of a vertex and aggregates these vectors into a new vector v

      • Applying GA on all vertices is just a matrix multiplication between normalized adjacency matrix and input activations matrix

    • The result from first stage is fed to AV, which performs NN operations to obtain a new activations matrix H_(L+1)

      • For GCN, AV multiplies the result from stage one with a trainable weight matrix W_L and applies non-linear activation function

    • The output of AV goes to SC, which propagates the new activations vector of each vertex among all out-edges of the vertex

      • The same or different?

    • AE: Lastly, new activations vector of each vertex goes into an edge-level NN architecture to compute an activations vector for each edge

      • GCN: edge-level NN is not needed

  • Repeat the process k times (k layers) allow the vertex to consider features of vertices k hops away

Backward Pass

  • computes the gradients for all trainable weights in the vertex- and edge-level NN architectures (i.e., AV and AE).

Training epochs

  • Forward pass

  • Backward pass

  • Weights update

    • Uses the gradients computed in the backward pass to update the trainable weights in the vertex- and edge-level NN architecture in a GNN

Run repeatedly until reaching acceptable accuracy

Design Overview

Three components:

  • EC2 graph servers (GS)

    • Input graph is partitioned using an edge-cut algorithm that takes care of load balancing across partitions

    • Each partition is hosted by a graph server

      • Vertex data: represented as a 2D array

      • Edge data: represented as a single array

      • Edges are stored in the compressed sparse row (CSR) format

    • Ghost buffer

      • Stores the data that are scattered in from remote servers

    • Communicate

      • With each other: to execute graph computations by sending / receiving data along cross-partitions edges

        • Only during scatter in both

          • Forward pass: activation values are propagated along cross-partition edges

          • Backward pass: gradients are propagated along the same edges in the reverse direction

      • With lambda threads: execute tensor computations

  • Lambda threads for tensor computation

    • Tensor operations such as AV and AE interleave with graph operations

    • Employs a high-performance linear algebra kernel for tensor computation

    • Used in both forward and backward pass

    • Communicates frequently with PS

      • Forward: retrieve weights from PSes to compute layer output

      • Backward: compute updated weights

  • EC2 parameter servers

Graph computation is done in a conventional way, breaking a vertex program into stages of

  • vertex parallel (gather)

  • edge parallel (scatter)

Tasks and pipelining

Fine-grained Tasks

  • Challenge: decompose the process into a set of fine-grained tasks that can

    • Overlap with each other

    • Be processed by Lambda's weak compute

  • Decomposition is done based on the data type and computation type

    • Graph operations performed on GSes: computations that involve adjacency matrix of the input graph (i.e. A)

    • Lambda operations: computations that involve only tensor data

Specific tasks over each training epoch

Forward pass

  • computes the output using current weights

  • graph-parallel tasks on GSes

    • Gather (GA)

    • Scatter (SC)

  • tensor-parallel tasks on Lambdas: multiply matrices involving only features and weights and apply activation functions

    • Apply Vertex (AV)

      • Lambda threads retrieve vertex data (H_(L)) from GSes and weight data (W_(L)) from PSes, compute their products, apply ReLU, and send the result back to GSes as the input for scatter

      • AV returns --> SC sends data, across cross-partition edges, to the machines that host their destination vertices

    • Apply Edge (AE)

      • Each Lambda thread retrieves

        • Vertex data from the source and destination vertices of the edge (i.e. activation vectors)

        • Edge data (such as edge weights) from GSes

      • Computes per-edge update by performing model-specific tensor operations

      • Updates are streamed back to GSes and become the inputs of the next layer's GA task

Backward pass

  • uses a loss function to compute weight updates

  • Involves: GSes, Lambdas, and PSes that coordinate to run a graph-augmented SGD algorithm

  • Each task in the forward pass --> a backward pass corresponded

    • Either propagates information in the reverse direction of edges on the graph

    • Or computes the gradients of its trainable weights with respect to a given loss function

  • Weight updates (WU), aggregate gradients across PSes

  • Del(GA), Del (SC): propagate information in the reverse direction

  • AE & AV: apply weights to compute the output of the edge and vertex NN

  • Conversely, Del(AE) and Del(AV) compute weight updates for the NNs, which are inputs to WU

    • Tensor-only computation and are executed by Lambdas

Pipelining

  • GSes kick off training by running parallel graph tasks

  • Dorylus divides vertices in each partition into intervals (i.e. minibatches)

    • For each interval, the amount of tensor computations depends on both number of vertices (i.e. AV) and number of edges (i.e. AE) in the interval

    • Amount of graph computation (on a GS) depends primarily on the number of edges (i.e. GA and SC)

  • Balance work intervals: division uses simple algorithm to ensure that different intervals have the same number of vertices and vertices in each interval have similar numbers of inter-interval edges

    • Edges incur cross-minibatch dependency (async pipeline needs to handle)

  • Each interval is processed by a task

    • When the pipeline is saturated, different tasks will be executed on distinct intervals of vertices

  • Each GS maintains a task queue and enqueues a task once it's ready to execute (i.e. input available)

    • Thread pool: # of threads = # of vCPUs

  • Output of GS task is fed to a Lambda for tensor computation

  • Figure 4

    • Dorylus enqueues a set of GA tasks, each processing a vertex interval

      • # of threads on each GS << # of tasks, some tasks finish earlier than others and results are pushed immediately to Lambda threads for AV

    • Once they are done, outputs are sent back to GS for Scatter

    • Backward pass

      • Del(AE) and Del(AV) compute gradients and send them to PSes for weight updates

Overall, through pipelining, Dorylus overlaps the graph-parallel and tensor-parallel computations so as to hide Lambda's communication latency

  • Contribution: enable pipelining in GNN training

    • Fine-grained tasks

    • Insight of computation separation

Bounded Asynchrony

  • Goal: workers do not need to wait for updates to proceed in most cases

    • Lambdas run in an extremely dynamic environment and stragglers almost always exist

  • Challenge: asynchrony slows down convergence (fast-progressing minibatches may use out-of-date weights, prolonging the training time)

  • Existing solutions: bounded staleness (lightweight sync)

    • But challenge uniquely to Dorylus is the two sync points

      • weight sync at each WU task

      • sync of (vertex) activations data from neighbors at each GA

Bounded Asynchrony at Weight Updates

  • Full asynchrony:

    • different vertex intervals are at their own training pace; some intervals may use a particular version v0 of weights during a forward pass to compute gradients while applying these gradients on another version v1 of weights on their way back in the backward pass

    • Statistical inefficiency

  • Method: weight stashing (proposed in PipeDream)

    • allows any interval to use the latest version of weights available in a forward pass and stashes (i.e., caches) this version for use during the corresponding backward pass.

  • Challenge:

    • Occurs at PSes (host weight matrices and perform updates)

    • Need to balance loads and bandwidth usage

    • Solution:

      • Multiple PSes

      • Directs a Lambda to a PS that has the lightest load

        • Each Lambda can be at different stages of an epoch (e.g. forware & backward pass, different layers)

        • Each PS host a replication of weight matrices of all layers --> any Lambda can use any PS in any stage

          • GNN: few layers, replicating weights are feasible (no much memory overheads)

        • But if we allow any Lambda to use any PS, each PS has to maintain not only the latest weight matrices but also their stashed version of all intervals, practically impossible

          • Solution:

            • each PS has replication of all latest weights (periodically broadcast their latest weights), but weight stashes only for a subset of vertex intervals

            • for each interval in a given epoch, the interval's weight stashes are only maintained on the first PS it interacts with in the epoch

Bounded Asynchrony at Gather

  • Async gather allows vertex intervals to progress independently using stale vertex values (i.e activation vectors) from their neighbors without waiting for their updates

  • Number of layers in a GNN is determined statically and n-layer GNN aims to propagate the impact of a vertex's n-hop neighborhood to the vertex.

  • Question: can asynchrony change the semantics of the GNN?

1) Can vertices eventually receive the effect of their n-hop neighborhood? Yes

2) Is it possible for any vertex to receive the effect of its m-hop neighbor where m > n after many epochs? No

  • Use bounded staleness at Gather (a fast-moving vertex interval is allowed to be at most S epochs away from the slowest-moving interval)

Lambda Management

  • Each GS runs a Lambda controller, which launches Lambdas, batches data to be sent to each Lambda, monitors each Lambda's health, and routes its result back to the GS

  • Launched by the controller for a task t at the time t's previous task starts executing

    • E.g. n Lambda threads are launched (prepare for AV) when n GA tasks start to run

  • Setting

    • Each runs with Open-BLAS library

    • Communicate with GS and PS using ZeroMQ

    • Deployed inside virtual private cloud (VPC)

    • AWS reuses a container that already has the code deployed instead of cold-starting a new container

    • Times each execution, relaunches after timeout

Lambda Optimizations

  • Challenge: limited network bandwidth

    • Per-lambda bandwidth goes down as the number of lambdas increases (suspect: Lambdas created by the same user get scheduled on the same machine and share a network link)

  • Three optimizations

    • Task fusion

      • Since AV of the last layer in a forward pass is connected directly to del(AV) of the last layer in the next backward pass, merge them into a single Lambda-based task

        • Reduce invocations of thousands of Lambdas for each epoch

        • Save a round-trip communication between Lambdas and GSes

    • Tensor rematerialization

      • Now: cache intermediate result during the forward pass to be used in the backward pass

      • But: because network communication is a bottleneck, more profitable to rematerialize the intermediate tensors by launching more Lambdas rather than retrieving them from GSes

    • Lambda-internal streaming

      • A data chunk: retrieve the first half of the data, with which it proceeds to computation while simultaneously retrieving the second half

      • Can: reduce Lambda response time

Autotuning Numbers of Lambdas

  • Performance: effectiveness of Lambdas depends on whether the pipeline can be saturated

    • Too few Lambdas would not generate enough task instances for the graph computations to saturate CPU cores

  • Cost side: we want to avoid over-saturating the pipeline which can generate too many CPU tasks for the GS to handle

  • Opt number of Lambdas is related to

    • Pace of the graph computation

      • Which depends on the graph structure (i.e. density) and partitioning that are hard to predict before execution

  • min(#intervals, 100)

    • number of intervals = number of vertex intervals on each GS

    • Auto-tuner auto-adjusts the number by periodically checking the size of the CPU's task queue

      • If the size of the queue constantly grows: CPU cores have too many tasks to process, scale down

      • If queue quickly shrinks, scale up the number of Lambdas

  • Goal: stabilize the size of the queue s.t. the number of Lambdas matches the pace of graph tasks

Evaluation

Instance Selection

  • c5: compute-optimized instances

  • c5n: compute and network optimized instances

  • Lambda: a container with 0.11 vCPUs and 192MB memory

    • $0.2 per 1M requests

    • Compute cost: $0.01125/h (billed per 100 ms)

    • Able to handle short bursts of massive parallelism much better than CPU instances

Asynchrony

  • Can provide overall performance benefits in general but too large a staleness value leads to slow convergence and poor performance, although the per-epoch time reduces

  • Choose s = 0 as default Lambda variant in the following experiments: allow a vertex to use a stale value from a neighbor as long as the neighbor is in the same epoch (e.g. can be in the previous layer)

Effects of Lambdas

  • Compare it with CPU-only servers for computations and GPU-only servers (both without Lambdas)

    • Use computation separation for scalability

  • Value: take the reciprocal of the total time (i.e. the performance or rate of completion) and divide it by the cost

    • Dorylus with Lambdas provides a 2.75x higher value than CPU-only

  • But for small dense graphs, GPU-only variants is better

    • This suggests that GPUs may be better suited to process small, dense graphs rather than large, dense graph

Scaling out

  • Dorylus can gain even more value by scaling out to more servers, due to the burst parallelism provided by Lambdas and deep pipelining.

  • Scales well in terms of both performance and values

  • Dorylus can roughly provide the same value as the CPU-only variant with only half of the number of servers.

Other observations

  1. The more sparse the graph, the more useful Dorylus is

    1. E.g. for Amazon and Friendster, Dorylus even outperforms the GPU-only version for two reasons

      1. The fraction of time on scatter is significantly larger in those two examples

        1. Why? The time depends on a combination of the number of ghost vertices and inter-partition edges. They have many inter-partition edges, but very few ghost vertices

      2. Scatter takes much longer time in GPU clusters

        1. Moving ghost data between GPU memories on different nodes is much slower than data transferring between CPU memories

  2. Lambda threads are more effective in boosting performance for GAT than GCN

    1. GAT includes an additional AE task, which performs per-edge tensor computation and thus benefits significantly from a high degree of parallelism

  3. Dorylus achieves comparable performance with CPU-only variant that uses twice as many servers

Comparison with existing systems

Breakdown of Performance and Costs

PreviousUnicorn: A System for Searching the Social GraphNextGaraph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication

Last updated 3 years ago

Was this helpful?

Forward propagation rule for the L-th layer
Mapping between SAGA-NN programming model and the rule R1
How to compute the gradients in the first layer for a 2-layer GCN