🐣
Reading List
  • Starting point
  • Reference list
  • PhD application guidelines
  • Big Data System
    • Index
      • Architecture
        • Storage
          • Sun's Network File System (NFS)
      • Execution Engine, Resource Negotiator, Schedulers
        • Execution Engines
        • Resource Negotiator
        • Schedulers
      • Machine Learning
      • SQL Framework
      • Stream Processing
      • Graph Processing
      • Potpourri: Hardware, Serverless and Approximation
  • Operating System
    • Index
      • OSTEP
        • Virtualization
          • CPU Abstraction: the Process
          • Interlude: Process API
          • Mechanism: Limited Direct Execution
        • Intro
  • Networking
    • Index
      • CS 294 (Distributed System)
        • Week 1 - Global State and Clocks
          • Distributed Snapshots: Determining Global States of Distributed Systems
          • Time, Clocks, and the Ordering of Events in a Distributed System
        • Weak 5 - Weak Consistency
          • Dynamo: Amazon's Highly Available Key-value Store
          • Replicating Data Consistency Explained Through Baseball
          • Managing update conflicts in Bayou, a weakly connected replicated storage system
      • CS 268 (Adv Network)
        • Intro
        • Internet Architecture
          • Towards an Active Network Architecture
          • The Design Philosophy of the DARPA Internet Protocols
        • Beyond best-effort/Unicast
          • Core Based Trees (CBT)
          • Multicast Routing in Internetworks and Extended LANs
        • Congestion Control
        • SDN
          • ONIX: A Distributed Control Platform for Large-scale Production Networks
          • B4: Experience with a Globally-Deployed Software Defined WAN
          • How SDN will shape networking
          • The Future of Networking, and the Past of Protocols
        • Datacenter Networking
          • Fat tree
          • Jellyfish
        • BGP
          • The Case for Separating Routing from Routers
        • Programmable Network
          • NetCache
          • RMT
        • Datacenter Congestion Control
          • Swift
          • pFabric
        • WAN CC
          • Starvation (Sigcomm 22)
        • P2P
          • Design and Evaluation of IPFS: A Storage Layer for the Decentralized Web
          • The Impact of DHT Routing Geometry on Resilience and Proximity
        • Net SW
          • mTCP
          • The Click modular router
        • NFV
          • Performance Interfaces for Network Functions
          • Making Middleboxes Someone Else's Problem: Network Processing as a Cloud Service
        • Ethics
          • On the morals of network research and beyond
          • The collateral damage of internet censorship by DNS injection
          • Encore: Lightweight Measurement of Web Censorship with Cross-Origin Requests
        • Low Latency
          • Aquila: A unified, low-latency fabric for datacenter networks
          • cISP: A Speed-of-Light Internet Service Provider
        • Disaggregation
          • Network Requirements for Resource Disaggregation
        • Tenant Networking
          • Invisinets
          • NetHint: While-Box Networking for Multi-Tenant Data Centers
        • Verification
          • A General Approach to Network Configuration Verification
          • Header Space Analysis: Static Checking for Networks
        • ML
          • SwitchML
          • Fast Distributed Deep Learning over RDMA
      • Computer Networking: A Top-Down Approach
        • Chapter 1. Computer Network and the Internet
          • 1.1 What Is the Internet?
          • 1.2 The Network Edge
          • 1.3 The Network Core
        • Stanford CS144
          • Chapter 1
            • 1.1 A Day in the Life of an Application
            • 1.2 The 4-Layer Internet Model
            • 1.3 The IP Service Model
            • 1.4 A Day in the Life of a Packet
            • 1.6 Layering Principle
            • 1.7 Encapsulation Principle
            • 1.8 Memory layout and Endianness
            • 1.9 IPv4 Addresses
            • 1.10 Longest Prefix Match
            • 1.11 Address Resolution Protocol (ARP)
            • 1.12 The Internet and IP Recap
      • Reading list
        • Elastic hyperparameter tuning on the cloud
        • Rethinking Networking Abstractions for Cloud Tenants
        • Democratizing Cellular Access with AnyCell
        • Dagger: Efficient and Fast RPCs in Cloud Microservices in Near-Memory Reconfigurable NICs
        • Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
        • Faster and Cheaper Serverless Computing on Harvested Resources
        • Network-accelerated Distributed Machine Learning for Multi-Tenant Settings
        • User-Defined Cloud
        • LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation
        • Beyond Jain's Fairness Index: Setting the Bar For The Deployment of Congestion Control Algorithms
        • IncBricks: Toward In-Network Computation with an In-Network Cache
  • Persistence
    • Index
      • Hardware
        • Enhancing Lifetime and Security of PCM-Based Main Memory with Start-Gap Wear Leveling
        • An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
  • Database
    • Index
  • Group
    • WISR Group
      • Group
        • Offloading distributed applications onto smartNICs using iPipe
        • Semeru: A memory-disaggregated managed runtime
      • Cache
        • Index
          • TACK: Improving Wireless Transport Performance by Taming Acknowledgements
          • LHD: Improving Cache Hit Rate by Maximizing Hit Density
          • AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network
          • Clustered Bandits
          • Important Sampling
          • Contexual Bandits and Reinforcement Learning
          • Reinforcement Learning for Caching with Space-Time Popularity Dynamics
          • Hyperbolic Caching: Flexible Caching for Web Applications
          • Learning Cache Replacement with CACHEUS
          • Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN
      • Hyperparam Exploration
        • Bayesian optimization in cloud machine learning engine
    • Shivaram's Group
      • Tools
      • Group papers
        • PushdownDB: Accelerating a DBMS using S3 Computation
        • Declarative Machine Learning Systems
        • P3: Distributed Deep Graph Learning at Scale
        • Accelerating Graph Sampling for Graph Machine Learning using GPUs
        • Unicorn: A System for Searching the Social Graph
        • Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless
        • Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication
        • MOSAIC: Processing a Trillion-Edge Graph on a Single Machine
        • Fluid: Resource-aware Hyperparameter Tuning Engine
        • Lists
          • Wavelet: Efficient DNN Training with Tick-Tock Scheduling
          • GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
          • ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training
          • ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
          • KungFu: Making Training inDistributed Machine Learning Adaptive
        • Disk ANN
      • Queries Processing
        • Building An Elastic Query Engine on Disaggregated Storage
        • GRIP: Multi-Store Capacity-Optimized High-Performance NN Search
        • Milvus: A Purpose-Built Vector Data Management System
        • Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings
        • Billion-scale Approximate Nearest Neighbor Search
        • DiskANN: Fast accurate billion-point nearest neighbor search on a single node
        • KGvec2go - Knowledge Graph Embeddings as a Service
    • Seminar & Talk
      • Berkeley System Seminar
        • RR: Engineering Record and Replay for Deployability
        • Immortal Threads: Multithreaded Event-driven Intermittent Computing on Ultra-Low-Power Microcontroll
      • Berkeley DB Seminar
        • TAOBench: An End-to-End Benchmark for Social Network Workloads
      • PS2
      • Sky Seminar Series
        • Spring 23
          • Next-Generation Optical Networks for Emerging ML Workloads
      • Reading List
        • Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks
        • Rearchitecting Linux Storage Stack for µs Latency and High Throughput
        • eBPF: rethinking the linux kernel
        • BPF for Storage: An Exokernel-Inspired Approach
        • High Velocity Kernel File Systems with Bento
        • Incremental Path Towards a Safe OS Kernel
        • Toward Reconfigurable Kernel Datapaths with Learned Optimizations
        • A Vision for Runtime Programmable Networks
        • The Demikernel and the future of kernal-bypass systems
        • Floem: A programming system for NIC-accelerated network applications
        • High Performance Data Center Operating Systems
        • Leveraging Service Meshes as a New Network Layer
        • Automatically Discovering Machine Learning Optimizations
        • Beyond Data and Model Parallelism for Deep Neural Networks
        • IOS: Inter-Operator Scheduler for CNN Acceleration
        • Building An Elastic Query Engine on Disaggregated Storage
        • Sundial: Fault-tolerant Clock Synchronization for Datacenters
        • MIND: In-Network Memory Management for Disaggregated Data Centers
        • Understanding host network stack overheads
        • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
        • Redesigning Storage Systems for Future Workloads Hardware and Performance Requirements
        • Are Machine Learning Cloud APIs Used Correctly?
        • Fault-tolerant and transactional stateful serverless workflows
      • Reading Groups
        • Network reading group
          • Recap
          • ML & Networking
            • Video Streaming
              • Overview
              • Reducto: On-Camera Filtering for Resource Efficient Real-Time Video Analytics
              • Learning in situ: a randomized experiment in video streaming
              • SENSEI: Aligning Video Streaming Quality with Dynamic User Sensitivity
              • Neural Adaptive Video Streaming with Pensieve
              • Server-Driven Video Streaming for Deep Learning Inference
            • Congestion Control
              • ABC: A Simple Explicit Congestion Controller for Wireless Networks
              • TCP Congestion Control: A Systems Approach
                • Chapter 1: Introduction
              • A Deep Reinforcement Learning Perspective on Internet Congestion Control
              • Pantheon: the training ground for Internet congestion-control research
            • Other
              • On the Use of ML for Blackbox System Performance Prediction
              • Marauder: Synergized Caching and Prefetching for Low-Risk Mobile App Acceleration
              • Horcrux: Automatic JavaScript Parallelism for Resource-Efficient Web Computation
              • Snicket: Query-Driven Distributed Tracing
            • Workshop
          • Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities
        • DB reading group
          • CliqueMap: Productionizing an RMA-Based Distributed Caching System
          • Hash maps overview
          • Dark Silicon and the End of Multicore Scaling
        • WISR
          • pFabric: Minimal Near-Optimal Datacenter Transport
          • Scaling Distributed Machine Learning within-Network Aggregation
          • WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
          • Data center TCP (DCTCP)
      • Wisconsin Seminar
        • Enabling Hyperscale Web Services
        • The Lottery Ticket Hypothesis
        • External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
      • Stanford MLSys Seminar
        • Episode 17
        • Episode 18
  • Cloud Computing
    • Index
      • Cloud Reading Group
        • Owl: Scale and Flexibility in Distribution of Hot Contents
        • RubberBand: cloud-based hyperparameter tuning
  • Distributed System
    • Distributed Systems Lecture Series
      • 1.1 Introduction
  • Conference
    • Index
      • Stanford Graph Learning Workshop
        • Overview of Graph Representation Learning
      • NSDI 2022
      • OSDI 21
        • Graph Embeddings and Neural Networks
        • Data Management
        • Storage
        • Preview
        • Optimizations and Scheduling for ML
          • Oort: Efficient Federated Learning via Guided Participant Selection
          • PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
      • HotOS 21
        • FlexOS: Making OS Isolation Flexible
      • NSDI 21
        • Distributed System
          • Fault-Tolerant Replication with Pull-Based Consensus in MongoDB
          • Ownership: A Distributed Futures System for Fine-Grained Tasks
          • Caerus: NIMBLE Task Scheduling for Serverless Analytics
          • Ship Computer or Data? Why not both?
          • EPaxos Revisited
          • MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds
        • TEGRA: Efficient Ad-Hoc Analytics on Evolving Graphs
        • GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language
      • CIDR 21
        • Cerebro: A Layered Data Platform for Scalable Deep Learning
        • Magpie: Python at Speed and Scale using Cloud Backends
        • Lightweight Inspection of Data Preprocessingin Native Machine Learning Pipelines
        • Lakehouse: A New Generation of Open Platforms that UnifyData Warehousing and Advanced Analytics
      • MLSys 21
        • Chips and Compilers Symposium
        • Support sparse computations in ML
      • SOSP 21
        • SmartNic
          • LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism
          • Xenic: SmartNIC-accelerated distributed transacitions
        • Graphs
          • Mycelium: Large-Scale Distributed Graph Queries with Differential Privacy
          • dSpace: Composable Abstractions for Smart Spaces
        • Consistency
          • Efficient and Scalable Thread-Safety Violation Detection
          • Understanding and Detecting Software Upgrade Failures in Distributed Systems
        • NVM
          • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
        • Learning
          • Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge
          • Faster and Cheaper Serverless Computing on Harvested Resources
  • Random
    • Reading List
      • Random Thoughts
      • Hesse
      • Anxiety
  • Grad School
    • Index
      • Resources for undergraduate students
Powered by GitBook
On this page
  • Reference review:
  • Main idea
  • Problem setup
  • Model of a distributed system
  • Algorithm

Was this helpful?

  1. Networking
  2. Index
  3. CS 294 (Distributed System)
  4. Week 1 - Global State and Clocks

Distributed Snapshots: Determining Global States of Distributed Systems

Setup: 分布式snapshot算法,使用:apache flink, apache spark (structured streaming), ray

PreviousWeek 1 - Global State and ClocksNextTime, Clocks, and the Ordering of Events in a Distributed System

Last updated 2 years ago

Was this helpful?

Reference review:

  • Some examples that worth visit again:

Main idea

  • Problem: detecting global states

    • Which in terms help solve: "stable property detection" problem, e.g. "computation has terminated", "system is deadlocked", "all tokens in a token ring has disappeared"

    • Can also be used for checkpointing

  • Propose an algorithm by which a process in a distributed system determines a global state of the system during a computation

Problem setup

  • A process can record its own state and the messages it sends and receives, nothing else

    • No share clocks or memory

  • To determine a global system state, a process p must enlist the cooperation of other processes that must record their own local states and send the recorded local states to p

  • Problem: algorithms by which processes record their own states and the states of communication channels so that the set of processes and channel states recorded from a global system state

    • MUST run concurrently with, but not alter the underlying computation

      • Don't stop sending the messages, don't stop the application

  • Example: taking photographies of migrating birds

  • Define a class of problem

    • y(S)=true/falsey(S)= true / falsey(S)=true/false for a global state S of D (distributed system)

      • y is a stable property of D if y(S)y(S)y(S) implies y(S′)y(S')y(S′)for all global states S′S'S′ of D reachable from global state SSS of D

      • i.e. if y is a stable property and y is true at a point in a computation of D, then y is true at all later points in that computation

  • The problem of initiating the next phase of computation is not considered here

Model of a distributed system

  • Labeled, directed graphs

    • Vertices: processes

      • Defined by a set of states, an initial state, and a set of events

      • An event eee in a process ppp is an atomic action that may change the state of ppp itself and the state of at most one channel ccc incident on ppp

          • 1) The process p in which the event occurs

          • 2) The state s of p immediately before the event

          • 3) The state s' of p immediately after the evetn

          • 4) The channel c (if any) whose state is altered by the event

          • 5) The message M, if any, sent along c (if c is a channel directed away from p) or received along c (if c is directed towards p)

        • M and c are a special symbol, nullnullnull, if the occurrence of e does not change the state of any channel

        • Event can occur in global state SSS if and only if

          • 1) The state of process p in global state S is s

          • and 2) if c is a channel directed towards p, then the state of c in global state S is a sequence of messages with M at its head

    • Edges: channels

      • State = sequence of messages sent along the channel, excluding the messages received along the channel

  • Assumption

    • Channel: infinite buffers, error-free, delivery messages in the order sent

  • A global state is a set of component process and channel states

    • Initial global state: state of each process is its initial state and the state of each channel is the empty sequence

    • next(S,e)next(S, e)next(S,e)= global state immediately after the occurrence of event e in global state S

      • computation of the system iff eie_iei​ can occur in global state SSS

  • Example 1

    • State transition illustration (four states: in-p, in-c, in-q, in-c')

Algorithm

  • Motivation, outline, termination

Motivation

  • How do global state recording algorithms work?

    • Each process: record its state

    • Two processes that a channel is incident on cooperate in recording the channel state

    • Require the recorded process and channel states form a "meaningful" global system state

  • Example of transition

    • One scenario: inconsistency arises because the state of p is recorded before p sent a message along c and the state of c is recorded after p sent the message

      • n = number of msgs sent along c before p's state is recorded

        • 在 p 的状态记录之前,p 发往channel c的msg数

      • n' = number of msgs sent along c before c's state is recorded

        • 在 c 的状态记录之前,p 发往channel c的msg数

      • the recorded global state may be inconsistent if n < n'

        • 两个token的情况: p的状态记录在in-p中,其他的状态记录在in-c中

        • n = 0, n' = 1?

    • Another scenario: state of c is recorded in global state in-p, the system then transits to global state in-c, and the states of c', p, and q are recorded in global state in-c

      • The recorded global state shows no token in the system

      • Example suggests that the recorded global state may be inconsistent if the state of c is recorded before p sends a message along c and the state of p is recorded after p sends a message along c, that is, n > n'

    • Thus, a consistent global state requires (1) n=n′n=n'n=n′

  • Definition

    • Let m be the number of messages received along c before q's state is recorded.

    • Let m' be the number of messages received along c before c's state is recorded

    • (2) m=m′m=m'm=m′: 与之前的分析相同

    • (3) n′≥m′n'\geq m' n′≥m′

      • I.e. # of msgs received along a channel cannot exceed the # of msgs sent along that channel in every state

    • (4) n≥mn\ge mn≥m

    • 理解

      • Channel要记录的state = list of msgs sent along the channel before the sender's state is recorded - list of msgs received along the channel before the receiver's state is recorded

        • 需要example理解这个内容

  • (1) - (4) suggests a simple algorithm by which q can record the state of channel c

    • Process p sends a special message, called a marker, after the nth message it sends along c

    • The state of c is the sequence of messages received by q after q records its own state and before q receives the marker along c

    • To ensure (4), q must record its state, if it has not done so already, after receiving a marker along c and before q receives further messages along c

Algorithm

  • Initiation

    • can be done by one or more processes, each of which records its state spontaneously, without receiving markers from other processes

  • Marker-sending rule for a process p: for each channel c, incident on, and directed away from p

    • p sends one marker along c after p records its state and before p sends further messages along c

    • Or

      • p records its state and prepares a special marker message (differ from the application message)

      • p sends the marker message to all other processes (using N-1 outbound channels), in this case, only q

      • Start recording all incoming messages from channels CjiC_{ji}Cji​for j not equal to i

    • Marker-receiving rule for a process q: on receiving a marker along a channel c

      • If not receive the marker message for the first time

        • Add all messages from inbound channels since we began recording to their states

  • Termination of the algorithm

    • Need to ensure that

      • (L1) no marker remains forever in an incident input channel

      • (L2) it records its states within finite time of initiation of the algorithm

    • Rules

      • All processes have received a marker (and recorded their own state)

      • All processes have received a marker on the N - 1 incoming channels (and recorded their states)

      • Later, a central server can gather the partial state to build a global snapshot

Proof and properties ref. paper

  • Casual consistency

    • Related to the Lamport clock partial ordering

    • An event is presnapshot if it occurs before the local snapshot on a process

    • Postsnapshot if afterwards

    • If event A happens casually before event B, and B is pre-snapshot, then A is too

Some questions

  • Also assume no failures on the channels? All messages arrive, no duplicate, in order and such?

    • Too strong assumptions could reduce the deployability of the algorithm?

  • What if the markers are lost? Retransmission? In-correct ordering?

  • What if the topology is dynamically changing

  • How fast does it take to form a global state? (with all the mechanisms of snapshotting, collecting, and putting together states from different components)

  • Centralized controller seem to be more easy to implement and reason about

Event eee is defined by

= a sequence of events in component processes

https://matt33.com/2019/10/27/paper-chandy-lamport/
https://www.cs.princeton.edu/courses/archive/fall16/cos418/docs/P8-chandy-lamport.pdf
A stable property