🐣
Reading List
  • Starting point
  • Reference list
  • PhD application guidelines
  • Big Data System
    • Index
      • Architecture
        • Storage
          • Sun's Network File System (NFS)
      • Execution Engine, Resource Negotiator, Schedulers
        • Execution Engines
        • Resource Negotiator
        • Schedulers
      • Machine Learning
      • SQL Framework
      • Stream Processing
      • Graph Processing
      • Potpourri: Hardware, Serverless and Approximation
  • Operating System
    • Index
      • OSTEP
        • Virtualization
          • CPU Abstraction: the Process
          • Interlude: Process API
          • Mechanism: Limited Direct Execution
        • Intro
  • Networking
    • Index
      • CS 294 (Distributed System)
        • Week 1 - Global State and Clocks
          • Distributed Snapshots: Determining Global States of Distributed Systems
          • Time, Clocks, and the Ordering of Events in a Distributed System
        • Weak 5 - Weak Consistency
          • Dynamo: Amazon's Highly Available Key-value Store
          • Replicating Data Consistency Explained Through Baseball
          • Managing update conflicts in Bayou, a weakly connected replicated storage system
      • CS 268 (Adv Network)
        • Intro
        • Internet Architecture
          • Towards an Active Network Architecture
          • The Design Philosophy of the DARPA Internet Protocols
        • Beyond best-effort/Unicast
          • Core Based Trees (CBT)
          • Multicast Routing in Internetworks and Extended LANs
        • Congestion Control
        • SDN
          • ONIX: A Distributed Control Platform for Large-scale Production Networks
          • B4: Experience with a Globally-Deployed Software Defined WAN
          • How SDN will shape networking
          • The Future of Networking, and the Past of Protocols
        • Datacenter Networking
          • Fat tree
          • Jellyfish
        • BGP
          • The Case for Separating Routing from Routers
        • Programmable Network
          • NetCache
          • RMT
        • Datacenter Congestion Control
          • Swift
          • pFabric
        • WAN CC
          • Starvation (Sigcomm 22)
        • P2P
          • Design and Evaluation of IPFS: A Storage Layer for the Decentralized Web
          • The Impact of DHT Routing Geometry on Resilience and Proximity
        • Net SW
          • mTCP
          • The Click modular router
        • NFV
          • Performance Interfaces for Network Functions
          • Making Middleboxes Someone Else's Problem: Network Processing as a Cloud Service
        • Ethics
          • On the morals of network research and beyond
          • The collateral damage of internet censorship by DNS injection
          • Encore: Lightweight Measurement of Web Censorship with Cross-Origin Requests
        • Low Latency
          • Aquila: A unified, low-latency fabric for datacenter networks
          • cISP: A Speed-of-Light Internet Service Provider
        • Disaggregation
          • Network Requirements for Resource Disaggregation
        • Tenant Networking
          • Invisinets
          • NetHint: While-Box Networking for Multi-Tenant Data Centers
        • Verification
          • A General Approach to Network Configuration Verification
          • Header Space Analysis: Static Checking for Networks
        • ML
          • SwitchML
          • Fast Distributed Deep Learning over RDMA
      • Computer Networking: A Top-Down Approach
        • Chapter 1. Computer Network and the Internet
          • 1.1 What Is the Internet?
          • 1.2 The Network Edge
          • 1.3 The Network Core
        • Stanford CS144
          • Chapter 1
            • 1.1 A Day in the Life of an Application
            • 1.2 The 4-Layer Internet Model
            • 1.3 The IP Service Model
            • 1.4 A Day in the Life of a Packet
            • 1.6 Layering Principle
            • 1.7 Encapsulation Principle
            • 1.8 Memory layout and Endianness
            • 1.9 IPv4 Addresses
            • 1.10 Longest Prefix Match
            • 1.11 Address Resolution Protocol (ARP)
            • 1.12 The Internet and IP Recap
      • Reading list
        • Elastic hyperparameter tuning on the cloud
        • Rethinking Networking Abstractions for Cloud Tenants
        • Democratizing Cellular Access with AnyCell
        • Dagger: Efficient and Fast RPCs in Cloud Microservices in Near-Memory Reconfigurable NICs
        • Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
        • Faster and Cheaper Serverless Computing on Harvested Resources
        • Network-accelerated Distributed Machine Learning for Multi-Tenant Settings
        • User-Defined Cloud
        • LegoOS: A Disseminated Distributed OS for Hardware Resource Disaggregation
        • Beyond Jain's Fairness Index: Setting the Bar For The Deployment of Congestion Control Algorithms
        • IncBricks: Toward In-Network Computation with an In-Network Cache
  • Persistence
    • Index
      • Hardware
        • Enhancing Lifetime and Security of PCM-Based Main Memory with Start-Gap Wear Leveling
        • An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
  • Database
    • Index
  • Group
    • WISR Group
      • Group
        • Offloading distributed applications onto smartNICs using iPipe
        • Semeru: A memory-disaggregated managed runtime
      • Cache
        • Index
          • TACK: Improving Wireless Transport Performance by Taming Acknowledgements
          • LHD: Improving Cache Hit Rate by Maximizing Hit Density
          • AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network
          • Clustered Bandits
          • Important Sampling
          • Contexual Bandits and Reinforcement Learning
          • Reinforcement Learning for Caching with Space-Time Popularity Dynamics
          • Hyperbolic Caching: Flexible Caching for Web Applications
          • Learning Cache Replacement with CACHEUS
          • Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN
      • Hyperparam Exploration
        • Bayesian optimization in cloud machine learning engine
    • Shivaram's Group
      • Tools
      • Group papers
        • PushdownDB: Accelerating a DBMS using S3 Computation
        • Declarative Machine Learning Systems
        • P3: Distributed Deep Graph Learning at Scale
        • Accelerating Graph Sampling for Graph Machine Learning using GPUs
        • Unicorn: A System for Searching the Social Graph
        • Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless
        • Garaph: Efficient GPU-accelerated GraphProcessing on a Single Machine with Balanced Replication
        • MOSAIC: Processing a Trillion-Edge Graph on a Single Machine
        • Fluid: Resource-aware Hyperparameter Tuning Engine
        • Lists
          • Wavelet: Efficient DNN Training with Tick-Tock Scheduling
          • GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
          • ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training
          • ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
          • KungFu: Making Training inDistributed Machine Learning Adaptive
        • Disk ANN
      • Queries Processing
        • Building An Elastic Query Engine on Disaggregated Storage
        • GRIP: Multi-Store Capacity-Optimized High-Performance NN Search
        • Milvus: A Purpose-Built Vector Data Management System
        • Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings
        • Billion-scale Approximate Nearest Neighbor Search
        • DiskANN: Fast accurate billion-point nearest neighbor search on a single node
        • KGvec2go - Knowledge Graph Embeddings as a Service
    • Seminar & Talk
      • Berkeley System Seminar
        • RR: Engineering Record and Replay for Deployability
        • Immortal Threads: Multithreaded Event-driven Intermittent Computing on Ultra-Low-Power Microcontroll
      • Berkeley DB Seminar
        • TAOBench: An End-to-End Benchmark for Social Network Workloads
      • PS2
      • Sky Seminar Series
        • Spring 23
          • Next-Generation Optical Networks for Emerging ML Workloads
      • Reading List
        • Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks
        • Rearchitecting Linux Storage Stack for µs Latency and High Throughput
        • eBPF: rethinking the linux kernel
        • BPF for Storage: An Exokernel-Inspired Approach
        • High Velocity Kernel File Systems with Bento
        • Incremental Path Towards a Safe OS Kernel
        • Toward Reconfigurable Kernel Datapaths with Learned Optimizations
        • A Vision for Runtime Programmable Networks
        • The Demikernel and the future of kernal-bypass systems
        • Floem: A programming system for NIC-accelerated network applications
        • High Performance Data Center Operating Systems
        • Leveraging Service Meshes as a New Network Layer
        • Automatically Discovering Machine Learning Optimizations
        • Beyond Data and Model Parallelism for Deep Neural Networks
        • IOS: Inter-Operator Scheduler for CNN Acceleration
        • Building An Elastic Query Engine on Disaggregated Storage
        • Sundial: Fault-tolerant Clock Synchronization for Datacenters
        • MIND: In-Network Memory Management for Disaggregated Data Centers
        • Understanding host network stack overheads
        • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
        • Redesigning Storage Systems for Future Workloads Hardware and Performance Requirements
        • Are Machine Learning Cloud APIs Used Correctly?
        • Fault-tolerant and transactional stateful serverless workflows
      • Reading Groups
        • Network reading group
          • Recap
          • ML & Networking
            • Video Streaming
              • Overview
              • Reducto: On-Camera Filtering for Resource Efficient Real-Time Video Analytics
              • Learning in situ: a randomized experiment in video streaming
              • SENSEI: Aligning Video Streaming Quality with Dynamic User Sensitivity
              • Neural Adaptive Video Streaming with Pensieve
              • Server-Driven Video Streaming for Deep Learning Inference
            • Congestion Control
              • ABC: A Simple Explicit Congestion Controller for Wireless Networks
              • TCP Congestion Control: A Systems Approach
                • Chapter 1: Introduction
              • A Deep Reinforcement Learning Perspective on Internet Congestion Control
              • Pantheon: the training ground for Internet congestion-control research
            • Other
              • On the Use of ML for Blackbox System Performance Prediction
              • Marauder: Synergized Caching and Prefetching for Low-Risk Mobile App Acceleration
              • Horcrux: Automatic JavaScript Parallelism for Resource-Efficient Web Computation
              • Snicket: Query-Driven Distributed Tracing
            • Workshop
          • Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities
        • DB reading group
          • CliqueMap: Productionizing an RMA-Based Distributed Caching System
          • Hash maps overview
          • Dark Silicon and the End of Multicore Scaling
        • WISR
          • pFabric: Minimal Near-Optimal Datacenter Transport
          • Scaling Distributed Machine Learning within-Network Aggregation
          • WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
          • Data center TCP (DCTCP)
      • Wisconsin Seminar
        • Enabling Hyperscale Web Services
        • The Lottery Ticket Hypothesis
        • External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
      • Stanford MLSys Seminar
        • Episode 17
        • Episode 18
  • Cloud Computing
    • Index
      • Cloud Reading Group
        • Owl: Scale and Flexibility in Distribution of Hot Contents
        • RubberBand: cloud-based hyperparameter tuning
  • Distributed System
    • Distributed Systems Lecture Series
      • 1.1 Introduction
  • Conference
    • Index
      • Stanford Graph Learning Workshop
        • Overview of Graph Representation Learning
      • NSDI 2022
      • OSDI 21
        • Graph Embeddings and Neural Networks
        • Data Management
        • Storage
        • Preview
        • Optimizations and Scheduling for ML
          • Oort: Efficient Federated Learning via Guided Participant Selection
          • PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
      • HotOS 21
        • FlexOS: Making OS Isolation Flexible
      • NSDI 21
        • Distributed System
          • Fault-Tolerant Replication with Pull-Based Consensus in MongoDB
          • Ownership: A Distributed Futures System for Fine-Grained Tasks
          • Caerus: NIMBLE Task Scheduling for Serverless Analytics
          • Ship Computer or Data? Why not both?
          • EPaxos Revisited
          • MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds
        • TEGRA: Efficient Ad-Hoc Analytics on Evolving Graphs
        • GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language
      • CIDR 21
        • Cerebro: A Layered Data Platform for Scalable Deep Learning
        • Magpie: Python at Speed and Scale using Cloud Backends
        • Lightweight Inspection of Data Preprocessingin Native Machine Learning Pipelines
        • Lakehouse: A New Generation of Open Platforms that UnifyData Warehousing and Advanced Analytics
      • MLSys 21
        • Chips and Compilers Symposium
        • Support sparse computations in ML
      • SOSP 21
        • SmartNic
          • LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism
          • Xenic: SmartNIC-accelerated distributed transacitions
        • Graphs
          • Mycelium: Large-Scale Distributed Graph Queries with Differential Privacy
          • dSpace: Composable Abstractions for Smart Spaces
        • Consistency
          • Efficient and Scalable Thread-Safety Violation Detection
          • Understanding and Detecting Software Upgrade Failures in Distributed Systems
        • NVM
          • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
        • Learning
          • Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge
          • Faster and Cheaper Serverless Computing on Harvested Resources
  • Random
    • Reading List
      • Random Thoughts
      • Hesse
      • Anxiety
  • Grad School
    • Index
      • Resources for undergraduate students
Powered by GitBook
On this page
  • Reference review:
  • Main idea
  • Problem setup
  • Model of a distributed system
  • Algorithm

Was this helpful?

  1. Networking
  2. Index
  3. CS 294 (Distributed System)
  4. Week 1 - Global State and Clocks

Distributed Snapshots: Determining Global States of Distributed Systems

Setup: 分布式snapshot算法,使用:apache flink, apache spark (structured streaming), ray

PreviousWeek 1 - Global State and ClocksNextTime, Clocks, and the Ordering of Events in a Distributed System

Last updated 2 years ago

Was this helpful?

Reference review:

  • Some examples that worth visit again:

Main idea

  • Problem: detecting global states

    • Which in terms help solve: "stable property detection" problem, e.g. "computation has terminated", "system is deadlocked", "all tokens in a token ring has disappeared"

    • Can also be used for checkpointing

  • Propose an algorithm by which a process in a distributed system determines a global state of the system during a computation

Problem setup

  • A process can record its own state and the messages it sends and receives, nothing else

    • No share clocks or memory

  • To determine a global system state, a process p must enlist the cooperation of other processes that must record their own local states and send the recorded local states to p

  • Problem: algorithms by which processes record their own states and the states of communication channels so that the set of processes and channel states recorded from a global system state

    • MUST run concurrently with, but not alter the underlying computation

      • Don't stop sending the messages, don't stop the application

  • Example: taking photographies of migrating birds

  • Define a class of problem

    • y(S)=true/falsey(S)= true / falsey(S)=true/false for a global state S of D (distributed system)

      • y is a stable property of D if y(S)y(S)y(S) implies y(S′)y(S')y(S′)for all global states S′S'S′ of D reachable from global state SSS of D

      • i.e. if y is a stable property and y is true at a point in a computation of D, then y is true at all later points in that computation

  • The problem of initiating the next phase of computation is not considered here

Model of a distributed system

  • Labeled, directed graphs

    • Vertices: processes

      • Defined by a set of states, an initial state, and a set of events

          • 1) The process p in which the event occurs

          • 2) The state s of p immediately before the event

          • 3) The state s' of p immediately after the evetn

          • 4) The channel c (if any) whose state is altered by the event

          • 5) The message M, if any, sent along c (if c is a channel directed away from p) or received along c (if c is directed towards p)

          • 1) The state of process p in global state S is s

          • and 2) if c is a channel directed towards p, then the state of c in global state S is a sequence of messages with M at its head

    • Edges: channels

      • State = sequence of messages sent along the channel, excluding the messages received along the channel

  • Assumption

    • Channel: infinite buffers, error-free, delivery messages in the order sent

  • A global state is a set of component process and channel states

    • Initial global state: state of each process is its initial state and the state of each channel is the empty sequence

  • Example 1

    • State transition illustration (four states: in-p, in-c, in-q, in-c')

Algorithm

  • Motivation, outline, termination

Motivation

  • How do global state recording algorithms work?

    • Each process: record its state

    • Two processes that a channel is incident on cooperate in recording the channel state

    • Require the recorded process and channel states form a "meaningful" global system state

  • Example of transition

    • One scenario: inconsistency arises because the state of p is recorded before p sent a message along c and the state of c is recorded after p sent the message

      • n = number of msgs sent along c before p's state is recorded

        • 在 p 的状态记录之前,p 发往channel c的msg数

      • n' = number of msgs sent along c before c's state is recorded

        • 在 c 的状态记录之前,p 发往channel c的msg数

      • the recorded global state may be inconsistent if n < n'

        • 两个token的情况: p的状态记录在in-p中,其他的状态记录在in-c中

        • n = 0, n' = 1?

    • Another scenario: state of c is recorded in global state in-p, the system then transits to global state in-c, and the states of c', p, and q are recorded in global state in-c

      • The recorded global state shows no token in the system

      • Example suggests that the recorded global state may be inconsistent if the state of c is recorded before p sends a message along c and the state of p is recorded after p sends a message along c, that is, n > n'

  • Definition

    • Let m be the number of messages received along c before q's state is recorded.

    • Let m' be the number of messages received along c before c's state is recorded

      • I.e. # of msgs received along a channel cannot exceed the # of msgs sent along that channel in every state

    • 理解

      • Channel要记录的state = list of msgs sent along the channel before the sender's state is recorded - list of msgs received along the channel before the receiver's state is recorded

        • 需要example理解这个内容

  • (1) - (4) suggests a simple algorithm by which q can record the state of channel c

    • Process p sends a special message, called a marker, after the nth message it sends along c

    • The state of c is the sequence of messages received by q after q records its own state and before q receives the marker along c

    • To ensure (4), q must record its state, if it has not done so already, after receiving a marker along c and before q receives further messages along c

Algorithm

  • Initiation

    • can be done by one or more processes, each of which records its state spontaneously, without receiving markers from other processes

  • Marker-sending rule for a process p: for each channel c, incident on, and directed away from p

    • p sends one marker along c after p records its state and before p sends further messages along c

    • Or

      • p records its state and prepares a special marker message (differ from the application message)

      • p sends the marker message to all other processes (using N-1 outbound channels), in this case, only q

    • Marker-receiving rule for a process q: on receiving a marker along a channel c

      • If not receive the marker message for the first time

        • Add all messages from inbound channels since we began recording to their states

  • Termination of the algorithm

    • Need to ensure that

      • (L1) no marker remains forever in an incident input channel

      • (L2) it records its states within finite time of initiation of the algorithm

    • Rules

      • All processes have received a marker (and recorded their own state)

      • All processes have received a marker on the N - 1 incoming channels (and recorded their states)

      • Later, a central server can gather the partial state to build a global snapshot

Proof and properties ref. paper

  • Casual consistency

    • Related to the Lamport clock partial ordering

    • An event is presnapshot if it occurs before the local snapshot on a process

    • Postsnapshot if afterwards

    • If event A happens casually before event B, and B is pre-snapshot, then A is too

Some questions

  • Also assume no failures on the channels? All messages arrive, no duplicate, in order and such?

    • Too strong assumptions could reduce the deployability of the algorithm?

  • What if the markers are lost? Retransmission? In-correct ordering?

  • What if the topology is dynamically changing

  • How fast does it take to form a global state? (with all the mechanisms of snapshotting, collecting, and putting together states from different components)

  • Centralized controller seem to be more easy to implement and reason about

An event eee in a process ppp is an atomic action that may change the state of ppp itself and the state of at most one channel ccc incident on ppp

Event eee is defined by

M and c are a special symbol, nullnullnull, if the occurrence of e does not change the state of any channel

Event can occur in global state SSS if and only if

next(S,e)next(S, e)next(S,e)= global state immediately after the occurrence of event e in global state S

= a sequence of events in component processes

computation of the system iff eie_iei​ can occur in global state SSS

Thus, a consistent global state requires (1) n=n′n=n'n=n′

(2) m=m′m=m'm=m′: 与之前的分析相同

(3) n′≥m′n'\geq m' n′≥m′

(4) n≥mn\ge mn≥m

Start recording all incoming messages from channels CjiC_{ji}Cji​for j not equal to i

https://matt33.com/2019/10/27/paper-chandy-lamport/
https://www.cs.princeton.edu/courses/archive/fall16/cos418/docs/P8-chandy-lamport.pdf
A stable property