Sundial: Fault-tolerant Clock Synchronization for Datacenters

https://www.usenix.org/conference/osdi20/presentation/li-yuliang

Need for synchronized clocks in datacenter
- Simplify or improve existing applications
  - Disributed databases, consistent snapshots
- Enable new applications
  - Network telemetry
  - One-way delay measurement for congestion-control
  - Distributed logging and debugging
- And now, no synchronized clocks with tight bound are available
Need for time-uncertainty bound
- Wait: a common op for ordering & consistency
  - Time-uncertainty bound to decides how long to wait to guarantee correctness
Need for tighter time-uncertainty bound
- Sundia: ~100ns time-uncertainty bound even under failures; 2-3 orders of magnitude better than existing designs
State-of-the-art clock synchronization
- Calculate offset between 2 clocks (exchanging messages, RTT)
- Path of messages
  - Variable and asymmetric delay (forward vs. reverse paths, queuing delay)
  - Best practice: sync between neighboring devices
- Network-wide synchronization
  - Spanning tree: clock values distributed along tree edges
- Periodic sync: clocks can drift apart over time, so periodic synchronization is needed
Calculation of time-uncertainty bound
- How long has it been since the last time it has synchronized
  - Root's direct children: large bound when affected by failure
  - Nodes in the sub-tree: large bound all the time to prepare for unnoticed failures
  - Need fast recovery from connectivity failures
- How fast the clock can drift away
  - Clocks drift as oscillator frequencies vary with temperature, voltage, and so on
  - Max drift rate is set conservatively in production (200 ppm in Google TrueTime)
  - Reason: must guarantee correctness
    What if we set it more aggressively? A large number of clock-related errors (application consistency etc.) during cooling failures!
    Need very frequent synchronization
- Both factors can be very large because of failures:
  - Frequency-related failures: cooling, voltage fluctuations
  - Connectivity failures: link/device failure that break the spanning tree
- HW-SW codesign
  - HW: message sending & processing, failure detection
    Frequent messages ~100 microseconds
    Fast failure detection with small timeout
    Remote failure detection: synchronous messaging
  - SW: enable the backup plan (re-configure the HW)
    Pre-compute the backup plan (by centralized controller)
    1 backup parent per device: multiple options for the backup parent, and device cannot distinguish different failures --> must design backup plan to be generic to different failures
    Any single link failure
    Any single device failure
    Root device failure
    Any fault-domain (e.g., rack, pod, power) failure: multiple devices / links go down
    Backup plan
    1 backup parent per device
    1 backup root: elect itself as the new root when root fails
    How to distinguish root failure from other failures?
    Key: get independent observation from other nodes
    2nd timeout: elect itself as the new root
    Backup plan that handles fault-domain failures
    If one domain failure, Breaks connectivity, takes down backup parent
    Avoid this case when computing the backup plan
  - Two salient features
    Frequent synchronization
    Fast recovery from connectivity failures
  - Evaluation
    Testbed
    Compare with state-of-the-art
    Metrics
    Scenarios: normal time (no failure), inject failure: link, device, domain
  - Summary
    Time-uncertainty bound is the key metric
    Existing sub-microsecond solutions fall short because of failures
    Sundial: HW-SW codesign
    Device HW: frequent message, sync messaging, fast failure detection
    Device SW: fast local recovery based on the backup plan
    Controller: pre-compute the backup plan generic to different failures
    First system: ~100 ns time-uncertainty bound, improvements on real applications

PreviousBuilding An Elastic Query Engine on Disaggregated Storage NextMIND: In-Network Memory Management for Disaggregated Data Centers

Last updated 4 years ago

Was this helpful?