Sundial: Fault-tolerant Clock Synchronization for Datacenters
https://www.usenix.org/conference/osdi20/presentation/li-yuliang
- Need for synchronized clocks in datacenter - Simplify or improve existing applications - Disributed databases, consistent snapshots 
 
- Enable new applications - Network telemetry 
- One-way delay measurement for congestion-control 
- Distributed logging and debugging 
 
- And now, no synchronized clocks with tight bound are available 
 
- Need for time-uncertainty bound - Wait: a common op for ordering & consistency - Time-uncertainty bound to decides how long to wait to guarantee correctness 
 
 
- Need for tighter time-uncertainty bound - Sundia: ~100ns time-uncertainty bound even under failures; 2-3 orders of magnitude better than existing designs 
 
- State-of-the-art clock synchronization - Calculate offset between 2 clocks (exchanging messages, RTT) 
- Path of messages - Variable and asymmetric delay (forward vs. reverse paths, queuing delay) 
- Best practice: sync between neighboring devices 
 
- Network-wide synchronization - Spanning tree: clock values distributed along tree edges 
 
- Periodic sync: clocks can drift apart over time, so periodic synchronization is needed 
 
- Calculation of time-uncertainty bound - How long has it been since the last time it has synchronized - Root's direct children: large bound when affected by failure 
- Nodes in the sub-tree: large bound all the time to prepare for unnoticed failures 
- Need fast recovery from connectivity failures 
 
- How fast the clock can drift away - Clocks drift as oscillator frequencies vary with temperature, voltage, and so on 
- Max drift rate is set conservatively in production (200 ppm in Google TrueTime) 
- Reason: must guarantee correctness - What if we set it more aggressively? A large number of clock-related errors (application consistency etc.) during cooling failures! 
- Need very frequent synchronization 
 
 
- Both factors can be very large because of failures: - Frequency-related failures: cooling, voltage fluctuations 
- Connectivity failures: link/device failure that break the spanning tree 
 
- HW-SW codesign - HW: message sending & processing, failure detection - Frequent messages ~100 microseconds 
- Fast failure detection with small timeout 
- Remote failure detection: synchronous messaging 
 
- SW: enable the backup plan (re-configure the HW) - Pre-compute the backup plan (by centralized controller) - 1 backup parent per device: multiple options for the backup parent, and device cannot distinguish different failures --> must design backup plan to be generic to different failures - Any single link failure 
- Any single device failure 
- Root device failure 
- Any fault-domain (e.g., rack, pod, power) failure: multiple devices / links go down 
 
- Backup plan - 1 backup parent per device 
- 1 backup root: elect itself as the new root when root fails - How to distinguish root failure from other failures? 
- Key: get independent observation from other nodes 
- 2nd timeout: elect itself as the new root 
 
- Backup plan that handles fault-domain failures - If one domain failure, Breaks connectivity, takes down backup parent 
- Avoid this case when computing the backup plan 
 
 
 
 
- Two salient features - Frequent synchronization 
- Fast recovery from connectivity failures 
 
- Evaluation - Testbed 
- Compare with state-of-the-art 
- Metrics 
- Scenarios: normal time (no failure), inject failure: link, device, domain 
 
- Summary - Time-uncertainty bound is the key metric - Existing sub-microsecond solutions fall short because of failures 
 
- Sundial: HW-SW codesign - Device HW: frequent message, sync messaging, fast failure detection 
- Device SW: fast local recovery based on the backup plan 
- Controller: pre-compute the backup plan generic to different failures 
- First system: ~100 ns time-uncertainty bound, improvements on real applications 
 
 
 
 
Last updated
Was this helpful?