Sundial: Fault-tolerant Clock Synchronization for Datacenters
https://www.usenix.org/conference/osdi20/presentation/li-yuliang
Need for synchronized clocks in datacenter
Simplify or improve existing applications
Disributed databases, consistent snapshots
Enable new applications
Network telemetry
One-way delay measurement for congestion-control
Distributed logging and debugging
And now, no synchronized clocks with tight bound are available
Need for time-uncertainty bound
Wait: a common op for ordering & consistency
Time-uncertainty bound to decides how long to wait to guarantee correctness
Need for tighter time-uncertainty bound
Sundia: ~100ns time-uncertainty bound even under failures; 2-3 orders of magnitude better than existing designs
State-of-the-art clock synchronization
Calculate offset between 2 clocks (exchanging messages, RTT)
Path of messages
Variable and asymmetric delay (forward vs. reverse paths, queuing delay)
Best practice: sync between neighboring devices
Network-wide synchronization
Spanning tree: clock values distributed along tree edges
Periodic sync: clocks can drift apart over time, so periodic synchronization is needed
Calculation of time-uncertainty bound
How long has it been since the last time it has synchronized
Root's direct children: large bound when affected by failure
Nodes in the sub-tree: large bound all the time to prepare for unnoticed failures
Need fast recovery from connectivity failures
How fast the clock can drift away
Clocks drift as oscillator frequencies vary with temperature, voltage, and so on
Max drift rate is set conservatively in production (200 ppm in Google TrueTime)
Reason: must guarantee correctness
What if we set it more aggressively? A large number of clock-related errors (application consistency etc.) during cooling failures!
Need very frequent synchronization
Both factors can be very large because of failures:
Frequency-related failures: cooling, voltage fluctuations
Connectivity failures: link/device failure that break the spanning tree
HW-SW codesign
HW: message sending & processing, failure detection
Frequent messages ~100 microseconds
Fast failure detection with small timeout
Remote failure detection: synchronous messaging
SW: enable the backup plan (re-configure the HW)
Pre-compute the backup plan (by centralized controller)
1 backup parent per device: multiple options for the backup parent, and device cannot distinguish different failures --> must design backup plan to be generic to different failures
Any single link failure
Any single device failure
Root device failure
Any fault-domain (e.g., rack, pod, power) failure: multiple devices / links go down
Backup plan
1 backup parent per device
1 backup root: elect itself as the new root when root fails
How to distinguish root failure from other failures?
Key: get independent observation from other nodes
2nd timeout: elect itself as the new root
Backup plan that handles fault-domain failures
If one domain failure, Breaks connectivity, takes down backup parent
Avoid this case when computing the backup plan
Two salient features
Frequent synchronization
Fast recovery from connectivity failures
Evaluation
Testbed
Compare with state-of-the-art
Metrics
Scenarios: normal time (no failure), inject failure: link, device, domain
Summary
Time-uncertainty bound is the key metric
Existing sub-microsecond solutions fall short because of failures
Sundial: HW-SW codesign
Device HW: frequent message, sync messaging, fast failure detection
Device SW: fast local recovery based on the backup plan
Controller: pre-compute the backup plan generic to different failures
First system: ~100 ns time-uncertainty bound, improvements on real applications
Last updated
Was this helpful?