Sundial: Fault-tolerant Clock Synchronization for Datacenters

https://www.usenix.org/conference/osdi20/presentation/li-yuliang

  • Need for synchronized clocks in datacenter

    • Simplify or improve existing applications

      • Disributed databases, consistent snapshots

    • Enable new applications

      • Network telemetry

      • One-way delay measurement for congestion-control

      • Distributed logging and debugging

    • And now, no synchronized clocks with tight bound are available

  • Need for time-uncertainty bound

    • Wait: a common op for ordering & consistency

      • Time-uncertainty bound to decides how long to wait to guarantee correctness

  • Need for tighter time-uncertainty bound

    • Sundia: ~100ns time-uncertainty bound even under failures; 2-3 orders of magnitude better than existing designs

  • State-of-the-art clock synchronization

    • Calculate offset between 2 clocks (exchanging messages, RTT)

    • Path of messages

      • Variable and asymmetric delay (forward vs. reverse paths, queuing delay)

      • Best practice: sync between neighboring devices

    • Network-wide synchronization

      • Spanning tree: clock values distributed along tree edges

    • Periodic sync: clocks can drift apart over time, so periodic synchronization is needed

  • Calculation of time-uncertainty bound

    • How long has it been since the last time it has synchronized

      • Root's direct children: large bound when affected by failure

      • Nodes in the sub-tree: large bound all the time to prepare for unnoticed failures

      • Need fast recovery from connectivity failures

    • How fast the clock can drift away

      • Clocks drift as oscillator frequencies vary with temperature, voltage, and so on

      • Max drift rate is set conservatively in production (200 ppm in Google TrueTime)

      • Reason: must guarantee correctness

        • What if we set it more aggressively? A large number of clock-related errors (application consistency etc.) during cooling failures!

        • Need very frequent synchronization

    • Both factors can be very large because of failures:

      • Frequency-related failures: cooling, voltage fluctuations

      • Connectivity failures: link/device failure that break the spanning tree

    • HW-SW codesign

      • HW: message sending & processing, failure detection

        • Frequent messages ~100 microseconds

        • Fast failure detection with small timeout

        • Remote failure detection: synchronous messaging

      • SW: enable the backup plan (re-configure the HW)

        • Pre-compute the backup plan (by centralized controller)

          • 1 backup parent per device: multiple options for the backup parent, and device cannot distinguish different failures --> must design backup plan to be generic to different failures

            • Any single link failure

            • Any single device failure

            • Root device failure

            • Any fault-domain (e.g., rack, pod, power) failure: multiple devices / links go down

          • Backup plan

            • 1 backup parent per device

            • 1 backup root: elect itself as the new root when root fails

              • How to distinguish root failure from other failures?

              • Key: get independent observation from other nodes

              • 2nd timeout: elect itself as the new root

            • Backup plan that handles fault-domain failures

              • If one domain failure, Breaks connectivity, takes down backup parent

              • Avoid this case when computing the backup plan

      • Two salient features

        • Frequent synchronization

        • Fast recovery from connectivity failures

      • Evaluation

        • Testbed

        • Compare with state-of-the-art

        • Metrics

        • Scenarios: normal time (no failure), inject failure: link, device, domain

      • Summary

        • Time-uncertainty bound is the key metric

          • Existing sub-microsecond solutions fall short because of failures

        • Sundial: HW-SW codesign

          • Device HW: frequent message, sync messaging, fast failure detection

          • Device SW: fast local recovery based on the backup plan

          • Controller: pre-compute the backup plan generic to different failures

          • First system: ~100 ns time-uncertainty bound, improvements on real applications

Last updated