# Sundial: Fault-tolerant Clock Synchronization for Datacenters

* Need for synchronized clocks in datacenter
  * Simplify or improve existing applications&#x20;
    * Disributed databases, consistent snapshots&#x20;
  * Enable new applications&#x20;
    * Network telemetry
    * One-way delay measurement for congestion-control
    * Distributed logging and debugging&#x20;
  * And now, no synchronized clocks with tight bound are available&#x20;
* Need for time-uncertainty bound&#x20;
  * Wait: a common op for ordering & consistency&#x20;
    * Time-uncertainty bound to decides how long to wait to guarantee correctness&#x20;
* Need for tighter time-uncertainty bound&#x20;
  * Sundia: \~100ns time-uncertainty bound even under failures; 2-3 orders of magnitude better than existing designs&#x20;
* State-of-the-art clock synchronization&#x20;
  * Calculate offset between 2 clocks (exchanging messages, RTT)
  * Path of messages&#x20;
    * Variable and asymmetric delay (forward vs. reverse paths, queuing delay)
    * Best practice: sync between neighboring devices&#x20;
  * Network-wide synchronization
    * Spanning tree: clock values distributed along tree edges
  * Periodic sync: clocks can drift apart over time, so periodic synchronization is needed&#x20;
* Calculation of time-uncertainty bound&#x20;
  * How long has it been since the last time it has synchronized
    * Root's direct children: large bound when affected by failure&#x20;
    * Nodes in the sub-tree: large bound all the time to prepare for unnoticed failures&#x20;
    * Need **fast recovery** from connectivity failures&#x20;
  * How fast the clock can drift away
    * Clocks drift as oscillator frequencies vary with temperature, voltage, and so on&#x20;
    * Max drift rate is set conservatively in production (200 ppm in Google TrueTime)
    * Reason: must guarantee correctness&#x20;
      * What if we set it more aggressively? A large number of clock-related errors (application consistency etc.) during cooling failures!&#x20;
      * Need **very frequent synchronization**
  * Both factors can be very large because of failures:&#x20;
    * Frequency-related failures: cooling, voltage fluctuations&#x20;
    * Connectivity failures: link/device failure that break the spanning tree&#x20;
  * HW-SW codesign
    * HW: message sending & processing, failure detection&#x20;
      * Frequent messages \~100 microseconds&#x20;
      * Fast failure detection with small timeout&#x20;
      * Remote failure detection: synchronous messaging&#x20;
    * SW: enable the backup plan (re-configure the HW)&#x20;
      * Pre-compute the backup plan (by centralized controller)
        * 1 backup parent per device: multiple options for the backup parent, and device cannot distinguish different failures --> must design backup plan to be generic to different failures&#x20;
          * Any single link failure&#x20;
          * Any single device failure&#x20;
          * Root device failure&#x20;
          * Any fault-domain (e.g., rack, pod, power) failure: multiple devices / links go down&#x20;
        * Backup plan
          * 1 backup parent per device&#x20;
          * 1 backup root: elect itself as the new root when root fails&#x20;
            * How to distinguish root failure from other failures?&#x20;
            * Key: get independent observation from other nodes
            * 2nd timeout: elect itself as the new root
          * Backup plan that handles fault-domain failures&#x20;
            * If one domain failure, Breaks connectivity, takes down backup parent&#x20;
            * Avoid this case when computing the backup plan&#x20;
    * Two salient features&#x20;
      * Frequent synchronization&#x20;
      * Fast recovery from connectivity failures&#x20;
    * Evaluation&#x20;
      * Testbed&#x20;
      * Compare with state-of-the-art&#x20;
      * Metrics&#x20;
      * Scenarios: normal time (no failure), inject failure: link, device, domain&#x20;
    * Summary
      * Time-uncertainty bound is the key metric&#x20;
        * Existing sub-microsecond solutions fall short because of failures
      * Sundial: HW-SW codesign
        * Device HW: frequent message, sync messaging, fast failure detection
        * Device SW: fast local recovery based on the backup plan
        * Controller: pre-compute the backup plan generic to different failures&#x20;
        * First system: \~100 ns time-uncertainty bound, improvements on real applications&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/sundial-fault-tolerant-clock-synchronization-for-datacenters.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
