> For the complete documentation index, see [llms.txt](https://sliu583.gitbook.io/blog/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://sliu583.gitbook.io/blog/specific-work/seminar-and-talk/fall-21-reading-list/sundial-fault-tolerant-clock-synchronization-for-datacenters.md).

# Sundial: Fault-tolerant Clock Synchronization for Datacenters

* Need for synchronized clocks in datacenter
  * Simplify or improve existing applications&#x20;
    * Disributed databases, consistent snapshots&#x20;
  * Enable new applications&#x20;
    * Network telemetry
    * One-way delay measurement for congestion-control
    * Distributed logging and debugging&#x20;
  * And now, no synchronized clocks with tight bound are available&#x20;
* Need for time-uncertainty bound&#x20;
  * Wait: a common op for ordering & consistency&#x20;
    * Time-uncertainty bound to decides how long to wait to guarantee correctness&#x20;
* Need for tighter time-uncertainty bound&#x20;
  * Sundia: \~100ns time-uncertainty bound even under failures; 2-3 orders of magnitude better than existing designs&#x20;
* State-of-the-art clock synchronization&#x20;
  * Calculate offset between 2 clocks (exchanging messages, RTT)
  * Path of messages&#x20;
    * Variable and asymmetric delay (forward vs. reverse paths, queuing delay)
    * Best practice: sync between neighboring devices&#x20;
  * Network-wide synchronization
    * Spanning tree: clock values distributed along tree edges
  * Periodic sync: clocks can drift apart over time, so periodic synchronization is needed&#x20;
* Calculation of time-uncertainty bound&#x20;
  * How long has it been since the last time it has synchronized
    * Root's direct children: large bound when affected by failure&#x20;
    * Nodes in the sub-tree: large bound all the time to prepare for unnoticed failures&#x20;
    * Need **fast recovery** from connectivity failures&#x20;
  * How fast the clock can drift away
    * Clocks drift as oscillator frequencies vary with temperature, voltage, and so on&#x20;
    * Max drift rate is set conservatively in production (200 ppm in Google TrueTime)
    * Reason: must guarantee correctness&#x20;
      * What if we set it more aggressively? A large number of clock-related errors (application consistency etc.) during cooling failures!&#x20;
      * Need **very frequent synchronization**
  * Both factors can be very large because of failures:&#x20;
    * Frequency-related failures: cooling, voltage fluctuations&#x20;
    * Connectivity failures: link/device failure that break the spanning tree&#x20;
  * HW-SW codesign
    * HW: message sending & processing, failure detection&#x20;
      * Frequent messages \~100 microseconds&#x20;
      * Fast failure detection with small timeout&#x20;
      * Remote failure detection: synchronous messaging&#x20;
    * SW: enable the backup plan (re-configure the HW)&#x20;
      * Pre-compute the backup plan (by centralized controller)
        * 1 backup parent per device: multiple options for the backup parent, and device cannot distinguish different failures --> must design backup plan to be generic to different failures&#x20;
          * Any single link failure&#x20;
          * Any single device failure&#x20;
          * Root device failure&#x20;
          * Any fault-domain (e.g., rack, pod, power) failure: multiple devices / links go down&#x20;
        * Backup plan
          * 1 backup parent per device&#x20;
          * 1 backup root: elect itself as the new root when root fails&#x20;
            * How to distinguish root failure from other failures?&#x20;
            * Key: get independent observation from other nodes
            * 2nd timeout: elect itself as the new root
          * Backup plan that handles fault-domain failures&#x20;
            * If one domain failure, Breaks connectivity, takes down backup parent&#x20;
            * Avoid this case when computing the backup plan&#x20;
    * Two salient features&#x20;
      * Frequent synchronization&#x20;
      * Fast recovery from connectivity failures&#x20;
    * Evaluation&#x20;
      * Testbed&#x20;
      * Compare with state-of-the-art&#x20;
      * Metrics&#x20;
      * Scenarios: normal time (no failure), inject failure: link, device, domain&#x20;
    * Summary
      * Time-uncertainty bound is the key metric&#x20;
        * Existing sub-microsecond solutions fall short because of failures
      * Sundial: HW-SW codesign
        * Device HW: frequent message, sync messaging, fast failure detection
        * Device SW: fast local recovery based on the backup plan
        * Controller: pre-compute the backup plan generic to different failures&#x20;
        * First system: \~100 ns time-uncertainty bound, improvements on real applications&#x20;
