# Replicating Data Consistency Explained Through Baseball

## Main Idea

* Replication protocols often involve complex trade-offs between consistency, performance, and availability&#x20;
* Shed lights on the consistency models offered by cloud providers, what might be offered in the future, and why vendors are offering consistency choices&#x20;

## Intro

### Consistency provided by cloud storage today&#x20;

* Azure: strongly consistent storage
  * Always see the latest value that was written for a data object
  * Reasonable to provide within a DC&#x20;
  * But raises concern for geo-replicated storage&#x20;
* AWS S3: designed with weak consistency&#x20;
  * Better performance and availability: read might be with stale data&#x20;
  * Eventually consistent: read operation is directed to a replica that has not yet received all the writes that were accepted by some other replica&#x20;
* AWS DynamoDB: both eventually consistent reads and strongly consistent reads&#x20;
* AWS SimpleDB: same choices for clients that read data&#x20;
* Google App Engine Datastore: eventually consistent reads + default strong consistency&#x20;
* Yahoo's PNUTS: read-any, read-critical, read-latest&#x20;
* Quorum-based storage systems: eventual + strong consistency \[by selecting different read and write quorums)&#x20;

### Consistency models proposed from research community&#x20;

* Trade-offs between consistency, performance, and availability&#x20;
  * Availability is the likelihood of a read operation successfully returning suitably consistent data in the presence of server failures
  * Performance refers to the time it takes to complete a read operation, that is, the read latency
* Two ends: strong and eventual&#x20;
  * Stronger: lower performance but reduced availability for reads or writes or both

### Questions to cope with

* Are different consistencies useful in practice?&#x20;
* Can application developers cope with eventual consistency?&#x20;
* Should cloud storage systems offer an even greater choice of consistency than the consistent and eventually consistent reads offered by some of today’s services?

## Read Consistency Guarantees&#x20;

### Model

* Multiple clients perform read and write operations to a data store, concurrently access shared information&#x20;
* Data is replicated among a set of servers, but the details of the replication protocol are hidden from clients&#x20;
* <mark style="color:orange;">Write: any operation that updates one or more data objects</mark>&#x20;
  * Eventually received at all servers and performed in the same order&#x20;
  * Order is consistent with order in which write operations are submitted by clients&#x20;
  * In practice
    * the order could be enforced, even for concurrent writers
    * by performing all writes at a master server or by having servers run a consensus protocol to reach agreement on the global order
* <mark style="color:orange;">Read: return values of one or more data objects that were previously written</mark>&#x20;
  * Not necessarily the latest values&#x20;
  * Can request a consistency guarantee, which dictates the set of allowable return values&#x20;
  * Each guarantee is defined by the set of previous writes whose results are visible to a read operation

<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FEF7OvsTzOEEkPSgM1ZfD%2Fimage.png?alt=media&#x26;token=efa9cbac-5b62-4ae4-966f-523c2edee460" alt=""><figcaption></figcaption></figure>

#### Strong Consistency

* Guarantees that a read operation returns the value that was last written for a given object
* Append? applying all writes to that object&#x20;

#### Eventual Consistency

* The weakest of the guarantees
* Allows the greatest set of possible return values
* The term derives from:&#x20;
  * Each replica eventually receives each write operation
  * If clients stopped performing writes then read operations would eventually return an object’s latest value

#### Consistent Prefix or Similar to Snapshot Isolation&#x20;

* A reader is guaranteed to observe an ordered sequence of writes starting with the first write to a data store
* E.x. the read may be answered by a replica that receives writes in order from a master replica but has not yet received some recent writes
  * Reader sees a version of the data store that existed at the master at some time in the past
  * Similar to "snapshot isolation" consistency offered by many DBMS
* For reads to a single data object in a system where write operations completely overwrite previous values of an object
  * Eventual consistency reads observe a consistent prefix
* Main benefit&#x20;
  * When reading multiple data objects \[Q: why?]&#x20;
  * Or when write operations incrementally update the objects&#x20;

#### Bounded Staleness

* Ensures that read results are not too out-of-date
* Staleness defined by a period T, e.x. 5 mins or # of missing writes or the amount of inaccuracy in a data value&#x20;
* Guarantees that a read operation will return any values written more than T minutes ago or more recently written values
* "Most natural concept for application developers"&#x20;

#### Monotonic Reads / "Session Guarantee"&#x20;

* Applies to a sequence of read operations that are performed by a given storage system client
* A client can read arbitrarily stale data, as with eventual consistency, but is guaranteed to observe a data store that is increasingly up-to-date over time
* E.x. client issues a read operation and then later issues another read to the same object(s)
  * the second read will return the same value(s) or a more recently written value

#### Read My Writes&#x20;

* Also applies to a sequence of operations performed by a single client
* Guarantees that the effects of all writes that were performed by the client are visible to the client’s subsequent reads
* E.x. If a client writes a new value for a data object and then reads this object
  * the read will return the value that was last written by the client (or some other value that was later written by a different client)
* For clients that issue no writes, the guarantees = eventual consistency&#x20;

#### Summary

* Last four read guarantees are all a form of eventual consistency, but stronger than what is typically provided in cloud storage system&#x20;
  * None is stronger than any of the others&#x20;
  * Applications can request multiple of these guarantees&#x20;
  * E.x. a client could request both monotonic reads and read my writes so that it observes a data store that is consistent with its own actions
* Strong consistency: desirable from consistency view point, but worst performance and availability
  * <mark style="color:green;">Generally requires reading from a designated primary site or from a majority of replicas</mark> &#x20;
* Eventual consistency: weakest consistency, but better performance and availability&#x20;
  * Allows the clients to read from any replicas&#x20;
* Others&#x20;
  * The “strength” of a consistency guarantee does not depend on when and how writes propagate between servers, but rather is **defined by the size of the set of allowable results for a read operation**&#x20;
  * **Smaller sets of possible read results indicate stronger consistency**&#x20;

<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FGINezxIne4qqCERzypie%2Fimage.png?alt=media&#x26;token=b3f27e1a-1261-4230-9bd1-09c65622e996" alt=""><figcaption></figcaption></figure>

### Baseball as a Sample Application&#x20;

* Two teams' scores&#x20;
* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FUgfigu3urcWIXgmJi3ZJ%2Fimage.png?alt=media\&token=23b8cc9c-f3e6-4974-bd78-2760e0005b59)
* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FVWvC62HMtXscJeT8saeZ%2Fimage.png?alt=media\&token=eb3c1e16-270d-4613-bc3a-23ad934956b8)
  * Many of the scores returned by eventually consistent reads were never the actual score&#x20;
  * Consistent prefix limits the result to scores the actually existed at some time&#x20;
  * Bounded staleness depend on the desired bound&#x20;
    * For bounded of 7 innings or more, result is the same for eventual consistency&#x20;
  * Monotonic reads: possible values depends on what has been read in the past&#x20;
  * Read my writes: depend on who is writing to the key-value store&#x20;

### Read Requirements for Participants

#### Official Scorekeeper

* Task: responsible for maintaining the score of the game by writing it to the persistent key- value store
* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FfOMD7GCJyWElFl57yog2%2Fimage.png?alt=media\&token=5fe088c8-7da5-4a86-b2c9-f52f8102a702)&#x20;
* What consistency?&#x20;
  * Read the most up-to-date previous score before adding one to produce the new score&#x20;
    * Otherwise could write an incorrect score and undermining the game
  * He needs strongly consistent data, but he does not need to perform strong consistency reads&#x20;
    * If there were multiple people playing the role of scorekeeper and taking turns updating the score, then they would need to perform reads that request strong consistency
    * The scorekeeper is the only person who updates the score
      * can request the *read my writes* guarantee and receive the same effect as a strong read
    * If scorekeeper changes in the middle of the game&#x20;
      * the new scorekeeper could perform a strong read when he first updates the score
      * but can use read my writes for subsequent reads
    * <mark style="color:red;">Uses application-specific knowledge</mark> to <mark style="color:green;">obtain the benefits of a weaker consistency read without actually giving up any consistency</mark>&#x20;
* In real life&#x20;
  * Strong consistency read
    * pessimistically assume that some client, anywhere in the world, may have just updated the data&#x20;
    * system therefore must access a majority of servers (or a fixed set of servers) in order to ensure that the most recently written data is accessed by the submitted read operation
  * Read my writes&#x20;
    * Simply needs to record the set of writes that were previously performed by the client and find some server that has seen all of these writes&#x20;
    * In baseball games where the previous write was performed by the scorekeeper may have happened many minutes or even hours ago&#x20;
      * Almost any server will have received the previous write and be able to answer the next read that requests the read my writes guarantee&#x20;
      * \[Q: essentially eventual consistency?]&#x20;

#### Umpire&#x20;

* Task: officiates a baseball game from behind home plate
* For the most part, does not really care  about the current score of the game
* Only care during the top half of the 9th inning (i.e. after the visiting team has batted and the home team is about to bat)&#x20;
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Frff95yaxkUtsWCeqOUxx%2Fimage.png?alt=media\&token=98c8308c-a46e-45fa-9130-2e1815963a3c)
* Never writes, only reads the values written by official scorekeeper
* In order to receive up-to-date info, the umpire must perform strong consistency reads&#x20;
  * Otherwise might end the game early&#x20;

#### Radio Reporter&#x20;

* Task: periodically announce the scores of games that are in progress or have completed
  * E.g. every 30 mins&#x20;
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FKoV9fumioN5kGBfgVZMM%2Fimage.png?alt=media\&token=3f2438af-899f-47d6-b335-8f371b689968)
* Broadcasts scores that are not completely up-to-date is okay
  * But don't want to report wrong scores&#x20;
  * The reporter wants both his reads to be performed on a snapshot that hold a consistent prefix of the writes that were performed by the scorekeeper
  * But this is still not sufficient!&#x20;
    * Could read a score of 2-5, the current score, then 30 mins later, read a score of 1-3&#x20;
      * If reporter reads from a primary server and later reads from another server&#x20;
    * \[Q: need to understand this]&#x20;
  * This can be avoided if the reporter requests the *monotonic reads* guarantee in addition to requesting a *consistent prefix*&#x20;
    * \[Q: why do I still need consistent prefix in this case?]&#x20;
  * Or obtain the same effect as a *monotonic read* by requesting *bounded staleness* with a bound of less than 30 minutes
    * Since the reporter only reads data every 30 minutes, he must receive scores that are increasingly up-to-date

#### Sportswriter

* Task: watches the game and later writes an article that appears in the morning paper or that is posted on some web site
* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fp2W9AKUvLjO0fGqJUDDX%2Fimage.png?alt=media\&token=ee63505d-c107-4774-88e5-df2897260023)
* Want the effect of strong consistency read (to report the final correct score), but he does not need to pay the cost&#x20;
  * a *bounded stalene*ss read with a bound of one hour is sufficient to ensure that the sportswriter reads the final score
  * In practice, any server should be able to answer such a read&#x20;
    * Eventual consistency: likely to be correct after an hour&#x20;
    * But request bounded staleness is the only way to be 100% certain about it&#x20;

#### Statistician&#x20;

* Task: keeping track of the season-long statistics for the team and for individual players
* ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2Fqheb1az9vMUDoMHhNKTG%2Fimage.png?alt=media\&token=d034bed9-0f50-4c46-97ab-df8b3b1d7726)
  * Sometime after each game has ended, adds the runs scored to the previous season total and writes this new value back into the data store
* Requirements&#x20;
  * Obtain the final score --> strong consistency read, or waits with bounded staleness&#x20;
  * For reading the statistics for the reason, also wants strong consistency&#x20;
    * Since she is the only person who writes statistics to the data store, she could use "read my writes" guarantee to get the latest value&#x20;

#### Stat Watcher&#x20;

* Task: periodically check on the team’s season statistics&#x20;
* Requirements: are usually content with eventual consistency
  * Updated once per day, and numbers that are slightly out-of-date is okay&#x20;
  * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FvbgjRa9PYfRlJMrZWTA1%2Fimage.png?alt=media\&token=8e69a018-4011-4a15-b04e-fd8851896c63)

### Conclusion&#x20;

<figure><img src="https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2FQek7tyRNZEhiOI8p5f4h%2Fimage.png?alt=media&#x26;token=e7d85818-bd3b-4987-a237-28fcaf9fd83a" alt=""><figcaption></figcaption></figure>

#### Take

* All presented consistency guarantees are useful&#x20;
* Different clients may want different consistencies even when accessing the same data
* Even simple databases may have diverse users with different consistency needs
* Clients should be able to choose their desired consistency
  * The preferred consistency often depends on *how* the data is being used
  * knowledge of *who* writes data or *when* data was last written can sometimes allow clients to perform a relaxed consistency read
* Others&#x20;
  * Cloud providers should offer a wider range of read consistencies for better resource utilization and cost savings&#x20;
