Replicating Data Consistency Explained Through Baseball
Main Idea
Replication protocols often involve complex trade-offs between consistency, performance, and availability
Shed lights on the consistency models offered by cloud providers, what might be offered in the future, and why vendors are offering consistency choices
Intro
Consistency provided by cloud storage today
Azure: strongly consistent storage
Always see the latest value that was written for a data object
Reasonable to provide within a DC
But raises concern for geo-replicated storage
AWS S3: designed with weak consistency
Better performance and availability: read might be with stale data
Eventually consistent: read operation is directed to a replica that has not yet received all the writes that were accepted by some other replica
AWS DynamoDB: both eventually consistent reads and strongly consistent reads
AWS SimpleDB: same choices for clients that read data
Google App Engine Datastore: eventually consistent reads + default strong consistency
Yahoo's PNUTS: read-any, read-critical, read-latest
Quorum-based storage systems: eventual + strong consistency [by selecting different read and write quorums)
Consistency models proposed from research community
Trade-offs between consistency, performance, and availability
Availability is the likelihood of a read operation successfully returning suitably consistent data in the presence of server failures
Performance refers to the time it takes to complete a read operation, that is, the read latency
Two ends: strong and eventual
Stronger: lower performance but reduced availability for reads or writes or both
Questions to cope with
Are different consistencies useful in practice?
Can application developers cope with eventual consistency?
Should cloud storage systems offer an even greater choice of consistency than the consistent and eventually consistent reads offered by some of today’s services?
Read Consistency Guarantees
Model
Multiple clients perform read and write operations to a data store, concurrently access shared information
Data is replicated among a set of servers, but the details of the replication protocol are hidden from clients
Write: any operation that updates one or more data objects
Eventually received at all servers and performed in the same order
Order is consistent with order in which write operations are submitted by clients
In practice
the order could be enforced, even for concurrent writers
by performing all writes at a master server or by having servers run a consensus protocol to reach agreement on the global order
Read: return values of one or more data objects that were previously written
Not necessarily the latest values
Can request a consistency guarantee, which dictates the set of allowable return values
Each guarantee is defined by the set of previous writes whose results are visible to a read operation
Strong Consistency
Guarantees that a read operation returns the value that was last written for a given object
Append? applying all writes to that object
Eventual Consistency
The weakest of the guarantees
Allows the greatest set of possible return values
The term derives from:
Each replica eventually receives each write operation
If clients stopped performing writes then read operations would eventually return an object’s latest value
Consistent Prefix or Similar to Snapshot Isolation
A reader is guaranteed to observe an ordered sequence of writes starting with the first write to a data store
E.x. the read may be answered by a replica that receives writes in order from a master replica but has not yet received some recent writes
Reader sees a version of the data store that existed at the master at some time in the past
Similar to "snapshot isolation" consistency offered by many DBMS
For reads to a single data object in a system where write operations completely overwrite previous values of an object
Eventual consistency reads observe a consistent prefix
Main benefit
When reading multiple data objects [Q: why?]
Or when write operations incrementally update the objects
Bounded Staleness
Ensures that read results are not too out-of-date
Staleness defined by a period T, e.x. 5 mins or # of missing writes or the amount of inaccuracy in a data value
Guarantees that a read operation will return any values written more than T minutes ago or more recently written values
"Most natural concept for application developers"
Monotonic Reads / "Session Guarantee"
Applies to a sequence of read operations that are performed by a given storage system client
A client can read arbitrarily stale data, as with eventual consistency, but is guaranteed to observe a data store that is increasingly up-to-date over time
E.x. client issues a read operation and then later issues another read to the same object(s)
the second read will return the same value(s) or a more recently written value
Read My Writes
Also applies to a sequence of operations performed by a single client
Guarantees that the effects of all writes that were performed by the client are visible to the client’s subsequent reads
E.x. If a client writes a new value for a data object and then reads this object
the read will return the value that was last written by the client (or some other value that was later written by a different client)
For clients that issue no writes, the guarantees = eventual consistency
Summary
Last four read guarantees are all a form of eventual consistency, but stronger than what is typically provided in cloud storage system
None is stronger than any of the others
Applications can request multiple of these guarantees
E.x. a client could request both monotonic reads and read my writes so that it observes a data store that is consistent with its own actions
Strong consistency: desirable from consistency view point, but worst performance and availability
Generally requires reading from a designated primary site or from a majority of replicas
Eventual consistency: weakest consistency, but better performance and availability
Allows the clients to read from any replicas
Others
The “strength” of a consistency guarantee does not depend on when and how writes propagate between servers, but rather is defined by the size of the set of allowable results for a read operation
Smaller sets of possible read results indicate stronger consistency
Baseball as a Sample Application
Two teams' scores
Many of the scores returned by eventually consistent reads were never the actual score
Consistent prefix limits the result to scores the actually existed at some time
Bounded staleness depend on the desired bound
For bounded of 7 innings or more, result is the same for eventual consistency
Monotonic reads: possible values depends on what has been read in the past
Read my writes: depend on who is writing to the key-value store
Read Requirements for Participants
Official Scorekeeper
Task: responsible for maintaining the score of the game by writing it to the persistent key- value store
What consistency?
Read the most up-to-date previous score before adding one to produce the new score
Otherwise could write an incorrect score and undermining the game
He needs strongly consistent data, but he does not need to perform strong consistency reads
If there were multiple people playing the role of scorekeeper and taking turns updating the score, then they would need to perform reads that request strong consistency
The scorekeeper is the only person who updates the score
can request the read my writes guarantee and receive the same effect as a strong read
If scorekeeper changes in the middle of the game
the new scorekeeper could perform a strong read when he first updates the score
but can use read my writes for subsequent reads
Uses application-specific knowledge to obtain the benefits of a weaker consistency read without actually giving up any consistency
In real life
Strong consistency read
pessimistically assume that some client, anywhere in the world, may have just updated the data
system therefore must access a majority of servers (or a fixed set of servers) in order to ensure that the most recently written data is accessed by the submitted read operation
Read my writes
Simply needs to record the set of writes that were previously performed by the client and find some server that has seen all of these writes
In baseball games where the previous write was performed by the scorekeeper may have happened many minutes or even hours ago
Almost any server will have received the previous write and be able to answer the next read that requests the read my writes guarantee
[Q: essentially eventual consistency?]
Umpire
Task: officiates a baseball game from behind home plate
For the most part, does not really care about the current score of the game
Only care during the top half of the 9th inning (i.e. after the visiting team has batted and the home team is about to bat)
Never writes, only reads the values written by official scorekeeper
In order to receive up-to-date info, the umpire must perform strong consistency reads
Otherwise might end the game early
Radio Reporter
Task: periodically announce the scores of games that are in progress or have completed
E.g. every 30 mins
Broadcasts scores that are not completely up-to-date is okay
But don't want to report wrong scores
The reporter wants both his reads to be performed on a snapshot that hold a consistent prefix of the writes that were performed by the scorekeeper
But this is still not sufficient!
Could read a score of 2-5, the current score, then 30 mins later, read a score of 1-3
If reporter reads from a primary server and later reads from another server
[Q: need to understand this]
This can be avoided if the reporter requests the monotonic reads guarantee in addition to requesting a consistent prefix
[Q: why do I still need consistent prefix in this case?]
Or obtain the same effect as a monotonic read by requesting bounded staleness with a bound of less than 30 minutes
Since the reporter only reads data every 30 minutes, he must receive scores that are increasingly up-to-date
Sportswriter
Task: watches the game and later writes an article that appears in the morning paper or that is posted on some web site
Want the effect of strong consistency read (to report the final correct score), but he does not need to pay the cost
a bounded staleness read with a bound of one hour is sufficient to ensure that the sportswriter reads the final score
In practice, any server should be able to answer such a read
Eventual consistency: likely to be correct after an hour
But request bounded staleness is the only way to be 100% certain about it
Statistician
Task: keeping track of the season-long statistics for the team and for individual players
Sometime after each game has ended, adds the runs scored to the previous season total and writes this new value back into the data store
Requirements
Obtain the final score --> strong consistency read, or waits with bounded staleness
For reading the statistics for the reason, also wants strong consistency
Since she is the only person who writes statistics to the data store, she could use "read my writes" guarantee to get the latest value
Stat Watcher
Task: periodically check on the team’s season statistics
Requirements: are usually content with eventual consistency
Updated once per day, and numbers that are slightly out-of-date is okay
Conclusion
Take
All presented consistency guarantees are useful
Different clients may want different consistencies even when accessing the same data
Even simple databases may have diverse users with different consistency needs
Clients should be able to choose their desired consistency
The preferred consistency often depends on how the data is being used
knowledge of who writes data or when data was last written can sometimes allow clients to perform a relaxed consistency read
Others
Cloud providers should offer a wider range of read consistencies for better resource utilization and cost savings
Last updated
Was this helpful?