Dynamo: Amazon's Highly Available Key-value Store

Main Idea

  • Dynamo: a highly available and scalable KV storage system to provide an "always-on" experience

  • Sacrifices consistency under certain failure scenarios to provide high availability

  • Make use of object versioning and application-specific conflict resolution

Intro

  • Main goal: reliability at massive scale

    • Depends on how its application state is managed

    • AWS: highly decentralized, loosely coupled, service oriented architecture of hundreds of services

      • Small but significant # of server and network components that are failing at any given time

    • E.x. shopping cart: always write to and read from its data store, data needs to be available across multiple data centers

  • Dynamo

    • manages the state of services that have very high reliability requirements

    • provides simple primary-key only interface to a data store

    • to achieve scalability and availability

      • data is partitioned and replicated using consistent hashing

      • consistency is facilitated by object versioning

        • during updates, maintained by a quorum-like technique and a decentralized replica sync protocol

      • gossip based distributed failure detection and membership protocol

    • completely decentralized

      • storage nodes can be added and removed without requiring manual partitioning or redistribution

    • demonstrates that eventually-consistent storage system can be used in production with demanding applications

Background

  • Relational DB: far from ideal

    • Don't need complex querying and management functionalities

    • Replication options limited, typically choose consistency over availability

    • Not easy to scale or use smart partitioning schemes for load balancing

System Requirements

  • Query model: simple read and write operations

  • ACID properties: guarantee that DB transactions are processed reliably

    • Weaker consistency in "C"

  • Efficiency: services must be able to configure Dynamo such that they consistently achieve their latency and throughput requirements

    • Tradeoffs: performance, cost efficiency, availability, durability guarantees

  • Other assumptions: no security requirements within AWS, initial design targets a scale up to hundreds of storage hosts

SLA

  • A page request to one of the e-commerce sites typically requires the rendering engine to construct is response by sending requests to over 150 services

  • 99.9% of distribution

  • Design constraints

    • Eventual consistency: when to resolve conflicts and who to resolve

    • Target: "always-writable" data store for AWS

      • v.s some of the traditional data stores that execute conflict resolution during writes and keep the read complexity simple --> "reject writes" if data store cannot reach all replicas at a given point in time

      • No updates are rejected due to failures or concurrent writes

    • Next design choice: who performs the process of conflict resolution

      • Either data store ("last-write-win") or the application ("semantics for merging")

    • Other key principals

      • Incremental scalability

      • Symmetry

      • Decentralization

      • Heterogeneity

    • Other

      • Nodes are trusted in a single domain

      • Do not require for hierarchical namespaces or complex relational schema

      • Built-for latency sensitive applications

System Architecture

What is required

  • The system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshaling, request routing, system monitoring and alarming, and configuration management

Dynamo paper focues

  • Partitioning, replication, versioning, membership, failure handling, and scaling

Interface

  • get(key)

    • locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context

  • put(key, context, object)

    • determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk

  • context

    • encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object

    • stored along with the object s.t. the system can verify the validity of the context object supplied in the put request

Partitioning Algorithm

  • Goal: scale incrementally --> dynamically partition the data over the set of nodes (i.e. storage hosts) in the system

  • Partition scheme: consistent hashing to distribute the load across multiple storage hosts

    • The output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value)

    • Each node in the system is assigned a random value within this space which represents its "position" on the ring

    • Each data item identified by a key is assigned to a node by hashing the data item's key to yield its position on the ring, then walking the ring clockwise to find the first node with a position larger than the item's position

    • Each node becomes responsible for the region in the ring between it and its predecessor node on the ring

  • Principal: departure or arrival of node only affects its immediate neighbors and other nodes remained unaffected

  • Also solve

    • Non-uniform data and load distribution

    • And oblivious to heterogeneity in performance of nodes

    • Dynamo

      • Uses the concept of "virtual nodes"

      • Each node can be responsible for one or more virtual nodes

      • When a new node is added to the system, it is assigned multiple positions (i.e. "tokens") in the ring

Replication

  • Each data item is replicated at N hosts

  • Coordinator: in charge of the replication of the data items that fall within its range

    • Each key, k, is assigned to the coordinator node

    • The coordinator also replicates the keys at the N-1 clockwise successor nodes in the ring

  • The list of nodes that is responsible for storing a particular key is called the preference list

    • Every node in the system can determine which nodes should be in this list for any particular key

  • To account for node failures

    • Preference list contains more than N nodes

Data Versioning

  • A put() call may return to its caller before the update has been applied at all the replicas, which can result in scenarios where a subsequent get() operation may return stale read

  • Idea: treat the result of each modification as a new and immutable version of the data

    • allows for multiple versions of an object to be present in the system at the same time

  • Conflicting versions

    • Client perform a reconcilation in order to collapse multiple branches of data evolution back into one

  • Vector clocks to capsure casuality between different versions of the same object

Execution of get() and put() requests

  • Invoked using AWS's request processing framework over HTTP

  • Two approaches

    • Route its request through a generic load balancer that will select a node based on load information

      • Pro: client does not have to link any code specific to Dynamo in its application

    • Use a partition-aware client library that routes requests directly to the appropriate coordinator node

      • Pro: lower latency, skips potential forwarding step

  • Coordinator: node handling read / write operation

    • First among the top N nodes in the preference list

  • Consistency protocol: similar to those used in quorum systems

    • For put() request, the coordinator generates the vector clock for the new version and writes the new version locally

      • Then it sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes

      • If at least W-1 nodes respond then the write is considered successful

    • For get() request, the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses before returning the result to the client

      • If multiple versions of data, it returns all versions it deems to be casually unrelated

      • The divergent versions are then reconciled and the reconciled version superseding the current versions is written back

Last updated