Context: I am a guy who is childishly obsessed with exploring Distributed Systems. Recently I ended up exploring Amazon “Dynamo”, so here’s QuickView. For detail read you can refer to Amazon whitepaper “Dynamo: Amazon’s Highly Available Key-value Store”

Dynamo Design

Queries:

Dynamo targets applications that need to store objects that are relatively small (< 1 MB).

ACID properties (Atomicity, Consistency, Isolation, Durability):

Stronger consistency systems aren’t capable of handling network partition. Dynamo targets applications that operate with weaker consistency to increase Availability.

Efficiency:

Since mean, median & variance metrics aren’t a good enough criteria to measure a wide userbase, the SLA’s are expressed in 99.9th percentile of the distribution.

Incremental Scalability:

A storage_host/node should scale with minimum impact on operator of system and the system.

Symmetry:

Every node in Dynamo should have the same set of responsibilities as its peers.

Decentralisation

Heterogeneity:

The work distribution must be proportional to the capabilities of the individual servers.

Merge Conflicts:

Traditional Systems had synchronous replica coordination but the problem with that was data was unavailable until it is certain it’s correct. Dynamo is an async consistent data store i.e. all updates reach all replicas eventually. Problems with async task are merge conflicts, here’s how they are solved.

When? Dynamo targets ‘always writeable’ datastore (i.e. whatever may happen user should always be able to update his shopping cart). The conflicts are resolved during read.

Who? Data Store or the Application. Using data store we can use simple methods like “last write wins”, whereas working on application level since the application knows the data schema it can use a specific conflict resolution method accordingly.

Dynamo Partition Algorithm (Consistent Hashing) :

Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing , the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.

To prevent logical partitions, some Dynamo nodes play the role of seeds. Seeds are nodes that are discovered via an external mechanism and are known to all nodes. Because all nodes eventually reconcile their membership with a seed, logical partitions are highly unlikely.

But this Algorithm has few problems: The random position assignment of each node on the ring leads to non-uniform data and it is oblivious to the heterogeneity in the performance of nodes. To solve this instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions

Dynamo Uniform Load Distribution:

Popular keys can be spread across the nodes uniformly through partitioning. A node is considered to be “in balance”, if the node’s request load deviates from the average load by a value a less than a certain threshold. During low loads the imbalance ratio is as high because of the fact that under high loads, a large number of popular keys are accessed and due to uniform distribution of keys the load is evenly distributed. However, during low loads fewer popular keys are accessed, resulting in a higher load imbalance.

Here’s the evolution of Dynamo’s partitioning scheme:

1> T random tokens per node and partition by token value: The fundamental issue with this strategy is that the schemes for data partitioning and data placement are intertwined. It is not possible to add nodes without affecting data partitioning

2> T random tokens per node and equal sized partitions:

3> Q/S tokens per node, equal-sized partitions: S being the number of nodes in the system.

Dynamo Data Versioning (Vector Clocks):

Dynamo uses vector clocks in order to capture causality between different versions of the same object. A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object. One can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks. If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require reconciliation.

Dynamo Operations: get() and put()

The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context.

The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request.

It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.

There are two strategies that a client can use to select a node: (1) route its request through a generic load balancer that will select a node based on load information, or (2) use a partition-aware client library that routes requests directly to the appropriate coordinator nodes. The advantage of the first approach is that the client does not have to link any code specific to Dynamo in its application, whereas the second strategy can achieve lower latency because it skips a potential forwarding step.

A node handling a read or write operation is known as the coordinator. Typically, this is the first among the top N nodes in the preference list. If the requests are received through a load balancer, requests to access a key may be routed to any random node in the ring. In this scenario, the node that receives the request will not coordinate it if the node is not in the top N of the requested key’s preference list. Instead, that node will forward the request to the first among the top N nodes in the preference list.

Dynamo uses a consistency protocol having two key configurable values: R and W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

On receiving a put() request for a key, the coordinator generates the vector clock for the new version and writes the new version locally. The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes. If at least W-1 nodes respond then the write is considered successful.

For a get() request, the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses before returning the result to the client. If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back.

Dynamo Failure Recovering (Merkle Trees):

Traditional systems used a quorum approach resulting unavailability during server failures and network partitions. Dynamo does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.

To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees. A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. If the hash values of the root of two trees are equal, then the values of the leaf nodes in the tree are equal and the nodes require no synchronization. Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts.

Dynamo Membership and Failure Detection (Gossip Based protocol):

The node that serves the request writes the membership change and its time of issue to persistent store. The membership changes form a history because nodes can be removed and added back multiple times. A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories. When a node starts for the first time, it chooses its set of tokens (virtual nodes in the consistent hash space) and maps nodes to their respective token sets. The mappings stored at different Dynamo nodes are reconciled during the same communication exchange. Therefore, partitioning and placement information also propagates via the gossip-based protocol and each storage node is aware of the token ranges handled by its peers. This allows each node to forward a key’s read/write operations to the right set of nodes directly. Temporary node failures are detected by the individual nodes when they fail to communicate with others (while forwarding requests).

Dynamo Balancing Background and Foreground Tasks:

Background tasks are integrated with an admission control mechanism. The admission controller constantly monitors the behavior of resource accesses while executing a “foreground” put/get operation and it decides on how many time slices will be available to background tasks.

Conclusion

Dynamo can be characterized as a zero-hop Distributed Hash Table, where each node maintains enough routing information locally to route a request to the appropriate node directly. No system is perfect Dynamo has it’s own caveats like:

1> Dynamo does not focus on the problem of data integrity and security and is built for a trusted environment.

2> Google’s Bigtable is a distributed storage system for managing structured data. On Contrast Dynamo targets applications that require only key/value access.

But still I’m excited to see how this thing evolves further.

See you in the next post (╯°□°)╯︵ ┻━┻