Consistency

Consistency is a key challenge in distributed software design. It is the property that all nodes in a distributed software system agree on the state of the overall system.

As soon as you [replicate] data across multiple nodes – which you do to improve resilience and availability – you introduce the possibility that the data on different nodes will be out of sync. You have to deal with the problem of how to maintain consistency.

A key idea is eventual consistency. The aim is for data across the whole system to eventually be consistent, but it is not guaranteed to be consistent at any particular point in time.

For most systems, eventual consistency is good enough. The time that replicated data may be out of sync is often measured in milliseconds – the time is takes for messages to be relayed across a local network. But it is much harder when you have to deal with things like financial transactions, where consistency is critical for correct business operations. You have to factor latency (the time for network calls to be made) into your design, and think about potential issues like race conditions.

There are several possible strategies to achieving eventual consistency, but an underlying principle is for all components of a system to agree on what is the source of truth for each type of data. The source of truth is the authoritative source for the data, and all other copies are eventually consistent with this source. This design constraint eliminates the need to resolve potentially conflicts between different copies of the same data, something that is incredibly difficult to do, especially in distributed software where we can’t rely on timestamps to represent the actual order of mutations.

Supporting techniques include chaos engineering (testing how a system behaves under failure conditions) and using idempotent operations (allowing requests to be repeated multiple times without changing the resulting state).

Conflict resolution

In order to ensure replica convergence, a system must reconcile differences between multiple copies of distributed data. Two points to this:

Anti-entropy: exchanging versions or updates of data between services.
Reconciliation: choosing an appropriate final state when concurrent updates are made to the same data.

How to do reconciliation?

This requires business rules. Most common approach is the easiest to implement: the last writer wins.

Reconciliation can be scheduled at different intervals:

Read repair: the correction is done when a read finds an inconsistency. Easy-ish to implement, but slows down the read operation.
Write repair: Correction done on write, when an inconsistency is spotted.
Asynchronous repair: Requires extra service to consolidate data, but does not interfere with read or write operations.