Distributed system
A distributed system is a general term for any network of computers working as one coherent system. Distributed systems include [cloud computing] services and [peer-to-peer networks].
A computer program that is partitioned across multiple nodes in a distributed system is called a distributed program or distributed software. Components of a distributed program are located on multiple networked computers, and the components coordinate their actions by passing messages between themselves over the network.
Examples of distributed software include applications like Hadoop (for [big data] processing), Kubernetes (for [container orchestration]), and Cassandra (a [distributed database]). Most [web applications] are also distributed programs. At the very least, web applications partition their processing between a [client and server]. The server-side application may be further partitioned into a [service-oriented architecture] such as [microservices].
Distributed programming is the process of developing and maintaining distributed programs. Distributed computing is the field of [computer science] that studies distributed systems.
Every distributed system shares three fundamental properties:
-
Its components operate concurrently.
-
Its components fail independently.
-
Its components do not share a global clock.
These properties distinguish distributed systems from single-node systems, and they are collectively the root cause of most of the challenges described in this document.
Designing distributed systems
Most large-scale systems are naturally distributed. That’s because monoliths, as they grow in scope and load, become ever harder to maintain, develop, operate, and [scale]. So the natural tendency is toward breaking down monoliths into lots of smaller and simpler components that can each be developed, deployed, operated, scaled, and optimized independently – a process known as [decomposition].
Distribution is not always a deliberate architectural choice, however. When an application must communicate with external third-party services – such as payment processors, fraud detection APIs, or identity providers – those external machines are outside the developer’s control. Yet the application is still subject to all the hazards of distributed communication: network failures, [latency], and data consistency issues across system boundaries. Whether or not you manage those extra machines yourself, you are responsible for planning around the failures that can occur in interactions with them.
But distribution of processing and data has significant [trade-offs] in [system design].
Although individual units within a distributed system may be simpler and more [resilient] due to increased [isolation], overall a distributed system will always have more [points of failure] and a higher degree of [accidental complexity] than single-node systems with comparable functionality. Distributed software is inherently [brittle] because you’re introducing internal network calls, which are subject to network failures and greater operational constraints like [latency] and [bandwidth].
Reasoning about concurrency, managing data consistency, and designing monitoring solutions, are other concerns that are particularly challenging in distributed software design.
Distributed programming is hard. The main challenges in designing distributed software are:
-
Fault tolerance: [Fault tolerance] is a key quality attribute of distributed systems, and it needs to be treated as a first-class concern in the system design. Distributed software is subject to a wide range of potential failure modes, including network failures, delayed messages, and node crashes. The principle of [designing for failure] means designing each service in a way that it can tolerate failures in dependent services. For example, if an email service goes offline, emails should be queued for delivery and sent later when the email service comes back online. [Asynchronous communication patterns] improve the fault tolerance of distributed systems by reducing temporal coupling (where the availability of one service is dependent on the immediate availability of another). Design patterns such as [circuit breakers] and [bulkheads] also help to improve the fault tolerance of distributed systems by isolating failures, so preventing failures in one component from cascading into other components. [Disaster recovery] planning is also important. It is the process of designing services to recover quickly, in an automated way, from failures.
-
Latency and bandwidth: [Latency] and [bandwidth] are two key performance metrics in distributed systems. Latency is the time it takes for a message to travel from one node to another, while bandwidth is the amount of data that can be transmitted over a network in a given time period. In distributed systems, latency and bandwidth are affected by network congestion, distance between nodes, and the number of hops a message must take to reach its destination. These are not considerations in single-node systems, where all components run in the same process.
-
Concurrency and consistency: Because distributed systems generally use asynchronous communication between services, managing [concurrency] and ensuring eventual consistency become key design concerns. CAP theorem is relevant in distributed systems that [replicate] data between nodes.
-
Observability: [Monitoring], [logging], and [debugging] all become much more complex in distributed systems. [Observability] and especially [distributed tracing] tools are specifically designed to address these challenges.
-
[Testing], too, is much more challenging in distributed systems. It is harder, for example, to do proper integration tests to validate a whole system, compared to monolithic systems where all the components run in the same process. There tends to be increased reliance upon service mocks, which reduces confidence in correctness of the tests.
-
Operational complexity: Large distributed systems cannot be fully comprehended by any individual. Teams own only a fraction of the overall architecture, and different components are deployed independently by different teams on their own schedules. It is often impossible to run the full system locally, which increases reliance on staging and production environments for integration testing and debugging. This requires a level of abstract thinking that is not needed in monolithic development, and it makes it harder to reason about systemic failures.
-
Infrastructure cost: In single-process applications, the cost of new functionality is measured primarily in lines of code. In distributed systems, every component added to the architecture has a direct and ongoing monetary cost. Over-engineering in a single service leaves behind unused code; over-engineering a distributed system leaves behind running infrastructure that must be paid for. The cost of this waste can be significant in large systems. Financial constraints are therefore a practical design concern alongside the technical ones.
For these reasons, distributed computing should be avoided until it is required to solve specific problems – usually something to do with scalability. Better to start with a modular monolith, with scalability planned into the design from the start, and extract independent services incrementally.
The eight fallacies of distributed software
Distributed software design requires a particular set of [design principles] and [systems thinking].
The eight fallacies of distributed computing are a set of assertions that describe false assumptions that are often made by computer programmers who are new to distributed computing. These fallacies were observed by Peter Deutsch et al at Sun Microsystems, and published online in 1994. The list extended four fallacies previously identified by Bill Joy and Dave Lyon, while James Gosling – another Sun Fellow and the inventor of [Java] – added the eight fallacy in 1997.
The eight fallacies are:
-
The network is reliable: The effect of making this assumption is that distributed programs are written with little [error handling] for network errors. Network errors can occur for a variety of reasons, including power failure, hardware failures, network congestion (including from [denial-of-service] attacks), and software errors in the downstream services. A network error is any category of error that prevents a component from communicating with another component running on another node. If applications are not designed to handle network errors, they may not be able to recover from those errors, with the result being the application "hanging" or crashing, losing data, or entering an inconsistent or incorrect state. Indirect [asynchronous] [inter-process communication], hardware and software [redundancy], and logic such as [timeouts] and [retries], can mitigate the effects of network errors.
-
Latency is zero: Latency is typically pretty good in LANs, but it deteriorates quickly when you move to WANs or public networks like the internet. We typically develop and test our distributed applications in local networks, and some dependencies may be mocked in those environments. Production infrastructure is often more geographically dispersed, and therefore has higher latency. It is easy, therefore, for developers to make the assumption that network calls are instantaneous (zero latency). Modern programming constructs that abstract network calls behind in-memory objects also contribute to this false assumption. If the design of a distributed program allows for unbounded network traffic, the application is at risk of performing badly in production environments.
-
Bandwidth is infinite: Ignoring bandwidth limitations can lead to network congestion ([bottlenecks]). Parkinson’s Law states that "work expands so as to fill the time available for its completion". A similar pattern is seen with consumption of available bandwidth in computer networks. Network bandwidth has improved thousands-of-fold over the decades, but the effect has been that we have used up much of that bandwidth with larger payloads. Rich web application interfaces, audio and video streaming, and VoIP are examples of modern applications that consume large amounts of bandwidth. As with network latency, in production environments there may be constraints on bandwidth that are beyond our control. Therefore, we should use bandwidth wisely, designing our payloads to maximize network transfer efficiency, and use [compression] and [caching] to reduce the amount of data that needs to be transferred over the network.
-
The network is secure: Assuming that internal networks are secure can open up vulnerabilities. A related assumption is that the third-parties you communicate with are trustworthy. In many distributed system designs, a "moat and castle" approach is taken to network security: once the perimeter security is breached, the internal networks are left vulnerable. Best practice is to enable end-to-end encryption throughout the internal network, and for all services to distrust input from other services (including other services in the same internal network).
-
Topology doesn’t change: Changes in network topology can have effects on both bandwidth and latency issues. When you deploy an application in the wild, the network topology is usually outside of your control. The operations team may add and remove servers, and make other changes to the network configuration, over time. And server and network faults can cause routing changes. Distributed applications should be designed to be resilient to changes in network topology. This means not depending on services being available in specific locations (IP addresses), instead dynamically discovering routes to target endpoints. [Service discovery] is a common pattern here. In addition, DNS names should be used instead of IP addresses, with a central DNS server managing routing tables (the mapping of DNS names to IP addresses).
-
There is only one administrator: A common assumption in distributed software design is that there will always be one single administrator who is responsible for the operation and maintenance of the entire system. In reality, different people are more likely to be responsible for different parts. This is particularly true where components of a distribute program are operated by third parties external to the main organization. Without a single authority, there is a risk of conflicting policies and procedures being implemented. For example, upgrades to database management systems may require corresponding changes to the configuration of object-relational mapping code. And system administrators may change disk quotas or set more limited privileges for users, which may inadvertently impact some operations of the application logic. Multiple strategies are needed to mitigate all these risks. [Monitoring] and [alerting] are essential. [Staging] environments can be used to test changes before they are deployed to production. And [canary releases] can be used to test new versions of software in production, with a small subset of users, before rolling out the changes to all users. All these things help, but there is no silver bullet here.
-
Transport cost is zero: The "hidden" costs of building and maintaining a network or subnet are not trivial. There are costs associated with buying and configuring routers, securing the network, leasing bandwidth for internet connections, and operating and maintaining the network. Even moving transport from the application level to the transport level is not free, as there are costs associated with [marshaling] (serializing information into bits, to get data onto the wire), which consumes compute resources and adds latency. Transport costs should be factored into system designs for any high-volume distributed system.
-
The network is homogeneous: This fallacy encompasses the first three fallacies. A common assumption made by programmers new to distributed programming is that every part of a distributed system will have consistent performance. But components within distributed systems do not behave uniformly, with identical performance, because networks vary in latency, bandwidth, and reliability. This variance can change over time, also. In addition, different nodes within the network may have different hardware specifications, and they may run different operating systems, leading to inconsistencies in processing and performance.
Related links
-
Fallacies of distributed computing explained, Arnon Rotem-Gal-Oz, 2008
-
Deutsch’s Fallacies, 10 Years After, Ingrid Van Den Hoogen, 2007 (archived)
-
Eventually consistent - revisited, Werner Vogels, 2008
-
MIT: Distributed Systems — Lecture series.