Fault tolerance (aka. resilience)

Designing for failure is a key design principle in IT system design, and especially in distributed software because distributed programs are inherently unreliable. They are subject to a wide range of failures, including network partitions, delayed messages, and node crashes.

Fault tolerance is closely related to availability. The goal of fault tolerance is to ensure that a system can continue to operate in the presence of failures. The goal of availability is to ensure that a system is always available to users. Fault tolerance helps to achieve availability.

A failure is any event that stops a service from working as expected. A failure is not necessarily a complete outage of a service. Failures include intermittent availability, higher than expected latency, and errors returned from dependency services. Handling partial failures is just as important as handling complete failures. For example, clients should not wait indefinitely for a response from a service.

In the context of distributed software, the principle of [designing for failure] means designing each service in a way that it can tolerate failures in dependent services.

Bad stuff happens. And bad things can happen from seemingly innocuous events. The trick to creating reliable distributed programs is to assume they will fail and to design for that. The aim is for a failure in one part of a system to have minimal impact on the rest of the system, so avoiding full outages. The job is not to stop anything from failing. The job is to stop failures from causing massive problems. A common analogy is the electrical grid, where a failure in one part of the system does not cause the whole grid to fail.

This is the opposite to the approach of trying to prevent failures and make complex systems run perfectly all the time. Such an approach produces inherently brittle systems. Rather than designing for perfection, we should design distributed software for resilience.

The concept of graceful degradation provides a useful way to think about designing for failure. It refers to any design strategy that allows a system to continue operating, albeit with reduced functionality, when some components fail. The idea is for a system to degrade gracefully, rather than failing catastrophically, when something goes wrong. This is the opposite of a brittle system, which fails completely when one part fails.

An analogy for graceful degradation is an escalator that just becomes a regular set of stairs when it stops working. By comparison, a lift is a brittle system. When it stops working, it is not usable at all.

Architectural patterns and visibility strategies to mitigate failures include, but are not limited to:

Redundancy: Running multiple instances of a service, so that if one fails, another can take over. Circuit breakers and load balancing are two common strategies for achieving redundancy. Circuit breakers automatically fail over to a backup system when a primary system fails. Load balancers distribute requests across multiple instances of a service, and will redirect requests away from failing instances.
Replication and caching: These strategies involve storing copies of data in multiple locations, and closer to the services that need the data. These strategies trade-off data consistency for improved resilience and reliability.
Fallback logic and default values: Providing a default value or cached data when a request fails.
Retries and timeouts: Automatically retrying requests that fail (in case the failure was transient), and automatically fail requests that take too long (ie. use timeouts so that network failures do not block the main thread of execution). A related strategy is to limit the number of pending requests: impose an upper bound on the number of oustanding requests that a client can have with a particular service, and don’t make additional requests once that limit is breached.
Reduced dependencies: Minimizing the number of dependencies between services, so that if one fails, others are not affected.
Reduced distance: Minimizing the physical distance between the location of services, and make their connections as reliable as possible. A wide area network is more likely to fail than a local area network, and a local area network is still more risky than co-located data.
Monitoring: Keeping an eye on the health of the system, so that operations teams can respond to failures quickly.

A failure event can also be a software bug that slips into production and causes an outage or some kind of incident. In fact, this is a common cause of failure in modern cloud-based, distributed systems, in which resilience of the hardware and network is often very high. To mitigate software failures, deployment strategies such as canary releases and blue-green deployments with automated rollbacks may be used. Another strategy to mitigating software failures is to deploy a Stand-in system, rather than a copy of the primary system, to failover infrastructure.

Assuming failure also puts emphasis on designing for changeability. A key aspect of designing for resilience is being able to quickly change software, so we can fix issues when they do occur.

Testing strategies like chaos engineering can also support the building of highly resilient systems.