Single point of failure (SPOF)

Planning to mitigate single points of failure is a critical aspect of system design. A single point of failure (SPOF) is any part of a system that, if it fails, stops the entire system from functioning correctly.

An analogy for a single point of failure is a light switch that controls all the lights in a house. If the switch breaks, none of the lights will turn on. Similarly, a single point of failure in an IT system is something that can lead to system-wide outages or data loss.

Single points of failure can have the following impacts:

Reduced [reliability] and availability. Increased downtime.
Data loss.
Financial losses (eg. due to lost revenue).
Reputation damage.

Identifying and mitigating single points of failure increases the fault tolerance of a system.

Common single points of failure include:

Databases: A single database instance with no replicas.
Load balancers: One load balancer managing all traffic.
Application servers: No redundancy in app servers.
Network devices: A single router or firewall.
Storage systems: A single disk or storage device.

Common strategies to mitigate single points of failure include:

Redundancy: Adding backups and replicas for critical components, for example using primary-replica databases or multi-region clusters.
Geographic redundancy: Deploying systems across multiple regions to handle regional outages.
Load balancing: Distributing traffic across multiple servers using tools like Nginx or AWS ELB. This prevents bottlenecks by ensuring no single server handles all requests.
High availability (HA): Designing systems to be highly available using active-active or active-passive failover setups.
Fault tolerance: Using circuit breakers and [retry mechanisms] to prevent failures from cascading.
Disaster recovery: Backing up data, and having recovery plans for system-wide failures.
Monitoring and alerting: Using tools like Prometheus, Datadog, or New Relic to identify failures before they escalate.
Chaos engineering/testing: Running experiments to test system resilience and identify single points of failure. For example, Netflix developed a tool called Chaos Monkey which would randomly turn off production instances to test system resilience!