Load balancer

A load balancer is a network device or software application that distributes incoming network traffic across multiple backend servers or other resources. This is a technique for horizontal scaling, and is widely used in distributed software to improve capacity, availability, and fault tolerance.

Load balancers work particularly well in front of stateless services, where any instance can handle any request without needing to consider session state. If a database also needs to be replicated to handle the load, there are additional challenges to ensure consistency.

Load balancers also enable rolling deployments: instances can be added and removed one at a time while deploying changes, making it possible to do thousands of releases to hundreds of microservices without any downtime.

Network layers

Load balancers can operate at different layers of the OSI model:

Layer 4 (transport layer): Makes routing decisions based on network information only – source and destination IP addresses and TCP/UDP port numbers. The load balancer forwards packets without inspecting their content. This is faster and more efficient, but less flexible.
Layer 7 (application layer): Makes routing decisions based on the content of the request – HTTP methods, URLs, headers, cookies, or query strings. This allows for intelligent routing, such as sending requests for /api to one server group and /static to another.

Key features

Health checks

Load balancers continuously monitor the health of backend servers. If a server becomes unresponsive, slow, or returns errors, the load balancer removes it from rotation and stops sending it traffic. This is how load balancers provide automatic failover: when a server fails, requests are rerouted to the remaining healthy instances with no visible impact on users.

Session persistence

Some applications require all requests from a particular user to be routed to the same backend server for the duration of a session – for example, applications that store session state in local memory rather than a shared store. Load balancers can achieve this using IP-based affinity (routing based on client IP address) or cookie-based tracking.

SSL termination

Handling SSL/TLS encryption and decryption is computationally expensive. A load balancer can offload this by decrypting incoming HTTPS traffic and forwarding unencrypted requests to the backend servers, reducing the burden on application code.

Single point of failure

A load balancer itself introduces a potential single point of failure. To mitigate this, load balancers are typically deployed in redundant pairs or clusters – with standby instances ready to take over – often spread across multiple availability zones.

Load balancing algorithms

The choice of algorithm significantly affects performance, particularly when server capabilities and request costs vary.

Round robin

The simplest algorithm. Requests are sent to each server in turn, cycling through the list. This works well when all servers are equally powerful and all requests are equally expensive. However, because it ignores the current state of each server, round robin can send requests to servers that are already overloaded, leading to poor tail latency (high 95th/99th percentile response times) even when the median is acceptable.

Round robin is still the default HTTP load balancing algorithm for Nginx.

Weighted round robin

An extension of round robin where each server is assigned a weight proportional to its capacity. More powerful servers receive more requests. Weights can be configured statically (by a human, based on known server specifications) or computed dynamically using a proxy metric such as observed response latency – if one server serves requests three times faster than another, it is assigned three times the weight. Dynamic weighting adapts to changes in server performance over time without manual configuration.

Least connections

Rather than cycling through servers blindly, the load balancer tracks the number of active connections on each server and always sends the next request to the server with fewest. Because the load balancer sits between clients and servers, it has accurate real-time visibility into each server’s workload. This cuts through variance in both server power and request cost, and performs very well under overload – it only drops requests when there is literally no more queue space available. It is a great default choice for most workloads.

Least response time

A variant that routes each request to the server with the lowest current response time. Similar in intent to least connections, but uses latency rather than connection count as the metric.

IP hash

Hashes the client’s IP address to select a server, ensuring the same client is always directed to the same server. Useful for session persistence and stateful operations, though it can produce uneven distribution if client IP addresses are not uniformly distributed.

Peak Exponentially Weighted Moving Average (PEWMA)

A more sophisticated algorithm that combines ideas from dynamic weighted round robin and least connections. For each server, it maintains a running weighted average of recent request latencies – where older measurements contribute exponentially less to the score – and multiplies this by the number of open connections. Lower scores are preferred. PEWMA can identify and stop sending requests to chronically slow servers entirely. It achieves better latency percentiles than least connections across the board, but at the cost of additional complexity and tuning parameters.