Retry

A retry mechanism is a way to handle transient failures in distributed software. When requests to services fail, the retry mechanism will automatically retry the request until it either succeeds or a retry threshold is reached.

Retry mechanisms are useful in situations where you expect transient failures to occur. Transient failures are failures that occur sporadically and are not indicative of a permanent problem with the system — for example, network timeouts, 503 Service Unavailable, or 429 Too Many Requests responses. Permanent failures, such as 400 Bad Request or 404 Not Found, will never succeed on retry and should not be retried. Retry mechanisms should be used in conjunction with other error handling mechanisms, such as logging and monitoring.

There are two main ways to implement a retry mechanism:

  • Simple retry: The simplest form of retry mechanism is to retry the operation a fixed number of times, with a fixed delay (which may be zero) between each retry.

  • Exponential backoff: With exponential backoff, the system will retry the operation with increasing delays between each retry. The aim is to avoid overwhelming the target service with retries, and to give the target service more time to recover from its failure mode. Adding random jitter (a small random offset) to each delay staggers retries from multiple clients, preventing the "thundering herd" problem where all clients retry simultaneously and re-overwhelm a recovering service.

A critical consideration for retry mechanisms is that in distributed systems, retries mean the same request may be received and processed more than once — known as at-least-once delivery. A retry may fire after a timeout without knowing whether the original request already succeeded, potentially double-charging a customer, sending duplicate notifications, or creating duplicate records. Only idempotent operations are naturally safe to retry. For non-idempotent operations, an idempotency key pattern can be used: the client attaches a unique key to each logical request, and the server uses it to detect and deduplicate retried requests, returning the cached result of the original rather than reprocessing it.

Retry mechanisms are an alternative solution to a circuit breaker. A circuit breaker monitors requests to a service and, if the number of failures within a configured time period exceeds a configured threshold, the circuit breaker will "trip" the connection, preventing any further requests from being sent to the failing service for a period of time. This gives the failing service time to recover from its failure mode, without being overwhelmed by retries in the meantime. Both retry mechanisms and circuit breakers are solutions for improving fault tolerance, and the two solutions may be used in conjunction with each other.