Grokking Microservices Design Patterns
Ask Author
Back to course home

0% completed

Vote For New Content
The Retry Pattern: A Solution to Unreliable External Resources
Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

The Retry Pattern addresses transient failures by attempting the operation again, giving the system a second (or third, etc.) chance to succeed. The core idea is simple: if an operation fails due to a temporary issue, wait a short period and try it again, on the assumption that the issue may have been resolved in the interim. Many distributed systems fail in partial ways – a subset of requests fail while others succeed – or suffer short-lived outages. In fact, “often, trying the same request again causes the request to succeed.” By re-issuing the request, a client can ride through brief disruptions. For example, if a service didn’t respond because of a momentary spike, a retry a few moments later might hit a less-busy instance and succeed. In effect, retries mask sporadic failures, increasing the apparent reliability as seen by the user.

How Retries Mitigate Failures

Retries work under the assumption that the failure was transient or non-deterministic. Network blips clear up, threads get freed, and services restart – so a subsequent attempt has a good chance of succeeding where the first failed. This dramatically reduces the error rate seen by higher-level services or end-users. Instead of immediately giving up when a downstream call fails, a microservice can retry and often complete the operation without needing any manual fix. The result is fewer errors returned to users and more robust inter-service communication. The Retry Pattern thus improves overall availability by leveraging the fact that many failures resolve themselves quickly.

Trade-offs and Other Strategies

While retrying is powerful, it’s not a universal solution and comes with trade-offs. One alternative strategy is the fail-fast approach, often managed via a Circuit Breaker (discussed later). Where the Retry Pattern keeps trying in hopes of eventual success, a Circuit Breaker stops trying after detecting a pattern of failures. The retry approach favors eventual success at the cost of extra wait time and work, whereas a fail-fast strategy favors quickly aborting to conserve resources when a fault is likely persistent. For transient faults, retries are ideal – there’s a high chance the next attempt will succeed. But for long-lasting or permanent faults (e.g. a down service that won’t be back for hours), blindly retrying wastes time and resources. In such cases, other mechanisms like circuit breakers (to cut off calls that are likely to fail) or graceful degradation (serving default responses or cached data) might be more appropriate.

It’s important to differentiate transient vs. permanent errors. The Retry Pattern should typically only kick in for errors that are likely to be temporary. For example, network timeouts, 5xx server errors, or database deadlocks might merit a retry. In contrast, errors like input validation failures (HTTP 400) or authentication errors (HTTP 401) are permanent for that request – no amount of retrying will fix a bad request or invalid credentials. Retrying on those would just repeat the failure and potentially worsen system load. Therefore, a well-designed retry mechanism checks the error type (or exception type) and retries only for transient conditions, while letting permanent errors fail fast.

In summary, the Retry Pattern solves the problem of intermittent failures by automatically re-invoking operations that may succeed on a subsequent attempt, thereby increasing reliability. However, it must be used judiciously. It works best in tandem with other failure-handling patterns: use retries for hiccups, fall back or fail fast for lasting errors. In the next sections, we’ll explore how to implement retries carefully to maximize their benefits while managing the trade-offs.

.....

.....

.....

Like the course? Get enrolled and start learning!

Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible