Grokking Microservices Design Patterns
Ask Author
Back to course home

0% completed

Vote For New Content
Performance Implications
Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

While the Retry Pattern can greatly enhance reliability, it also introduces performance considerations and risks that architects must carefully manage:

  • Increased Load on Dependencies: Every retry is an additional request. When failures are truly rare, this extra load is negligible (and well worth the improved reliability). However, if a downstream service is failing due to being overloaded, retries can make the problem worse. For instance, imagine Service A is hammering Service B which is slow to respond. If A starts retrying aggressively, B gets even more requests. In the worst case, many clients retrying a struggling service can create a retry storm that amplifies load exponentially, possibly overwhelming the service completely. This is why backoff is crucial – to give breathing room between retries – and why limited attempts are important. It’s also a reason to incorporate circuit breakers or throttling for protection.

  • Increased Latency for End Users: When a service retries an operation, the overall time to get a response becomes longer. A single failed attempt might add a few seconds of delay while we wait and retry. This means that users might experience higher latency. In some cases, this is acceptable – a slightly slower successful response is better than an immediate error. But it can also degrade the user experience if overdone. If you chain multiple retries (or if multiple services in a call chain each retry), the cumulative delay can add up. One must balance success rate vs. response time. Often there is a cutoff where it’s better to fail fast than to keep the user waiting too long. That’s why setting an overall timeout for an operation is important in conjunction with retries. For instance, you might decide that an API call should either succeed within 5 seconds (including any retries) or not at all, to maintain a responsive SLA.

  • Wasted Work and Resource Consumption: Every retry that ultimately fails was essentially wasted effort – CPU cycles, network bandwidth, and memory that didn’t produce a successful result. If the underlying issue is not transient, all retries will fail and all that work is for naught (plus possibly straining the system). Therefore, it’s critical to detect non-transient failures quickly and avoid retrying them. For example, if a response comes back “permission denied” or “invalid request,” a retry won’t help – the client should not waste resources trying again. This often means your retry logic should inspect error codes/exceptions and have a blacklist of non-retryable errors. Similarly, if the first attempt already took a long time (nearly hitting a client timeout), it might be counterproductive to start another attempt that will likely also time out – better to propagate an error up than to tie up threads on hopeless retries.

  • Cascading Effects in Complex Systems: In a microservice ecosystem, consider the scenario where Service A calls B, and B calls C. If both A and B have their own retry policies, a failure in C could cause B to issue multiple attempts, and A to also issue multiple attempts of its call to B. This multiplies the load on B and C. It’s wise to avoid uncoordinated retries at every layer. Instead, decide which layer should retry and let others fail fast. A common practice is letting the edge service or client-facing service handle retries, while intermediate services either propagate errors or use other policies. This prevents cascading retries from blowing up traffic. Also, monitoring systems should be in place to detect when excessive retries are happening (as it could indicate a downstream incident).

  • Retry Storms & Thundering Herd: If a popular service experiences a glitch, many instances of many services might all start retrying around the same time. This sudden surge (a retry storm) can be dangerous. Backoff and jitter help avoid synchronization of retries. Jitter specifically tackles the thundering herd by de-syncing retry timing, which is a best practice to include. Another mitigation is to implement budgets or limits on retries – for example, allowing a certain number of retries per second, or gradually increasing the interval (which exponential backoff inherently does). Some systems employ adaptive algorithms that lengthen backoff if errors continue (to avoid constant pressure).

  • Idempotency and Side Effects (Revisited): A performance or correctness pitfall of retries is the risk of performing an action multiple times. We must ensure either the action is idempotent or use measures to prevent side-effect duplication. Without this, retries could corrupt data (e.g. applying a transaction twice). In terms of performance, duplicate side effects might also double-consume resources (e.g. sending the same email twice uses twice the email-sending resource). Always consider the nature of the operation: Do not retry non-idempotent operations blindly unless you have a way to guard against repeated side effects (such as deduplication keys or check-pointing).

Strategies to Mitigate Issues

To address the downsides above, use a combination of techniques:

  • Implement exponential backoff (don’t hammer the service rapidly).
  • Add jitter to avoid synchronization of retries.
  • Limit retry attempts to a reasonable number (and possibly limit the total time spent retrying).
  • Use circuit breakers or fail-fast switches for scenarios where the failure is likely persistent. For instance, after N failures in a row, you might stop retrying for a short window (let the circuit open) to give the system time to recover.
  • Distinguish error types – retry only on errors that are transient. For example, network timeouts, connection refused, or 502 Bad Gateway errors from a load balancer. Do not retry on business logic errors or client errors.
  • Monitor and tune – use metrics to detect when retries are happening frequently. If a particular service is causing lots of retries, that might indicate it’s struggling or that your retry policy needs adjustment.
  • Ensure idempotency – as stressed, make sure the operations can handle being repeated safely, or implement request deduplication mechanisms on the server side, especially for critical actions.
  • Test under failure conditions – it’s important to simulate scenarios (like a dependency going down, or high latency) and see how your retry logic behaves. Make sure it actually improves resilience and doesn’t unintentionally overwhelm the system or time out too late.

When Not to Use the Retry Pattern

There are scenarios where retries might not be the best approach. If an error is clearly non-transient (e.g. a configuration error or a fatal exception), retrying just delays the inevitable. Likewise, if the downstream service is known to be down for an extended period (say a planned outage), a retry loop will only burn resources — a circuit breaker or a fallback response is preferable. Real-time systems with strict latency requirements might opt to fail fast rather than retry and violate the latency SLA. Also, if an operation is extremely expensive or has side effects that can’t be repeated, you should avoid automated retries. In such cases, alternative patterns like manual compensation, eventual reconciliation (for asynchronous processes), or simply alerting a human might be better. In essence, use retries where they make sense (transient, recoverable errors) and avoid them where they don’t (permanent failures or scenarios where retries could cause harm).

.....

.....

.....

Like the course? Get enrolled and start learning!

Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible