In this blog article, Nintex senior developer Kok Jun Lye shares engineering advice about building resilient applications that replicate data processing as fast as on-premises – but the goal is faster.
A common challenge of designing and developing modern cloud applications is resiliency. An application needs to be able to handle transient errors gracefully and recover from the errors in order to minimize any chance of impact on users. Some impact is probably normal, but too much is detrimental.
By definition, a transient error is a temporary error that is likely to disappear soon. Maybe it will be gone in a flash, but programmers of technology can’t risk transient errors impacting the flow of business.
There are different strategies to handle transient errors and this article covers two commonly used strategies – retry and circuit breaker – which we explain further in this article.
What causes transient errors?
Transient errors can be caused by many different reasons such as:
- Application-level throttling – an external REST API that monitors usage rates and limits excessive overconsumption by temporarily limiting data speeds solely for those violating the system requirements – as the system resources support however many different users and customers (for Nintex it’s thousands of customers!)
- Network-level interruptions – high connection latency and intermittent service interruption in the network infrastructure
- Temporary service unavailability – a dependent service could be unavailable temporarily but recovers shortly
How to identify transient errors
It is important to know the types of transient errors that will be generated before planning on how to handle the transient errors. Typical transient errors are 503 (Service Unavailable), 504 (Gateway Timeout), and 408 (Request Timeout) for HTTP connections. We should also be aware of non-transient errors, for example, 404 (Not Found) and 410 (Gone) so that we do not implement retry on non-transient errors. It is important to note that the examples given are conditional because the transient error responses of your dependent services could vary.
Using Polly to handle transient errors
Many design patterns, for example – retry pattern and circuit breaker pattern – have been devised as the solutions to this common challenge. Although we can implement our own bespoke solution according to the design patterns, using a library could avoid us from reinventing the wheel.
Polly is a .NET transient fault handling library that we can use to handle transient errors. It offers numerous resilience policies that are available to the public. We can also specify different fault handling behaviors on the policies when using Polly.
Many transient errors could self-correct after a short period of time and retrying would often help in this circumstance. The code snippet below shows how the retry policy is configured:
Transient HTTP error policy is added to handle network failure exceptions and transient HTTP status codes such as 503. A constant delay of 2 seconds and max retry count of 3 are applied in the policy and when the retries happen, status code and retry count are logged:
Logging is useful when retry is being performed because we can gain more insights on why the retries happened and how many retry attempts have been performed.
The constant delay used in the above example might not always be a good solution for retry implementation. Retries could be sent to the dependent service at the same time in a short duration that will strain the resources and make the dependent service difficult to recover if there is connection congestion. The code snippet above shows the use of a jitter to randomize the delay and increasing the delay exponentially.
When retries are performed, the delay is increased exponentially as the retry attempt increases:
This will prevent too many requests from being sent out in the same time interval if the dependent service is temporarily unavailable.
In some scenarios, we would like the operations to fail fast instead of keeping on retrying on an error that we deem will not be recovered in a very short duration. The code snippet above shows how the circuit breaker policy is configured.
The circuit breaker is in Closed state to tolerate the first 5 exceptions but after that, it will break and enter Open state for 30 seconds. Within the 30 seconds, it will throw BrokenCircuitException if a call is performed.
As shown in the above logs, the circuit breaker is still in Closed state to allow calls in the first 5 attempts, after that it breaks and enters Open state. If the service restored after 30 seconds, a subsequent call would succeed and the circuit breaker would reset:
Hence, the circuit breaker becomes very useful when we expect the dependent service will not be restored in a very short duration while the application is still able to tolerate a specific number of transient failures.
Modern cloud applications need to be resilient so that the quality and reliability of the service can be consistently in a good state to avoid unnecessary service failures. Transient errors should always be identified and handled for better service reliability in every modern application.
We’re hiring engineering roles in Malaysia and around the globe. Please join our team! Learn more about all available career opportunities at Nintex here.
Want to discuss how the Nintex Platform can help your organization? Get in touch with the team at Nintex today.