Building resilient applications

In this blog article, Nintex senior developer Kok Jun Lye shares engineering advice about building resilient applications that replicate data processing as fast as on-premises – but the goal is faster.

A common challenge of designing and developing modern cloud applications is resiliency. An application needs to be able to handle transient errors gracefully and recover from the errors in order to minimize any chance of impact on users. Some impact is probably normal, but too much is detrimental.

By definition, a transient error is a temporary error that is likely to disappear soon. Maybe it will be gone in a flash, but programmers of technology can’t risk transient errors impacting the flow of business.

There are different strategies to handle transient errors and this article covers two commonly used strategies – retry and circuit breaker – which we explain further in this article.

What causes transient errors?

Transient errors can be caused by many different reasons such as:

  1. Application-level throttling – an external REST API that monitors usage rates and limits excessive overconsumption by temporarily limiting data speeds solely for those violating the system requirements – as the system resources support however many different users and customers (for Nintex it’s thousands of customers!)
  2. Network-level interruptions – high connection latency and intermittent service interruption in the network infrastructure
  3. Temporary service unavailability – a dependent service could be unavailable temporarily but recovers shortly

How to identify transient errors

It is important to know the types of transient errors that will be generated before planning on how to handle the transient errors. Typical transient errors are 503 (Service Unavailable), 504 (Gateway Timeout), and 408 (Request Timeout) for HTTP connections. We should also be aware of non-transient errors, for example, 404 (Not Found) and 410 (Gone) so that we do not implement retry on non-transient errors. It is important to note that the examples given are conditional because the transient error responses of your dependent services could vary.

Using Polly to handle transient errors

Many design patterns, for example – retry pattern and circuit breaker pattern –  have been devised as the solutions to this common challenge. Although we can implement our own bespoke solution according to the design patterns, using a library could avoid us from reinventing the wheel.

Polly is a .NET transient fault handling library that we can use to handle transient errors. It offers numerous resilience policies that are available to the public. We can also specify different fault handling behaviors on the policies when using Polly.

Retry

Many transient errors could self-correct after a short period of time and retrying would often help in this circumstance. The code snippet below shows how the retry policy is configured:

private const int MaxRetryAttempt = 3;
        private static readonly TimeSpan RetryDelay = TimeSpan.FromSeconds(2);

        public override void Configure(IFunctionsHostBuilder builder)
        {
	     var logger = GetLogger();

            builder.Services.AddHttpClient("resilientClient")
                .AddTransientHttpErrorPolicy(policyBuilder =>
                    policyBuilder.WaitAndRetryAsync(MaxRetryAttempt, _ => RetryDelay,
                    onRetry: (response, delay, retryCount, context) =>
                    {
                        logger.Warning($"Status Code: {response.Result.StatusCode} . Retry Attempt: {retryCount}");
                    }));
        }

 
Transient HTTP error policy is added to handle network failure exceptions and transient HTTP status codes such as 503. A constant delay of 2 seconds and max retry count of 3 are applied in the policy and when the retries happen, status code and retry count are logged:

Logging is useful when retry is being performed because we can gain more insights on why the retries happened and how many retry attempts have been performed.

builder.Services.AddHttpClient("resilientClient")
                .AddTransientHttpErrorPolicy(policyBuilder =>
                    policyBuilder.WaitAndRetryAsync(MaxRetryAttempt, retryAttempt => 
                        TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) + TimeSpan.FromMilliseconds(Random.Next(0, 200)),
                    onRetry: (response, delay, retryCount, context) =>
                    {
                        logger.Warning($"Status Code: {response.Result.StatusCode} . Retry Attempt: {retryCount}. Delayed: {delay.TotalMilliseconds} ms.");
                    }));

 
The constant delay used in the above example might not always be a good solution for retry implementation. Retries could be sent to the dependent service at the same time in a short duration that will strain the resources and make the dependent service difficult to recover if there is connection congestion. The code snippet above shows the use of a jitter to randomize the delay and increasing the delay exponentially.

When retries are performed, the delay is increased exponentially as the retry attempt increases:

This will prevent too many requests from being sent out in the same time interval if the dependent service is temporarily unavailable.

Circuit Breaker

 private const int AllowedExceptionCount = 5;
        private static readonly TimeSpan BreakDuration = TimeSpan.FromSeconds(30);

        public override void Configure(IFunctionsHostBuilder builder)
        {
            var logger = GetLogger();

            builder.Services.AddHttpClient("resilientClient")
                .AddTransientHttpErrorPolicy(policyBuilder =>
                    policyBuilder.CircuitBreakerAsync(AllowedExceptionCount,
                        BreakDuration,
                        onBreak: (response, timeSpan) =>
                        {
                            logger.Warning($"Break. Status code: {response.Result.StatusCode}.");
                        },
                        onReset: () =>
                        {
                            logger.Information("Circuit breaker is reset.");
                        }));
        }

 
In some scenarios, we would like the operations to fail fast instead of keeping on retrying on an error that we deem will not be recovered in a very short duration. The code snippet above shows how the circuit breaker policy is configured.

Building Resilient Applications

The circuit breaker is in Closed state to tolerate the first 5 exceptions but after that, it will break and enter Open state for 30 seconds. Within the 30 seconds, it will throw BrokenCircuitException if a call is performed.

Building Resilient Applications

As shown in the above logs, the circuit breaker is still in Closed state to allow calls in the first 5 attempts, after that it breaks and enters Open state. If the service restored after 30 seconds, a subsequent call would succeed and the circuit breaker would reset:

Building Resilient Applications

Hence, the circuit breaker becomes very useful when we expect the dependent service will not be restored in a very short duration while the application is still able to tolerate a specific number of transient failures.

Conclusion

Modern cloud applications need to be resilient so that the quality and reliability of the service can be consistently in a good state to avoid unnecessary service failures. Transient errors should always be identified and handled for better service reliability in every modern application.

We’re hiring engineering roles in Malaysia and around the globe. Please join our team! Learn more about all available career opportunities at Nintex here.

 

 

Want to discuss how the Nintex Platform can help your organization? Get in touch with the team at Nintex today.

 

 

KokJun Lye

KokJun is a Nintex Senior Developer based out of Kuala Lumpur. At Nintex he works on products like Nintex for SharePoint and Nintex Workflow Cloud. In addition to that, he enjoys building scalable cloud applications.

Request a live demo
See how you can manage, automate and optimize your business processes today ‐ get a demo from one of our experts.
Why Our Customers Trust Nintex on

Please wait while form loads...

Couldn't load the form.

Please disable your ad blocker or try a different browser. If you continue to experience issues, please contact info@nintex.com