Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

How To Implement Chaos Engineering For Microservices Using Istio

“Embrace Failures. Chaos and failures are your friends, not enemies.” A microservice ecosystem is going to fail at some point. The issue is not if you fail, but when you fail, will you notice or not. It’s between whether it will affect your users because all of your services are down, or it will affect only a few users and you can fix it at your own time.

Chaos Engineering is a practice to intentionally introduce faults and failures into your microservice architecture to test the resilience and stability of your system. Istio can be a great tool to do so. Let's have a look at how Istio made it easy.

For more information on how to setup Istio and what are virtual service and Gateways, please have a look at the following blog, how to setup Istio on GKE.

Fault Injection With Istio

Fault injection is a testing method to introduce errors into your microservice architecture to ensure it can withstand the error conditions. Istio lets you injects errors at HTTP layer instead of delaying the packets or killing the pods at network layer. This way, you can generate various types of HTTP error codes and test the reaction of your services under those conditions. 

Generating HTTP 503 Error

Here we see that two pods are running two different versions of recommendation service using the recommended tutorial while installing the sample application.

Currently, the traffic on the recommendation service is automatically load balanced between those two pods.

CODE: https://gist.github.com/velotiotech/c255f6c52bf7cec88a693f8d52c8c9d4.js

Now let's apply a fault injection using virtual service which will send 503 HTTP error codes in 30% of the traffic serving the above pods.

CODE: https://gist.github.com/velotiotech/165e3bbf3030eb267e7d2769ccd30328.

To test whether it is working, check the output from the curl of customer service microservice endpoint. 

You will find the 503 error on approximately 30% of the request coming to recommendation service.

To restore normal operation, please delete the above virtual service using:

CODE: https://gist.github.com/velotiotech/b2d39ff496247b3fb2c7437208882049.js

Delay

The most common failure we see in production is not the down service, rather a delay service. To inject network latency as a chaos experiment, you can create another virtual service. Sometimes, it happens that your application doesn’t respond on time and creates chaos in the complete ecosystem. How to simulate that behavior, let's have a look.

CODE: https://gist.github.com/velotiotech/deb94b26a9cfe34194ee0bc8d8d8a58c.

Now, if you hit the URL of endpoints of the above service in a loop, you will see the delays in some of the requests. 

Retry

In some of the production services, we expect that instead of failing instantly, it should retry N number of times to get the desired output. If not succeeded, then only a request should be considered as failed.

For that mechanism, you can insert retries on those services as follows:

CODE: https://gist.github.com/velotiotech/9608a8fd1c711937c23a7703f3184b58.

Now any request coming to recommendation will do 3 attempts before considering it as failed.

Timeout

In the real world, an application faces most failures due to timeouts. It can be because of more load on the application or any other latency in serving the request. Your application should have proper timeouts defined, before declaring any request as "Failed". You can use Istio to simulate the timeout mechanism and give our application a limited amount of time to respond before giving up.

Wait only for N seconds before failing and giving up.

CODE: https://gist.github.com/velotiotech/2740a385c6a044918b3c9f31e14bb07b.js

Conclusion

Istio lets you inject faults at the HTTP layer for your application and improves its resilience and stability. But, the application must handle the failures and take appropriate course of action. Chaos Engineering is only effective when you know your application can take failures, otherwise, there is no point in testing for chaos if you know your application is definitely broken.