In a standard DevOps team, new versions of a server-side software may be deployed several times a week to production. Even if the souce code passes through an automated testing pipeline before being rolled out, the test coverage is never exhaustive. How to honour our Service Level Agreement (SLA) in case a bug chimes in ? I present a solution in this article.

If you are not sure what the expressions "SLA", "SLO" and "SLI" mean I recommend reading this page.

The scenario

We will imagine that the service we propose to our clients is a REST API with one endpoint available under /rate that provides the currency exchange rate between euros and swiss francs.

This API is queried 100 times per hour and to absorbe the load, we deploy 10 instances of our pod. We defined the following Service Level Objective (SLO): 99 % of requests to the /rate endpoint must return a 200 Ok http status code over the last 24 rolling hours. The associated Service Level Indicator (SLI) comes naturally as the proportion of resquests to /rate that return a 200 OK status code. To keep some buffer, we configured our monitoring to raise an alarm if our SLI is below 99 % during the last rolling hour. Initially we have version 1.0.0 in production that works just fine. Then our development team improves the precision of the value returned and packages it as version 1.1.0. At h0, this new version is deployed into production but unfortunately a regression is also packaged with it which makes our endpoint always return 500 code. What happens now ?

Without specific caution

One hour after the release (at h1), our monitoring system has measure the following response codes :
Status code
Occurences
200
0
500
100
Which gives a success rate of = 0% and triggers paging (the period of time that just elapsed is called the "time to detection"). The operational team rolled the deployment back to version 1.0.0. We consider between the moment the pager rang and the application was effectively rolled back ("time to recovery"), one more hour elapsed. At h2, we therefore have the following statistics :
Status code
Occurences
200
0
500
2 x 100 = 200
24 hours after the rollout of version 1.1.0, we end up with the following :
Status code
Occurences
200
22 x 100 = 2 200
500
2 x 100 = 200
Which gives 92.7% of success. Here is a visual representation :
Even if our operational team reacted correcly our SLI is below the target value of 99 % and our SLO is not honoured.

How to mitigate the risk ?

Instead of deploying 10 replicas of version 1.1.0 and remove completely version 1.0.0 at h0, we will deploy 1 out of 10 instances with version 1.1.0 and 9 with v1.0.0 (this is the canary pattern). Our load balancer evenly distribute requests to all pods. After one hour, here are our statistics :
Status code
Occurences
200
90
500
10
Which is a succes rate of 90%. Like in the previous situation, paging is triggered and our operational team rolls back after an hour. Here is our count after the two first hours :
Status code
Occurences
200
2 x 90 = 180
500
2 x 10 = 20
After roll-back our application always returns 200 ok. For the next 22 hours.
Status code
Occurences
200
2 x 90 + 22 x 100 = 2 380
500
2 x 10 = 20
This makes a success rate of 99.2% and our SLO is respected !

Conclusion

Thanks to the canary pattern we could fit the risk of deploying a regression to production into our error budget. But wait… ok it is nice to have the canary pattern but what if our back-end infrastructure suffers a network outage ? This is a right claim. In fact, we have to consider a full risk analysis in order to define our srategy to honour the SLO. This is the work of a site reliability engineer, feel free to contact me at info@etienne-delmotte.tech if you are interested. I hope you enjoyed reading this article about site reliability engineering.