One hour after the release (at h1), our monitoring system has measure the following response codes :
Which gives a success rate of = 0% and triggers paging.
The operational team rolled the deployment back to version 1.0.0. We consider between the moment the pager rang and the application was effectively rolled back, one more hour elapsed.
At h2, we therefore have the following statistics :
24 hours after the rollout of version 1.1.0, we end up with the following :
Which gives 92.7% of success.
Here is a visual representation :
Even if our operational team reacted correcly our SLI is below the target value of 99 % and our SLO is not honoured