Azure wasn’t having one of its best days yesterday. The mess started around 07:30 local time, and lasted until almost 16:00. In other words, our sites were down for the full duration of the business day!
The following table contains the ‘explanations’ Microsoft shared with the public during the downtime:
time (UTC) | seriousness | status |
---|---|---|
06:35 | Advisory | An alert for Web Sites in West Europe is being investigated. It has not been determined if this is customer impacting. More information will be provided as it is known. |
08:15 | Advisory | We are experiencing an issue with Web Sites in the West Europe sub-region. A very small number of customers may see intermittent timeouts or slow response when attempting to access sites. We are actively investigating. Further updates will be published to keep you apprised of the situation. |
08:15 | Partial Performance Degradation | We are experiencing an issue with Web Sites in the West Europe sub-region. A very small number of customers may see intermittent timeouts or slow response when attempting to access sites hosted in this sub-region. We are actively investigating. Further updates will be published to keep you apprised of the situation. |
09:15 | Partial Performance Degradation | Engineers are actively investigating the issue. A very small number of customers may still see intermittent slowdown or timeouts. Further updates will be published within 2 hours. |
11:15 | Partial Service Interruption | Engineers are currently evaluating repair options. A subset of customers may experience interruption to web sites hosted in West Europe. Further updates will be published within 2 hours. |
13:15 | Partial Service Interruption | Engineers are continuing to evaluate and implement mitigation options. A subset of customers may experience interruption to web sites hosted in West Europe. Further updates will be published within 3 hours. |
14:53 | Advisory | The repair steps have been successfully executed and validated. Full Web Sites functionality has been restored in the West Europe sub-region. |
- No mention of the cause. If they’d told us what the problem was we could have made a guesstimate ourselves on the duration of the downtime, and temporarily moved sites elsewhere.
- No mention of how many sites/customers were affected by this. ‘a subset’ & ‘a very small number’ are as vague as one can get
And if all of this wasn’t bad enough, around 18:00 excrement again came in touch with ventilation devices! And this time both the West Europe & North Europe regions were affected. In other words: all hosting Azure has in Europe.
Again little/no explanations were shared:
time (UTC) | seriousness | status |
---|---|---|
16:50 | Advisory | An alert for Web Sites in West Europe is being investigated. It has not been determined if this is customer impacting. More information will be provided as it is known. |
18:10 | Partial Performance Degradation | Some customers may experience intermittent time out errors or slow response when accessing their web sites. Engineers are actively gathering monitoring and alerting details to troubleshoot Web Sites in the West Europe sub-region. Further updates will be published to keep you apprised of the situation. |
19:07 | Partial Performance Degradation | The repair steps have been successfully executed and validated. Full Web Sites functionality has been restored in the North Europe sub-region. |
This one lasted ‘only’ 2 hours. Resolved around 20:00.
This morning I did some Googling and came across even scarier rumours: While repairing whatever caused the downtime yesterday apparently Microsoft has restored old(er) versions of files, without informing the customers involved!
I can not confirm whether or not this actually happened, but fact is that several of our site-indexes were corrupt and needed rebuilding…
At the moment of writing, 24+ hours after the incidents, there is still no explanation whatsoever (believe me, I have Googled!) for what was the cause of this monster outage. And when I don’t know the cause, I wonder… What are the chances this will happen again, and when?