Azure web sites down – February 27th

html_status_code_502_azureAzure wasn’t having one of its best days yesterday. The mess started around 07:30 local time, and lasted until almost 16:00. In other words, our sites were down for the full duration of the business day!

The following table contains the ‘explanations’ Microsoft shared with the public during the downtime:

time (UTC)seriousnessstatus
06:35AdvisoryAn alert for Web Sites in West Europe is being investigated. It has not been determined if this is customer impacting. More information will be provided as it is known.
08:15AdvisoryWe are experiencing an issue with Web Sites in the West Europe sub-region. A very small number of customers may see intermittent timeouts or slow response when attempting to access sites. We are actively investigating. Further updates will be published to keep you apprised of the situation.
08:15Partial Performance DegradationWe are experiencing an issue with Web Sites in the West Europe sub-region. A very small number of customers may see intermittent timeouts or slow response when attempting to access sites hosted in this sub-region. We are actively investigating. Further updates will be published to keep you apprised of the situation.
09:15Partial Performance DegradationEngineers are actively investigating the issue. A very small number of customers may still see intermittent slowdown or timeouts. Further updates will be published within 2 hours.
11:15Partial Service InterruptionEngineers are currently evaluating repair options. A subset of customers may experience interruption to web sites hosted in West Europe. Further updates will be published within 2 hours.
13:15Partial Service InterruptionEngineers are continuing to evaluate and implement mitigation options. A subset of customers may experience interruption to web sites hosted in West Europe. Further updates will be published within 3 hours.
14:53AdvisoryThe repair steps have been successfully executed and validated. Full Web Sites functionality has been restored in the West Europe sub-region.

  • No mention of the cause. If they’d told us what the problem was we could have made a guesstimate ourselves on the duration of the downtime, and temporarily moved sites elsewhere.
  • No mention of how many sites/customers were affected by this. ‘a subset’ & ‘a very small number’ are as vague as one can get

And if all of this wasn’t bad enough, around 18:00 excrement again came in touch with ventilation devices! And this time both the West Europe & North Europe regions were affected. In other words: all hosting Azure has in Europe.
Again little/no explanations were shared:

time (UTC)seriousnessstatus
16:50AdvisoryAn alert for Web Sites in West Europe is being investigated. It has not been determined if this is customer impacting. More information will be provided as it is known.
18:10Partial Performance DegradationSome customers may experience intermittent time out errors or slow response when accessing their web sites. Engineers are actively gathering monitoring and alerting details to troubleshoot Web Sites in the West Europe sub-region. Further updates will be published to keep you apprised of the situation.
19:07Partial Performance DegradationThe repair steps have been successfully executed and validated. Full Web Sites functionality has been restored in the North Europe sub-region.

This one lasted ‘only’ 2 hours. Resolved around 20:00.

This morning I did some Googling and came across even scarier rumours: While repairing whatever caused the downtime yesterday apparently Microsoft has restored old(er) versions of files, without informing the customers involved!
I can not confirm whether or not this actually happened, but fact is that several of our site-indexes were corrupt and needed rebuilding…

At the moment of writing, 24+ hours after the incidents, there is still no explanation whatsoever (believe me, I have Googled!) for what was the cause of this monster outage. And when I don’t know the cause, I wonder… What are the chances this will happen again, and when?

--
1,474 views

Leave a Reply

reduction
%d bloggers like this: