Microsoft did not have a very good week, as on two separate occasions critical services were down worldwide for several hours.
The latest was early Thursday morning when Outlook was down for several hours due to what Microsoft says was an issue caused by a “configuration update.” It comes just days after a host of Microsoft services were down for several hours Monday evening.
Microsoft originally thought the latest outage for Exchange Online accounts via Outlook on the Web was only in India, but soon realized it was worldwide.
Another update was the issue, this time a configuration update to components that route user requests. The outage lasted for several hours, beginning around 2 a.m. on the East Coast and lasting until past 6 a.m.
Microsoft reverted the update to fix the issue.
We’ve reverted an update identified to be causing the problem and impact should be mitigated for the majority of users. We’re restarting components that are still performing below expected thresholds to help with service stability. Please refer to EX223208 in the admin center.
— Microsoft 365 Status (@MSFT365Status) October 1, 2020
It’s the second global outage in less than a week for the Redmond technology giant. On Monday night, a coding issue related to an update caused a massive outage to Microsoft 365 applications that depend on Azure Active Directory for authentication. Those services included Outlook, Teams, Exchange, SharePoint, OneDrive, Dynamics 365 and other applications.
The company has published a detailed report of that outage on its Azure stats history page. According to the report, a latent code defect caused the issue.
On September 28 at 21:25 UTC, a service update targeting an internal validation test ring was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process.
Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries.
Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.
In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade.
Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact.
However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes. This significantly extended the time to mitigate the issue.
By midnight, service was restored for all users.
The timing of these two global outages is unfortunate given that Microsoft’s Ignite event was held virtually last week. Dozens of new products and features were announced for collaboration, security, cloud services and more.
If you enjoyed this article and want to receive more valuable industry content like this, click here to sign up for our digital newsletters!
Leave a Reply