Major incident: Authentication Cloud Outage

Root cause analysis

To understand the root cause of this incident, we would like to explain how our cloud deployment works. Our deployment pipeline is based on standard tools like Make, Terraform and Kubectl. Usually, this is triggered in an automated process and running on GitHub Actions. Terraform is a tool that uses a declarative approach to defining infrastructure. You provide Terraform with a description of the target infrastructure, and Terraform calculates a plan for changing your current infrastructure to match the target infrastructure.

Nevis uses Make to orchestrate jobs that execute Terraform or Kubectl, render deployment templates, and run integrity checks. These additional checks ensure that Terraform or Kubectl operates on the right resources and has up-to-date information on all instances. Provisioning new instances is also based on the same approach. New instances are applied via Make, which then calls Terraform to create the new infrastructure resources and secrets.

Unfortunately, due to an undetected failure in this chain of activities, our DevOps team ran a specific Terraform script directly. Due to the missing integrity checks, the Terraform script did not contain all necessary resources. Due to its declarative approach, it treated existing instances as resources to be cleaned up.
Terraform started to delete Azure Key Vaults, Kubernetes ingress, Kubernetes secrets and additional resources for existing instances. However, no persistent data was lost thanks to existing safeguards like role-based access controls and locks on critical resources like databases.

Another unexpected and unfortunate side-effect was that Terraform deleted the alerting functionality for each instance in our monitoring infrastructure. These are one of our primary sources of information on the health of the Authentication Cloud. As a result, we were not able to detect warning signs at the earliest stage of the outage.

While we generally believe heavily in automation to reduce problems and human errors in deployments, it can also lead to major infrastructure damage. In this case: missing limitations in the impact of a deployment operation and additional safeguards to prevent deletion of resources.

May 17, 2023 · 17:25 CEST

Update

All remaining problems on instances with push authentication were resolved.
Complete restoration of service in the Switzerland region.

May 10, 2023 · 20:59 CEST

Update

New reports by customers that service is available, but problems with push authentication persist.

May 10, 2023 · 16:38 CEST

Update

All customer instances were successfully recovered, but some continued to experience degraded service due to remaining issues with push notification functionality. The cause was that internal secrets required to communicate
with Google Firebase Cloud Messaging were not yet fully recovered during the initial recovery.

May 10, 2023 · 00:18 CEST

Update

More customer instances were recovered.

May 9, 2023 · 23:14 CEST

Update

First customer instances fully recovered with confirmation.

May 9, 2023 · 21:25 CEST

Update

The root cause was identified as several crucial Kubernetes resources were missing in the Switzerland region. As a result, the team started work on restoring the missing resources.

May 9, 2023 · 17:55 CEST

Issue

Internal monitoring systems show increased error rates for the Switzerland region. Loss of all availability checks and alerts.

May 9, 2023 · 14:08 CEST

Authentication Cloud Outage

Updates

Root cause analysis