Major incident: All Instances are down in Zürich[CH] region

The Authentication Cloud recently experienced two outages affecting the Zurich region:

February 18, 2025: Outage occurred at 15:14 CET and was resolved by 16:10 CET.
February 25, 2025: Outage occurred at 16:08 CET and was resolved by 16:48 CET.

Outage Details:

Root Cause

The root cause of the outage was an Azure managed automated ingress controller failing to allocate IP addresses to the public entry point. Without assigned IPs, Cloudflare received HTTP 522 errors (origin unavailable), rendering the cluster inaccessible and disrupting business services.

Impact

Regions Affected: Switzerland (zrh)
Instances Affected: 100%
Users impacted: All

Timeline

Incident 1: February 18, 2025

All times are in CET.


Feb 18, 2025, 15:14	Outage Detected: Critical alerts triggered due to service unavailability.
Feb 18, 2025, 15:53	Issue Identified: Team noted that the cluster’s public networking entry point was not available
Feb 18, 2025, 15:55	Issue impact: Team noted that ingress objects and the Application Load Balancer had invalid public IP addresses
Feb 18, 2025, 16:00	Response: Redeployed the ingress controller to force refreshing the public frontend IP assignment
Feb 18, 2025, 16:05	Resolution: Business services are noted as coming back online
Feb 18, 2025, 16:30	Full availability: Business services are completely recovered

Incident 2: February 25, 2025

All times are in CET.


Feb 25, 2025, 16:08	Outage Detected: Critical alerts triggered due to service unavailability.
Feb 18, 2025, 16:18	Issue Identified: Team noted that ingress objects and the Application Load Balancer had invalid public IP addresses again
Feb 18, 2025, 16:20	Response: Redeployed the ingress controller to force refreshing the public frontend IP assignment
Feb 18, 2025, 16:23	Resolution: Business services are noted as coming back online
Feb 18, 2025, 16:48	Full availability: Business services are completely recovered

Remediation and follow-up steps

We are committed to maintaining the highest levels of service availability. A thorough post-mortem analysis has been conducted, and we have identified the following short-term and long-term mitigation strategies:

Short-Term:

Azure Remediation: Due to the Azure managed ingress controller not being directly managed by Nevis, we are collaborating with Microsoft Azure to analyze, define, and implement a solution. While we are actively engaged with Microsoft Azure to achieve a permanent resolution, we are awaiting their detailed remediation plan. We are escalating this issue to ensure prompt resolution.

Long-Term:

Multiple Active Public Entry Points: To enhance redundancy and load balancing, we plan to implement multiple active public entry points. This will ensure that even if one entry point experiences IP allocation issues, traffic will continue to be served by the remaining active entry points.

Conclusion

The recent outages in Zurich, caused by an Azure ingress controller issue, impacted all users. We are addressing the immediate issue with our cloud provider and will be implementing long-term solutions, including redundant entry points, to ensure improved service availability.

We apologize for the disruption caused by this incident. While our team responded swiftly to address each of the incidents, we understand that any downtime has an impact to our customers and their users.

March 7, 2025 · 09:29 CET

Post-mortem

The Authentication Cloud experienced two recent outages affecting the Zurich region:

February 18, 2025: Outage occurred at 15:30 CET and was resolved by 17:30 CET.
February 25, 2025: Outage occurred at 15:50 CET and was resolved by 16:50 CET.

Preliminary Findings

Our initial investigations indicate that instances were being assigned incorrect IP addresses/hostnames by an unmanaged automated node within our cloud service provider’s infrastructure. We have escalated this issue with the provider and are working diligently to determine the root cause.

Next Steps

Thorough Investigation: We are conducting a comprehensive investigation to fully understand the cause of these outages and identify potential contributing factors.
Postmortem Analysis: A postmortem analysis will be conducted to determine preventive actions and ensure the prevention of similar incidents in the future.
Cloud Provider Collaboration: We are actively collaborating with our cloud service provider to address the underlying issue with their automated node.

Conclusion

We sincerely apologize for the disruption and inconvenience these outages caused. We understand that any downtime can be disruptive to your operations, and we are committed to providing a highly available and reliable service. We will continue to provide updates as our investigation progresses and implement necessary measures to prevent future occurrences.

February 27, 2025 · 15:37 CET

Resolved

After closely monitoring the service, we are pleased to announce that our operations have returned to normal. The issue has been successfully resolved, and our product is functioning as expected.

February 25, 2025 · 16:48 CET

Monitoring

We have successfully identified the cause of the service disruption and have deployed a solution. Our team is actively monitoring the situation to ensure the issue is fully resolved.

February 25, 2025 · 16:31 CET

Investigating

Our team is diligently analyzing the service disruption impacting our product. We apologize for the inconvenience caused and appreciate your understanding. We are committed to resolving the issue and will provide updates as soon as possible.

February 25, 2025 · 16:26 CET

Issue

All Instances are down in Zürich[CH] region

Severity: Full outage

Location: Switzerland

Affected Services: All Authentication Cloud instances in Zürich[CH] region

Summary: We are currently experiencing an all instance down issue [in region] affecting our Authentication Cloud services. Our team is actively investigating the issue to restore service as quickly as possible. We apologize for the inconvenience caused and appreciate your patience.

Impact: During this incident, the affected service is inaccessible, resulting in disruption of functionality for our users. We understand the impact this may have on your workflow and sincerely apologize for any inconvenience caused.

February 25, 2025 · 16:10 CET

All Instances are down in Zürich[CH] region

Updates

Outage Details:

Root Cause

Impact

Timeline

Incident 1: February 18, 2025

Incident 2: February 25, 2025

Remediation and follow-up steps

Short-Term:

Long-Term:

Conclusion

Preliminary Findings

Next Steps

Conclusion