All Instances are down in Zürich[CH] region
Updates
The Authentication Cloud recently experienced two outages affecting the Zurich region:
- February 18, 2025: Outage occurred at 15:14 CET and was resolved by 16:10 CET.
- February 25, 2025: Outage occurred at 16:08 CET and was resolved by 16:48 CET.
Outage Details:
Root Cause
The root cause of the outage was an Azure managed automated ingress controller failing to allocate IP addresses to the public entry point. Without assigned IPs, Cloudflare received HTTP 522 errors (origin unavailable), rendering the cluster inaccessible and disrupting business services.
Impact
- Regions Affected: Switzerland (zrh)
- Instances Affected: 100%
- Users impacted: All
Timeline
Incident 1: February 18, 2025
All times are in CET.
Feb 18, 2025, 15:14 | Outage Detected: Critical alerts triggered due to service unavailability. |
Feb 18, 2025, 15:53 | Issue Identified: Team noted that the cluster’s public networking entry point was not available |
Feb 18, 2025, 15:55 | Issue impact: Team noted that ingress objects and the Application Load Balancer had invalid public IP addresses |
Feb 18, 2025, 16:00 | Response: Redeployed the ingress controller to force refreshing the public frontend IP assignment |
Feb 18, 2025, 16:05 | Resolution: Business services are noted as coming back online |
Feb 18, 2025, 16:30 | Full availability: Business services are completely recovered |
Incident 2: February 25, 2025
All times are in CET.
Feb 25, 2025, 16:08 | Outage Detected: Critical alerts triggered due to service unavailability. |
Feb 18, 2025, 16:18 | Issue Identified: Team noted that ingress objects and the Application Load Balancer had invalid public IP addresses again |
Feb 18, 2025, 16:20 | Response: Redeployed the ingress controller to force refreshing the public frontend IP assignment |
Feb 18, 2025, 16:23 | Resolution: Business services are noted as coming back online |
Feb 18, 2025, 16:48 | Full availability: Business services are completely recovered |
Remediation and follow-up steps
We are committed to maintaining the highest levels of service availability. A thorough post-mortem analysis has been conducted, and we have identified the following short-term and long-term mitigation strategies:
Short-Term:
- Azure Remediation: Due to the Azure managed ingress controller not being directly managed by Nevis, we are collaborating with Microsoft Azure to analyze, define, and implement a solution. While we are actively engaged with Microsoft Azure to achieve a permanent resolution, we are awaiting their detailed remediation plan. We are escalating this issue to ensure prompt resolution.
Long-Term:
- Multiple Active Public Entry Points: To enhance redundancy and load balancing, we plan to implement multiple active public entry points. This will ensure that even if one entry point experiences IP allocation issues, traffic will continue to be served by the remaining active entry points.
Conclusion
The recent outages in Zurich, caused by an Azure ingress controller issue, impacted all users. We are addressing the immediate issue with our cloud provider and will be implementing long-term solutions, including redundant entry points, to ensure improved service availability.
We apologize for the disruption caused by this incident. While our team responded swiftly to address each of the incidents, we understand that any downtime has an impact to our customers and their users.
The Authentication Cloud experienced two recent outages affecting the Zurich region:
- February 18, 2025: Outage occurred at 15:30 CET and was resolved by 17:30 CET.
- February 25, 2025: Outage occurred at 15:50 CET and was resolved by 16:50 CET.
Preliminary Findings
Our initial investigations indicate that instances were being assigned incorrect IP addresses/hostnames by an unmanaged automated node within our cloud service provider’s infrastructure. We have escalated this issue with the provider and are working diligently to determine the root cause.
Next Steps
- Thorough Investigation: We are conducting a comprehensive investigation to fully understand the cause of these outages and identify potential contributing factors.
- Postmortem Analysis: A postmortem analysis will be conducted to determine preventive actions and ensure the prevention of similar incidents in the future.
- Cloud Provider Collaboration: We are actively collaborating with our cloud service provider to address the underlying issue with their automated node.
Conclusion
We sincerely apologize for the disruption and inconvenience these outages caused. We understand that any downtime can be disruptive to your operations, and we are committed to providing a highly available and reliable service. We will continue to provide updates as our investigation progresses and implement necessary measures to prevent future occurrences.
After closely monitoring the service, we are pleased to announce that our operations have returned to normal. The issue has been successfully resolved, and our product is functioning as expected.
We have successfully identified the cause of the service disruption and have deployed a solution. Our team is actively monitoring the situation to ensure the issue is fully resolved.
Our team is diligently analyzing the service disruption impacting our product. We apologize for the inconvenience caused and appreciate your understanding. We are committed to resolving the issue and will provide updates as soon as possible.
All Instances are down in Zürich[CH] region
Severity: Full outage
Location: Switzerland
Affected Services: All Authentication Cloud instances in Zürich[CH] region
Summary: We are currently experiencing an all instance down issue [in region] affecting our Authentication Cloud services. Our team is actively investigating the issue to restore service as quickly as possible. We apologize for the inconvenience caused and appreciate your patience.
Impact: During this incident, the affected service is inaccessible, resulting in disruption of functionality for our users. We understand the impact this may have on your workflow and sincerely apologize for any inconvenience caused.
← Back