Historical record of incidents for SecureAuth Service
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateOur team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateThe problem has been resolved.<br><b style='color: green'>Resolved</b> - The component IAM SaaS US Region is down and we're working on it.
Our team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateOur team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateThe problem has been resolved.<br><b style='color: green'>Resolved</b> - The component IAM SaaS US Region is down and we're working on it.
Our team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateThe problem has been resolved.<br><b style='color: green'>Resolved</b> - The component IAM SaaS US Region is down and we're working on it.
Our team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateOur team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "IAM SaaS US Region - Something has gone wrong"
Last updateOur team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS US Region is down and we're working on it.
Report: "Authentication Issues for Hybrid/ On-Prem Customers using New Experience Realms"
Last updateAuthentication Issues for Hybrid/ On-Prem Customers using New Experience Realms We have identified an issue affecting a subset of Hybrid / On-Prem customers. The Authentication attempts using the new experience may fail and display an 'Invalid User' error to end users due to expired certificates, even after ACRU has been completed. To avoid being impacted, refer to the steps in the knowledge base article below to deploy the fix. (https://support.secureauth.com/hc/en-us/articles/37586282092820-Authentication-attempts-against-New-Experience-Datastores-fail-with-Invalid-User-after-SecureAuth-G3-Intermediate-Certificates-have-expired) If you have any questions regarding this issue or require assistance with applying the fix, please log a ticket at https://support.secureauth.com. Our teams are on standby and are ready to assist.
Report: "Authentication Issues for Hybrid/ On-Prem Customers using New Experience Realms"
Last updateAuthentication Issues for Hybrid/ On-Prem Customers using New Experience Realms We have identified an issue affecting a subset of Hybrid / On-Prem customers. The Authentication attempts using the new experience may fail and display an 'Invalid User' error to end users due to expired certificates, even after ACRU has been completed. To avoid being impacted, refer to the steps in the knowledge base article below to deploy the fix. (https://support.secureauth.com/hc/en-us/articles/37586282092820-Authentication-attempts-against-New-Experience-Datastores-fail-with-Invalid-User-after-SecureAuth-G3-Intermediate-Certificates-have-expired)If you have any questions regarding this issue or require assistance with applying the fix, please log a ticket at https://support.secureauth.com. Our teams are on standby and are ready to assist.
Report: "IAM SaaS EU Region - Something has gone wrong"
Last updateThe problem has been resolved.<br><b style='color: green'>Resolved</b> - The component IAM SaaS EU Region is down and we're working on it.
Our team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS EU Region is down and we're working on it.
Report: "IAM SaaS EU Region - Something has gone wrong"
Last updateThe problem has been resolved.<br><b style='color: green'>Resolved</b> - The component IAM SaaS EU Region is down and we're working on it.
Our team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS EU Region is down and we're working on it.
Report: "IAM SaaS EU Region - Something has gone wrong"
Last updateThe problem has been resolved.<br><b style='color: green'>Resolved</b> - The component IAM SaaS EU Region is down and we're working on it.
Our team of engineers are investigating.<br><b style='color: red'>Firing</b> - The component IAM SaaS EU Region is down and we're working on it.
Report: "CIAM - Email Delivery Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating degraded performance for Email delivery by CIAM. Please report any issues with receiving emails to support@secureauth.com
Report: "Quality Degradation - Fraud Service - Number Insights"
Last updateThis incident has been resolved. Please reach out to support@secureauth.com with any questions.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating an increase in error logs around the Number Insight functionality of our Fraud Service. You may experience quality degradation around this service during this time. We will provide an update once the new version is available, please reach out to support@secureauth.com with any questions.
Report: "Authenticate Android Issues"
Last updateThis incident has been resolved.
Version 25.1.21 of SecureAuth Authenticate for Android has been released to the Google Playstore. Impacted users who have not deleted the app will have the preexisting enrollments restored after installing the update. If you have any questions or are still experiencing issues, please reach out to Support at https://support.secureauth.com.
We are continuing to monitor for any further issues.
A small percentage of android users are experiencing missing enrolled accounts. A fix has been submitted to the Google Play store and will be available shortly. We strongly suggest you do not delete the app if experiencing this issue. We will provide an update once the new version is available, please reach out to support@secureauth.com with any questions.
Report: "Mobile Carrier Widespread Outage - Verizon Wireless - United States (US)"
Last updateVerizon has reported that they have fully restored their services.
We are aware of a widespread outage affecting Verizon Wireless in the United States. This may impact voice and SMS delivery to Verizon users.
Report: "New Mobile App Version Issues"
Last update**RCA – SecureAuth Authenticate iOS Release** **Problem Description:** On September 19, 2024 at 10:00AM PDT, SecureAuth released SecureAuth Authenticate version 25.0.18 to the IOS App Store. The application was targeted to 1% of the userbase with the previous version installed; however, users could still download the new version manually if desired. At approximately 11:00AM PDT, we receive internal and external reports that version 25.0.18 of SecureAuth Authenticate has broken preexisting enrollments, and users were unable to re-enroll to resolve the issue. **Cause:** Within the update process for SecureAuth Authenticate version 25.0.18, a format conversion of account data to a new arc-6 architecture is performed. The format conversion process failed due to missing group entitlements within the project configuration. This resulted in the SecureAuth Authenticate App’s inability to properly load account information, thus breaking functionality for end users. While no account data was lost, previous enrollments were inaccessible, and users were prompted to enroll, but could not. These entitlements were not discovered during QA due to a caching of configurations inside of Apple’s TestFlight system where the application was being tested. The only way to clear this cache is a factory reset of the device. Unknown to developers, these cached configurations held onto the missing group entitlements that were not present in the GA version of the application that was released. **Recovery:** The engineering team initiated resolution efforts on two fronts: Front 1: Revert to the previous version as an initial interim mitigation to restore functionality to impacted users as quickly as possible while a separate team focused on providing permanent fix for the following release. However, upon revert efforts, it was determined that this option was not viable due to technical complications brought on by the update and Apple compliance standards related to the Apple Watch application. Focus of efforts quickly shifted to Front 2. Front 2: Identify the cause of the format conversion failure and implement the fix. Upon successful completion of QA validation, submit to the iOS App Store for urgent review and publishment. **Resolution:** The format conversion process failed due to missing group entitlements within the project configuration. The fix reapplied the necessary group entitlements, restoring the app’s ability access the account data. **Timeline:** Sep 19, 2024 • 10:00 AM PDT - SecureAuth Authenticate version 25.0.18 released in iOS App Store • 11:00 AM PDT – Internal teams discover the issue with App release and Engineering Teams are notified • 11:09 AM PDT – Incident bridge started and Engineering teams begin investigating the issue • 11:15 AM PDT – Engineering Teams begin efforts to revert App to previous version in iOS App Store. • 11:20AM PDT – Status Page updated to inform customers to hold from application updates until further notice • 11:20 AM PDT – Engineering Teams continue to investigate root cause of the issue while also working on reverting to previous version on the iOS App Store • 12:30 PM PDT – Confirmed that the issue was not isolated to iOS 18 • 1:00 PM PDT – Discarded rollout option due to complications with Apple compliance standards for Wearable App • 1:00 PM PDT – Engineering teams refocus to provide patch for the Authenticate app. • 2:05 PM PDT – Cause of the issue identified as a failure of the format conversion process due to missing group entitlements within project configuration • 2:25 PM PDT – Fix is implemented and QA validation is initiated • 3:00 PM PDT – New build published for Urgent Review to App Store. • 4:12 PM - New version is deployed. Impacted users were notified to download the version 25.1.18 with preexisting enrollments intact. Support and Engineering teams continue to monitor the situation closely with customers. **Corrective Actions:** • Work with Apple to review their TestFlight requirements and determine why configurations were being cached, discover the standard duration of the cache period, and identify the steps needed to ensure the cache is cleared and updated configurations are being used during TestFlight QA processes. • Improve the current Pull Request and Code Review process in Mobile Development in order to mitigate the impacts of missing configurations and improve code release standards. • Add test cases to our QA suite to cover fresh devices, as it was determined that if wiped or “new” devices were being used for testing, the cached configurations would have been discovered.
We are now moving this incident to resolved. Support and Engineering teams will continue monitor the situation closely. If you have any further issues, please contact support at https://support.secureauth.com. An RCA will be provided shortly.
Version 25.1.18 of the Authenticate App has been released. Impacted users must update to this version to restore their accounts. If this does not resolve your issue or if you have additional questions, please report the issue at https://support.secureauth.com. An RCA for this incident will be provided once our internal investigation is complete.
Status Update: 1. We attempted to revert to the previous version as an initial interim mitigation to restore functionality to impacted users as quickly as possible. Upon revert efforts, it was determined that this option is not viable due to technical complications brought on by the update. 2. The team has now shifted focus to provide patch for the Authenticate app. Our development teams are actively working to provide the release of the hotfix as soon as possible. We will continue to provide updates to you as we receive more information.
We are continuing to work on a fix for this issue.
We've submitted a rolled back version of the mobile application to Apple. We are awaiting Apple's review and approval. We will update the status once available in the App Store.
We've been alerted to an issue with the new release of our Authenticate application. We believe this issue is for iOS only. Android devices do not seem to be impacted at this time. We are rolling back our versioning. Until further notice, don't update the to new version of Authenticate application, 25.0.18. If you are affected by this, other MFA methods (eg. SMS, Voice) outside of the mobile app remain operational.
Report: "Cloud Service Issue"
Last update**Polaris Twilight Outage RCA - September 12, 2024** **Problem Description** On September 11, 2024 at 7:16PM, the SecureAuth Cloud Infrastructure encountered widespread connection issues with databases systems which resulted in authentication failures for impacted customers. **Cause** The SecureAuth Cloud Operations team was alerted of connections issues with the Twilight service \(integral service which other microservices are reliant\). Upon investigation, we identified that the service was experiencing database latency due to CPU utilization spikes on the database. The CPU spikes triggered mass restarts of the Twilight Service which in turn caused extended CPU spikes on the database. The root cause was due to legacy dependencies on the database that were negatively affected during a redistribution exercise related to the Vault migration performed on August 29, 2024. Those legacy dependencies were originally determined to be benign, and therefore assumed to have no impact to the customer base after the Vault migration. It was determined that the CPU spikes were caused by the interface between the service and the database in form of health checks that created a snowball effect, resulting in the aforementioned issues with the Twilight service. Due to the nature of this issue, not all customers were immediately impacted; however, the recovery and resolution of this issue impacted all customer cloud services as a result of the scaling operations. **Recovery** To mitigate this issue, the cloud services were scaled down alleviate database pressure. Once the database stabilized, the services were scaled back up in a controlled manner until all services were fully restored. **Timeline:**` `Sep 11, 2024 * 7:16PM PST – Twilight connection issues begin and alerts were triggered * 7:17PM PST – Cloud Operations team join bridge to investigate alerts * 7:27PM PST – Issue is understood and mitigation efforts begin * 7:27PM PST – Scale down of cloud services to alleviate database pressure begins. * 7:40PM PST – Scale down complete and database CPU utilization stabilizes * 7:41PM PST – Controlled \(staggered\) scale up of cloud services begins * 8:30PM PST – Controlled scale up of cloud services is completed * 8:40PM PST – All services in running state * 9:00PM PST – Validation testing complete and incident resolved * Post-9:00PM PST – Continued to monitor closely while working with some customers as needed to resolve intermittent issues caused by the incident. Corrective Actions * Engineering to review and improve the Twilight to Cockroach Database interface and determine a more elegant solution to the health check actions that would diminish the result of mass-restarts of the service during periods of high-usage spikes. * Leadership review of database alternatives to the solution architecture * Improve decision-making accuracy by increasing team knowledge around legacy systems to ensure end to end awareness of potential impacts to assumed benign configuration changes. * Introduce additional gates into the existing CAB \(Change Advisory Board\) process, including additional Engineering leadership, including cross-functional Subject Matter Experts
The incident has been resolved. For any remaining issues or questions please reach out to support@secureauth.com.
A fix has been implemented and all issues have been resolved. We are continuing to monitor.
We have identified the issue and are in the process of implementing a fix.
We are investigating an issue with our Cloud Services, and will post updates as we gain understanding to the issue.
Report: "Service Interruption - Passwordless Platform (Arculix) - 12:01pm-12:13pm PDT"
Last updateThere was an interruption of service for the Passwordless platform, previously known as Arculix, between 12:01pm PDT until 12:13pm PDT. Service was restored and no further interruptions are expected. If you have any further questions or inquiries, please reach out to support at https://support.secureauth.com.
Report: "SMS Quality degradation - United States (US)"
Last updateThis incident has been resolved.
Our SMS provider has informed us that they are currently experiencing quality degradation with SMS delivery within the US. They are working directly with carrier partners and will provide updates as soon as a solution is in place. We will provide an update as we learn more. For more information directly from our provider: https://status.telesign.com/incidents/cbv5188zqwvm
Report: "SMS Quality degradation - United States (US) - Verizon Wireless"
Last updateWe received notification from our SMS OTP Providers that the network issues previously reported by United States (US) - Verizon Wireless from United States (US) have been resolved. United States (US) - Verizon Wireless end users should no longer experience delays receiving SMS messages as all services are back to functioning properly. If you have additional questions, please feel free to contact support at https://support.secureauth.com.
Our SMS provider has informed us that they are currently experiencing quality degradation with SMS delivery for Verizon within the US. They are working directly with Verizon and will provide updates as soon as a solution is in place. We will provide an update as we learn more. For more information directly from our provider: https://status.telesign.com/incidents/7tw76kkhc7tj
Our SMS provider has informed us that they are currently experiencing quality degradation with SMS delivery for Verizon within the US. They are working directly with Verizon and will provide updates as soon as a solution is in place. We will provide an update as we learn more. For more information directly from our provider: https://status.telesign.com/incidents/7tw76kkhc7tj
Our SMS provider has informed us that they are currently experiencing quality degradation with SMS delivery for Verizon within the US. They are working directly with Verizon and will provide updates as soon as a solution is in place. We will provide an update as we learn more. For more information directly from our provider: https://status.telesign.com/incidents/7tw76kkhc7tj
Report: "T-Mobile SMS Performance Degredation"
Last updateOur SMS provider has confirmed that the carrier issue with T-Mobile has been resolved. If you continue to experience delivery degradation with T-Mobile, please let us know by opening a support ticket @ https://support.secureauth.com
Our SMS provider has informed us that they are currently experiencing quality degradation with SMS delivery for T-Mobile within the US. They are working directly with T-Mobile and will provide updates as soon as a solution is in place. We will provide an update as we learn more. For more information directly from our provider: https://status.telesign.com/incidents/60tx8c6qtdqb
Report: "Investigating Increase in Alerts"
Last updateThis incident has been resolved.
We are currently investigating an increase in alerts for the SaaS/Full Cloud Identity Platform environment. If you have any questions, please reach out to our Global Support Team at support@secureauth.com or open a case through our portal at https://support.secureauth.com
Report: "Intermittent DNS Issues"
Last updateGeneral Statement During the procurement of a new production region (APAC), an incorrect DNS change was applied to one of our DNS providers (AWS Route53). This led to intermittent DNS resolution issues, affecting customer reachability of IdP services. Due to the nature of DNS propagation, and the intermittentness of the issue, a delay of our monitors' ability to detect the problem occurred. Once identified, the DNS change was reverted, and DNS propagation of the fix began, completing at approximately 10:30AM PST. Timeline of Events Mar 26 7:30 PST Incorrect DNS change applied to Route53 Mar 26 8:30 PST DevOps team began diagnosing the issue Mar 26 8:55 PST DevOps team removed duplicate record Mar 26 10:30 PST DNS Propagation appears complete. Corrective Actions Additional approvals required for any DNS changes. Review opportunity to expand endpoint monitoring from multiple regions and DNS name servers.
We are aware of ongoing intermittent DNS issues affecting various services. We have found the root of the issue and implemented a fix. Please allow for DNS propagation of the fix. We will continue to monitor.
Report: "Issues with Polaris services"
Last updateThis incident has been resolved.
The fix has been implemented and services are back online. We will continue to monitor.
We have isolated the issue and a fix is being implemented.
We are continuing to investigate these issues. A potential root cause has been found.
We are receiving widespread alerts around our cloud services. We are investigating.
We are currently investigating this issue.
Report: "SecureAuth Cloud Services - 3rd Party Mobile Carrier"
Last updateThis incident has been resolved.End users should no longer experience delays receiving SMS messages
The issue has been identified and the fix is implemented by the provider. There is no delay in receiving sms .
We are still continuing to investigate on the issue. We have reached out to the provider to get the ETA to address the issue.
We are currently investigating Issue with the Provider . You will expect some delay in getting codes for SMS service
Report: "Multiple US Operators Maintenance on Voice and SMS"
Last updateThis incident has been resolved.
Several US Operators have maintenance going on which is expected to end at 9am Pacific time today. Since it is operator maintenance and not our providers, we have to wait for their maintenance to finish
Report: "SMS Outages with US Mobile Providers"
Last updateThe SA IdP service encountered issues with SMS One Time Password \(OTP\) deliveries to US mobile carriers. Our SMS vendor informed us of potential delivery delays. We have since received notification from our SMS vendor that the previously reported issue is resolved. A final Root Cause Analysis \(RCA\) document will be provided once we receive detailed insights from our SMS vendor regarding the root cause from both their end and that of the mobile carriers.
The issue has been resolved - a final RCA will follow as we work with our vendor on the overall root cause.
Our vendor has reported the issue has been resolved. We will continue to monitor the situtation. Please report any issues to support@secureuth.com
We have reports from our SMS/Voice vendor that they are seeing some US Mobile Providers impacted by this.
We are receiving reports of SMS Issues with US Providers. SMS delivery of SMS MFA/OTP may be impacted
Report: "PUSH service Errors"
Last update**Problem Description:** On October 28, 2023 at approximately 3:00pm Pacific Time, the certificate utilized between the SecureAuth Push Service and the Apple Push Service expired and SecureAuth could no longer send push requests to iOS devices. Android devices were not impacted. **Resolution:** SecureAuth regenerated the certificate and uploaded it to the Push Service and iOS devices were once again receiving Push Notifications for authentication. **Corrective Actions:** Implement fix for certificate renewal notification to necessary stakeholders Implement additional certificate monitors and system alerts
The fix for ios devices is now resolved. RCA will Follow.
We identified the issue. We are implementing the fix.
We have Identified the issue we are working on implementing a fix.
We are currently investigating a Push Service issue that is impacting IOS devices - other Devices working without any issues. We will continue to provide updates .
Report: "Intermittent Issues with the SMS Service"
Last updateThe issue has been resolved - RCA to follow
We are currently investigating an SMS Service issue that is impacting multiple customers - other OTP methods appear to be working without issue
Report: "Potential Issues with Telephony service in Australia"
Last updateThis issue has been resolved and verified by users in Australia
The vendor is reporting the issue is resolved - for those customers impacted, please test the Voice OTP.
We are continuing to investigate this issue.
We are currently working with our Telephony Vendor for some reported issues of the Telephone Service (Voice OTP) that is affecting some users in Australia.
Report: "Issues with Polaris services"
Last update**RCA – EKS Outage - 09022023** **Leadership Response:** We apologize for the inconvenience and the difficulty your teams faced as you gave up your time with friends and family to communicate and resolve this incident with your internal users and customers. This is not an event we take lightly and had all hands-on deck to resolve the issue as quickly and efficiently as possible. Your experience with SecureAuth is very important to us. We value our partnership and the trust you continually put into our solution to protect your teams and your customers. We will continue to strive for excellence and make any changes necessary to deliver the stability and security that you require to be successful. **Incident Summary:** During a planned maintenance window between 06:00 and 12:00 UTC on September 2, 2023, a majority of SA IdP tenants on the EKS cluster failed resulting in an outage. All customers, cloud and hybrid deployments that use cloud services, were down or degraded during the incident. As a routine update of backend services was being performed, a networking component plugin failed its update, which caused the Vault service which stores application keys to fail. Since most production pods rely on the Vault service to obtain secrets, none of the production pods could come online. To resolve this, we reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system. Once this issue was resolved, there was an influx of back-logged communication to the backend database as all the production pods came back online. This overloaded the database connection pool causing additional bandwidth issues that impacted response times. To accelerate recovery of the entire environment, we temporarily reduced the number of active pods, which allowed the system to process the backlog. At approximately 15:50 UTC, all services were restored, and a postmortem of the incident began. **Root Cause:** * HashiCorp Vault failed to start after the EKS cluster update. This required manual intervention for VPC-CNI and CoreDNS add-ons. Vault is a critical dependency of many other cloud services to start. * Once Vault was operational, thousands of pods attempted to come online at once and many of them need to connect to one or more databases. The database servers became overwhelmed, preventing all services from coming back on-line. * A bug was discovered in the AMI used for production EKS worker nodes that causes auto-scaling of deployment replicas to grow to maximum capacity. This bug creates over-reporting of how much CPU is being used by each pod. This, in turn, generated about three times the normal pod count, further complicating recovery. **Resolution**: * Updated the VPC-CNI * Restarted CoreDNS and Vault * Throttled replication events to prevent overload on the databases * Deployed a workaround to the auto-scale bug to prevent overruns on connections **Corrective Actions:** * Instrumenting the upgrade process to detect CNI and Vault failures * Deploying future upgrades in isolated pod clusters to reduce impact to customers * Deploying a full resolution for the auto-scale bug * Completing any outstanding EKS upgrade tasks * Implementation of new communication protocol to inform customers of future incidents in a more timely and comprehensive manner
We will be continuing to watch the platform; however, we have had no further indications of system issues. Please contact support@secureauth.com if you have any issues. A full Root Cause Analysis (RCA) will be forth-coming.
The outage you’ve been experiencing has been resolved, and all effected parties have been restored. We are monitoring the status closely. We appreciate your patience with our teams as we worked diligently to bring you back online. A formal root cause analysis (RCA) will be forthcoming, but we wanted to provide as much information as possible upfront while we work on the official RCA which will be provided through our Customer Experience team. During a routine update of backend services, a networking component plugin failed its update, which caused the Vault service (storage of application keys) to fail. We reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system. This issue caused an influx of communication to the backend database that overloaded the system and also resulted in spinning up an excessive number of pods causing additional bandwidth issues that even further aggravated the response times and connection issues. To resolve, we manually reduced the number of active pods, which allowed the system to slowly recover from the outage.
The infrastructure has been recovered and loads are returning to normal. We will continue to monitor.
Some services have been restored at this time, but we are continuing to troubleshoot the issue.
We have identified the issue and are working to restore service. We will update when we have more information.
We have discovered an infrastructure issues and have AWS support is involved helping to resolve the issue.
We believe we have a focal point for the investigation and will post an update as soon as we have more info.
We are continuing to work on a fix for this issue.
We are currently investigating intermittent issues with Polaris services. Current services that may be impacted are full cloud and hybrid SA IdP installations. We will provide updates as we receive more information. While our maintenance window has not yet finished (maintenance window ends 8 AM Eastern Time), we are notifying customers of a potential issue. We will keep you updated as we progress.
Report: "NGE certificate issue"
Last updateThis incident has been resolved.
We've restarted an application service and we believe that addressed the issue.
We are tracking an issue affecting NGE certificates.
Report: "OTP delivery issues with SMS and Push services."
Last updateThis incident has been resolved.
Customers report services are working
Customers on older versions of SA IdP (9.3 and previous) are the only customers potentially impacted, which utilizes a separate intermediate infrastructure that is being investigated.
We are continuing to investigate the issue
We are currently investigating OTP delivery issues with SMS and Push services.
Report: "SMS/Telephony Issues in Hong Kong"
Last updateThis incident has been resolved.
The provider has implemented a fix and SMS should be delivering successfully to Hong Kong.
Our SMS/Telephony provider is having issues sending SMS/Telephony messages to users in Hong Kong. The provider is working on the issue.
Report: "SMS and Telephony outage"
Last update# Incident Description Starting around 1:12 PM on June 8, our monitors detected an issue with our primary, third-party SMS \(simple message service\) provider used for sending MFA \(multi-factor authentication\) codes. This issue caused intermittent delays when sending messages. While we have a secondary provider, the automatic failover did not occur as expected due to the nature of the primary provider’s SMS issue: the primary actually acknowledged our system’s call to send a message and would queue it but ultimately fail to deliver it promptly. Once our team detected the issue via our monitoring system and performed an investigation, we initiated a manual failover to our secondary SMS provider to reduce any service disruptions for our customers around 2:34 PM. Our primary provider resolved their issue at 5:06 PM, and we manually switched back around 8:43 PM. # Root Cause One of our 3rd party service providers experienced an intermittent service disruption which impacted MFA SMS delivery. # Follow Up Actions We apologize for any disruption in services during this incident. We will review our SMS provider failover thresholds to make them more sensitive to prevent this type of challenge from affecting our customers again. # Reference Telesign Status Page Incident: [https://status.telesign.com/incidents/t201ymq6kl1f](https://status.telesign.com/incidents/t201ymq6kl1f) All dates/times are UTC.
We have switched back to our primary provider and all services are operational
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
Switching to the secondary provider for SMS appears to have resolved the issue for now. We will continue to monitor.
The issue has been identified as a fault out the telephony provider side so we are switching to a backup provider.
Our Telephony provider is having an issue with SMS and Telephony.
Report: "SA IdP Dashboard Data is not Updating"
Last updateWe apologize for any inconvenience experienced during this incident, and we treat such disruptions seriously. Below is a summary of the issue, a root cause, and what we are doing to improve service going forward. Incident Description A TLS certificate that is used on an endpoint for SA IdP dashboard metric data collection expired on May 9. This meant HTTPS connections to this endpoint failed and caused SA IdP dashboard metric data to become outdated, degrading service for part of the administrative console. After some customers reported an issue with outdated metrics, the SecureAuth team investigated and fixed the issue by renewing the TLS certificate. The new certificate was in place by 6:10 PM on May 12. Data was refreshed around 1:30 AM on May 13, allowing the normal metric refresh period to run its course. Root Cause The DevOps team recently replaced several expiring certificates, and this one was missed due to a monitoring gap. Corrective Actions The DevOps team is following up by ensuring the TLS certificate monitoring gap is addressed and by reviewing additional monitoring requirements for the endpoint in question. Note: All times/dates are UTC.
Dashboards are again populating with data - if you continue to have issues, please contact support@secureauth.com
A fix has been implemented and we are monitoring the results.
The issue was an expired certificate within the backend services. The certificate has been updated and the job to update the dashboards will run later this afternoon to update the dashboards. Dashboards should be updated at roughly 1900 PST.
The Dashboard data for the SA IdP is not updating for login information. This is not impacting any SA IdP services or logins, only the reporting mechanism.
Report: "SecureAuth Authentication Services Down"
Last update# Incident Description A routine database automatic scaling operation failed around 9:30 AM ET on 4/19/2023. This caused the database cluster to become unresponsive and fail to accept connections. While data integrity was unaffected, data availability was compromised. The SecureAuth team worked to restore service and forced the scaling operation to finish. Once the scaling operation finished, the database cluster once again was accepting connections and service was restored around 12:30 PM ET on 4/19/2023. # Root Cause It appears that the automatic scaling operation failed to complete due to an issue with the database platform in general. One cluster node was marked unhealthy by the scaling algorithm incorrectly, and this prevented the rest of the cluster nodes from scaling properly. The database cluster began rejecting connections soon thereafter. # Corrective Actions Typically such automatic scaling operations are seamless and without incident. However, given the conditions for auto-scale failure on 4/19/2023, we acknowledge there is still risk with such activity. The SecureAuth team is following up with a database platform vendor for more information regarding the scaling operation failure. In the meantime, any auto-scale operations for this database cluster have been suspended, and the SecureAuth team will monitor capacity and do any scaling activity for this database cluster during planned maintenance windows. Finally, the SecureAuth team is confirming vendor best practices are being followed for any/all connections to this database cluster and will make adjustments as needed. Note: times are Eastern time zone.
All customers should have returned to normal. Please contact SecureAuth Support (support@secureauth.com) if you are still having issues.
Services have been restored and we will continue to monitor the situation to ensure no further issues.
Updated estimated resolution time 1:00pm Eastern Time. Continuing to recover the underlying database infrastructure components.
Identified an issue with the underlying database infrastructure which should be resolved by 12:15pm Eastern Time (9:15am Pacific/16:15 UTC).
The SecureAuth Authentication services have been impacted by a issue within the backend services - we are working with the appropriate teams and third-parties to isolate and resolve.
Report: "SaaS-IDP Connection Issue."
Last update# Incident Description At 4/28/2023 3:15 AM UTC, the DevOps team did a planned, minor update to traffic management appliances during a maintenance window. While no downtime was expected and all configurations were tested in lower environments, the new update in production did not respect TLS cipher suite configuration as expected. This created a TLS negotiation issue between the SA IdP appliances and the traffic management appliances, blocking some logins from occurring. Unfortunately, the post-change test suite did not detect the issue. After hearing about difficulties from some customer reports and first-hand observations, the DevOps team identified the issue and rolled back the change to restore service at 4:15 AM UTC \(4/28/23\). # Root Cause An update to the update to traffic management appliances did not obey a declarative configuration for TLS cipher suites. This caused a TLS negotiation failure between SA IdP appliances and the traffic management appliances, blocking some logins. # Corrective Actions We apologize for any inconvenience caused during this disruption in service. The SecureAuth team will pause any further changes to the traffic management appliances until we have sorted out the associated configuration issue \(which we plan to solve in the next few weeks\). In addition, any future updates done to these appliances will be done during a planned maintenance window where we notify customers that there may be downtime expected \(regardless of such risk\). Finally, we will review our post-change test plan to evaluate adding tests that would discover such trouble points in the future.
We have found the issue and have made the proper steps to restore service for affected for SaaS-IdP tenants. Further details will be provided in an RCA once the operations team concludes the issue investigation.
We are experiencing technical difficulties during our maintenance window with all SaaS-IdP connections.
Report: "Certificate creation issue"
Last updateIncident Description * On November 14, multiple customers reported issues with the renewal of PFX personal certificates. Root Cause * SecureAuth utilizes Cloud-based Hardware Security Modules \(HSM\) through Thales. * Thales was performing maintenance on the Cloud HSM infrastrucuture over the weekend which caused the SecureAuth Certificate Authority \(CA\) systems to be unable to connect to the Cloud HSM for key validation. * The SecureAuth CA’s regularly renews the Certificate Revocation Lists \(CRL\) for multiple CA’s - tthe expiration of the delta CRLs is approximately 48 hours, which is why we did not have any impacts from the Thales maintenance until Sunday evening with customers not being impacted until Monday morning. * To further exacerbate the problem, the alerts generated by the monitoring systems were not going to the location the L1 team monitors. Corrective Actions * Restarting the CA’s on all of the “NGE” servers corrected the issue * Documentation was not completely up-to-date on the configuration of the multiple-region deployment of the NGE certificates. The DevOps team will be reviewing the documentation and updating as necessary. * Review of all DevOps alerts has been conducted to ensure all alerts are going to the location that is actively monitored vs. the Slack channel that also has the alerts, but is not routinely monitored. * The DevOps Team has enrolled in status updates through the Thales Status Page and will review any changes or maintenance that is posted to that site to ensure internal testing can be performed to validate all SecureAuth operations are not impacted.
RCA to be posted soon
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating reports of our PFX/Certificate creation not working for some customers.
Report: "SMS Delivery Issues Impacting Limited Cell Providers"
Last updateWe have moved to our secondary provider and all carriers appear to be working. Please note, for a few days, the numbers will be coming over as "short codes" - this is normal and we are working with the secondary carrier to get to full numbers soon.
SMS messages are being received from our test devices, continuing to monitor the situation
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
There is an issue with the 3rd party vendor we are using to send SMS messages to the carrier. The issue appears to be with the vendor delivering messages to AT&T
We are investigating an issue with one of the cell providers (AT&T) where users are not able to utilize SMS as an MFA factor.
Report: "SMS Delivery Issues with some networks"
Last updateNo further issues have been reported and internal checks continue to be successful
The 3rd Party carriers have appeared to have resolved the issue - we are monitoring to ensure no further issues. If you continue to see any issues, please contact support@secureauth.com
We are currently investigating SMS delivery issues to for some Cell Networks. Our service provider is currently investigating the issue, and we will provide an update as soon as we receive more information.
Report: "Push Service Failures"
Last update# SecureAuth Production Incident Root Cause Analysis \(RCA\) ## Incident Description The Push to Accept \(P2A\) service was sending push requests to users; however, approval of those requests were not getting back to the Identity Platform \(IdP\) to finish the login process for a user with P2A multi-factor authentication \(MFA\). All other MFA methods were still available to users. ## Root Cause During normal, automated auto-scaling of the Amazon Elastic Kubernetes Service \(EKS\), there was an issue with the networking components of the EKS nodes, which caused the Redis databases used within the environment to lose access with other services. It is suspected that the auto-scaling of the EKS environment impacted the core networking and DNS services of the cluster due to a single auto-scaling policy for the entire cluster. ## Resolution * Alerts were generated and investigated by the DevOps Team within minutes of the networking issues; however, these alerts were generated from multiple services, and it was eventually determined to be the Redis services after multiple other services were also cycled. * The Redis Pods were cycled, and the Push Service was restored to normal operation. ## Corrective Actions * The DevOps Team is working to move from Redis pods on EKS to utilizing the AWS Elasticache service to allow for redundancy and additional stability for the service. ETA is the end of 2022. * Additional alerts have been created updated procedures have been added to runbooks to minimize a potential future failure. * The DevOps team is modifying the auto-scaling policies to separate the core services into one auto-scaling group \(ASG\) and the application itself into a separate ASG. This is expected to be completed by the end of October 2022.
This incident has been resolved.
The issue has been resolved, now monitoring
Other MFA methods (TOTP, SMS, etc) are still working
The issue has been identified and a fix is being implemented.
The push service is not accepting approval for push to accept
Report: "SMS/Voice MFA Options for Limited Customers"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have reports of a single customer having SMS/Voice MFA issues. This appears to be an issue with our third-party SMS/Voice Provider.
Report: "Certificate Revocation List Issue"
Last updateThis incident has been resolved, Root Cause Analysis pending
A resolution has been placed, we will provide an RCA here shortly after.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "SMS For Verizon Users"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
There have been reports of intermittent SMS delivery issues impacting US Verizon users.
Report: "SMS Delivery Issues"
Last updateThis incident has been resolved.
We are waiting on our Telephony provider for a RCA and will continue to monitor the service
The issue has been identified as a problem with our Telephony Provider and was resolved at 13:39 UTC
We are currently investigating this issue.
Report: "Intermittent DNS Provider Issues"
Last updateThe alerts look stable for our DNS resolution, as well as on the provider side.
We are continuing to work on a fix for this issue.
Our DNS provider that had issues has reported the issue is resolved, but we are still seeing alerts bounce back and forth. Due to the inconsistency, we are leaving the issue as identified for now until DNS resolution has stabilized.
One of our DNS provider has had a DNS resolution degradation. Impacts could result in intermittent issues resolving to our endpoints.
Report: "Increased API Errors"
Last update**Incident Description**: Customers attempting to log into the Cloud IdP platform were experiencing 504s. **Customer Impact**: A subset of Cloud IdP Customers **Root Cause**: Between 4/28/2022 21:00 UTC – 4/29/2022 01:00 UTC; as part of a routing update to backend services, a number of Broker services used by the Cloud IdP platform failed to update to the necessary version which did not support other changes being applied. **Resolution**: All failing Broker services were identified and the appropriate version was applied shortly after. **Corrective Actions**: During the upgrade process a missing flag was not detected by the script and failed to appropriate upgrade all backend services to the correct version. An audit of the update scripts is being performed to ensure important missing flags or settings fail the upgrade and allow the operator to take the appropriate corrective actions prior to the upgrade commencing.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are observing an increased number of API Errors with our Cloud IdP Customers. Users may experience issues logging in. We are investigating the issue and will post an update once available.
Report: "DynDNS Maintenance Affects Cloud Service Availability"
Last updateIncident Description: Any service resolving to [us-services.secureauth.com](http://us-services.secureauth.com) experienced failures to resolve against the URL while making any attempt to contact cloud services including, but not limited to: datastore connection and multifactor services. Root Cause: One of our main DNS providers have changed their nameserver IPs; this directly resulted in our need to update our glue records, which directly caused our loss of name resolution. Resolution/Corrective Actions: Our glue records have been updated respectively to match our DNS provider's new nameserver IPs. We have ensured we are now enlisted on our DNS provider's notifications when major infrastructure changes occur on their end to prevent another occurrence as such.
DynDNS is the DNS provider for SecureAuth. During maintenance that was performed on Saturday around 9:30 AM PDT, DynDNS change the IP addresses associated with SecureAuth services. This caused connections to SecureAuth services to fail until the client resolved to the new IP addresses. Services were restored around 10:45 AM PDT.
Report: "Intermittent Failure For Cloud Only IdP Connection"
Last updateIncident Description: Any service resolving to [us-services.secureauth.com](http://us-services.secureauth.com) experienced failures to resolve against the URL while making any attempt to contact cloud services including, but not limited to: datastore connection and multifactor services. Root Cause: One of our main DNS providers have changed their nameserver IPs; this directly resulted in our need to update our glue records, which directly caused our loss of name resolution. Resolution/Corrective Actions: Our glue records have been updated respectively to match our DNS provider's new nameserver IPs. We have ensured we are now enlisted on our DNS provider's notifications when major infrastructure changes occur on their end to prevent another occurrence as such.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Intermittent Failure For Cloud Only IdP Connection"
Last updateThis incident has been resolved and RCA will be provided upon requests.
A fix has been implemented and we are monitoring the results.
We are aware of an intermittent cloud connection with full cloud instances. We are investigating at the time and will post back the moment a discovery has been made.
Report: "SMS Delivery Issues with US T Mobile subscribers"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating SMS delivery issues to US T Mobile users. Our service provider is currently investigating the issue, and we will provide an update as soon as we receive more information.
Report: "Android Mobile App (Authenticate) Update Issue"
Last updateThis incident has been resolved.
SecureAuth Authenticate implements Firebase Cloud Messaging to safely deliver push notifications to user’s devices. To uniquely identify each device, FCM generates and assigns a token to each device. While that token might remain unchanged for long periods of time, it might change unexpectedly due to a few conditions: • When an app is restored on a new device. • When the user uninstalls/reinstalls the app (we consider updates as new installation as well). • When the user clears app data. Given the current capabilities of our solution, we can’t dynamically update that token, so the users will need to re-enroll their accounts in order to use push notifications. They’ll continue to receive notifications for a short period of time, but they won’t be able to reply to them and at some point they’ll stop receiving them. The remediation measure includes showing an error to the end-users whenever the token changes so that they know they’ll need to perform a re-enrollment in order to keep using push notification capabilities. This remediation measure is included in the latest version of our Android app, available on the Google play store.
Investigating: Users using Android mobile devices or tablets may encounter Push notification failures. User Impact: Users using Android mobile devices or tablets may be unable to log on using Push2Accept or Symbol2Accept due to push notification failures. The issue may manifest itself in a number of ways: 1. In some cases, the push notification may not be received 2. In some cases, pressing “accept” or pressing the symbol, the user will not be logged in, no error will be displayed on the mobile device 3. In some cases, pressing “accept” or pressing the symbol, the user will not be logged in, an error may be displayed on the mobile device Preliminary RCA: This issue is caused by the Android OS app update process. For some devices, when the app is updated, the unique identifier used to validate the device to our cloud is modified by the Android OS, resulting in push notifications failing. The SecureAuth development team is currently investigating this issue with the Google Android team. Remediation: Users must reenroll the impacted devices
Report: "We are investigating increased error rates in SecureAuth Identity Platform."
Last updateOn Dec 7th around 7:42 am PST, SecureAuth detected that several services were encountering issues and customers were impacted. Upon further investigation, the root cause of the service interruption was due to a wide-spread outage in AWS US-East-1 region. At 8:39 am PST, SecureAuth switched the MFA Cloud Services to the backup cluster hosted in AWS’s US-West-2 region. However, IdPs that run from the AWS US-East-1 region could not be moved and certain customers were continuing to be impacted. AWS started recovery around 2 PM PST and SecureAuth’s services started to recover as well. The AWS US-East-1 region eventually became fully functional around Dec 8th 1:32 PM PST. SecureAuth will engage with AWS to discuss the root cause of the outage and devise plans for future reliability improvements.
AWS service has been fully restored and all SecureAuth services returned back to normal.
We are continuing to monitor for any further issues.
AWS has restored some of the impacted services but has no ETA on full resolution. We will continue to monitor.
We are continuing to investigate this issue.
AWS is encountering networking issues in US-East-1 region and impacted several services. There is currently no ETA from AWS on the resolution timeline.
We are continuing to investigate this issue.
We are investigating increased error rates in SecureAuth Identity Platform and selected Cloud Services.
Report: "Network error detected impacting SACloud services."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We have detected a network related issue impacting SACloud services and are actively working on a resolution.
Report: "SMS Service Degradation due to AT&T Network Issue"
Last updateThis incident has been resolved.
AT&T is having network issues delivering SMS messages. We have activated the failover SMS service provider and are monitoring the situation.
Report: "Intermittent Issues with Cloud Services for US-West SecureAuth Cloud Data Center"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Investigating - We are currently investigating intermittent issues with cloud services impacting SMS, phone OTP, Push, Geo, Threat, Phone Fraud, certificate issuance. We will provide updates as we receive more information. Customer impact: End-users may experience failures when requesting OTP via phone, SMS, or push notifications. Geo and threat services may not respond but should fail open on the IdP side resulting in no end-user impact. Phone Fraud may not respond and could impact user logon experience for interactive and API based logons that leverage this service Platforms Impacted: v9.3 and below