Historical record of incidents for Opsgenie
Report: "Customers may experience delays or failures receiving emails"
Last updateWe were experiencing cases of degraded performance for outgoing emails from Confluence, Jira Work Management, Jira Service Management, Jira, Opsgenie, Trello, Atlassian Bitbucket, Guard, Jira Align, Jira Product Discovery, Atlas, Compass, and Loom Cloud customers. The system is recovering and mail is being processed normally as of 16:45 UTC. We will continue to monitor system performance and will provide more details within the next hour.
Report: "Delays observed at JSM and Opsgenie alert search functionality"
Last updateAll search functionality is operational without any latency. Thank you for your patience.
The problem is mitigated, and we are now monitoring closely.
We identified degraded performance at alert search functionality for some Jira Service Management and Opsgenie Cloud customers due to the infrastructure issue from cloud provider. No impact has been observed at alert critical flows like notification. The team has taken actions to mitigate the issue and minimize the impact to search functionality
Report: "Delays observed at JSM and Opsgenie alert search functionality"
Last updateWe identified degraded performance at alert search functionality for some Jira Service Management and Opsgenie Cloud customers due to the infrastructure issue from cloud provider. No impact has been observed at alert critical flows like notification. The team has taken actions to mitigate the issue and minimize the impact to search functionality
Report: "Schedule API are getting timed out"
Last updateThis incident has been resolved.
We are investigating cases of degraded performance for Alert Schedules experiencing timeouts and slowness for Opsgenie Cloud customers. Requests have been taking more than 30s and some have been timing out. We will provide more details within the next hour.
Report: "Schedule API are getting timed out"
Last updateWe are investigating cases of degraded performance for Alert Schedules experiencing timeouts and slowness for Opsgenie Cloud customers. Requests have been taking more than 30s and some have been timing out.We will provide more details within the next hour.
Report: "EU OpsGenie API calls having intermittent routing issues"
Last updateIssues where some API calls configured to use api.opsgenie.com/eu were intermittently returning a 404 error with the message 'no Route matched with those values' should now be resolved. API calls to api.eu.opsgenie.com and api.opsgenie.com (without /eu) were not affected at this time.
We are aware of an issue where some API calls configured to use api.opsgenie.com/eu are intermittently returning a 404 error with the message 'no Route matched with those values'. API calls to api.eu.opsgenie.com and api.opsgenie.com (without /eu) are not affected. Our team is investigating this issue.
Report: "Services menu in Opsgenie is not responding"
Last updateOur engineers have been closely monitoring the platform and are declaring this incident resolved. Thank you for your patience.
Our team has identified the issue with the Services page in Opsgenie and is working to fix it.
Our engineering team is actively investigating this incident and working to bring the Opsgenie service back up as quickly as possible. Users affected by this incident may notice that Services functionality is slow or completely unavailable for the web page We will update this page as we have additional information.
Report: "Elevated 5XX errors in Schedule API at Opsgenie USA region"
Last updateOur team has identified the issue in Schedule API between 15:30 UTC and 17:00 UTC. We saw performance degradation and 5XX errors in response. Faulty deployment has been reverted quickly in the USA region and rapid recovery is seen. We are monitoring the system for a full recovery right now. The Schedule API is up and running again without any data loss.
Report: "Opsgenie Web UI is slow or unavailable in US region"
Last updateOur engineers have been closely monitoring the platform and are declaring this incident resolved. Thank you for your patience.
Our engineering team has implemented fixes. We will continue to monitor all systems. Thank you for your patience.
We've noticed that Opsgenie Web UI is responding slowly or unavailable in US region. Our engineering team is actively investigating this incident and working to bring Opsgenie back up to speed as quickly as possible. We'll keep you posted with further updates on this page.
Report: "Reported issues with OEC functionality"
Last updateWe've verified that OEC endpoints are back online.
We reverted the faulty routing configuration change and started getting traffic on OEC endpoints
We've identified a recent change that has broken some endpoint routing configurations and caused OEC endpoint requests to be directed to wrong service. We're reverting that change on production at the moment.
We are continuing to investigate this issue.
We've been notified that a number of OEC clients have been failing to create Jira tickets. We're currently investigating the issue.
Report: "Users are experiencing reCaptcha errors while signing up"
Last updateThis issue has been resolved.
We have identified the root cause and the issue appears to be resolved.
Users attempting to sign up are encountering reCaptcha errors that are preventing a successful signup.
Report: "Unable to edit Opsgenie rotations in US, EU and Sydney regions in Web Application"
Last updateOur engineers have been closely monitoring the platform and are declaring this incident resolved. Thank you for your patience.
We are continuing to work on a fix for this issue.
Our team has identified the issue with Opsgenie Web Application / Edit Rotation feature in US, EU and Sydney regions and is working to fix it. Check back soon for another update! Our team is working hard to get the feature up and running again.
Report: "Some products are hard down"
Last updateBetween 03-07-2024 20:08 UTC to 03-07-2024 20:31 UTC, we experienced downtime for Opsgenie. The issue has been resolved and the service is operating normally.
We have mitigated the problem and continue looking into the root cause. The outage was between 8:08pm 03/07 UTC - 08:31pm 03/07 UTC We are now monitoring closely.
We are investigating an issue with <FUNCTIONALITY IMPACTED> that is impacting <SOME/ALL> Atlassian, Atlassian Partners, Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira, Opsgenie, Atlassian Developer, Atlassian (deprecated), Trello, Atlassian Bitbucket, Guard, Jira Align, Jira Product Discovery, Atlas, Atlassian Analytics, and Rovo Cloud customers. We will provide more details within the next hour.
Report: "Intermittent error accessing content"
Last updateBetween 2024-06-20 22:04 UTC to 2024-06-20 22:28 UTC, we experienced intermittent issue for users to access the services for some Atlassian Cloud customers. The issue has been resolved and the service is operating normally.
We have identified the root cause of the intermittent errors and have mitigated the problem. We are now monitoring closely.
We are investigating an intermittent issue with accessing Atlassian Cloud services that is impacting some Atlassian Cloud customers. We will provide more details once we identify the root cause.
Report: "Error responses across multiple Cloud products"
Last update### Summary On June 3rd, between 09:43pm and 10:58 pm UTC, Atlassian customers using multiple product\(s\) were unable to access their services. The event was triggered by a change to the infrastructure API Gateway, which is responsible for routing the traffic to the correct application backends. The incident was detected by the automated monitoring system within five minutes and mitigated by correcting a faulty release feature flag, which put Atlassian systems into a known good state. The first communications were published on the Statuspage at 11:11pm UTC. The total time to resolution was about 75 minutes. ### **IMPACT** The overall impact was between 09:43pm and 10:17pm UTC, with the system initially in a degraded state, followed by a total outage between 10:17pm and 10:58pm UTC. _The Incident caused service disruption to customers in all regions and affected the following products:_ * Jira Software * Jira Service Management * Jira Work Management * Jira Product Discovery * Jira Align * Confluence * Trello * Bitbucket * Opsgenie * Compass ### **ROOT CAUSE** A policy used in the infrastructure API gateway was being updated in production via a feature flag. The combination of an erroneous value entered in a feature flag, and a bug in the code resulted in the API Gateway not processing any traffic. This created a total outage, where all users started receiving 5XX errors for most Atlassian products. Once the problem was identified and the feature flag updated to the correct values, all services started seeing recovery immediately. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because the change did not go through our regular release process and instead was incorrectly applied through a feature flag. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Prevent high-risk feature flags from being used in production * Improve the policy changes testing * Enforcing longer soak time for policy changes * Any feature flags should go through progressive rollouts to minimize broad impact * Review the infrastructure feature flags to ensure they all have appropriate defaults * Improve our processes and internal tooling to provide faster communications to our customers We apologize to customers whose services were affected by this incident and are taking immediate steps to address the above gaps. Thanks, Atlassian Customer Support
Between 22:18 UTC to 22:56 UTC, we experienced errors for multiple Cloud products. The issue has been resolved and the service is operating normally.
We are investigating an issue with error responses for some Cloud customers across multiple products. We have identified the root cause and expect recovery shortly.
Report: "US - Increased delays on alert flow"
Last updateThis incident has been resolved.
Our engineering team has implemented fixes. We will continue to monitor all systems. Thank you for your patience.
We are continuing to work on a fix for this issue.
We are observing delays for our alert flow. No alert has been lost and our team is actively working on it to mitigate the delays. We'll keep you posted with further updates.
Report: "Admin Portal Feature Access Issue"
Last updateBetween 6:30 AM UTC to 9:50 AM UTC, we experienced failures in accessing some features from the Admin Portal. The issue has been resolved and the service is operating normally.
We are investigating an issue causing failures in accessing some features from the Admin Portal, which is impacting some of our Cloud customers. We have identified the root cause and anticipate recovery shortly.
Report: "Investigating new product purchasing"
Last updateBetween 28th Feb 2024 23:15 UTC to 29th Feb 2024 00:05 UTC, we experienced issue with new product purchasing for all products. All new sign up products have been successfully provision and confirmed issue has been resolved and the service is operating normally.
We are investigating an issue with new product purchasing that is impacting for all products. Customers adding new cloud products may have experienced a long waiting page or an error page after attempting to add a product. We have mitigated the root cause and are working to resolve impact for customers who attempted to add a product during the impact period. We will provide more details within the next hour.
Report: "Opsgenie SAML login at eu region is not working"
Last updateWe fixed the problem and verified that there are no more login issues.
The team has identified the issue causing the signature validation error at EU SAML Login. The fix has been started to deploy and SAML login activities are being monitored by team currently
Opsgenie SAML Login functionality is not working at only EU region due to signature verification error at login certificates. Already logged in customers haven't been affected by the error.
Report: "Service Disruptions Affecting Atlassian Products"
Last update### **Summary** On February 14, 2024, between 20:05 UTC and 23:03 UTC, Atlassian customers on the following cloud products encountered a service disruption: Access, Atlas, Atlassian Analytics, Bitbucket, Compass, Confluence, Ecosystem apps, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, Opsgenie, StatusPage, and Trello. As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names used for internal service-to-service connections. Active domain names were incorrectly deleted during this event. This impacted all cloud customers across all regions. The issue was identified and resolved through the rollback of the faulty deployment to restore the domain names and Atlassian systems to a stable state. The time to resolution was two hours and 58 minutes. ### **IMPACT** External customers started reporting issues with Atlassian cloud products at 20:52 UTC. The impact of the failed change led to performance degradation or in some cases, complete service disruption. Symptoms experienced by end-users were unsuccessful page loads and/or failed interactions with our cloud products. ### **ROOT CAUSE** As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names that were being used for internal service-to-service connections. Active domain names were incorrectly deleted during this operation. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. The detection was delayed because existing testing & monitoring focused on service health rather than the entire system’s availability. To prevent a recurrence of this type of incident, we are implementing the following improvement measures: * Canary checks to monitor the entire system availability. * Faster rollback procedures for this type of service impact. * Stricter change control procedures for infrastructure modifications. * Migration of all DNS records to centralised management and stricter access controls on modification to DNS records. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
We experienced increased errors on Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Align, Jira Product Discovery, Atlas, Compass, and Atlassian Analytics. The issue has been resolved and the services are operating normally.
We have identified the root cause of the Service Disruptions affecting all Atlassian products and have mitigated the problem. We are now monitoring this closely.
We have identified the root cause of the increased errors and have mitigated the problem. We continue to work on resolving the issue and monitoring this closely.
We are investigating reports of intermittent errors for all Cloud Customers across all Atlassian products. We will provide more details once we identify the root cause.
Report: "Major outages in heartbeat services pinging via email and email integration in EU region"
Last updateThis incident has been resolved.
The team has reverted the changes and identified that corresponding heartbeat and email integration start working now
Due to faulty domain configurations, heartbeats updates via email and incoming email services have been rejected starting from 13:45 UTC. The team has been working on the fix. Except for email updates and email integrations, heartbeat feature and integrations are still fully functional.
Report: "IOS based Alert Notifications delivered as non-critical"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
As of 12.12.2023 14:46 UTC, we have observed that some alert notifications are not being delivered to IOS-based devices as critical. We are currently investigating the issue.
Report: "We observe increased error rates due to the cloud provider"
Last updateOur engineers have been closely monitoring the platform and are declaring this incident resolved. Thank you for your patience.
We observe increased error rates due to the cloud provider. It caused notification delays and elevated API errors for our customers. All systems are recovering now, we are monitoring.
Report: "Egress connectivity timing out"
Last updateThe systems are stable after the fix and monitoring for a specified duration
The issue was identified and a fix implemented. We are monitoring currently.
We are currently investigating an incident that result in outbound connections from Atlassian cloud in us-east-1 intermittently timing out. This affects Jira, Trello, Confluence, Ecosystem products. The features affected for these products are those that require opening a connection from Atlassian Cloud to public endpoints on the Internet
Including Atlassian Developer
We are currently investigating an incident that result in connection time outs on service egress proxy. This affects Jira, JSM, Confluence, BitBucket, Trello, Ecosystem products. The features affected for these products are those that require a connection to service egress.
Report: "Scheduled report functionality is disabled"
Last updateWe're excited to inform you that we've shipped upgrades to our production environment, enabling scheduled reports once again. What's Changed: To continuously improve and ensure the security of our services, we've implemented additional controls including domain restrictions and a limitation on the number of recipients per email. This is specifically for mitigation purposes. From now on, users will start receiving emails for the reports they've scheduled for themselves, and they will also have the ability to create new tasks. Impact: Please note that any existing scheduled jobs with external recipients will no longer be editable. However, users can delete these and create new jobs using their email IDs. Thank you for your patience during these changes. We want to assure you that future updates and communications will be shared promptly to keep you informed. We appreciate your understanding and continued support.
The changes have been shipped to production, and scheduled reports are enabled now.
We have implemented additional controls and are introducing domain restrictions as well as limitations to the number of recipients for mitigation purposes. Scheduled reports will be enabled for all customers on November 24th, PST. Users will begin receiving emails for the reports they have scheduled for themselves and they will also have the ability to create new tasks. We will share more updates when the changes are fully implemented.
The root cause is identified and the spam activity is mitigated. The team is working on adding more controls to prevent further spam activities. The scheduled report feature will be kept disabled for a while until the further controls implemented. However, the reporting service is fully available and the reports can be downloaded manually via the reporting page.
The cause of the issue is identified, and the team is working on the fix.
Scheduled report functionality is disabled as we suspect a possible spam activity. Only the reports with a custom schedules are disabled, periodic emails are not impacted. The team is investigating the issue and will provide more update.
Report: "Atlassian Account login issues"
Last update### **SUMMARY** On Sep 13, 2023, between 12:00 PM UTC and 03: 30 PM UTC, some Atlassian users were unable to sign in to their accounts and use multiple Atlassian cloud products. The event was triggered by a misconfiguration of rate limits in an internal service which caused a cascading failure in sign-in and signup-related APIs. The incident was quickly detected by multiple automated monitoring systems. The incident was mitigated on Sep 13, 2023, 03: 30 PM UTC by the rollback of a feature and additional scaling of services which put Atlassian systems into a known good state. The total time to resolution was about 3 hours & 30 minutes. ### **IMPACT** The overall impact was between Sep 13, 2023, 12:00 PM UTC and Sep 13, 2023, 03: 30 PM UTC on multiple products. The Incident caused intermittent service disruption across all regions. Some users were unable to sign in for sessions. Other scenarios that temporarily failed were new user signups, profile retrieval, and password reset. During the incident we had a peak of 90% requests failing across authentication, user profile retrieval, and password reset use cases. ### **ROOT CAUSE** The issue was caused due to a misconfiguration of a rate limit in an internal core service. As a result, some sign-in requests over the limit received HTTP 429 errors. However, retry behavior for requests caused a multiplication of load which led to higher service degradation. As many internal services depend on each other, the call graph complexity led to a longer time to detect the actual faulty service. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We are continuously improving our system's resiliency. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Audit and improve service rate limits and client retry and backoff behavior. * Improve scale and load test automation for complex service interactions. * Audit cross-service dependencies and minimize them where possible related to sign-in flows. Due to the unavailability of sign-in, some customers were unable to create support tickets. We are making additional process improvements to: * Enable our unauthenticated support contact form and notify users that it should be used when standard channels are not available. * Create status page notifications more quickly and ensure that for severe incidents, notifications to all subscribers are enabled. We apologize to users who were impacted during this incident; we are taking immediate steps to improve the platform’s reliability and availability. Thanks, Atlassian Customer Support
Between 12:45 UTC to 15:30 UTC, we experienced login and signup issues for Atlassian Accounts. The issue has been resolved and the service is operating normally. We will publish a post-incident review with the details of the incident and the actions we are taking to prevent similar problem in the future.
We are no longer seeing occurrences of the Atlassian Accounts login errors, all clients should be able to successfully login now. We will continue to monitor.
We can see a reduction in the Atlassian Accounts login issues after the mitigation actions were taken. We are still monitoring closely and will continue to provide updates.
We have identified the root cause of the Atlassian Accounts login issues impacting Cloud Customers and have mitigated the problem. We are now monitoring this closely.
We are investigating an issue with Atlassian Accounts login that is impacting some Cloud customers. We will provide more details within the next hour.
Report: "Multiple product logins"
Last update### **SUMMARY** On August 30, 2023, between 4:07 and 5:30 UTC, some customers were unable to login to Atlassian's Cloud products using [id.atlassian.com](http://id.atlassian.com). Logged-in users were also unable to switch accounts, change passwords, or log out. Users with existing sessions were not impacted. Between 5:32 and 6:00 UTC, traffic was incrementally restored to a previous build, mitigating the impact for users. The total time to resolution was one hour and 53 minutes. ### **IMPACT** Users were not able to login using Atlassian's shared account management system \([id.atlassian.com](http://id.atlassian.com)\). This affected users who were trying to login to the following products: Jira, Confluence, Trello, Opsgenie, mobile apps and ecosystem apps. Aside from the inability to login, there was no impact on other Atlassian products or features. ### **ROOT CAUSE** Multiple Set-Cookie headers were unintentionally modified so that only the last Set-Cookie header remained in the response to user's browsers. The issue was caused by a change to Network Extensions within the Edge Network. As a result, users that needed a new session could not login. Upon login, the users were redirected to login again and no session was created for them. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue was not detected in Atlassian's staging environment. End-to-end tests did not cover the use case of multiple Set-Cookie headers in the single response and therefore this bug went unnoticed. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Automated tests to be put in place to validate that cookies are not being removed from responses. * Configuration of networking extensions will be guaranteed to be identical in staging and production to ensure errors are picked up earlier. Furthermore, we typically deploy our changes progressively by cloud region to avoid broad impact, but in this case, the change was not deemed risky and was deployed to all regions. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures: * Changes to network extensions in the future will use progressive rollouts. * With staging being properly utilized, errors similar to this one will not be deployed to any production environments. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Between 4:30AM UTC to 6:00AM UTC, we experienced issues for users attempting to login for Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Jira Product Discovery, Compass, and Atlassian Analytics. The issue has been resolved and the service is operating normally.
We are investigating reports of intermittent errors for login to Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Jira Product Discovery, Compass, and Atlassian Analytics Cloud customers. We will provide more details once we identify the root cause.
Report: "Opsgenie has experienced delay on Android Notifications"
Last updateThis incident has been resolved.
No delay on Android Notification is experienced now. All Android Notification delay has return to normal.
We are seeing delays with Android notifications. We have identified the cause and are currently working on mitigation of this issue
Report: "We observe degraded performance on incident timeline functionality"
Last updateThe issue has been resolved completely and all the functionality is fully working now. The customers experienced some latencies for the entries added to the incident timeline during the incident. However, there is not data loss and all the messages processed successfully.
The rollback completed successfully and we observed the remediation. The functionality is fully working now, and we are closely monitoring the system.
Some misconfiguration caused the incident. We identified the root cause and reverting the change.
We observe degraded performance on incident timeline functionality. We are investigating the issue and we will provide more details within the next hour.
Report: "Sign-ups, Product Activation, and Billing not working"
Last updateWe mitigated the issue with Sign-ups, Product Activation, and Billing, and the systems are back to BAU, and all functionality is restored.
We have identified the root cause of the Sign-ups, Product Activation, and Billing not working and have mitigated the problem. We are now monitoring closely.
We are investigating an issue with Sign-ups, Product Activation, and Billing that is impacting all of our Cloud Customers. We will provide more details within the next hour.
Report: "Performance issues and outages with Cloud products"
Last update### **SUMMARY** We understand the importance of providing reliable and consistent service to our valued customers. On July 6, 2023, from 03:52 to 15:11 UTC, we experienced an issue with an upgraded version of a third-party tool that functions as our internal artifact management system. Despite our monitoring system identifying the incident within two minutes, this issue led to the degradation of the scaling capabilities of our internal hosting platform, resulting in service degradation or outages for customers of Atlassian cloud. In response to this situation, we are taking immediate measures to enhance the stability of our system and prevent similar issues from re-occurring. ### **IMPACT** This incident affected multiple regions and products due to the diminished scaling capabilities of our internal hosting platform. In most products and offerings, customers faced reduced functionality, slower response times, and limited access to specific features. ### **ROOT CAUSE** The root cause of the incident was the introduction of new functionality in a third-party tool that functions as our internal artifact management system. It led to an unexpected increase in the load on the primary database of the artifact system. Upon identifying and localizing the problem, we promptly adjusted the system configuration to regain stability. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** Over the next months, we will enact a temporary freeze on non-critical upgrades of the artifact management system, and we will focus our efforts on three high-priority initiatives: 1. **Enhancing system scaling:** We prioritized work ensuring that downtime in a critical infrastructure component does not affect the scaling of other components. We expect to complete this initiative within the next two months. 2. **Reducing interdependencies:** We are working to mitigate the risk of potential cascading failures by ensuring that significant system components are able to operate independently in the case of issues. Initiatives 1 and 2 are already in progress but have been given priority to be completed as soon as possible. 3. **Strengthening testing procedures:** Alongside these initiatives, we are addressing the need for even more stringent testing procedures than we already have in place to prevent potential issues in future updates. We are committed to collaborating closely with our technology partners to ensure the most optimal experience for our customers. We apologize for any inconvenience caused by this incident and appreciate your understanding. Our team is dedicated to continually improving our systems and processes to provide you with the exceptional service you deserve. Thank you for your continued support and trust in us. Sincerely, Atlassian Customer Support
We experienced performance issues and outages for several Atlassian Cloud Products. The issue has been resolved and the service is operating normally.
We have identified the root cause of an issue with an internal infrastructure component that has been impacting multiple Cloud products - including Jira Software, Jira Service Management and Confluence - and customers. This issue had lead to a performance impact and, in some cases, outages. We have implemented a fix to resolve the issue and recovery is in progress.
We are investigating an issue with an internal infrastructure component that is impacting multiple Cloud products, including Jira Software, Bitbucket, Jira Service Management and Confluence, and customers. These issues include performance impact and, in some cases, outages. Users may experience slow loading and uploading of attachments, login issues or inability for new customers to sign up. We have identified the root cause and are actively working on the service recovery.
Report: "Intermittent errors during login for some customers"
Last updateBetween 07:31 UTC to 12:32 UTC, we experienced errors during login for Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Product Discovery, Compass, and Atlassian Analytics. The issue has been resolved and the service is operating normally.
We have identified the root cause of the errors during login and have mitigated the problem. We are now monitoring closely.
We are investigating reports of errors during login that is impacting some Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Product Discovery, Compass, and Atlassian Analytics. We have identified the root cause and expect recovery shortly.
We are investigating reports of errors during login for some customers that is impacting some Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Product Discovery, and Atlassian Analytics Cloud customers. We will provide more details within the next hour.
We are investigating reports of errors during login for some customers that is impacting some Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Product Discovery, and Atlassian Analytics Cloud customers. We will provide more details within the next hour.
We are investigating reports of errors during login for some customers that is impacting some Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Product Discovery, and Atlassian Analytics Cloud customers. We will provide more details within the next hour.
Report: "Increased delays on all flows in EU region"
Last updateOur engineers have been closely monitoring the platform and are declaring this incident resolved. Thank you for your patience.
Our engineering team has implemented fixes. We are monitoring all systems. Thank you for your patience.
Due to extreme load, we are experiencing problem in cache layer. Our team is actively addressing the problem and working on implementing a fix as quickly as possible. We appreciate your patience.
We detected delays in login flow. We identified the problem and working on fix.
We are continuing on a fix for the issue.
The issue has been identified and a fix is being implemented.
The team has identified the cause of the problem and actively working on it to mitigate the delays.
Our platform is experiencing some delays for all system in EU region. Team is actively working on it to mitigate the delays. We'll keep you posted with further updates.
Report: "Partial Outage in Schedule API"
Last updateOur engineering team has fixed the problem and we are closely monitoring the platform. Thank you for your patience.
We are investigating an issue with Schedule API that is impacting some Opsgenie US Cloud customers. We will provide more details within the next hour.
Report: "Delays in Android push notifications"
Last update### **SUMMARY** On April 4, 2023, between 13:32 and 14:50 UTC, Atlassian customers using Opsgenie faced significant delays while receiving Android push notifications. This was caused by an incident in a third party messaging service, which is responsible for Android push notification delivery. This in turn affected our systems. The incident was immediately detected by our monitoring tools, our on-call engineers were paged, and at 14:50 UTC our systems recovered successfully. The total time to resolution was about 80 minutes. ### **IMPACT** The overall impact was between April 4, 2023, 13:32 - 14:50 UTC in Opsgenie_._ The incident only resulted in delays in Android push notifications only, and these notifications were delivered successfully after FCM service was restored and no data loss occurred. ### **ROOT CAUSE** The issue was caused by an incident in a third party messaging service, which is responsible for delivering push notifications to Android devices. **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. The impact was immediately caught by our monitoring tools, and the responsible team immediately started analysis of incident. We value transparency with our customers and will continue to notify you and take any necessary actions promptly during an incident. In order to handle degradation or outage of messaging channels, Opsgenie recommends that users configure multiple channels of message delivery - including push notifications, mobile SMS, phone calls, and email. In order to improve our response for the future, we will also be analyzing whether we can employ autoscaling solutions for our systems in case of an outage/high load related to one notification channel. We apologize to customers whose services were impacted during this incident. Thanks, Atlassian Customer Support
This incident on Firebase has been resolved. Android push notifications are operational.
We are continuing to monitor for any further issues.
Android push notification delivery is fully operational for now, but we are still monitoring the Firebase outage (https://status.firebase.google.com/incidents/9ZPv9faHLen8bzLVSaft).
The issue has been identified as caused by an error on Firebase (https://status.firebase.google.com/incidents/9ZPv9faHLen8bzLVSaft). We continue to monitor the situation and send update within the next hour.
We are investigating an issue with our Android push notifications that is impacting some of our notifications for Android. We will provide more details within the next hour.
Report: "Increased delays on Jira Cloud and Jira Service Management Cloud integrations while creating/updating Opsgenie alerts in US region"
Last update### **SUMMARY** On April 3, 2023, from 1:15 pm UTC to 5:20 pm UTC Atlassian customers using Opsgenie product to integrate with a separate Jira Service Management Cloud instance faced significant delays while creating and updating alerts from Jira Cloud and Jira Service Management Cloud integrations in the US region. The issue was reported by our customers and also detected via internal monitoring tools. The reason for the incident was that one of the Opsgenie integration components could not scale to the high volume of requests from Jira. This caused delays in creating alerts or Jira issues by up to 30 minutes. The incident was mitigated by scaling the integration component, which put Atlassian systems into a known good state. The total time to resolution was about four hours and 30 minutes. ### **IMPACT** The overall impact was on April 3, 2023, from 1:15 pm UTC to 5:20 pm UTC. The Incident caused degradation to customers hosted in the US region only. This caused delays of up to 30 min, in creating Opsgenie alerts from Jira issues for customers who have the Jira to Opsgenie integration enabled. ### **ROOT CAUSE** The issue was caused by the sudden spike in the volume of messages, due to bulk actions. This requires scaling up the instances manually. Our proactive monitoring prevents delays by alerting early enough to allow manual scaling. A misconfiguration in this threshold and escalation policy, in our monitoring system, prevented us from scaling up instances well in time. **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Improving auto-scaling for integration components to take care of sudden spikes in the volume of incoming messages for creating alerts via integration * Adding additional monitoring mechanisms to raise an alarm when volume thresholds are breached We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the product’s performance and availability. Thanks, Atlassian Customer Support
We observed some delays while creating/updating alerts from Jira Cloud and Jira Service Management Cloud integrations in US region. The problem is resolved now.
Report: "Partial outage in ICC sessions due to network problem"
Last updateOur engineers have been closely monitoring the platform and are declaring this incident resolved. Thank you for your patience.
The problem is related to a misconfiguration on our network causing ICC sessions to fail. We are working on a fix and adjusting our network. Incident flow is continue to work without a problem and there is no data loss. We appreciate your patience as our teams continue with our investigations into the service interruption
Report: "Opsgenie On Call analytics dashboards are not showing data"
Last updateFor the 3 impacted dashboards, we have made data accessible for upto 1 year. For date ranges longer than 1 year, please reach out to the customer support. We will continue to monitor the system.
The team is still working to increase the data accessibility for the 3 impacted dashboards. We will keep you posted on further updates.
While we continue to increase the data accessibility for the 3 impacted dashboards, customers can continue to use them for the last 6 months of data. Rest of the reporting dashboards are working as expected. We will keep you posted on further updates.
While we continue to fix the data, we have observed latency in data shown for a few of the other analytics dashboards. We will continue to work to fix the rest of the data and will update you once we have more information.
Data is fixed for last six months and impacted dashboards can be generated with last six month of data. We will continue to work on fixing rest of the data and will update once we have more information.
Job triggered to fix rest of the data is taking more time then expected, we are actively monitoring it and will update once we have more information.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We have identified the root cause and rolled out a fix for the last month of data. All data generated in the last month should be accessible. We have triggered the fix for the rest of the data and will update once the data is fully accessible.
Opsgenie "On Call Reports", "On Call Time Analytics" & "Total On Call Time per User" Analytics dashboards are not showing data. We are working to identify the root cause and we'll keep you posted with further updates.
Report: ""_incomingData" and "_actionSource" fields are missing in Opsgenie Debug logs"
Last updateThis incident has been resolved.
"_incomingData" and "_actionSource" fields are visible in debug logs generated after January 13, 2023 1:55:39 PM UTC
We have identified the problem and worked on the solution to fix debug logs now.
We have identified that "_incomingData" and "_actionSource" fields are missing in Opsgenie Debug Logs due to internal issues.
Report: "Observing delays in incoming webhook integration processing in EU region"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Opsgenie analytics dashboards was not accessible"
Last updateOpsgenie analytics dashboards were not available between 2022-11-29 18:15 (UTC) and 2022-11-29 19:30 (UTC). We observed a spike in traffic patterns which caused degradation in one of our services. After detecting the problem, our engineering team worked towards getting the service back up as quickly as possible. The analytics dashboards started working partially from 2022-11-29 19:30 (UTC) and were fully functional by 2022-11-29 20:10 (UTC).
Report: "Android users of Jira, Confluence and Opsgenie app with Compromised device check feature turned on is getting locked out of their app"
Last updateBetween 2022-11-06 and 2022-11-11, 18:45 EST, we experienced an issue where Android users of Jira, Confluence, and Opsgenie apps with the Compromised device check feature turned on is getting locked out of their apps for Confluence, Jira Software, and Opsgenie. The issue has been resolved and the service is operating normally. If a customer is locked out post-login and there is no retry option, we request that a user either clears the app data or reinstall the app.
We are investigating an issue where Android users of Jira, Confluence and Opsgenie app with the Compromised device check feature turned on is getting locked out of their app. Note that this is only affecting the Android mobile app for customers who have turned on the Compromised device check feature via admin.atlassian.com
We are investigating reports of intermittent errors for <SOME/ALL> Confluence, Jira Software, and Opsgenie Cloud customers. We will provide more details once we identify the root cause.
Report: "Logs are delayed on search and download API in EU region"
Last updateThe incident has been resolved. Logs are available without any delay.
A fix has been implemented and the delay of the logs are decreasing rapidly. We are monitoring the progress. Alert logs and all other functionality is working as expected.
We have identified the root cause of the problem . Our engineers are working on fixing the problem now.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
Logs are delayed on the log search page and the download API in EU Region. We are currently investigating the issue.
Report: "Elevated error rate in all Opsgenie services"
Last updateBetween 2022-10-21 09:01 UTC and 2022-10-21 09:12 UTC, in the US region, we started to see elevated error rate in our infrastructure due to faulty deployment. We have deployed a fix to mitigate the issue and have verified that all services have recovered without data loss. Thanks to quick reaction of our engineering team, issue has been resolved and the service is operating normally.
Report: "Schedule overrides were not editable/deletable"
Last updateBetween 2022-10-20 14:23 (UTC) and 2022-10-20 17:39 (UTC), due to latency in Opsgenie system, we were not allowing any edit or delete for schedule overrides and returning error to some of our users. After detecting the problem, our engineering team worked towards getting Opsgenie service back up as quickly as possible.
Report: "Opsgenie reporting & Analytics are not accessible"
Last updateReporting & analytics is completely operational now.
We are continuing to monitor for any further issues.
Customers were not able to access Opsgenie reporting & Analytics between 9:36 AM UTC & 11:40 AM UTC. We identified the root cause to be one of the recent deployments and we have reverted the change to fix the issue. Currently we are monitoring the current state and validating it with support ticket owners.
Customers are not able to access Opsgenie reporting & Analytics since 9:36 AM UTC. We have identified the root cause and rolling out the fix.
Report: "Delays in notification service"
Last update### **SUMMARY** On Sep 14, 2022, between 03:36 PM and 04:26 PM UTC, Atlassian customers using the Opsgenie product received delayed notifications for up to 50 minutes. The event was triggered by a code change that upgrades a common framework. The changes included in this framework update impacted customers in the both US and EU regions. The incident was detected by the on-call developer and mitigated by reverting the latest changes, which put Opsgenie systems into a known good state. The total time to resolution was around 50 minutes. ### **IMPACT** The overall impact was between Sep 14, 2022, 03:36 PM UTC, and Sep 14, 2022, 04:26 PM UTC on Opsgenie products. The incident service disruption was limited to US and EU region customers who did not receive their notifications immediately, but instead experienced notification delays of up to 50 minutes. In total, ~132K notifications in the US region and ~23.6K notifications in the EU region were sent with delays. Only less than %0.6 of the active customers were affected. ### **ROOT CAUSE** The issue was caused by an Atlassian-initiated change to upgrade a common framework. While the majority of the intended changes had been tested successfully, there were some accompanying changes with the framework upgrade that caused the notification service to stop processing new notification requests. Instead, these notifications remained in the queues until the deployment was reverted, resulting in notification delays for customers of up to 50 minutes. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident: * We are improving the testing and deployment processes we follow after framework updates. * We are implementing new monitoring to reduce the detection and response time even further. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified the problem and working on it. We are expecting that notification service will return normal state in a short time.
We are seeing delays with outbound notifications. We have identified the cause and are currently working on mitigation of this issue.
Report: "Incident Page cannot be reachable by some of accounts"
Last updateThe Incident Page is now completely operational.
We reverted the recent deployment. We also validated the error response codes has been disappeared immediately after the fix. Currently we are monitoring the current state and validating it with some of support ticket owners.
At one of a recent deployment, we made a change on handling user permissions at specific component of Incident page. That causes a problem while loading a page. We are working on a fix and expecting to close the incident very soon.
We are currently investigating an issue at Incident page. When you try to open Incident page, Opsgenie Web application drops user to login page. We are currently working on investigating route cause.
Report: "Opsgeine Reporting service down"
Last updateThis incident has been resolved.
We have released a fix for the report display problem. Affected report displays and download should have returned to a functioning state. We are still actively working with engineering team on fixing the root cause of this issue.
We have released a fix for the report display problem. Affected report displays and download should have returned to a functioning state. We are still actively working with engineering team on fixing the root cause of this issue.
We are continuing to investigate this issue.
We are continue to investigate
We are currently investigating it.
Report: "Errors navigating products, logging in, and logging out"
Last updateBetween 03/Aug/22 15:20 UTC to 03/Aug/22 17:47 UTC, some customers experienced errors using Atlassian products, including errors while logging in or being forcibly logged out. The root cause was a DNS service deployment in US East region, which caused widespread DNS lookup errors for a variety of Atlassian services including authentication services. We have rolled back the change to mitigate the issue and have verified that the authentication services have recovered. The issue has been resolved and Atlassian services are operating normally.
Report: "Intermittent errors across multiple products in eu-central"
Last update### **SUMMARY** On July 19, 2022, between 05:40 and 07:10 UTC, Atlassian customers in the EU region using Jira, Confluence and Opsgenie experienced problems loading pages through the web UI. The incident was automatically detected at 05.14 by one of Atlassian’s automated monitoring systems. The main disruption was resolved within 16 minutes with the full recovery taking additional 74 minutes. ### **IMPACT** Between July 19, 2022, 05:40 UTC and July 19, 2022, 07:10 UTC Jira, Confluence and OpsGenie users saw some web pages fail to load. During the 16 minute period from 06:40 UTC to 6:56 UTC, customers were unable to access Jira Confluence and OpsGenie web UI because the Atlassian Proxy \(the ingress point for service requests\) was unable to service most requests. ### **ROOT CAUSE** The issue was caused by an AWS initiated change that impacted Elastic Block Store \(EBS\) volume performance to such an extent that new instance creation and therefore auto scaling, was blocked. As a result, the products above, as well as essential internal Atlassian services could not auto scale to the increasing incoming service requests as the EU region came online. Once the AWS change had been rolled back, most Atlassian services recovered. Some internal services required manual scaling as a result of unhealthy nodes preventing scaling initiation, which prolonged complete recovery. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity and we apologize to customers whose services were impacted during this incident. We see two main avenues to increase our resiliency during an incident where AWS auto scaling is blocked: * Implement step scaling: Simple scaling in most cases works well. In this case due to nodes becoming unhealthy, simple scaling stops responding to scaling alarms and therefore the service can become “stuck” and will not recover once scaling is possible again. We are exploring the use of step scaling, as this will allow scaling even in the case of instances becoming unhealthy. * Implement improved alarming to identify “stuck” scaling to increase the TTR when scaling is available again. We are taking these immediate steps to improve the platform’s resiliency. Thanks, Atlassian
Between 07:00 UTC to 07:45 UTC, we experienced degraded functionality for some features in Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, and Atlassian Developer. The issue has been resolved and the service is operating normally.
Multiple Atlassian Cloud products and addons were unavailable to customers in some EU regions. The issue has been resolved and we are monitoring for further impact.
Report: "Opsgenie Analytics is slow or unavailable"
Last updateThe incident has been resolved
We have identified the issue and working on resolving it
We've noticed that Reporting and Analytics is responding slowly. The issue is related to emailing of reports and csv downloads. Our web interface for reporting continues to work as expected. Our engineering team is actively investigating this incident and working to bring Opsgenie back up to speed as quickly as possible. We'll keep you posted with further updates on this page.
Report: "Web and Mobile Application are slow or unavailable"
Last updateThe Problem has been resolved and the services are operating normally! Opsgenie has faced partial outages due to a minor update by the cloud provider and the team has worked with the cloud provider team to solve the incident in time. Only 15% of total requests and 4.1% of customers are affected by the incident. We will take the necessary actions to prevent facing a similar incident.
The Fix has been deployed and rapid recovery is seen. We are monitoring the system for a full recovery.
Our team has identified that Web and Mobile Application are responding slowly or unavailable for only Frankfurt.
We are continuing to work on a fix for this issue.
Our team has identified the issue and are working on a fix. Next update in 1 hour or with a resolution of the incident
We've noticed that the Web and Mobile Application are responding slowly or unavailable for Frankfurt and North Virginia regions. This is not affecting our APIs. Our engineering team is actively investigating this incident and working to bring Opsgenie back up to speed as quickly as possible. We'll keep you posted with further updates on this page.