Statuspage.io

Is Statuspage.io Down Right Now? Check if there is a current outage ongoing.

Statuspage.io is currently Operational

Last checked from Statuspage.io's official status page

Historical record of incidents for Statuspage.io

Report: "statuspage.io domain not resolving in Brazil region for some users"

Last update
resolved

After a thorough investigation, we have determined that the recent connectivity issues experienced for statuspage.io domains in Brazil were isolated to specific Internet Service Providers (ISPs) within Brazil and there was no impact or issue with Statuspage’s infrastructure and DNS configurations itself. Our team is actively monitoring the situation and trying our best to reach out to the relevant ISPs to ensure that access issues are resolved for all affected users. If connectivity issues persist, and you are located in Brazil, we kindly ask that you open a support request with your internet service provider or follow our community post to resolve the issue. https://community.atlassian.com/forums/Statuspage-articles/Mitigation-steps-for-statuspage-io-domain-ISP-Issues-in-Brazil/ba-p/2951025#M274

identified

We are actively working with ISPs to resolve the current issues and to better understand the recent changes that have contributed to these problems. We also encourage our customers to contact their ISPs, as this appears to be an ISP-specific issue. Our investigation show that statuspage.io domains can be resolved using Google or Cloudflare DNS resolvers, such as 1.1.1.1 and 8.8.8.8. Please refer to the steps outlined in our community post to access all Statuspage-based domains effectively. https://community.atlassian.com/t5/Statuspage-articles/Mitigation-steps-for-statuspage-io-domain-ISP-Issues-in-Brazil/ba-p/2951025#M274

identified

Please follow the steps mentioned in the below link to mitigate the DNS issues in Brazil for statuspage.io domains: https://community.atlassian.com/t5/Statuspage-articles/Mitigation-steps-for-statuspage-io-domain-ISP-Issues-in-Brazil/ba-p/2951025#M274

identified

We have determined that the connectivity issue is due to a DNS name resolution problem affecting certain ISPs in Brazil, which are unable to resolve *.statuspage.io domains. This issue does not impact all Brazilian customers and subscribers; it varies depending on the customer's or subscriber's ISP. Customers using custom domains for their status pages are not affected. While your Statuspage remains available, ISP configurations are preventing traffic from resolving the Statuspage domain. We are exploring mitigation strategies, but we advise affected customers to contact their ISP to understand why their ISP cannot resolve *.statuspage.io domains.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Elevated Server Errors across Statuspage Services"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are investigating server errors. We have isolated the problem and applied the fix.

Report: "Errors on the Statuspage JSM plugin"

Last update
resolved

This issue has been resolved.

identified

The issue has been identified, and we will be fixing it shortly.

investigating

We are aware of error messages displayed by the Statuspage JSM plugin. The team is currently investigating the issue.

Report: "Error responses across multiple Cloud products"

Last update
resolved

This incident has been resolved.

monitoring

We are investigating an issue with error responses for some Cloud customers across multiple products. We have identified the root cause and expect recovery shortly. In the meantime, we have enabled the alternative login option for Statuspage, so that our customer can still log in to their Statuspages.

Report: "Statuspage API is facing intermittent issues"

Last update
resolved

This issue has been resolved.

monitoring

A fix has been implemented, and we are monitoring the results. The API has fully recovered.

identified

The issue has been identified, and a fix is in progress.

investigating

We are currently investigating this issue.

Report: "Statuspage product provisioning failing intermittently"

Last update
resolved

We have identified and fixed the issue and now Statuspage provisioning is working for our customers.

investigating

We are investigating an issue where we are facing intermittent issues while provisioning Statuspage.

Report: "Intermittent issues in incidents shown in Status embed frame"

Last update
resolved

There was an intermittent error with status embed page where cached responses were incorrectly being shown to a few customers. This is resolved as of now.

investigating

We are investigating an intermittent issue reported by a few customers about wrong data shown in the status embed frame in their sites. We are investigating this issue currently and will share updates soon.

Report: "Intermittent errors while accessing public Statuspages"

Last update
postmortem

### **SUMMARY** From 06:00 UTC to 07:45 UTC on October 28, 2023, Atlassian customers using Statuspage had intermittent issues with all Statuspage functionality. The event occurred due to a database performance issue during a [scheduled database maintenance](https://metastatuspage.com/incidents/s21b66328h9j). This impacted customers in all regions. The incident was detected within one minute by monitoring the upgrade process and mitigated by rolling back to a known good snapshot which put Statuspage systems into a known good state. The total time to resolution was about one hour and 45 minutes. ### **IMPACT** The overall impact was between 06:00 UTC and 07:45 UTC October 28, 2023. This incident affected Statuspage customers from all regions and caused intermittent backend errors on all Statuspage activity including viewing pages, adding subscribers, and creating/updating events. We performed a rollback operation during recovery to return to a known good state. ### **ROOT CAUSE** The issue was caused by database performance issues after a routine database maintenance and upgrade. As a result, our backends returned intermittent errors to several user requests. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We take the utmost care to provide a highly reliable service. We will pursue several preventive measures to ensure that this situation does not occur in the future, including: * Fixing the cause of the performance issues before future upgrades; and * Improving our testing process for database upgrades to catch potential performance issues. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support

resolved

Issue is now resolved and everything is back to normal working state.

monitoring

Update: We have fixed the issue and are monitoring actively

investigating

We are currently seeing intermittent errors in viewing public Statuspages. We are investigating this problem and will provide updates shortly

Report: "Atlassian Account login issues"

Last update
postmortem

### **SUMMARY** On Sep 13, 2023, between 12:00 PM UTC and 03: 30 PM UTC, some Atlassian users were unable to sign in to their accounts and use multiple Atlassian cloud products. The event was triggered by a misconfiguration of rate limits in an internal service which caused a cascading failure in sign-in and signup-related APIs. The incident was quickly detected by multiple automated monitoring systems. The incident was mitigated on Sep 13, 2023, 03: 30 PM UTC by the rollback of a feature and additional scaling of services which put Atlassian systems into a known good state. The total time to resolution was about 3 hours & 30 minutes. ‌ ### **IMPACT** The overall impact was between Sep 13, 2023, 12:00 PM UTC and Sep 13, 2023, 03: 30 PM UTC on multiple products. The Incident caused intermittent service disruption across all regions. Some users were unable to sign in for sessions. Other scenarios that temporarily failed were new user signups, profile retrieval, and password reset. During the incident we had a peak of 90% requests failing across authentication, user profile retrieval, and password reset use cases. ‌ ### **ROOT CAUSE** The issue was caused due to a misconfiguration of a rate limit in an internal core service. As a result, some sign-in requests over the limit received HTTP 429 errors. However, retry behavior for requests caused a multiplication of load which led to higher service degradation. As many internal services depend on each other, the call graph complexity led to a longer time to detect the actual faulty service. ‌ ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We are continuously improving our system's resiliency. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Audit and improve service rate limits and client retry and backoff behavior. * Improve scale and load test automation for complex service interactions. * Audit cross-service dependencies and minimize them where possible related to sign-in flows. ‌ Due to the unavailability of sign-in, some customers were unable to create support tickets. We are making additional process improvements to: * Enable our unauthenticated support contact form and notify users that it should be used when standard channels are not available.  * Create status page notifications more quickly and ensure that for severe incidents, notifications to all subscribers are enabled. ‌ We apologize to users who were impacted during this incident; we are taking immediate steps to improve the platform’s reliability and availability. Thanks, Atlassian Customer Support

resolved

Between 12:45 UTC to 15:30 UTC, we experienced login and signup issues for Atlassian Accounts. The issue has been resolved and the service is operating normally. We will publish a post-incident review with the details of the incident and the actions we are taking to prevent similar problem in the future.

monitoring

We are no longer seeing occurrences of the Atlassian Accounts login errors, all clients should be able to successfully login now. We will continue to monitor.

monitoring

We can see a reduction in the Atlassian Accounts login issues after the mitigation actions were taken. We are still monitoring closely and will continue to provide updates.

monitoring

We have identified the root cause of the Atlassian Accounts login issues impacting Cloud Customers and have mitigated the problem. We are now monitoring this closely.

investigating

We are investigating an issue with Atlassian Accounts login that is impacting some Cloud customers. We will provide more details within the next hour.

Report: "Elevated Server Errors on Public Pages"

Last update
resolved

This incident has been resolved.

monitoring

We have identified the issue and a fix has been implemented. We have scaled our services to mitigate the issue and are monitoring the results.

investigating

We are investigating cases of degraded performance for public pages. Pages may be failing to load or loading more slowly than normal.

Report: "Intermittent errors during login for some customers"

Last update
resolved

This incident has been resolved.

identified

We have identified issues with our login system. To unblock customer login we have temporarily enabled alternate login flows for the manage portal. We will continue to monitor the situation and provide further details as we investigate further.

Report: "Intermittent errors during login for some customers"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented, we are seeing recovery and continuing to monitor the incident.

identified

We have identified issues with our login system. To unblock customer login we have temporarily enabled alternate login flows for the manage portal. We will continue to monitor the situation and provide further details as we investigate further.

investigating

We are investigating reports of intermittent errors during login for some customers using Statuspage. We will provide more details once we identify the root cause.

Report: "Partial outage while accessing pages over http protocol"

Last update
resolved

This incident has been resolved.

identified

We are in the process of rolling out a fix for impacted domains.

identified

We have identified an unintended issue with redirecting http to https for a tiny cohort of customers on SSL-enabled custom domains. This does not affect the availability of any Statuspage on a custom domain with SSL enabled - they are available via https:// .

Report: "Some Statuspage users are experiencing difficulties in creating/editing incidents"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Increased error rate while SMS Subscription signups"

Last update
resolved

This incident is resolved.

investigating

Issue seems intermittent and we are continuing to investigate this issue.

investigating

We are currently investigating the issue.

Report: "Statuspage was unable to accept new signups"

Last update
resolved

Between 9:09 PM PST on March 22nd and 9:46 PM PST on March 23rd, there was an issue preventing new signups for Statuspage. Customers attempting to sign up during that time may have encountered difficulties. We have since identified and resolved the issue, which only affected new signups and did not impact any existing Statuspages or their components. We appreciate your patience and understanding while we addressed this matter.

Report: "We are noticing some slowness/intermittent errors while loading some public pages."

Last update
resolved

This has been resolved.

monitoring

Systems seem stable and we are monitoring now

investigating

We are noticing some slowness/intermittent errors while loading some public pages. We are investigating the errors.

Report: "Not able to upload images to the Manage Portal - Statuspage"

Last update
resolved

This incident has been resolved.

investigating

We are currently experiencing an issue with uploading images on manage portal.

Report: "Delayed delivery of Incident notifications via email"

Last update
resolved

We experienced a delay in delivery of notifications of incidents for a period of 15 minutes which has been resolved. We have identified the cause of this is an ongoing incident with one of our messaging vendors, and we have migrated all possible notification traffic to our alternate messaging vendors.

Report: "SSO enabled private pages are facing authentication issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating the issue.

Report: "Manage Portal Outage"

Last update
resolved

The manage portal was out of service between 11:21-11:25 AM PST. The problem has resolved itself and we have returned to normal operations.

Report: "Abnormal API Timeouts"

Last update
resolved

Our API service experienced an abnormal number of timeouts affecting 0.6% of total traffic due to DNS errors between 12:30-1:10 pm PST.

Report: "Statuspage notifications are not being sent out"

Last update
resolved

This incident has been resolved. All notifications have been processed. No notifications were lost.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Elevated Server Errors on Public Pages"

Last update
resolved

Between 5:09 to 5:13 PM PST, public status pages experienced elevated errors due to increased traffic.

Report: "Billing page inaccessible for some customers"

Last update
resolved

This incident has been resolved.

monitoring

Billing page access should be restored for all customers. We are continuing to monitor the issue.

identified

The issue has been identified and a fix is being implemented.

investigating

A subset of Statuspage customers are unable to access the Billing page within Statuspage. We are currently investigating the issue.

Report: "Elevated Errors for New Email Subscriptions"

Last update
resolved

Between 4:55pm and 10:10pm PST, users were not able to subscribe via email to statuspages. We have identified the root cause and have resolved the issue.

Report: "Errors accessing manage portal"

Last update
resolved

A recent deploy was found to contain errors. Our infrastructure has been successfully rolled back to a previous version of the code, and traffic is being served as normal again.

Report: "Elevated Errors for New SMS Subscriptions"

Last update
resolved

We have resolved issue.

investigating

We are currently investigating this issue.

Report: "Intermittent Errors Accessing Public Pages Due to Elevated Traffic"

Last update
resolved

Due to elevated traffic, we experienced intermittent timeouts and errors in serving public pages between 4:24 to 4:25 AM PST. In response, we have implemented updates to our services to mitigate the same issue in the future.

Report: "Login failures for manage portal"

Last update
resolved

This incident has been resolved.

investigating

We are still investigating the root cause of the issue. Users who have attempted to login to manage their Statuspages, should be receiving an email with a login link which they can use to login temporarily instead of the normal login flow.

investigating

We are currently investigating failed logins when users are accessing Statuspage's manage portal.

Report: "Incorrect AWS Component Emails"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for the incorrect emails. If any other emails are received about AWS components please disregard them

identified

We have identified the cause of the emails and are working on a resolution now

investigating

A number of incorrect emails were sent out saying AWS third party components will no longer receive updates. These components have not been impacted and we are investigating the root cause of the emails now.

Report: "Login failures"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We have identified the issue and are working on a resolution now

investigating

We are currently investigating failed logins when users are accessing Statuspage

Report: "Multiple sites showing down/under maintenance"

Last update
postmortem

Earlier this month, several hundred Atlassian customers were impacted by a site outage. We have published a Post-Incident Review which includes a technical deep dive on what happened, details on how we restored customers sites, and the immediate actions we’ve taken to improve our operations and approach to incident management. [https://www.atlassian.com/engineering/post-incident-review-april-2022-outage](https://www.atlassian.com/engineering/post-incident-review-april-2022-outage)

resolved

We have restored impacted Statuspage customer sites and the service is operating normally. If you need assistance, please reply to your support ticket so that our engineers can work with you. If you have any trouble accessing your support ticket, contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu) Our teams will be working on a detailed Post Incident Report to share publicly by the end of April.

monitoring

We have restored impacted Statuspage customer sites and the service is operating normally. If you need assistance, please reply to your support ticket so that our engineers can work with you. If you have any trouble accessing your support ticket, contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have now restored 99% of users impacted by the outage and have reached out to all affected customers. Our teams are available to help customers with any concerns. If you need assistance, please reply to your support ticket so that our engineers can work with you. If you have any trouble accessing your support ticket, contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have now restored 85% of users impacted by the outage and will continue to get sites back to customers for validation, over the weekend. As we hand your restored site over to you for validation, please reach out to our teams should you find any issues so that our support engineers can work to get you fully operational. You can contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have now restored 78% of users impacted by the outage as we continue to move with more speed and accuracy. Our teams will continue to restore sites through the weekend, and we expect to have all sites restored no later than end of day Tuesday, April 19th PT. As we restore your site and hand it over to you for validation, please reach out to our teams should you find any issues so that our support engineers can work to get you fully restored. You can contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have made significant progress over the last 24 hours and have now restored functionality for 62% of users impacted by the outage. We have also doubled the size of the batches we are pushing through the restoration process, which was a result of optimizing automated processes as well as accelerating our restoration speed. Our global engineering teams continue to work 24/7, and we expect to progress quickly through technical restoration of remaining customer sites over the weekend. If you do not have access to your open ticket, please contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have now restored functionality for 55% of users impacted by the outage. With automation in full effect, we have significantly increased the pace at which we are conducting technical restoration of affected customer sites, and we have reduced the time required for the validation of restored sites by half. If you are still experiencing an outage and do not have access to your open ticket, please contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have now restored functionality for 53% of users impacted by the outage. As outlined in yesterday’s update, we are restoring affected customers using a three step process: 1. Technical restoration of affected sites 2. Internal validation of restored sites 3. Validating with affected customers before enabling their users By automating some of our validation steps, we have now reduced time for internal validation of restored sites by half, which allows our support engineers to more quickly engage restored customers for validation and full site handover. If you are still experiencing an outage and do not have access to your open ticket, please contact us at https://support.atlassian.com/contact/#/ (choose the Billing, Payments, & Pricing options from the drop down menu).

identified

We have restored functionality for 49% of users impacted by the outage. We are taking a batch-based approach to restoring customers, and to-date, this process has been semi-automated. We are beginning to shift towards a more automated process to restore sites. That said, there are still a number of steps required before we hand a site to customers for review and acceptance. We are restoring affected customers identified by a mix of multiple variables including site size, complexity, edition, tenure, and several other factors in groups of up to 60 at a time. The full restoration process involves our engineering teams, our customer support teams, and our customer, and has three steps: 1. Technical restoration involving meta-data recovery, data restores across a number of services, and ensuring the data across the different systems is working correctly for product and ecosystem apps 2. Verification of site functionality to ensure the technical restoration has worked as expected 3. Lastly, working directly with the affected customer to enable them to verify their data and functionality before enabling for their users We have also contacted all customers who are *up next* for step 3 in the site restoration process described above. These customers are aware that they are next in queue through their support ticket and/or via a support engineer. We have proactively reached out to technical contacts and system admins at all impacted customers, and opened support tickets for each of them. However, we learned that some customers have not yet heard from us or engaged with our support team. If you are experiencing an outage and do not have access to your open ticket, please contact us through our (choose the Billing, Payments, & Pricing options from the drop down menu): https://support.atlassian.com/contact/#/ For more information from our engineering team, please read our update from our CTO, Sri Viswanath: https://www.atlassian.com/engineering/april-2022-outage-update

identified

The team is moving through the restoration process this week and is accelerating toward recovery. Functionality for 40% of impacted users has been restored.

identified

A small number of Atlassian customers continue to experience service outages and are unable to access their sites. Our global engineering teams are working 24/7 to make progress on this incident. At this time, we have rebuilt functionality for over 35% of the users who are impacted by the service outage, with no reported data loss. The rebuild stage is particularly complex due to several steps that are required to validate sites and verify data. These steps require extra time, but are critical to ensuring the integrity of rebuilt sites. We apologize for the length and severity of this incident and have taken steps to avoid a recurrence in the future.

identified

A small number of Atlassian customers continue to experience service outages and are unable to access their sites. Our global engineering teams are working 24/7 to make progress on this incident. At this time, we have rebuilt functionality for over 35% of the users who are impacted by the service outage, with no reported data loss. The rebuild stage is particularly complex due to several steps that are required to validate sites and verify data. These steps require extra time, but are critical to ensuring the integrity of rebuilt sites. We apologize for the length and severity of this incident and have taken steps to avoid a recurrence in the future.

identified

A small number of Atlassian customers continue to experience service outages and are unable to access their sites. Our global engineering teams are working 24/7 to make progress on this incident. At this time, we have rebuilt functionality for over 35% of the users who are impacted by the service outage, with no reported data loss. The rebuild stage is particularly complex due to several steps that are required to validate sites and verify data. These steps require extra time, but are critical to ensuring the integrity of rebuilt sites. We apologize for the length and severity of this incident and have taken steps to avoid a recurrence in the future.

identified

A dedicated team continue to work 24/7 to expedite service recovery. Restoration of all customers remains our top priority. We hear and appreciate all the feedback from our valued customers and are taking every necessary step to both restore full service and ensure site integrity as soon as possible.

identified

We are still working 24/7 to restore service to affected customers. We have restored partial access for some customers and will be continuing to restore access into next week.

identified

We continue to work 24/7 to restore service to affected customers. We have restored partial access for some customers and will be continuing to restore access into next week.

identified

Our teams are committed to restoring each customer’s service as soon as possible and are working through the weekend toward recovery.

identified

Our teams are committed to restoring each customer’s service as soon as possible and are working through the weekend toward recovery.

identified

The restoration process is underway. At this time we have no new significant updates, but the team continues to work around the clock to bring our customers back online.

identified

The restoration process is underway. At this time we have no new significant updates, but the team continues to work around the clock to bring our customers back online.

identified

Our team is working 24/7 to progress through site restoration work. Core functionality has been restored across a number of sites. We are continuously improving the process with the aim of accelerating the restoration process from here.

identified

Our team is working 24/7 to progress through site restoration work. Core functionality has been restored across a number of sites. We are continuously improving the process with the aim of accelerating the restoration process from here.

identified

The team is continuing the restoration process through the weekend and working toward recovery. We are continuously improving the process based on customer feedback and applying those learnings as we bring more customers online.

identified

Restoration work to restore sites is underway and will continue into the weekend. We are taking a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of these site restorations.

identified

Restoration work to restore sites is underway and will continue into the weekend. We are taking a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of these site restorations.

identified

We have started successfully restoring sites and continue to work on restoration to a wider cohort of customers. We are taking a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of these site restorations.

identified

We have started successfully restoring sites and continue to work on restoration to a wider cohort of customers. We are taking a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of this first round of restorations remains the same from our last update.

identified

We have started successfully restoring sites and continue to work on restoration to a wider cohort of customers. We are taking a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of this first round of restorations remains the same from our last update.

identified

We continue to work on partial restoration to a cohort of customers. The plan is to take a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of this first round of restorations remains the same from our last update

identified

We continue to work on partial restoration to the first cohort of customers. The plan to take a controlled and hands-on approach as we gather feedback from customers to ensure the integrity of this first round of restorations remains the same from our last update.

identified

We are beginning partial restoration to a cohort of customers. The early stages of this process will be controlled and hands-on, as we work with customers live to get feedback and ensure that restoration is working correctly before we accelerate the process for the next cohort. We will continue to post updates here as we move the process along.

identified

We are continuing work in the verification stage on a subset of instances. Once reenabled, support will update accounts via opened incident tickets. Restoration of customer sites remains our first priority and we are coordinating with teams globally to ensure that work continues 24/7 until all instances are restored.

identified

We are continuing work in the verification stage on a subset of instances. Once reenabled, support will update accounts via opened incident tickets. Restoration of customer sites remains our first priority and we are coordinating with teams globally to ensure that work continues 24/7 until all instances are restored.

identified

We are continuing work in the verification stage on a subset of instances. Once reenabled, support will update accounts via opened incident tickets. Restoration of customer sites remains our first priority and we are coordinating with teams globally to ensure that work continues 24/7 until all instances are restored.

identified

We are continuing to work on the resolution of the incidents for some Statuspage, Jira Work Management, Jira Service Management, Confluence, Jira Software, Atlassian Access, Jira Product Discovery, and Opsgenie Cloud customers.

identified

We have partially reactivated the Statuspages of affected customers. The hosted pages should be up, and the API capabilities have been restored so affected customers can use this to manage their pages while work is done to restore access to the manage portal. We have defined two processes to resolution of the issues impacting some customers. These processes each involve multiple stages of work. We are currently working on one of the processes and we will provide more detail as we progress through resolution.

identified

The issue has been identified and a fix is being implemented.

Report: "Missing Metrics for a Subset of Status Pages"

Last update
resolved

This incident has been resolved.

monitoring

We have released a fix for the metrics display problem. Affected metrics displays should have returned to a functioning state. We are still actively engaged with our vendor and are working with their engineering team on fixing the root cause of this issue.

identified

We are still actively engaged with our vendor while their engineering teams work on the issue. Unfortunately, alternative workarounds did not produce acceptable results. Once again, this is purely a display problem and we are not experiencing any data loss.

identified

We have identified a metrics display problem for a majority of status pages due to a vendor issue. This is purely a display problem and we are not experiencing any data loss. We are working to return the service to normal operations.

investigating

We are currently investigating an issue with missing system metrics for a subset of status pages.

Report: "Delay in System Metrics"

Last update
resolved

This incident has been resolved.

monitoring

We have identified and resolved all issues with third-parties and we are now monitoring and continuing to process any delayed metrics.

identified

Based on additional reporting and investigation, we have found that System Metrics is experiencing delays between 5 to 30 minutes. We are continuing to work with our vendor to return the service to normal operations.

identified

We have identified the issue and we’re working with our vendor to return the service to normal operations.

investigating

System metrics is experiencing a delay of up to 5 to 10 minutes. We are currently investigating this issue.

Report: "Issues With Login"

Last update
postmortem

### **SUMMARY** On March 14, 2022, between 01:05pm and 01:47pm UTC, some Atlassian customers were unable to login to our products including Trello and Statuspage, and could not access some services including the ability to create support tickets. The underlying cause was a newly introduced configuration data store that did not scale up properly due to a misconfiguration of autoscaling. The incident was detected by Atlassian's automated monitoring system and mitigated by disabling the use of the new configuration datastore which put our systems into a known good state. The total time to resolution was approximately 42 minutes. ### **IMPACT** The overall impact was between March 14 2022, 01:05 PM UTC and March 14, 2022, 01:47 PM UTC across seven products and services. The bug impacted several of the key dependent services which resulted in an outage for end users, leading to failed logins across the following products and services: * [**getsupport.atlassian.com**](http://getsupport.atlassian.com) * [**confluence.atlassian.com**](http://confluence.atlassian.com) * [**jira.atlassian.com**](http://jira.atlassian.com) * [**partners-jira.atlassian.com**](http://partners-jira.atlassian.com) * [**community.atlassian.com**](http://community.atlassian.com) * [**manage.statuspage.io**](http://manage.statuspage.io) * [**trello.com**](http://trello.com) * [**university.atlassian.com**](http://university.atlassian.com) ### **ROOT CAUSE** The issue was caused by an underlying configuration data store based on AWS DynamoDB failing to scale up. During post-setup fine-tuning it was identified that initial values for the read capacity units \(RCUs\) and write capacity units \(RCUs\) were over-provisioned. As a result a decision was made to decrease them however the resulting values proved to be insufficient to handle the increased traffic in our system. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We're prioritizing the following improvement actions to avoid repeating this type of incident: * Fix the configuration so that the new configuration data store dynamically scales-up regardless of the size of the incoming traffic. * Conduct more thorough capacity planning and load testing. * Improve the resilience of the system by adding fallbacks to our secondary data store. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support

resolved

This incident has been resolved.

monitoring

We have identified and mitigated an issue with users logging in to Statuspage starting at 6:05am PST and ending at 6:46am PST and are monitoring the results.

Report: "Statuspage Cache Invalidation Delayed"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

Content changes to public status pages (e.g. incident creation and updates) were delayed by up to 15 minutes, for changes that were made beginning Mar. 4, 1pm PST. Users may have briefly seen out-of-date content when viewing public status pages during this time. A bug was identified in our cache invalidation layer and a fix is currently being deployed.

Report: "Public API experiencing increased errors"

Last update
resolved

Increased load caused API service to experience errors from 10:03 - 10:15 PST.

Report: "Intermittent errors accessing public pages due to elevated traffic"

Last update
resolved

Due to elevated traffic, we experienced intermittent timeouts and errors in serving public pages between 5:57 and 5:58 AM PST. We have made updates to our services to prevent similar problems from happening.

Report: "Server errors from the Manage Portal"

Last update
resolved

Between 5:20 to 5:24 AM PST, a slow performing API query resulted in an elevated number of 500 errors on the manage portal. The root cause of the degraded performance has been identified and a fix has been deployed. The manage portal is now operating normally.

Report: "Decrease in site availability due to errant database migration"

Last update
resolved

This incident has been resolved.

identified

A data migration erroneously dropped indexes that were still in use, causing decrease in availability for a subset of all inbound requests.

Report: "Elevated server errors"

Last update
resolved

Increased load caused API service to experience errors from 8:15 to 8:45 AM PST.

Report: "Infrastructure Issues - Billing and signup impacted"

Last update
postmortem

### **SUMMARY** On December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC, Atlassian Cloud services using AWS services in the US-EAST-1 region experienced a failure. This affected customers using Atlassian Access, Bitbucket Cloud, Compass, Confluence Cloud, the Jira family of products, and Trello. Products were unable to operate as expected, resulting in partial or complete degradation of services. The event was triggered by an AWS networking outage in US-EAST-1 affecting multiple AWS services and led to the inability to access AWS APIs and the AWS management console. The incident was first reported by Atlassian Access whose monitoring detected faults accessing DynamoDB services in the region. Recovery of affected Atlassian services occurred on a service-by-service basis from 2021-12-07 21:50 UTC when the underlying AWS services also began to recover. Full recovery of Atlassian Cloud services was notified at 2021-12-08 1:55 UTC. ### **IMPACT** The overall impact occurred between December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC_._ The incident caused partial to complete service disruption of Atlassian Cloud services in the US-EAST-1 region. Product-specific impacts are listed below. The primary impact for customers of Jira Software, Jira Service Management and Jira Work Management hosted in the US-EAST-1 region, was being unable to scale up, which caused slow response times for web requests and delays in background job processing, including webhooks in the AP region. There was significant latency for customers accessing Jira. Some customers experienced service unavailability while the incident took place.  Jira Align experienced an email outage for US customers due to the AWS Service outage that affected many of the AWS Services including Simple Email Service. A small percentage of Jira Align emails were not sent due to the AWS incident. Bitbucket Pipelines was unavailable and steps failed to be executed. For Jira Automation, tenant’s rules execution were delayed since CloudWatch was affected. Confluence experienced minor impact due to upstream services impacting user management, search, notifications, and media. At the same time Confluence was impacted by error rates related to the inability to scale up, and GraphQL had higher latencies. Trello email-to-board and dashcards features experienced degraded performance. Atlassian Access reported product transfers from one organization failed intermittently. Admins were not able to update features like IP Allowlist, Audit Logs, Data Residency, Custom Domain Email Notification and Mobile Application Management. Yet, users were able to access and view these features. During the incident, emails to admins experienced a delay. There was degraded experience when creating and deleting API tokens.  Statuspage was largely unaffected. However, notification workers could not scale up and communications to customers were delayed, though they could be replayed later. The incident also impacted users trying to sign in to manage portals and private pages. Compass experienced a minor impact on its ability to write to its primary database store. No core features were affected.  Atlassian's customers could have experienced stale data issues in production, US-EAST-1 for ~30s, against expected 5s at p99, because of delayed token resolution.  The provisioning of new cloud tenants was also impacted until the recovery of the services.  ### **ROOT CAUSE** The issue was caused by a problem with several network devices within AWS’s internal network. These devices were receiving more traffic than they were able to process, which led to elevated latency and packet loss. As a result, it affected multiple AWS services which Atlassian's platform relies on, causing service degradation and disruption to the products mentioned above. For more information in regards to the root cause, see [Summary of the AWS Service Event in the Northern Virginia \(US-EAST-1\) Region](https://aws.amazon.com/message/12721). There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are taking immediate steps to improve the Atlassian platform's resiliency and availability to reduce the impact of such an event in the future. While Atlassian's Cloud services do run in several regions \(US EAST and WEST, AP, EU CENTRAL and WEST, among others\) and data is replicated across several regions to increase the resilience against outages of this magnitude, we have identified and are taking actions that include improvements to our region failover process. This will minimize the impact of future outages on Atlassian’s Cloud services and provide better support for our customers.  We are prioritizing the following actions to avoid repeating this type of incident: * Enhance and strengthen our plans for cross-region resiliency and disaster recovery plans, including: continue practicing region failover in production, investigate and implement better resilience strategies for services, Active/Active or Active/Passive. * Improving and adopting multi-region architecture for services that do require it.  * Exercise wargaming scenarios that will simulate this outage to assess customer view of the incident. This will allow us to create further action items to improve our region failover process.  We apologize to customers whose services were impacted during this incident.  Thanks, Atlassian Customer Support

resolved

This incident has been resolved.

monitoring

Signin to the manage portal and certain private pages will resume usual authentication through Atlassian Access

identified

Signin to the manage portal and certain private pages will take place through a link sent via email until the authentication issues have been resolved.

identified

Notification services have recovered and are operational.

identified

We're investigating issues affecting notifications. More information will be made available as soon as we can determine the cause and work toward a fix.

identified

We are continuing to work on a fix for this issue.

identified

We are continuing to work on a fix for this issue.

identified

We're investigating issues affecting authentication and sign-in.

identified

We're investigating issues affecting billing and signup, which may impact signing into the manage portal and private pages. More information will be made available as soon as we can determine the cause and work toward a fix.

Report: "Infrastructure Issues - Manage Portal"

Last update
resolved

Between 7:25 am (PST) - 7:47 am (PST) users may have experienced issues accessing the manage portal. These issues should be resolved now.

Report: "Errant deploy. Successfully rolled back to previous version."

Last update
resolved

A recent deploy was found to contain errors or significant performance degradations. Our infrastructure has been successfully rolled back to a previous version of the code, and traffic is being served as normal again.

Report: "Delays in SSL certificate provisioning"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Elevated server errors"

Last update
resolved

Increased load caused manage service to experience errors

Report: "Elevated Server Errors for Jira Software Integration."

Last update
resolved

The elevated server errors were due to a faulty certificate deployment in an upstream service. We have fixed the issue and returned the service to normal operations.

investigating

We are currently investigating this issue.

Report: "Elevated server errors in us east region"

Last update
resolved

The site experienced a higher than normal amount of load, and may have caused pages to be slow or unresponsive.

Report: "Timeouts accessing manage portal"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating a small percentage of timeouts when accessing the Manage portal.