Historical record of incidents for Pagely
Report: "Node.js v24 Upgrades"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
The Pagely team will be performing upgrades on all servers to use Node.js v24 LTS as the default system package.This change does not directly impact any web services on the server, however it may impact any post-deployment steps, custom theme or plugin build steps, or similar operations which utilize the /usr/bin/node binary. If any code requires a specific version of Node.js, Pagely recommends managing local versions in client directories by following the steps in our documentation: https://support.pagely.com/hc/en-us/articles/360019779191-Update-and-Manage-Node-js-Versions-with-n
Report: "Atomic - Issue Promoting Domains"
Last updateThis incident has been resolved.
A fix has been deployed and promoting/adding/and deleting domains should now be functioning as intended. We are continuing monitoring for any further issues.
The root cause has been identified and a fix is being deployed.
We are continuing to investigate this issue. This issue may also impact adding/removing domains from an App in Atomic.
We are currently investigating an issue in the Atomic control panel where domains are unable to be promoted to an App's Primary Domain. We will provide additional details shortly.
Report: "Atomic API Issues"
Last updateAs we mentioned in our last update we reverted changes to our gateway which affected the Atomic API. We have not seen any further issues and will mark this incident as resolved.
We have reverted the changes and our API at this time is working as intended. We will continue to monitor this issue before marking this incident as resolved.
Additional changes are being deployed to the gateway and we expect a resolution to this issue shortly.
We have identified an issue with an endpoint related to a gateway configuration change made earlier today. We expect the fix to be rolled out shortly.
We are continuing to investigate this issue.
We are currently investigating issues with the Atomic API and will have additional updates shortly.
Report: "PressCDN Purge Issue"
Last updateThis incident is now resolved. PressCDN purges should now complete successfully. Our team identified issues resulting from the earlier Atomic Control Panel Maintenance and we have taken the appropriate corrective action to restore functionality to API calls.
Our team has identified the root of the issue. A fix has been implemented and PressCDN purges should now complete successfully. We are continuing monitoring for any further issues.
We are currently investigating an issue with PressCDN purges returning API failures. We will provide additional details shortly.
Report: "App Provisioning Issue"
Last updateThis incident has been resolved. Our team identified a newly provisioned internal server that was collecting new app provision information which was not intended. We have corrected the issue and app provisioning is now directed to the proper internal server.
Our team has identified the root of the issue. We have tested new provisioning and each app provision has been successful.
Unfortunately the app provisioning issue persists after an initial fix. This is a new incident to track further progress on the issue.
Report: "App Provisioning Issue"
Last updateThis incident is resolved.
We have identified the source of the issue and new app provisioning is now successful. We will monitor the incident and mark it as resolved shortly.
We are continuing to investigate the issue and have identified some potential causes of the app provisioning failures. We will post additional updates as we find out more regarding this issue.
We are currently investigating an issue with app provisioning not completing successfully. We will provide additional details shortly.
Report: "Payment Processor Issues"
Last updateThe incident with the payment processor has been resolved! We can confirm that payment processing is operating as expected now.
Our payment processor has reported that the service has been restored. We are monitoring how the service operates on Pagely end. We will provide an update as we confirm that it operates as expected.
Our payment processor is testing the service before restoring its functionality. We will continue to follow the status of the service and report any updates accordingly.
Our payment processor is continuing to work on the issue. We will continue to monitor and update when we have more information.
Our payment processor has identified the issue and is currently working on restoring services. We will update once we have further information.
Our payment processor is still investigating the issue. We will update once we have further information.
Our payment processor is still currently investigating the issues.
Our payment processor is currently still investigating the issue.
Our payment processor is still investigating further. We will update once we have further information.
Our payment processor is investigating this issue further at this time.
We are currently investigating an issue with our payment processor preventing payments and stopping sign-ups.
Report: "Shared database outage for vps-virginia-aurora-20"
Last updateThe new database environment is working as expected and no further action is needed at this time. This incident has been resolved.
Our engineering team has restored vps-virginia-aurora-20 to a new database environment and sites are now able to connect to this new environment properly. We are still monitoring this new environment and site availability has been restored.
Our engineering team is actively working with AWS regarding issues with the database vps-virginia-aurora-20.
Our team is continuing to work with AWS to implement a fix at this time.
The issue has been identified with the database and a fix is being implemented.
We were alerted to an issue with the shared database vps-virginia-aurora-20 and are currently investigating the issues at this time.
Report: "Availability issues with Web128/Web129"
Last updateThis incident has been resolved.
Our team has resolved the issue with high load on web128 and web129. We will continue to monitor these instances and provide another update shortly.
We are currently investigating availability issues with our shared instances web128 and web129. We will provide further updates as we continue investigation into this incident.
Report: "Intermittent issues with Atomic"
Last updateThis incident has been resolved.
Pagely have fixed the Atomic control panel issue, and are continuing to monitor performance.
We have identified the root cause for this issue, and are actively working to implement a fix.
We are aware of intermittent issues with Pagely's Atomic control panel. We are investigating.
Report: "Service degradation with Zendesk support"
Last updateThis incident has been resolved.
We are investigating an issue with Zendesk. Customers can still create support tickets within Atomic, but live chat and ticket response times may be slower than normal.
Report: "Atomic Control Panel is down"
Last updateThis incident has been resolved.
The issue was related to third-party payment provider having issues with their service.
We are currently investigating issues with Atomic Control Panel being down.
Report: "P20 service degradation"
Last updateThe shared-hosting service degradation should now be resolved. If your site still is having issues, please contact Pagely support.
We are continuing to monitor our shared-hosting infrastructure for ongoing issues.
At this time, we believe the issue should be resolved for the majority of shared-hosting customers.
The cause of the shared-hosting issue has been identified, and we are exploring potential fixes.
We are investigating an issue impacting a subset of our shared-hosting customers. Details will be added as they become available.
Report: "Atomic Control Panel - API Errors"
Last updateThis incident has been resolved.
We're experiencing an elevated level of API errors within Atomic and are currently looking into the issue.
Report: "Support Ticket Creation Issues for Collaborators"
Last updateThis issue has been identified and resolved.
We are currently investigating in an issue regarding a small subset of collaborators that can not submit a ticket through our ticketing system. We will provide further updates as we continue to resolve this issue.
Report: "CI/CD Deployment Issues"
Last updateThe incident has been resolved.
We are continuing to deploy the fix, although the finish time has been delayed. We will post any further updates we have.
We expect the issue to be resolved shortly as we are still finishing up the rollout of the needed updates.
We are continuing to deploy the fix to correct the issue with CI/CD and will post any further updates we have.
The issue with CI/CD deployments has been identified and the team is currently working on deploying the update to our environment.
Pagely engineers are currently investigating issues with CI/CD deployments and we will have further information via this status page once we have identified the issue.
Report: "Service degradation with Zendesk support"
Last updateZendesk reports that this issue has been resolved.
We are investigating an issue with Zendesk. Customers can still create support tickets within Atomic, but live chat and ticket response times may be slower than normal.
Report: "RDS us-east-2 (Ohio) Availability Issues"
Last updateThe issue has now been resolved.
The RDS service affected by the outage is responsive again and sites should no longer have issues connecting to the database. We will continue to monitor and provide a further update once we are comfortable closing this incident.
We're currently investigating an issue with the AWS us-east-2 (Ohio) region causing an outage for the RDS service within that region.
Report: "Shared hosting infrastructure outage"
Last updateNo further issues have come up after our fix was implemented. The issue has been resolved.
A fix has been applied and the shared hosting infrastructure is stable at this point. We'll continue to monitor the infrastructure.
We were alerted to issues with our shared hosting infrastructure and are currently investigating issues related to the shared hosting platform at this time.
Report: "Images missing from support.pagely.com"
Last updateAll images appear to be loading properly at this time with no further issues.
A fix has been made and images are appearing online at this time. We'll continue monitoring to ensure all images are working as intended.
Around 14:50 UTC Pagely engineers identified an issue with our support documentation in which all images fail to load. A fix is being worked on in order to bring back all images that are missing back online. We will update as progress is made.
Report: "Availability issues within the us-east-1 (Virginia) region"
Last updateWe have not observed any new issues over the past couple of hours so we'll go ahead and mark this incident as resolved.
We have not received official word from AWS about the issue so far but we're currently no longer seeing any issues that would be affecting our customer's servers and the number of reported issues on external monitoring sites has since dropped substantially. We're continuing to monitor the status of our hosting infrastructure for now.
We're currently investigating a potential AWS issue within the us-east-1 (Virginia) region. We're seeing server connectivity issues that auto-resolve within a few minutes. A small number of sites may have experienced a brief outage while the network connection to their server is degraded but we haven't seen these outages last more than a couple of minutes.
Report: "Shared Hosting Maintenance - web124/web125"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Pagely engineers will be performing maintenance on the following shared web nodes: - web124 - web125 During maintenance, dynamic/uncached requests to these web nodes will not respond for 1-2 minutes. Within Atomic, you can determine if your apps are affected by this maintenance by referring to the "Primary Server Info", "Server Name" within your app overview. If the server name listed is not in the list above, then your app is not affected by this maintenance.
Report: "VPS User Provisioning"
Last updateA fix has been implemented and tested successfully. No further issues are noted at this time.
We are currently aware and reviewing issues with creating, editing or adding new users or SSH keys. This does not affect any existing users or keys, only new keys and new users at this time. We will post additional updates as we have more information towards a solution.
Report: "Security Alert for customers using CircleCI with Pagely."
Last updateThis security alert is now resolved, however we still urge customers to take precautions needed in the steps outlined within this alert. We've also provided the full incident report from CircleCI, which was posted on January 13th and that can be read on their blog: https://circleci.com/blog/jan-4-2023-incident-report/
CircleCI recently disclosed a security event on their blog: https://circleci.com/blog/january-4-2023-security-alert/ The nature of the disclosure relates to potential compromise of all secrets stored within a repository on the CircleCI platform. While CircleCI has taken steps since the initial disclosure to automatically rotate what they can for you, there are certain things that rely on you to fully resolve the matter. We urge customers using CircleCI to take the following steps as soon as possible: - Immediately rotate any and all secrets stored in CircleCI. There's a tool available to fetch all of your secrets from CircleCI. (https://github.com/CircleCI-Public/CircleCI-Env-Inspector) - Delete and re-create any CI/CD Integrations or Webhook configurations in Atomic if they were used with CircleCI. Full documentation can be found here: https://support.pagely.com/hc/en-us/articles/360050828232-Automatically-Deploying-Your-WordPress-Site-with-CircleCI - after recreating your integrations, you will need to update the integration ID and secret within your pipeline configuration. - If you are using SSH keys to perform any deployments, please regenerate those as well. If you have any questions or concerns regarding this event, please do not hesitate to Contact Pagely Support: https://support.pagely.com/hc/en-us/articles/114094215332-Contacting-Support
Report: "Redis 6.2.10 Security Release"
Last updateThe scheduled maintenance has been completed.
Pagely will be performing a network-wide upgrade of Redis to v6.2.10 beginning Jan 18 02:00 UTC to address security flaws in previous versions. Anticipated impact involves a flush of Redis cache which may cause a temporary increase in server load and response times until object caches are gradually and automatically repopulated. If you have any questions or concerns regarding this event, please do not hesitate to Contact Pagely Support: https://support.pagely.com/hc/en-us/articles/114094215332-Contacting-Support
Report: "Increased Load/Response Times for New Relic users"
Last updateImpacted hosts have had fixes applied to resolve this issue.
A fix has been implemented and is being deployed to all affected hosts. We will continue to monitor the situation.
We have identified an issue involving higher than normal server CPU usage and response times for customers using New Relic, and we are currently implementing a fix. We will post additional updates here as we have more information.
Report: "us-east-2 (Ohio) Availability Issues"
Last updateAfter further monitoring, no further issues appear to be happening within the us-east-2 Ohio region. Everything has been resolved at this time.
We've gone ahead and confirmed that all servers appear to be operational from within the us-east-2 Ohio region at this time. We will continue to monitor the region to ensure no other issues remain.
At this time, most of the AWS issues that appeared to be the root cause of the issues has been cleared. We are still continuing to work on servers within the region that may still be unavailable to ensure they recover fully.
Latest update from AWS: Instance Impairments 11:25 AM PDT We continue to make progress in recovering the remaining EC2 instances and EBS volumes affected by the loss of power in a single Availability Zone in the US-EAST-2 Region. The vast majority of EC2 instances are now healthy, but we continue to work on recovering the remaining EBS volumes affected by the issue. EC2 API error rates and latencies have returned to normal levels. Elastic Load Balancing remains weighted away from the affected Availability Zone. Error rates and latencies for Lambda function invocations have now returned to normal levels. Power has been restored to all affected resources and remains stable. We expect the recovery of EC2 instances and EBS volumes to continue to improve over the next 30 minutes. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
Latest update from AWS: 11:25 AM PDT We continue to make progress in recovering the remaining EC2 instances and EBS volumes affected by the loss of power in a single Availability Zone in the US-EAST-2 Region. The vast majority of EC2 instances are now healthy, but we continue to work on recovering the remaining EBS volumes affected by the issue. EC2 API error rates and latencies have returned to normal levels. Elastic Load Balancing remains weighted away from the affected Availability Zone. Error rates and latencies for Lambda function invocations have now returned to normal levels. Power has been restored to all affected resources and remains stable. We expect the recovery of EC2 instances and EBS volumes to continue to improve over the next 30 minutes. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
Further update from AWS: 10:49 AM PDT We continue to see recovery of EC2 instances that were affected by the loss of power in a single Availability Zone in the US-EAST-2 Region. At this stage, the vast majority of affected EC2 instances and EBS volumes have returned to a healthy state and we continue to work on the remaining EC2 instances and EBS volumes. Elastic Load Balancing has shifted traffic away from the affected Availability Zone. Single-AZ RDS databases were also affected and will recover as the underlying EC2 instance recovers. Multi-AZ RDS databases would have mitigated impact by failing away from the affected Availability Zone. While the vast majority of Lambda functions continue operating normally, some functions are experiencing invocation failures and latencies, but we expect this to improve over the next 30 minutes. Power has been restored to all affected resources and remains stable. We expect the recovery of EC2 instances and EBS volumes to continue to improve over the next 45 minutes. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
Our team is working on failover options while AWS works to resolve this issue. Latest update from AWS: Instance Impairments 10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
We're currently investigating an issue with the AWS us-east-2 (Ohio) region causing an outage for servers within that region.
Report: "Tickets and Chats Outage"
Last updateAs of 12:15 PM MST, the issues and errors around tickets and chats from our provider have been resolved.
Latest update: July 28, 2022 11:36 AM: We are beginning to see some improvement in the error rates affecting the US-East region. Our team is monitoring and we will post another update in the next 30 minutes.
Update from provider: July 28, 2022 11:00 AM: Our team continues to investigate elevated error rates and access issues in the US-East region. We will post another update within the next 30 minutes.
Latest updates from the provider: We have confirmed access issues and high error rates in the US-East region. Further updates to come shortly.
We're currently investigating an issue with our Live Chat and Ticketing system in which the systems are intermittently available with our platform provider and working on a resolution in order to bring them fully back online.
Report: "Cloudflare Service Issues"
Last updateThis incident has been resolved.
A fix has been implemented by Cloudflare and we are monitoring the results.
Currently Cloudflare is investigating their issues across entire network. Pagely's Customers using Cloudflare may be affected and unable to access the site.
Report: "Ares Configuration Management Issue in Atomic Dashboard"
Last updateWe have tested and verified the issue is now resolved.
A fix has been implemented and we are monitoring the results.
We've identified the issue and are implementing a fix.
The issue also affects the provisioning of new apps - we ask that you please hold off on creating new apps in Atomic until the issue is resolved. If you urgently need to create a new app then please contact our Support.
We are continuing to investigate this issue.
Pagely Engineers are investigating an issue impacting the ability to add HTTP redirects, custom access rules and other ARES configuration operations within the Atomic Dashboard. This does not affect the availability or performance of any existing websites. If you are planning to perform any ARES rule management operations in Atomic, we sincerely apologize for this inconvenience and ask that you hold off on making changes within Atomic until we have the problem resolved.
Report: "Site Management Problems in Atomic Dashboard"
Last updateWe have tested and verified the issue is now resolved.
The fix has been deployed and we are currently verifying that it has resolved the issue. This incident will be resolved if all looks good.
We have identified the cause of the reported issues and a fix is on the way out.
Pagely Engineers are investigating an issue impacting the ability to add new sites, manage PHP versions, and perform other site management operations within the Atomic Dashboard. This does not affect the availability or performance of any existing websites. If you are planning to perform any site management operations in Atomic, we sincerely apologize for this inconvenience and ask that you hold off on making changes within Atomic until we have the problem resolved.
Report: "Slack Outage"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
Due to the widespread Slack outage, customers will be unable to access their private Slack support channels to interact with Pagely support. Live chat and Atomic support tickets are still functioning as normal.
Report: "Cloudflare Possible Network Performance Issues in West Coast (USA)"
Last updateThis incident has been resolved.
Similar to issues experienced earlier today, Cloudflare is now reporting: "Cloudflare is Observing Possible Network Performance Issues in West Coast (USA)" If you are routing your Pagely traffic through Cloudflare, you may be experiencing Cloudflare error response codes like 520 or 524. Cloudflare has been updating the status of this incident on their end at https://www.cloudflarestatus.com/ Pagely hosting itself is not directly affected.
Report: "Cloudflare network congestion"
Last updateCloudflare is now reporting "All Systems Operational."
Since approximately 17:29 UTC or before, Cloudflare has been experiencing network congestion. If you are routing your Pagely traffic through Cloudflare, you may be experiencing Cloudflare error response codes like 520 or 524. Cloudflare has been updating the status of this incident on their end at https://www.cloudflarestatus.com/ Pagely hosting itself is not directly affected.
Report: "WordPress 5.8.3 Security Release"
Last updateUpgrades are now complete.
The Pagely team has already begun rolling out this patch for all customers. If you have a version hold request on file, we will patch your site while keeping it on the same major branch version.
Report: "AWS incident causing availability issues for multiple VPS's"
Last updateUpdate from Pagely: While we will continue monitoring for any issues, this issue is now resolved. --- Update from AWS: [4:22 PM PST] Starting at 4:11 AM PST some EC2 instances and EBS volumes experienced a loss of power in a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Instances in other data centers within the affected Availability Zone, and other Availability Zones within the US-EAST-1 Region were not affected by this event. At 4:55 AM PDT, power was restored to EC2 instances and EBS volumes in the affected data center, which allowed the majority of EC2 instances and EBS volumes to recover. However, due to the nature of the power event, some of the underlying hardware experienced failures, which needed to be resolved by engineers within the facility. Engineers worked to recover the remaining EC2 instances and EBS volumes affected by the issue. By 2:30 PM PST, we recovered the vast majority of EC2 instances and EBS volumes. However, some of the affected EC2 instances and EBS volumes were running on hardware that has been affected by the loss of power and is not recoverable. For customers still waiting for recovery of a specific EC2 instance or EBS volume, we recommend that you relaunch the instance or recreate the volume from a snapshot for full recovery.
Update from Pagely: At this time our engineering team is not observing any issues with our customer's servers, databases, or sites. Our team will continue monitoring for any issues and providing relevant updates as they become available. --- Latest Updates from AWS: [12:03 PM PST] Over the last hour, after addressing many of the underlying hardware failures, we have seen an accelerated rate of recovery for the affected EC2 instances and EBS volumes. We continue to work on addressing the underlying hardware failures that are preventing the remaining EC2 instances and EBS volumes. For customers that continue to have EC2 instance or EBS volume impairments, relaunching affected EC2 instances or recreating affecting EBS volumes within the affected Availability Zone, continues to be a faster path to full recovery. [1:39 PM PST] We continue to make progress in addressing the hardware failures that are delaying recovery of the remaining EC2 instances and EBS volumes. At this stage, if you are still waiting for an EC2 instance or EBS volume to fully recover, we would strongly recommend that you consider relaunching the EC2 instance or recreating the EBS volume from a snapshot. As is often the case with a loss of power, there may be some hardware that is not recoverable, which will prevent us from fully recovering the affected EC2 instances and EBS volumes. We are not quite at that point yet in terms of recovery, but it is unlikely that we will recover all of the small number of remaining EC2 instances and EBS volumes. If you need help in launching new EC2 instances or recreating EBS volumes, please reach out to AWS Support. [3:13 PM PST] Since the last update, we have more than halved the number of affected EC2 instances and EBS volumes and continue to work on the remaining EC2 instances and EBS volumes. The remaining EC2 instances and EBS volumes have all experienced underlying hardware failures due to the nature of the initial power event, which we are working to resolve. We expect to make further progress on this list within the next hour, but some of the remaining EC2 instances and EBS volumes may not be recoverable due to hardware failures. If you have the ability to relaunch an affected EC2 instance or recreate an affected EBS volume from snapshot, we continue to strongly recommend that you take that path.
Latest update from AWS: [11:08 AM PST] We continue to make progress in restoring power and connectivity to the remaining EC2 instances and EBS volumes, although recovery of the remaining instances and volumes is taking longer than expected. We believe this is related to the way in which the data center lost power, which has led to failures in the underlying hardware that we are working to recover. While EC2 instances and EBS volumes that have recovered continue to operate normally within the affected data center, we are working to replace hardware components for the recovery of the remaining EC2 instances and EBS volumes. We have multiple engineers working on the underlying hardware failures and expect to see recovery over the next few hours. As is often the case with a loss of power, there may be some hardware that is not recoverable, and so we continue to recommend that you relaunch your EC2 instance, or recreate you EBS volume from a snapshot, if you are able to do so.
Latest update from AWS: [9:28 AM PST] We continue to make progress in restoring connectivity to the remaining EC2 instances and EBS volumes. In the last hour, we have restored underlying connectivity to the majority of the remaining EC2 instance and EBS volumes, but are now working through full recovery at the host level. The majority of affected AWS services remain in recovery and we have seen recovery for the majority of single-AZ RDS databases that were affected by the event. If you are able to relaunch affected EC2 instances within the affected Availability Zone, that may help to speed up recovery. Note that restarting an instance at this stage will not help as a restart does not change the underlying hardware. We continue to work towards full recovery.
Almost all customer servers should be fully recovered at this time. Some high availability clusters are still running in impaired mode but that doesn't affect the availability of the sites at the moment.
Latest update from AWS: [6:51 AM PST] We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. For the remaining EC2 instances, we are experiencing some network connectivity issues, which is slowing down full recovery. We believe we understand why this is the case and are working on a resolution. Once resolved, we expect to see faster recovery for the remaining EC2 instances and EBS volumes. If you are able to relaunch affected EC2 instances within the affected Availability Zone, that may help to speed up recovery. Note that restarting an instance at this stage will not help as a restart does not change the underlying hardware. We have a small number of affected EBS volumes that are still experiencing degraded IO performance that we are working to recover. The majority of AWS services have also recovered, but services which host endpoints within the customer’s VPCs - such as single-AZ RDS databases, ElasticCache, Redshift, etc. - continue to see some impact as we work towards full recovery.
We are continuing to work on a fix for this issue.
Update from Pagely: We are starting to see signs of recovery and have restored a portion of the affected servers. Some servers and RDS instances are still unavailable so we're working towards restoring those. Latest update from AWS: [5:39 AM PST] We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. Network connectivity within the affected Availability Zone has also returned to normal levels. While all services are starting to see meaningful recovery, services which were hosting endpoints within the affected data center - such as single-AZ RDS databases, ElastiCache, etc. - would have seen impact during the event, but are starting to see recovery now. Given the level of recovery, if you have not yet failed away from the affected Availability Zone, you should be starting to see recovery at this stage.
The issue is also affecting our Atomic dashboard, which is currently unavailable or returning errors intermittently.
Latest update from AWS: [5:18 AM PST] We continue to make progress in restoring power to the affected data center within the affected Availability Zone (USE1-AZ4) in the US-EAST-1 Region. We have now restored power to the majority of instances and networking devices within the affected data center and are starting to see some early signs of recovery. Customers experiencing connectivity or instance availability issues within the affected Availability Zone, should start to see some recovery as power is restored to the affected data center. RunInstances API error rates are returning to normal levels and we are working to recover affected EC2 instances and EBS volumes. While we would expect continued improvement over the coming hour, we would still recommend failing away from the Availability Zone if you are able to do so to mitigate this issue.
Latest update from AWS: [5:01 AM PST] We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
Update from AWS: [4:35 AM PST] We are investigating increased EC2 launch failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue.
An issue with AWS in US-EAST-1 (Virginia) region is currently causing an outage for a large number of servers in the region.
Report: "Atomic Control Panel - API Errors"
Last updateThis incident has been resolved.
We're experiencing an elevated level of API errors within Atomic and are currently looking into the issue.
Report: "AWS incident causing availability issues with Atomic"
Last updateThis incident has been resolved.
Latest updates from AWS: [2:04 PM PST] We have executed a mitigation which is showing significant recovery in the US-EAST-1 Region. [...] We still do not have an ETA for full recovery at this time. [2:43 PM PST] We have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. [...] We continue to work toward full recovery for all impacted AWS Services and API operations. [...] [3:03 PM PST] Many services have already recovered, however we are working towards full recovery across services. [...] https://status.aws.amazon.com/ --- Update from Pagely: At this time Atomic is mostly stable, however some actions from within the dashboard may not function fully at this time. Pagely engineers are continuing to reinstate full functionality of Atomic still. We will continue our efforts and appreciate all of your patience with this continuing issue.
Latest update from AWS: [11:26 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. [...] The root cause of this issue is an impairment of several network devices in the US-EAST-1 Region. We are pursuing multiple mitigation paths in parallel, and have seen some signs of recovery, but we do not have an ETA for full recovery at this time. [...] https://status.aws.amazon.com/ --- Update from Pagely: Pagely Engineers have been working diligently to reinstate full functionality of Atomic. Unfortunately, the nature of the issue occurring at AWS limits our options for viable workarounds. We will continue our efforts to remediate the issue. The impact of this issue as it relates to Pagely is limited to the Atomic Dashboard. Our data does not indicate any problems with your actual websites or host servers. With that said, we will continue to monitor things using systems residing outside of AWS and remediate any issues that may occur. Thank you for your continued patience while this incident is ongoing.
Customer API and CI/CD should be working but may intermittently return errors.
The outage further affects the Atomic API and CI/CD Integrations.
An issue with AWS in us-east-1 (Virginia) region is currently causing an outage of the Atomic control panel.
Report: "Atomic Login Issues"
Last updateThe issue preventing users to logging into Atomic control panel has been resolved. Feel free to contact us back in case you still experiencing any issue.
A fix has been applied to allow users to login again. We will continue to monitor for further issues. If you continue to experience any issues logging in, don't hesitate to contact our support team.
Pagely Engineers are currently investigating issues with logging into the Atomic control panel. Some users may not be able to login to their account at this time due to these issues. We will update as progress is made to fix the issues.
Report: "Migration to new payment gateway"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
In an ongoing effort to improve our tools and our user's experience, we are currently migrating to a new billing system. During this migration, our support team will be available to assist with any billing information updates. Please <a href="https://support.pagely.com/hc/en-us/articles/114094215332-Contacting-Support">submit a ticket</a> for additional details. We appreciate your patience and apologize for any inconvenience this may cause.
Report: "Facebook Services Down -"
Last updateFacebook is back online and functioning normally at this time. Our monitoring doesn't show any further performance issues in relation to Facebook's outage any further.
Facebook and services that are part of Facebook have all appeared to come back online. However, some services may still be recovering at this time. You can view the status of some Facebook services at the following: https://status.fb.com/ Customers utilizing some of their services may still experience some issues at this time.
We're currently aware that Facebook services appear to be fully down at this time. As a result of this, sites that may utilize Facebook in any way may be running into timeouts or "Service Unavailable" errors.
Report: "WordPress 5.8.1 Security and Maintenance Release"
Last updateUpgrades are now complete.
The Pagely team has already begun rolling out this patch for all customers. If you have a version hold request on file, we will patch your site while keeping it on the same major branch version.
Report: "Intermittent connectivity issues"
Last updateThe issue has now been resolved and our servers are no longer experiencing connectivity issues.
We're continuing to investigate this issue. We're being alerted about 5-10 minutes outages intermittently. We're not seeing errors on signs of resource contention on the servers, however, some requests do not reach the servers due to networking issues. A small number of servers in the Ohio hosting region may also be affected.
We are currently investigating what appear to be connectivity/networking issues on AWS's side. This is causing intermittent timeout errors over a wide range of sites and has so far only been affecting servers in the Virginia hosting region.
Report: "Chat System Outage"
Last updateThis issue is now resolved and live chat is operational.
We're currently investigating an issue with our Live Chat system with our platform provider and working on a resolution in order to bring it back online.
Report: "Fastly is having an outage"
Last updateFastly status is reporting the incident as resolved: https://status.fastly.com/ Our monitoring does not show any sites still down due to the outage. As such we are marking this incident as resolved.
Fastly status page now shows "The issue has been identified and a fix is being implemented". Customers that use Fastly can monitor https://status.fastly.com/ for further updates. Our monitoring is not reporting any sites as being down at this time.
Fastly appears to be having an outage. We are unable to load the main page from several locations and some client sites using Fastly are reporting down. Fastly has posted an update on their status page saying that they're currently investigating the issue: https://status.fastly.com/ Customers using Fastly may experience their sites being down at this time. Pagely hosting is unaffected.
Report: "WordPress Core Updates to 5.7.1"
Last updateUpgrades are now complete.
We have approved the release quality of WordPress version 5.7.1 and will be deploying it for all customer websites which don't have a version hold configured. Every customer is always up to date with their respective latest minor security release and a more consistent core version will ensure better compatibility as well as improve logistics. Completion is expected by the 28'th of April.
Report: "Atomic Control Panel - Elevated Response Times"
Last updateThis incident has been resolved.
A fix has been applied and we are currently monitoring performance. All services should be operating normally at this time, however please do not hesitate to contact our Support Team if you are still experiencing any problems with Atomic. A resolution will be posted once we conclude things are back to normal.
Pagely Engineers are currently investigating elevated response times for interactions within the Atomic Control Panel. This does not relate to the performance of your actual hosting service. Certain interactions within Atomic, such as loading different sections or performing searches, may be performing inconsistently or returning error messages for exceeding timeouts. We will update this page as progress is made.
Report: "International traffic into the Sydney datacenter network connectivity issues"
Last updateThe external internet provider to the Sydney region network has been shifted to recover from this event. This incident has been resolved.
The issue has been identified and a fix is being implemented by our underlying provider for the Sydney region. The network connectivity monitoring alerts have resolved, but we're continuing to keep an eye on things.
Internal monitoring has alerted us to network connectivity issue in the Sydney region, that our Operations team is working to resolve. Some international visitor requests outside of Australia into the Sydney region may be impacted while this is ongoing. Visitor requests within Australia should not be impacted at this time. Apologies on the disruption and a further update will be posted shortly.
Report: "Slack Outage"
Last updateSlack is reporting that this issue is resolved, so customers with private Slack support channels should be able to use them as normal at this time.
At this time, Slack is connecting but may still be slow to respond.
Due to the widespread Slack outage, customers will be unable to access their private Slack support channels to interact with Pagely support. Live chat and Atomic support tickets are still functioning as normal.
Report: "RDS connectivity issue on p20-aurora-2 shared RDS"
Last updateEverything has looked stable since yesterday's database restoration. This incident is now resolved.
At this time your sites are operational again. Pagely Engineers have reinstated your databases to a very near point-in-time recovery. The first signs of problems occurred at 18:18 UTC, and we have restored your databases to a point-in-time backup at 18:00 UTC.
The issue has been identified and a fix is being implemented.
Internal monitoring has alerted to an issue on a shared RDS (p20-aurora-2) that our Operations team is working to resolve. This appears to be related to a bug on the Amazon RDS service. We are currently working to bring up a new RDS cluster from a recent point in time to send traffic to it. Some customer sites, particularly uncached traffic requests, may be impacted while this is ongoing. Apologies on the disruption and a further update will be posted shortly.
Report: "RDS connectivity issue on a virginia-aurora shared RDS."
Last updateAfter more discussion with the database team at Amazon Web Services, we have a better understanding of the failure that occurred. The root cause was determined to be a rare bug within the Aurora database engine. The bug causes the mysql process to be unable to start up properly when an ALTER statement is interrupted by a DB instance reboot. This is why we were unable to launch new instances or new DB clusters from point-in-time recovery targets after the ALTER was issued, which contributed to the extended downtime you experienced. Amazon was able to correct the problem for the affected system and they have confirmed that this bug will be fixed in an upcoming RDS update. Pagely will apply the patch to all of our Aurora Database Clusters as soon as it becomes available. In the interim, our DevOps team knows about this issue and we feel confident that it will not recur. The conditions for triggering this bug are very specific which we can account for and avoid rather easily. Our team has adopted new Standard Operating Procedures which takes this condition into account when interacting with our database clusters. We appreciate your patience and understanding both during and following this event.
Summary On Friday December 4, 2020, one of our managed Amazon Aurora Database Clusters, vps-virginia-aurora-3, experienced an extended outage lasting approximately three and a half hours. A relatively small portion of the overall sites hosted in this region reside on this DB cluster, that is to say that there was plenty of spare capacity at the time of the event. Affected sites experienced database connection errors throughout the duration of the event. The recovery point of your data when services were restored is approximately 15-70 minutes prior to the onset of the service interruption. - Database services for the affected sites were unavailable between 9:15AM PST and 12:30PM PST. - By 12:30PM PST, service availability was restored to a backup DB cluster with a 9:00AM PST point-in-time. - By approximately 7:00PM PST, all restored sites were fully migrated to a brand new Aurora RDS DB Cluster and away from the problem system. More Details Our investigation is ongoing and we are working closely with the team at Amazon to fully understand the nature of this issue. Although database issues have happened in the past, they are usually resolved within a few minutes, not hours. An event of this nature had not occurred for us before. We need more time to investigate the matter with Amazon before we can say definitively what the cause was. Rest assured, both Pagely and Amazon are interested in finding a root cause so that a similar event can not occur in the future. We have already had some great discussions on mitigating this type of impact in the future, and we continue to work on determining a root cause. We can tell you that the behavior we observed of this Aurora Cluster was not typical and it also got the attention of the database team at AWS who, independent of Pagely's investigation, noticed the DB cluster was behaving erratically and connected with us to let us know they're applying an emergency fix. Typical actions such as adding a reader instance, performing a failover, restarting a DB instance, were not working for this cluster until steps were taken by AWS to address a problem they were seeing. While the issue was ongoing with the original DB cluster, Pagely Engineers were also launching new DB clusters with varying point-in-time recovery targets. This is a proactive step we will take if we feel the time it could take for a system to recover exceeds the time it may take to launch a new DB cluster with slightly older data. Our goal during these moments is to get sites running again as quickly as possible and with the last-known good set of data. At a certain point in the incident, because things were taking so long, we told you we'd restore from older (less than 24hrs) SQL backups, but we actually were able to get a DB cluster launched with a fairly recent point-in-time recovery target (15-70 minutes old). After this recovery was performed, and with the assistance of AWS, the originally affected system was also brought back to an operational state. This system is currently under evaluation and is not currently powering any of your live sites. Migrations were performed to get all affected sites relocated to a completely different and newly built Aurora DB cluster. With that said, if you think you are missing any data please us know and we can provide you with a separate SQL dump from the affected system for manual examination. We want to assure you that every step was taken to restore service availability as soon as possible and with the most current possible data set. Some of these operations take time to complete, even when everything is working correctly. When things are not working correctly, recovery timelines can be impacted further. We have a playbook we follow in these situations and we always try to think a few steps ahead. This typically leads to no or very little noticeable impact to your services, but then there are days like today. We always work to keep events of this severity a rarity, if not a faint memory, most of the time, and we thank you for your understanding as we worked to get things back to normal.
At this time your sites are operational again. Pagely Engineers have reinstated your databases to a very near point-in-time recovery. We did not need to resort to the older SQL backups. The first signs of problems occurred between 16:15 and 17:00 UTC, and we have restored your databases to a point-in-time backup at 16:00 UTC. Further efforts are underway to migrate your databases to a final placement on one of our very newest DB clusters. We will follow up shortly with additional information, including a root cause analysis and issuance of service credits. Thank you.
Pagely Engineers continue to wait for the most recent data sets in our restoration efforts to complete provisioning. At this time, we will begin restoring affected sites from your regular SQL backups. The age of this data is slightly older, but no more than 24 hours old. This is only being done because of the extended time it is taking to recover sites with more current data, we'd like to reinstate site availability as soon as possible. Our team will happily assist in providing more recent data after that is made available by the system.
Pagely Engineers are still working to restore availability of the databases for all affected sites. We sincerely apologize for the extended delay, a full post-mortem will be provided after services are restored. Our team is currently working through a novel failure case that is not fixable by the typical remediation steps we take - such as adding a new replica instance to a DB cluster - and attempts to launch a new cluster based on the latest point in time is also taking an extended period of time to complete. While we are waiting for these contingency measures to finish provisioning, the originally affected database cluster is beginning to show signs of self-recovery. So our team will continue to assess the situation and make a decision soon based on the earliest available resource to restore your applications. Depending on the outcome of that, the data may be very current or slightly (15-30 minutes) behind. We will continue to report on progress as it is made.
The vps-virginia-aurora-3 database cluster has experienced a critical failure. Although data integrity is still okay, we are having trouble getting the DB cluster to start. Pagely Engineers have already initiated the process to launch a new DB cluster with the latest available data set as a point-in-time recovery. Once this resource has finished creating, we will update your application to use the new endpoint.
We are continuing to work on a fix for this issue.
Internal monitoring has alerted to an issue on a shared RDS that our Operations team is working to resolve. Some customer sites, particularly uncached traffic requests, may be impacted while this is ongoing. Apologies on the disruption and a further update will be posted shortly.