Historical record of incidents for Aptible
Report: "Increased error rate"
Last updateWe are investigating an increased error rate in our API which may be causing failed operations
Report: "Aptible Documentation Site Unavailable"
Last updateThis incident has been resolved.
Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.
Report: "Aptible Documentation Site Unavailable"
Last updateThis incident has been resolved.
Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.
Report: "Route53 increased propagation delays"
Last updateRoute 53 record propagation appears to have returned to normal.
We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. Running App and Databases are not impacted, but creation or deletion of Databases or Endpoints, as well as scaling services to/from zero containers may be impacted. We'll continue to monitor the situation and provide updates as we have any additional information to shre.
Report: "Route53 increased propagation delays"
Last updateRoute 53 record propagation appears to have returned to normal.
We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. Running App and Databases are not impacted, but creation or deletion of Databases or Endpoints, as well as scaling services to/from zero containers may be impacted.We'll continue to monitor the situation and provide updates as we have any additional information to shre.
Report: "Aptible Documentation Site Unavailable"
Last updateThis incident has been resolved.
Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.
Report: "Aptible Documentation Site Unavailable"
Last updateThis incident has been resolved.
Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.
Report: "Delayed Operations in eu-central-1"
Last updateThis incident has been resolved.
We are currently experiencing issues with operations being delayed for stacks hosted in eu-central-1. Our Engineering team is currently working to restore normal functionality.
Report: "Delayed Operations"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating.
Report: "Delayed Operations"
Last updateThis incident has been resolved.
A fix has been implemented and operations are running smoothly again. We are monitoring.
We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating.
Report: "App and Database operation failures"
Last updateThis incident has been resolved.
We are experiencing intermittent failures in App and Database operations due to issues with an upstream provider. This issue only affects Apps and Databases with endpoints. Retrying the operation may resolve the issue. We are actively monitoring the situation and will provide updates once the problem is fully resolved.
Report: "Operations blocked - Route 53 propagation delays"
Last updateThis incident has been resolved.
We are noticing Route 53 record requests succeeding in a normal time frame, and are lifting the operation block at this time. We'll continue to observe running operations to ensure stability.
We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. In order to prevent Apps and Databases DNS records from reaching an inconsistent state, we are temporarily blocking Operations. Performance and reachability of existing Apps and Database is not impacted.
Report: "Database provision errors"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
We've identified an error blocking the creation of new Databases on the platform, and our team is applying a fix. Reachability of your existing databases, and the ability to scale or restart them is not impacted.
Report: "Delayed Operations"
Last updateThis incident has been resolved.
We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating—more updates to follow.
Report: "Long load balancer registration times"
Last updateAWS has indicated that the underlying issue has been resolved, and our monitoring indicates it is safe to run operations again. All inconsistencies impacting customer apps or databases (there were only 4 impacted resources) have been resolved.
We are experiencing longer than usual Route53 change times, and some operations are unable to Rollback gracefully. In order to prevent resources from reaching a failed state where the DNS is not properly configured, we are blocking creation of new operations on the platform. We will update soon with additional information.
Report: "Limited Availability Incident in shared-us-west-1"
Last updateOn 2024-10-16, between 00:20 and 02:38 UTC, some customer apps and databases in a single shared stack, shared-us-west-1, experienced an availability incident as a result of a problem encountered with planned maintenance. Service has been restored to those affected apps and databases, and this incident is considered resolved at this time.
Report: "Impacted platform operation in us-east-2"
Last updateAWS has resolved the underlying issue.
We are no longer observing error responses for S3, and have re-allowed operations in us-east-2. We will continue to monitor the situation.
AWS confirmed multiple services are impacted in us-east-2. We are blocking operations in that region until availability stabilizes.
We are investigating an S3 outage in the us-east-2 region, which is impacting new operations on resources in that region. All apps and databases are running normally, though if your code relies on S3 directly, or 3rd party services that rely on S3, you may see application-level impact.
Report: "Long load balancer registration times"
Last updateAWS has marked this issue RESOLVED as of 19:19 UTC, and we have not observed any issues in the last hour. The issue has been resolved and all services are operating normally.
The latest update from AWS indicates that operations created around 17:10 through 17:20 UTC were impacted, which matches our internal metrics. AWS has promised another update by 18:00 UTC, and we will continue to monitor the situation until we are satisfied that it is resolved.
We're again seeing degradation and failure to register new load balancer targets in about about 10% of running operations.
Loadbalancer registration appears to be working as expected at this time. We will continue to monitor operations until AWS resolves their service degradation notice.
AWS has acknowledged the impact we are seeing and opened an incident: > We are investigating increased load balancer back-end instance registration times in the us-east-1 Region. September 26, 2024 at 16:21:43 UTC Since 16:05 UTC, Aptible is observing some recovery, about half of endpoint target registrations are succeeding at this time.
This service impact only applies to resources hosted in the `us-east-1` region. Customers may notice operations reaching timeout, but at this point all operations are rolling back successfully to the previous state.
We are investigating abnormally long registration times for new targets with AWS Load Balancers. This may be causing extended operation times for releases (Deploy, Scale, Restart) for services that have Endpoints.
Report: "Dockerfile based `git-push` deployments issue"
Last updateThis incident has been resolved.
After the recent git server maintenance, a fallout issue was identified that affected `git push` based deployments to existing apps. A fix has been put in place, so we expect further deployments will not be affected. Please contact support if you encounter further issues.
Report: "Git-based Deploy Log Streaming Disruption on Aptible CLI for Dedicated Stacks"
Last updateFor dedicated stacks only, <a href="https://www.aptible.com/docs/core-concepts/apps/deploying-apps/image/deploying-with-git/overview">git-based deployments</a> were not streaming logs about the deployment operation activity as they normally do. The deploy operations were running normally in the background but not streaming live logs to the CLI. This incident impacted git-based deploys from the CLI between June 14th, 5:44 AM UTC, and June 14th, 1:45 PM UTC. Our team has applied a fix, which has resolved the issue. Please contact our <a href="https://contact.aptible.com/">Support Team</a> if you have additional questions.
Report: "Temporary Metrics Unavailability in Aptible Dashboard"
Last updateWe are notifying our users of an issue where some metrics are not available on the Aptible Dashboard (app.aptible.com) for the period between May 5, 2024, 18:54 UTC and May 6, 2024, 22:50 UTC. We want to assure you that this does not affect the functionality of <a href="https://www.aptible.com/docs/metric-drains">Aptible Metric Drains</a>. If you have any concerns or require further assistance, please do not hesitate to reach out to our <a href="https://contact.aptible.com/">support team</a>.
Report: "Update on CVE-2024-3094: XZ Utils Vulnerability"
Last updateAptible is aware of <a href="https://nvd.nist.gov/vuln/detail/CVE-2024-3094">CVE-2024-3094</a>, a critical vulnerability in XZ Utils, specifically affecting versions 5.6.0 and 5.6.1, with a CVSS score of 10, indicating a severe level of risk. This vulnerability results from a supply chain compromise and is present in data compression software widely used across major Linux distributions. The malicious code discovered in the affected versions allows for unauthorized system access, posing a significant security threat. The Aptible platform and services do not utilize the affected software versions and are not impacted. Aptible customers are urged to evaluate dependencies in their Docker Images and other systems and patch as needed urgently to mitigate the risk associated with this vulnerability. Given the scope and severity of the CVE, our security team continues to monitor the situation actively. If you have any concerns or questions, please contact the <a href="https://www.aptible.com/docs/support">Aptible Support team</a>.
Report: "Response to Leaky Vessels: Docker and runc container breakout vulnerabilities"
Last updateWe have proactively addressed a recent security vulnerability identified as "Leaky Vessels," a container breakout issue affecting runc versions up to 1.1.11. This vulnerability had the potential to allow unauthorized access to the host OS from containers. Our team has promptly updated our systems, including all instances of runc to the secure version, to ensure the highest level of security for our platform and your services. This update mitigates the risks associated with this vulnerability. The following CVEs have been addressed on our platform: - CVE-2024-21626: <a href="https://snyk.io/blog/cve-2024-21626-runc-process-cwd-container-breakout/">runc process.cwd & leaked fds container breakout</a> - CVE-2024-23651: <a href="https://snyk.io/blog/cve-2024-23651-docker-buildkit-mount-cache-race/">Buildkit Mount Cache Race</a> - CVE-2024-23653: <a href="https://snyk.io/blog/cve-2024-23653-buildkit-grpc-securitymode-privilege-check/">Buildkit GRPC SecurityMode Privilege Check</a> - CVE-2024-23652: <a href="https://snyk.io/blog/cve-2024-23652-buildkit-build-time-container-teardown-arbitrary-delete/">Buildkit Build-time Container Teardown Arbitrary Delete</a> We assure you that our swift actions have kept our systems, and consequently your services, secure and unaffected by this vulnerability. We remain committed to maintaining the highest security standards and will continue to monitor and update our systems to safeguard your data and services. For more detailed information about this topic, you can refer to the Snyk blog post: https://snyk.io/blog/leaky-vessels-docker-runc-container-breakout-vulnerabilities/
Report: "Missing Dashboard Metrics for Small Number of Apps and Databases"
Last updateThis incident has been resolved.
For a small number of apps and databases deployed, restarted, or scaled since Friday, Jan 19th 16:00 UTC, metrics were missing from the Aptible Dashboard metrics view. There is no other impact; a fix is rolling out for metrics for those apps and databases, and this incident will be resolved once the fix has been completed.
Report: "Operations Blocked for Shared Stack shared-eu-central-1"
Last updateThis incident has been resolved.
Aptible operations have been temporarily blocked in shared stack shared-eu-central-1 in order to address a stack-specific error. Our team will provide an updated status once operations are unblocked.
Report: "EC2 Host Failure"
Last updateThis incident has been resolved.
We are investigating an EC2 dedicated host failure affecting a small number of dedicated stacks.
Report: "Aptible API Degraded Performance"
Last updateThis incident has been resolved.
The Aptible team is aware of intermittent degraded performance in the Aptible API, which led to some users seeing API-related Operation timeouts. Performance has returned to normal levels, and the team continues to monitor to ensure stability.
Report: "Quay.io Registry Issues"
Last updateQuay is reporting that this incident has been resolved.
We have failed over to our secondary registry provider and are monitoring ongoing status.
We have identified an issue with our primary upstream registry provider which is impacting some Aptible Deploy operations. Our team is in the process of failing over to our backup provider and will update this incident when this has been completed.
Report: "CVE-2023-44487 "HTTP/2 Rapid Reset" Response"
Last updateWe are aware of the recently disclosed vulnerability CVE-2023-44487, also known as the "HTTP/2 Rapid Reset Attack," which poses a potential risk of Denial of Service (DoS) attacks on HTTP/2-capable web servers. We are actively monitoring the situation and have conducted in-house tests on our HTTPS Endpoints that utilize AWS Application Load Balancers (ALBs). Currently, there is no evidence suggesting Aptible is vulnerable to this particular security concern. AWS has put in place extra measures to mitigate this vulnerability, ensuring that our services stay secure and fully functional. More information here: - AWS: CVE-2023-44487 - HTTP/2 Rapid Reset Attack: https://aws.amazon.com/security/security-bulletins/AWS-2023-011/ On Endpoint Types at Aptible: - HTTP(S) Endpoints: these use Application Load Balancers (ALBs) and have mitigations in place to address the vulnerability. Some legacy endpoints created before 2018 use legacy Elastic Load Balancers (ELBs), which do not support HTTP/2 and are not vulnerable. - TLS / TCP Endpoints: if customers are exposing custom HTTP/2-capable web servers behind these Endpoints, we recommend verifying with your web server vendor to determine if you are affected and, if so, promptly install the latest patches to mitigate this issue.
Report: "Host Provisioning Delays in us-east-1"
Last updateThis incident has been resolved.
We are again seeing successful deployment of new hosts in the affected single availability zone in the us-east-1 region. We will continue to monitor for an additional period before resolving the incident.
AWS continues to work on recovering this issue in a single availability zone in the us-east-1 region. Running apps and databases continue not to be impacted by this failure.
AWS is experiencing an issue preventing the timely deployment of new hosts in a single availability zone in the us-east-1 region. As a result, some app and database restart, scale, and deployment operations that result in a new host being provisioned may fail and roll back. Running apps and databases are not impacted by this failure.
Report: "EC2 Host Failure - us-east-1"
Last updateThis incident has been resolved.
We are investigating several EC2 dedicated host failures affecting some customers with apps and databases in us-east-1, related to an AWS incident. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "Delayed Operations"
Last updateThis incident has been resolved.
Our team has mitigated this issue, and newly created operations should now succeed. Customers may see long-delayed operations begin to fail. These failed operations will need to be restarted.
Our team has determined the root cause as an internal dependency causing operations to hang. We're currently beginning steps to remediate this issue—more updates to follow.
We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating—more updates to follow.
Report: "Metric Drains Interrupted for Some Dedicated Stacks"
Last update# Incident Postmortem: Metric Drains Interrupted for Some Dedicated Stacks ## Executive Summary On June 20, 2023, our platform experienced a service degradation incident for Metric Drains while rolling out a new feature for Metric Drains. This was due to unexpected side effects of a new internal utility used to deploy the feature. Some of our customers experienced interruptions in their metric drains during this incident. All issues were subsequently addressed, and service has been fully restored. ## Detailed Incident Description Configuration Change Initiation: The rollout of the change relied on a two-step configuration process to update the software for the metric drain emitter and aggregator components within each dedicated stack. This process was initiated using a new utility that had been successfully deployed in the past but not at the scale required for this rollout. Utility Timeouts and Delays: During the rollout, the configuration utility started experiencing cascading timeouts as operations queued with increasing delays in executing the configuration changes. During this period of delay in having configuration uniformly updated for the rollout, this caused some customer stacks to be only partially configured for the updated metric drain software. Customer Impact: A small number of customers who were deploying or scaling services during this period had their metric drains interrupted due to the aforementioned configuration issues. Resolution: Our team immediately worked on fixing the configuration issues. By 16:24 EDT, we successfully restored the configuration state for the affected customers, and the service was resumed to its regular state. Follow-up Audit: On the following morning of June 21, a follow-up audit revealed that two additional customers still needed configuration updates for their metric drains. We immediately addressed these issues. ## Root Cause Analysis The root cause of this issue was a combination of the increased scale of the rollout and the relative novelty of the utility used for the configuration changes. Although this utility had performed successfully under previous workloads, it did not sufficiently scale to handle the increased demand of this particular rollout. ## Lessons Learned and Preventative Measures Testing Deployment Tools at Scale: testing new deployment tools and utilities under maximum practical loads is crucial to ensure they can handle expected full-scope workloads without disruption. Audit Processes: Though our follow-up audit process effectively identified additional affected customers, we will make such audits more timely to catch any lingering issues sooner. We sincerely apologize for any inconvenience caused to our customers during this incident. We take this issue seriously and are committed to ensuring that such incidents do not occur in the future.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Related to the 6/20 incident, some customers on Dedicated Stacks are seeing a continued disruption to their Metric Drains. We are implementing a fix for those customers/stacks and will follow up via support tickets to affected customers.
Report: "Metric Drains Interrupted for Some Dedicated Stacks"
Last updateThis incident has been resolved.
During a rollout of an updated version of Metric Drains in the past hour, some customers on dedicated stacks experienced an interruption in service for Metric Drains. The issue has been identified and is being addressed, with the issue expected to be resolved for all affected customers within the following 10 minutes.
Report: "AWS Availability (3rd party)"
Last updateThis incident has been resolved.
Backup copying has been re-enabled, and we are continuing to monitor for any other impact.
We are aware of ongoing issues with AWS related to a failure of the Lambda service. Our assessment is that there is no direct impact to the availability or performance of any Aptible hosted services. One impact we are monitoring is that copying Database backups to a second region is not working at this time. When the incident resolves, the copies will be made automatically. If you are utilizing any AWS services directly in your own AWS account, you may monitor the status of those services at https://health.aws.amazon.com/health/status
Report: "Aptible transactional emails are delayed"
Last updateThis incident has been resolved.
Our email provider has reported delays in processing, resulting in delays in sending queued transactional emails sent by our platform. In particular, this means the following workflows might be delayed: - Email verifications - Password resets - Role invitations This does not affect any customer apps (unless you happen to be using the same email provider, of course): only transactional emails sent by Aptible itself are affected.
Report: "Long queueing times for Aptible operations"
Last updateThis incident has been resolved.
At this time, all issues related to long queueing times should be resolved. We are leaving this incident in "monitoring" status until the underlying AWS issue is either resolved or acknowledged by AWS with an explicit resolution ETA.
Operations for resources in us-east-1 are queueing for up to 5 minutes before beginning to execute, as a result of an AWS issue causing certain operations (especially `CopySnapshot` operations) to be extremely delayed. (These operations ordinarily begin to execute within 1 or 2 seconds.) Currently, only resources in us-east-1 are affected. We are making changes to disable some of the affected AWS operations until the underlying AWS issue is resolved. In the meantime, operations may take longer than usual to start, but will eventually begin to execute if no action is taken (and if they are not cancelled).
Report: "Operations paused in ap-southeast-1 region"
Last updateCalls to the EC2 API are responding normally, Operations in ap-southeast-1 have been resumed.
The issue has been identified and a fix is being implemented.
Due to the unavailability of the AWS EC2 API in the ap-southeast-1 region, Aptible Operations have been blocked in that region.
Report: "AWS Outage"
Last updateAWS has identified the root cause of the Endpoint unavailability: > Between 2:26 PM and 3:04 PM PDT\(9:26PM ~ 10:04 PM UTC\) we experienced increased packet loss for traffic destined to public endpoints in the US-EAST-1 Region, which affected Internet and public Direct Connect connectivity for endpoints in the US-EAST-1 Region. This is, unfortunately, essentially the same impact we've seen in two previous incidents, although AWS's description of the cause is slightly different: October 15th 2022: [https://status.aptible.com/incidents/grf6gdrrszf9](https://status.aptible.com/incidents/grf6gdrrszf9) > Between 12:20 AM and 11:28 AM PDT, we experienced intermittent failures in Route53 Health Checks impacting Target Health evaluation in US-EAST-1. The issue has been resolved and the service is operating normally. September 27th, 2021: \(Only a couple of Endpoints were impacted, so no incident was created\) > On September 27, 2021, between 8:45 AM and 2:09 PM PDT, Route53 experienced increased change propagation times for Health Check edits where unexpected failover to their secondary application load balancer \(ALB\) occurred despite their primary ALB targets being healthy. The issue has been resolved and the service is operating normally. While AWS describes these incidents as "increased change propagation times", "intermittent failures", and "increased packet loss", and apparently do not qualify as an incident to be posted to [https://status.aws.amazon.com,](https://status.aws.amazon.com,) the observed impact to our customers is very clear: the impacted Endpoints are totally unreachable for a period. As such, we will permanently implement the "temporary" change we made on October 15th: we will be disabling the Route53 health checks \(and the associated custom error page\) for all Endpoints, as this has been the root cause of these availability incidents. As we indicated to customers during the Oct 15th and Nov 3rd incidents, you may restart any App in order to immediately disable the Route53 health check. Any App which has been deployed, restarted, or scaled since October 15th will already have it disabled, and we will make another announcement when we intend to disable it globally on all Apps for which it remains enabled.
We no longer see any impact, and will continue investigating for an RCA.
Based on a random sampling, and reported affected Endpoints, we are no longer seeing any impact. We will continue to monitor the situation.
We're seeing many Endpoints recover without action being taken, so we're looking into ways to identify Endpoints that remain impacted so that we can efficiently fix them. Restarting known impacted Apps remains the quickest solution that we know of.
We've observed that running `aptible restart --app $handle` can resolve the underlying issue with the ELB, and recommend restarting any of your impacted Apps at this time.
We are currently investigating a large number of unreachable ELBs in AWS's us-east-1 region, and are wait for acknowledgement from AWS and trying to narrow the scope of the failures in order to provide failover/workarounds if possible.
Report: "High vulnerabilities in OpenSSL (CVE-2022-3602 & CVE-2022-3786)"
Last updateThis incident has been resolved.
OpenSSL's pre-announcements of CVE-2022-3602 described this issue as CRITICAL but has since been downgraded to HIGH [0]. Aptible remains unaffected by this vulnerability. We still recommend every Aptible customer check the OpenSSL versions used in their apps to confirm they're unaffected. Please follow the aforementioned steps to check the version and update OpenSSL accordingly. Additional Context & Guidance from OpenSSL: https://www.openssl.org/blog/blog/2022/11/01/email-address-overflows/ [0] https://www.openssl.org/news/secadv/20221101.txt
OpenSSL has announced a critical vulnerability [0] for which a patch will be released tomorrow, November 1, 2022 between 13:00 and 17:00 UTC. The nature of the vulnerability has not been disclosed, but based on how it's being handled, Aptible expects it could be a serious vulnerability affecting data confidentiality for those running affected OpenSSL versions (>= 3.0.0, < 3.0.7). Aptible has reviewed all infrastructure components that we manage and have confirmed that all are unaffected by this vulnerability. These components include: - Our Managed TLS endpoints - The TLS endpoints for our REST API services (Auth and Deploy APIs) - All versions of our managed databases - Our log forwarding infrastructure - Our metrics collection infrastructure - Our SSH and Git server infrastructure Still, every Aptible customer should check the OpenSSL versions used in their apps to confirm they're unaffected. To do so, run: $ aptible ssh --app $APP_HANDLE openssl version If the version is >= 3.0.0, you should plan to upgrade your apps' Docker image(s) tomorrow as soon as OpenSSL 3.0.7 is released. We will continue to update this incident page as more information is revealed about the vulnerability. If the vulnerability is only exploitable for *server-side* OpenSSL functionality, the impact to Aptible customers would be significantly reduced. Only those customers who use plain TCP endpoints [1] with their own OpenSSL for TLS termination would be affected in this scenario. [0] https://mta.openssl.org/pipermail/openssl-announce/2022-October/000238.html [1] https://deploy-docs.aptible.com/docs/tcp-endpoints
Report: "Continuation of AWS ELB incident https://status.aptible.com/incidents/b3xvn9tmzfjz"
Last updateWith an acknowledgement from AWS (copied below) we are marking this incident as resolved. > Between 12:20 AM and 11:28 AM PDT, [on October 15th] we experienced intermittent failures in Route53 Health Checks impacting Target Health evaluation in US-EAST-1. The issue has been resolved and the service is operating normally.
The issue should be resolved for all endpoints that have been affected. We've also updated the behavior of the platform so that if you see this issue, running `aptible restart` on the affected app will update the configuration of all the app's endpoints to ignore health checks, and should thereby resolve the issue.
All impacted endpoints that we have identified have been fixed by disabling their health checks. If you have an Endpoint that seems to be impacted, please email support@aptible.com and include the domain name of the Endpoint in question.
This is a continuation of https://status.aptible.com/incidents/b3xvn9tmzfjz It has been noticed that we can only detect the issue when HTTP requests are made to the Endpoint, and as the incident started in the middle of a weekend night, we were not able to identify Endpoints which were not in use. We're reviewing and remediating additional Endpoints now, and will make an update to the platform to remove health checks entirely if AWS cannot fix the Route53 issue.
Report: "AWS ALB issue"
Last updateThis incident has been resolved.
We are disabling health checks on the impacted Endpoints in order to bring those applications back online.
We have identified an AWS issue with Route 53 health checks where some records are failing over to Brickwall, our error server, despite the containers passing health checks. As far as we can tell, this is affecting a small number of Endpoints, unfortunately including dashboard.aptible.com, as well.
Report: "EC2 Host Failure"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "AWS us-east-2 power failure"
Last updateEverything appears to be operational at this time. Please open a ticket with support if you continue to experience issues in us-east-2.
It looks like AWS has recovered most services, and we are continuing to monitor operations in us-east-2 to ensure everything is working properly.
AWS experienced a power failure in a single availability zone the us-east-2 Region. This affected networking in that AZ, and has also impacted load balancer registration times. We are working to ensure any resources we can identify as impact are moved to another AZ, but the impact to load balancer registration, and an issue creating/updating/deleting Route 53 records are both impacting our ability to mitigate availability issues.
Report: "EC2 Host Failure"
Last updateThis incident has been resolved.
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "EC2 Host Failure"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "Host failure in ca-central-1"
Last updateThis incident has been resolved.
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases in the ca-central-1 region. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "EC2 Host Failure"
Last updateThis incident has been resolved.
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "Host provisioning failures"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating an issue that is blocking new hosts from being provisioned. As a result, some app and database restart, scale, and deployment operations that result in a new host being provisioned may fail. Running apps and databases are not impacted by this failure.
Report: "CVE-2022-22965 "Spring4Shell" Response"
Last updateThis incident has been resolved.
Recently a series of vulnerabilities in the popular Java framework Spring were found, notably CVE-2022-22965 [0] (dubbed "Spring4Shell") and CVE-2022-22963 [1]. Aptible does not use the Spring framework in any of our internal applications, and has verified that none of our offered services that use Java are vulnerable either. We will continue monitoring the situation. [0] https://tanzu.vmware.com/security/cve-2022-22965 [1] https://tanzu.vmware.com/security/cve-2022-22963
Report: "Operations blocked - Route 53 propagation delays"
Last updateThis incident has been resolved.
Operations have been restored. Route 53 response times are still slow, but within acceptable limits.
We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. In order to prevent Apps and Databases DNS records from reaching an inconsistent state, we are temporarily blocking Operations.
Report: "EC2 Host Failure"
Last updateThis incident has been resolved.
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Report: "Route53 increased propagation delays"
Last updateThis incident has been resolved.
At this time we have only observed an impact when creating or destroying DNS records. No DNS record changes have been impacted, so we are resuming normal operations. Our team will continue to monitor closely, and work to resolve the initially failed operations.
We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. In order to prevent Apps and Databases DNS records from reaching an inconsistent state, we are temporarily blocking Operations which will require updating DNS records.