Aptible

Is Aptible Down Right Now? Check if there is a current outage ongoing.

Aptible is currently Operational

Last checked from Aptible's official status page

Historical record of incidents for Aptible

Report: "Increased error rate"

Last update
investigating

We are investigating an increased error rate in our API which may be causing failed operations

Report: "Aptible Documentation Site Unavailable"

Last update
resolved

This incident has been resolved.

investigating

Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.

Report: "Aptible Documentation Site Unavailable"

Last update
Resolved

This incident has been resolved.

Investigating

Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.

Report: "Route53 increased propagation delays"

Last update
resolved

Route 53 record propagation appears to have returned to normal.

monitoring

We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. Running App and Databases are not impacted, but creation or deletion of Databases or Endpoints, as well as scaling services to/from zero containers may be impacted. We'll continue to monitor the situation and provide updates as we have any additional information to shre.

Report: "Route53 increased propagation delays"

Last update
Resolved

Route 53 record propagation appears to have returned to normal.

Monitoring

We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. Running App and Databases are not impacted, but creation or deletion of Databases or Endpoints, as well as scaling services to/from zero containers may be impacted.We'll continue to monitor the situation and provide updates as we have any additional information to shre.

Report: "Aptible Documentation Site Unavailable"

Last update
resolved

This incident has been resolved.

investigating

Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.

Report: "Aptible Documentation Site Unavailable"

Last update
resolved

This incident has been resolved.

investigating

Our online documentation at aptible.com/docs is temporarily unavailable. We are working with our upstream provider to resolve the issue and will update this incident when it is resolved.

Report: "Delayed Operations in eu-central-1"

Last update
resolved

This incident has been resolved.

identified

We are currently experiencing issues with operations being delayed for stacks hosted in eu-central-1. Our Engineering team is currently working to restore normal functionality.

Report: "Delayed Operations"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating.

Report: "Delayed Operations"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and operations are running smoothly again. We are monitoring.

investigating

We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating.

Report: "App and Database operation failures"

Last update
resolved

This incident has been resolved.

monitoring

We are experiencing intermittent failures in App and Database operations due to issues with an upstream provider. This issue only affects Apps and Databases with endpoints. Retrying the operation may resolve the issue. We are actively monitoring the situation and will provide updates once the problem is fully resolved.

Report: "Operations blocked - Route 53 propagation delays"

Last update
resolved

This incident has been resolved.

monitoring

We are noticing Route 53 record requests succeeding in a normal time frame, and are lifting the operation block at this time. We'll continue to observe running operations to ensure stability.

identified

We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. In order to prevent Apps and Databases DNS records from reaching an inconsistent state, we are temporarily blocking Operations. Performance and reachability of existing Apps and Database is not impacted.

Report: "Database provision errors"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

We've identified an error blocking the creation of new Databases on the platform, and our team is applying a fix. Reachability of your existing databases, and the ability to scale or restart them is not impacted.

Report: "Delayed Operations"

Last update
resolved

This incident has been resolved.

investigating

We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating—more updates to follow.

Report: "Long load balancer registration times"

Last update
resolved

AWS has indicated that the underlying issue has been resolved, and our monitoring indicates it is safe to run operations again. All inconsistencies impacting customer apps or databases (there were only 4 impacted resources) have been resolved.

identified

We are experiencing longer than usual Route53 change times, and some operations are unable to Rollback gracefully. In order to prevent resources from reaching a failed state where the DNS is not properly configured, we are blocking creation of new operations on the platform. We will update soon with additional information.

Report: "Limited Availability Incident in shared-us-west-1"

Last update
resolved

On 2024-10-16, between 00:20 and 02:38 UTC, some customer apps and databases in a single shared stack, shared-us-west-1, experienced an availability incident as a result of a problem encountered with planned maintenance. Service has been restored to those affected apps and databases, and this incident is considered resolved at this time.

Report: "Impacted platform operation in us-east-2"

Last update
resolved

AWS has resolved the underlying issue.

monitoring

We are no longer observing error responses for S3, and have re-allowed operations in us-east-2. We will continue to monitor the situation.

identified

AWS confirmed multiple services are impacted in us-east-2. We are blocking operations in that region until availability stabilizes.

investigating

We are investigating an S3 outage in the us-east-2 region, which is impacting new operations on resources in that region. All apps and databases are running normally, though if your code relies on S3 directly, or 3rd party services that rely on S3, you may see application-level impact.

Report: "Long load balancer registration times"

Last update
resolved

AWS has marked this issue RESOLVED as of 19:19 UTC, and we have not observed any issues in the last hour. The issue has been resolved and all services are operating normally.

identified

The latest update from AWS indicates that operations created around 17:10 through 17:20 UTC were impacted, which matches our internal metrics. AWS has promised another update by 18:00 UTC, and we will continue to monitor the situation until we are satisfied that it is resolved.

identified

We're again seeing degradation and failure to register new load balancer targets in about about 10% of running operations.

monitoring

Loadbalancer registration appears to be working as expected at this time. We will continue to monitor operations until AWS resolves their service degradation notice.

identified

AWS has acknowledged the impact we are seeing and opened an incident: > We are investigating increased load balancer back-end instance registration times in the us-east-1 Region. September 26, 2024 at 16:21:43 UTC Since 16:05 UTC, Aptible is observing some recovery, about half of endpoint target registrations are succeeding at this time.

identified

This service impact only applies to resources hosted in the `us-east-1` region. Customers may notice operations reaching timeout, but at this point all operations are rolling back successfully to the previous state.

investigating

We are investigating abnormally long registration times for new targets with AWS Load Balancers. This may be causing extended operation times for releases (Deploy, Scale, Restart) for services that have Endpoints.

Report: "Dockerfile based `git-push` deployments issue"

Last update
resolved

This incident has been resolved.

identified

After the recent git server maintenance, a fallout issue was identified that affected `git push` based deployments to existing apps. A fix has been put in place, so we expect further deployments will not be affected. Please contact support if you encounter further issues.

Report: "Git-based Deploy Log Streaming Disruption on Aptible CLI for Dedicated Stacks"

Last update
resolved

For dedicated stacks only, <a href="https://www.aptible.com/docs/core-concepts/apps/deploying-apps/image/deploying-with-git/overview">git-based deployments</a> were not streaming logs about the deployment operation activity as they normally do. The deploy operations were running normally in the background but not streaming live logs to the CLI. This incident impacted git-based deploys from the CLI between June 14th, 5:44 AM UTC, and June 14th, 1:45 PM UTC. Our team has applied a fix, which has resolved the issue. Please contact our <a href="https://contact.aptible.com/">Support Team</a> if you have additional questions.

Report: "Temporary Metrics Unavailability in Aptible Dashboard"

Last update
resolved

We are notifying our users of an issue where some metrics are not available on the Aptible Dashboard (app.aptible.com) for the period between May 5, 2024, 18:54 UTC and May 6, 2024, 22:50 UTC. We want to assure you that this does not affect the functionality of <a href="https://www.aptible.com/docs/metric-drains">Aptible Metric Drains</a>. If you have any concerns or require further assistance, please do not hesitate to reach out to our <a href="https://contact.aptible.com/">support team</a>.

Report: "Update on CVE-2024-3094: XZ Utils Vulnerability"

Last update
resolved

Aptible is aware of <a href="https://nvd.nist.gov/vuln/detail/CVE-2024-3094">CVE-2024-3094</a>, a critical vulnerability in XZ Utils, specifically affecting versions 5.6.0 and 5.6.1, with a CVSS score of 10, indicating a severe level of risk. This vulnerability results from a supply chain compromise and is present in data compression software widely used across major Linux distributions. The malicious code discovered in the affected versions allows for unauthorized system access, posing a significant security threat. The Aptible platform and services do not utilize the affected software versions and are not impacted. Aptible customers are urged to evaluate dependencies in their Docker Images and other systems and patch as needed urgently to mitigate the risk associated with this vulnerability. Given the scope and severity of the CVE, our security team continues to monitor the situation actively. If you have any concerns or questions, please contact the <a href="https://www.aptible.com/docs/support">Aptible Support team</a>.

Report: "Response to Leaky Vessels: Docker and runc container breakout vulnerabilities"

Last update
resolved

We have proactively addressed a recent security vulnerability identified as "Leaky Vessels," a container breakout issue affecting runc versions up to 1.1.11. This vulnerability had the potential to allow unauthorized access to the host OS from containers. Our team has promptly updated our systems, including all instances of runc to the secure version, to ensure the highest level of security for our platform and your services. This update mitigates the risks associated with this vulnerability. The following CVEs have been addressed on our platform: - CVE-2024-21626: <a href="https://snyk.io/blog/cve-2024-21626-runc-process-cwd-container-breakout/">runc process.cwd & leaked fds container breakout</a> - CVE-2024-23651: <a href="https://snyk.io/blog/cve-2024-23651-docker-buildkit-mount-cache-race/">Buildkit Mount Cache Race</a> - CVE-2024-23653: <a href="https://snyk.io/blog/cve-2024-23653-buildkit-grpc-securitymode-privilege-check/">Buildkit GRPC SecurityMode Privilege Check</a> - CVE-2024-23652: <a href="https://snyk.io/blog/cve-2024-23652-buildkit-build-time-container-teardown-arbitrary-delete/">Buildkit Build-time Container Teardown Arbitrary Delete</a> We assure you that our swift actions have kept our systems, and consequently your services, secure and unaffected by this vulnerability. We remain committed to maintaining the highest security standards and will continue to monitor and update our systems to safeguard your data and services. For more detailed information about this topic, you can refer to the Snyk blog post: https://snyk.io/blog/leaky-vessels-docker-runc-container-breakout-vulnerabilities/

Report: "Missing Dashboard Metrics for Small Number of Apps and Databases"

Last update
resolved

This incident has been resolved.

monitoring

For a small number of apps and databases deployed, restarted, or scaled since Friday, Jan 19th 16:00 UTC, metrics were missing from the Aptible Dashboard metrics view. There is no other impact; a fix is rolling out for metrics for those apps and databases, and this incident will be resolved once the fix has been completed.

Report: "Operations Blocked for Shared Stack shared-eu-central-1"

Last update
resolved

This incident has been resolved.

identified

Aptible operations have been temporarily blocked in shared stack shared-eu-central-1 in order to address a stack-specific error. Our team will provide an updated status once operations are unblocked.

Report: "EC2 Host Failure"

Last update
resolved

This incident has been resolved.

investigating

We are investigating an EC2 dedicated host failure affecting a small number of dedicated stacks.

Report: "Aptible API Degraded Performance"

Last update
resolved

This incident has been resolved.

monitoring

The Aptible team is aware of intermittent degraded performance in the Aptible API, which led to some users seeing API-related Operation timeouts. Performance has returned to normal levels, and the team continues to monitor to ensure stability.

Report: "Quay.io Registry Issues"

Last update
resolved

Quay is reporting that this incident has been resolved.

monitoring

We have failed over to our secondary registry provider and are monitoring ongoing status.

identified

We have identified an issue with our primary upstream registry provider which is impacting some Aptible Deploy operations. Our team is in the process of failing over to our backup provider and will update this incident when this has been completed.

Report: "CVE-2023-44487 "HTTP/2 Rapid Reset" Response"

Last update
resolved

We are aware of the recently disclosed vulnerability CVE-2023-44487, also known as the "HTTP/2 Rapid Reset Attack," which poses a potential risk of Denial of Service (DoS) attacks on HTTP/2-capable web servers. We are actively monitoring the situation and have conducted in-house tests on our HTTPS Endpoints that utilize AWS Application Load Balancers (ALBs). Currently, there is no evidence suggesting Aptible is vulnerable to this particular security concern. AWS has put in place extra measures to mitigate this vulnerability, ensuring that our services stay secure and fully functional. More information here: - AWS: CVE-2023-44487 - HTTP/2 Rapid Reset Attack: https://aws.amazon.com/security/security-bulletins/AWS-2023-011/ On Endpoint Types at Aptible: - HTTP(S) Endpoints: these use Application Load Balancers (ALBs) and have mitigations in place to address the vulnerability. Some legacy endpoints created before 2018 use legacy Elastic Load Balancers (ELBs), which do not support HTTP/2 and are not vulnerable. - TLS / TCP Endpoints: if customers are exposing custom HTTP/2-capable web servers behind these Endpoints, we recommend verifying with your web server vendor to determine if you are affected and, if so, promptly install the latest patches to mitigate this issue.

Report: "Host Provisioning Delays in us-east-1"

Last update
resolved

This incident has been resolved.

monitoring

We are again seeing successful deployment of new hosts in the affected single availability zone in the us-east-1 region. We will continue to monitor for an additional period before resolving the incident.

identified

AWS continues to work on recovering this issue in a single availability zone in the us-east-1 region. Running apps and databases continue not to be impacted by this failure.

investigating

AWS is experiencing an issue preventing the timely deployment of new hosts in a single availability zone in the us-east-1 region. As a result, some app and database restart, scale, and deployment operations that result in a new host being provisioned may fail and roll back. Running apps and databases are not impacted by this failure.

Report: "EC2 Host Failure - us-east-1"

Last update
resolved

This incident has been resolved.

investigating

We are investigating several EC2 dedicated host failures affecting some customers with apps and databases in us-east-1, related to an AWS incident. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "Delayed Operations"

Last update
resolved

This incident has been resolved.

monitoring

Our team has mitigated this issue, and newly created operations should now succeed. Customers may see long-delayed operations begin to fail. These failed operations will need to be restarted.

identified

Our team has determined the root cause as an internal dependency causing operations to hang. We're currently beginning steps to remediate this issue—more updates to follow.

investigating

We are currently experiencing issues with operations being delayed. Our Engineering team is currently investigating—more updates to follow.

Report: "Metric Drains Interrupted for Some Dedicated Stacks"

Last update
postmortem

# Incident Postmortem: Metric Drains Interrupted for Some Dedicated Stacks ## Executive Summary On June 20, 2023, our platform experienced a service degradation incident for Metric Drains while rolling out a new feature for Metric Drains. This was due to unexpected side effects of a new internal utility used to deploy the feature. Some of our customers experienced interruptions in their metric drains during this incident. All issues were subsequently addressed, and service has been fully restored. ## Detailed Incident Description Configuration Change Initiation: The rollout of the change relied on a two-step configuration process to update the software for the metric drain emitter and aggregator components within each dedicated stack. This process was initiated using a new utility that had been successfully deployed in the past but not at the scale required for this rollout. Utility Timeouts and Delays: During the rollout, the configuration utility started experiencing cascading timeouts as operations queued with increasing delays in executing the configuration changes. During this period of delay in having configuration uniformly updated for the rollout, this caused some customer stacks to be only partially configured for the updated metric drain software. Customer Impact: A small number of customers who were deploying or scaling services during this period had their metric drains interrupted due to the aforementioned configuration issues. Resolution: Our team immediately worked on fixing the configuration issues. By 16:24 EDT, we successfully restored the configuration state for the affected customers, and the service was resumed to its regular state. Follow-up Audit: On the following morning of June 21, a follow-up audit revealed that two additional customers still needed configuration updates for their metric drains. We immediately addressed these issues. ## Root Cause Analysis The root cause of this issue was a combination of the increased scale of the rollout and the relative novelty of the utility used for the configuration changes. Although this utility had performed successfully under previous workloads, it did not sufficiently scale to handle the increased demand of this particular rollout. ## Lessons Learned and Preventative Measures Testing Deployment Tools at Scale: testing new deployment tools and utilities under maximum practical loads is crucial to ensure they can handle expected full-scope workloads without disruption. Audit Processes: Though our follow-up audit process effectively identified additional affected customers, we will make such audits more timely to catch any lingering issues sooner. We sincerely apologize for any inconvenience caused to our customers during this incident. We take this issue seriously and are committed to ensuring that such incidents do not occur in the future.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Related to the 6/20 incident, some customers on Dedicated Stacks are seeing a continued disruption to their Metric Drains. We are implementing a fix for those customers/stacks and will follow up via support tickets to affected customers.

Report: "Metric Drains Interrupted for Some Dedicated Stacks"

Last update
resolved

This incident has been resolved.

identified

During a rollout of an updated version of Metric Drains in the past hour, some customers on dedicated stacks experienced an interruption in service for Metric Drains. The issue has been identified and is being addressed, with the issue expected to be resolved for all affected customers within the following 10 minutes.

Report: "AWS Availability (3rd party)"

Last update
resolved

This incident has been resolved.

monitoring

Backup copying has been re-enabled, and we are continuing to monitor for any other impact.

monitoring

We are aware of ongoing issues with AWS related to a failure of the Lambda service. Our assessment is that there is no direct impact to the availability or performance of any Aptible hosted services. One impact we are monitoring is that copying Database backups to a second region is not working at this time. When the incident resolves, the copies will be made automatically. If you are utilizing any AWS services directly in your own AWS account, you may monitor the status of those services at https://health.aws.amazon.com/health/status

Report: "Aptible transactional emails are delayed"

Last update
resolved

This incident has been resolved.

identified

Our email provider has reported delays in processing, resulting in delays in sending queued transactional emails sent by our platform. In particular, this means the following workflows might be delayed: - Email verifications - Password resets - Role invitations This does not affect any customer apps (unless you happen to be using the same email provider, of course): only transactional emails sent by Aptible itself are affected.

Report: "Long queueing times for Aptible operations"

Last update
resolved

This incident has been resolved.

monitoring

At this time, all issues related to long queueing times should be resolved. We are leaving this incident in "monitoring" status until the underlying AWS issue is either resolved or acknowledged by AWS with an explicit resolution ETA.

monitoring

Operations for resources in us-east-1 are queueing for up to 5 minutes before beginning to execute, as a result of an AWS issue causing certain operations (especially `CopySnapshot` operations) to be extremely delayed. (These operations ordinarily begin to execute within 1 or 2 seconds.) Currently, only resources in us-east-1 are affected. We are making changes to disable some of the affected AWS operations until the underlying AWS issue is resolved. In the meantime, operations may take longer than usual to start, but will eventually begin to execute if no action is taken (and if they are not cancelled).

Report: "Operations paused in ap-southeast-1 region"

Last update
resolved

Calls to the EC2 API are responding normally, Operations in ap-southeast-1 have been resumed.

identified

The issue has been identified and a fix is being implemented.

investigating

Due to the unavailability of the AWS EC2 API in the ap-southeast-1 region, Aptible Operations have been blocked in that region.

Report: "AWS Outage"

Last update
postmortem

AWS has identified the root cause of the Endpoint unavailability: > Between 2:26 PM and 3:04 PM PDT\(9:26PM ~ 10:04 PM UTC\) we experienced increased packet loss for traffic destined to public endpoints in the US-EAST-1 Region, which affected Internet and public Direct Connect connectivity for endpoints in the US-EAST-1 Region. This is, unfortunately, essentially the same impact we've seen in two previous incidents, although AWS's description of the cause is slightly different: October 15th 2022: [https://status.aptible.com/incidents/grf6gdrrszf9](https://status.aptible.com/incidents/grf6gdrrszf9) > Between 12:20 AM and 11:28 AM PDT, we experienced intermittent failures in Route53 Health Checks impacting Target Health evaluation in US-EAST-1. The issue has been resolved and the service is operating normally. September 27th, 2021: \(Only a couple of Endpoints were impacted, so no incident was created\) > On September 27, 2021, between 8:45 AM and 2:09 PM PDT, Route53 experienced increased change propagation times for Health Check edits where unexpected failover to their secondary application load balancer \(ALB\) occurred despite their primary ALB targets being healthy. The issue has been resolved and the service is operating normally. While AWS describes these incidents as "increased change propagation times", "intermittent failures", and "increased packet loss", and apparently do not qualify as an incident to be posted to [https://status.aws.amazon.com,](https://status.aws.amazon.com,) the observed impact to our customers is very clear: the impacted Endpoints are totally unreachable for a period. As such, we will permanently implement the "temporary" change we made on October 15th: we will be disabling the Route53 health checks \(and the associated custom error page\) for all Endpoints, as this has been the root cause of these availability incidents. As we indicated to customers during the Oct 15th and Nov 3rd incidents, you may restart any App in order to immediately disable the Route53 health check. Any App which has been deployed, restarted, or scaled since October 15th will already have it disabled, and we will make another announcement when we intend to disable it globally on all Apps for which it remains enabled.

resolved

We no longer see any impact, and will continue investigating for an RCA.

monitoring

Based on a random sampling, and reported affected Endpoints, we are no longer seeing any impact. We will continue to monitor the situation.

identified

We're seeing many Endpoints recover without action being taken, so we're looking into ways to identify Endpoints that remain impacted so that we can efficiently fix them. Restarting known impacted Apps remains the quickest solution that we know of.

identified

We've observed that running `aptible restart --app $handle` can resolve the underlying issue with the ELB, and recommend restarting any of your impacted Apps at this time.

investigating

We are currently investigating a large number of unreachable ELBs in AWS's us-east-1 region, and are wait for acknowledgement from AWS and trying to narrow the scope of the failures in order to provide failover/workarounds if possible.

Report: "High vulnerabilities in OpenSSL (CVE-2022-3602 & CVE-2022-3786)"

Last update
resolved

This incident has been resolved.

monitoring

OpenSSL's pre-announcements of CVE-2022-3602 described this issue as CRITICAL but has since been downgraded to HIGH [0]. Aptible remains unaffected by this vulnerability. We still recommend every Aptible customer check the OpenSSL versions used in their apps to confirm they're unaffected. Please follow the aforementioned steps to check the version and update OpenSSL accordingly. Additional Context & Guidance from OpenSSL: https://www.openssl.org/blog/blog/2022/11/01/email-address-overflows/ [0] https://www.openssl.org/news/secadv/20221101.txt

monitoring

OpenSSL has announced a critical vulnerability [0] for which a patch will be released tomorrow, November 1, 2022 between 13:00 and 17:00 UTC. The nature of the vulnerability has not been disclosed, but based on how it's being handled, Aptible expects it could be a serious vulnerability affecting data confidentiality for those running affected OpenSSL versions (>= 3.0.0, < 3.0.7). Aptible has reviewed all infrastructure components that we manage and have confirmed that all are unaffected by this vulnerability. These components include: - Our Managed TLS endpoints - The TLS endpoints for our REST API services (Auth and Deploy APIs) - All versions of our managed databases - Our log forwarding infrastructure - Our metrics collection infrastructure - Our SSH and Git server infrastructure Still, every Aptible customer should check the OpenSSL versions used in their apps to confirm they're unaffected. To do so, run: $ aptible ssh --app $APP_HANDLE openssl version If the version is >= 3.0.0, you should plan to upgrade your apps' Docker image(s) tomorrow as soon as OpenSSL 3.0.7 is released. We will continue to update this incident page as more information is revealed about the vulnerability. If the vulnerability is only exploitable for *server-side* OpenSSL functionality, the impact to Aptible customers would be significantly reduced. Only those customers who use plain TCP endpoints [1] with their own OpenSSL for TLS termination would be affected in this scenario. [0] https://mta.openssl.org/pipermail/openssl-announce/2022-October/000238.html [1] https://deploy-docs.aptible.com/docs/tcp-endpoints

Report: "Continuation of AWS ELB incident https://status.aptible.com/incidents/b3xvn9tmzfjz"

Last update
resolved

With an acknowledgement from AWS (copied below) we are marking this incident as resolved. > Between 12:20 AM and 11:28 AM PDT, [on October 15th] we experienced intermittent failures in Route53 Health Checks impacting Target Health evaluation in US-EAST-1. The issue has been resolved and the service is operating normally.

monitoring

The issue should be resolved for all endpoints that have been affected. We've also updated the behavior of the platform so that if you see this issue, running `aptible restart` on the affected app will update the configuration of all the app's endpoints to ignore health checks, and should thereby resolve the issue.

monitoring

All impacted endpoints that we have identified have been fixed by disabling their health checks. If you have an Endpoint that seems to be impacted, please email support@aptible.com and include the domain name of the Endpoint in question.

identified

This is a continuation of https://status.aptible.com/incidents/b3xvn9tmzfjz It has been noticed that we can only detect the issue when HTTP requests are made to the Endpoint, and as the incident started in the middle of a weekend night, we were not able to identify Endpoints which were not in use. We're reviewing and remediating additional Endpoints now, and will make an update to the platform to remove health checks entirely if AWS cannot fix the Route53 issue.

Report: "AWS ALB issue"

Last update
resolved

This incident has been resolved.

monitoring

We are disabling health checks on the impacted Endpoints in order to bring those applications back online.

identified

We have identified an AWS issue with Route 53 health checks where some records are failing over to Brickwall, our error server, despite the containers passing health checks. As far as we can tell, this is affecting a small number of Endpoints, unfortunately including dashboard.aptible.com, as well.

Report: "EC2 Host Failure"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "AWS us-east-2 power failure"

Last update
resolved

Everything appears to be operational at this time. Please open a ticket with support if you continue to experience issues in us-east-2.

monitoring

It looks like AWS has recovered most services, and we are continuing to monitor operations in us-east-2 to ensure everything is working properly.

identified

AWS experienced a power failure in a single availability zone the us-east-2 Region. This affected networking in that AZ, and has also impacted load balancer registration times. We are working to ensure any resources we can identify as impact are moved to another AZ, but the impact to load balancer registration, and an issue creating/updating/deleting Route 53 records are both impacting our ability to mitigate availability issues.

Report: "EC2 Host Failure"

Last update
resolved

This incident has been resolved.

investigating

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "EC2 Host Failure"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "Host failure in ca-central-1"

Last update
resolved

This incident has been resolved.

identified

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases in the ca-central-1 region. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "EC2 Host Failure"

Last update
resolved

This incident has been resolved.

investigating

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "Host provisioning failures"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating an issue that is blocking new hosts from being provisioned. As a result, some app and database restart, scale, and deployment operations that result in a new host being provisioned may fail. Running apps and databases are not impacted by this failure.

Report: "CVE-2022-22965 "Spring4Shell" Response"

Last update
resolved

This incident has been resolved.

monitoring

Recently a series of vulnerabilities in the popular Java framework Spring were found, notably CVE-2022-22965 [0] (dubbed "Spring4Shell") and CVE-2022-22963 [1]. Aptible does not use the Spring framework in any of our internal applications, and has verified that none of our offered services that use Java are vulnerable either. We will continue monitoring the situation. [0] https://tanzu.vmware.com/security/cve-2022-22965 [1] https://tanzu.vmware.com/security/cve-2022-22963

Report: "Operations blocked - Route 53 propagation delays"

Last update
resolved

This incident has been resolved.

monitoring

Operations have been restored. Route 53 response times are still slow, but within acceptable limits.

investigating

We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. In order to prevent Apps and Databases DNS records from reaching an inconsistent state, we are temporarily blocking Operations.

Report: "EC2 Host Failure"

Last update
resolved

This incident has been resolved.

identified

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Report: "Route53 increased propagation delays"

Last update
resolved

This incident has been resolved.

monitoring

At this time we have only observed an impact when creating or destroying DNS records. No DNS record changes have been impacted, so we are resuming normal operations. Our team will continue to monitor closely, and work to resolve the initially failed operations.

identified

We've noticed that some Operations are failing due to Route53 record changes not propagating within the 10 minute time limit allowed by our platform. In order to prevent Apps and Databases DNS records from reaching an inconsistent state, we are temporarily blocking Operations which will require updating DNS records.