Castle

Is Castle Down Right Now? Check if there is a current outage ongoing.

Castle is currently Operational

Last checked from Castle's official status page

Historical record of incidents for Castle

Report: "Partial service degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We’re currently experiencing an elevated number of errors returned by our API. Our team has identified the root cause and is actively working on a fix. We apologize for any inconvenience and will provide updates as soon as more information becomes available.

Report: "Partial API Outage"

Last update
resolved

This incident has been resolved.

Report: "Slightly elevated number of 5xx responses"

Last update
resolved

A fix has been implemented and all issues are resolved.

identified

We are currently experiencing a slightly elevated number of 5xx response errors. Our team has identified the core issue and is actively working on a fix. Please rest assured that no data has been lost.

Report: "Partial service degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Dashboard downtime"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We experience a Dashboard downtime due to a failure in the underlying infrastructure changes. Our team is working to resolve the issue and restore full functionality. We apologize for any inconvenience caused and appreciate your patience.

investigating

We are currently investigating this issue.

Report: "Service degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We have identified and resolved the root cause of the recent system issues. All affected systems have been restored to normal operation.

identified

We experienced a disruption in our API services due to a failure in the underlying infrastructure changes. Our team is currently working to resolve the issue and restore full functionality as soon as possible. We apologize for any inconvenience caused and appreciate your patience.

Report: "Partial unavailability dashboard"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We're experiencing temporary issues with our dashboard's functionality. Our team is currently investigating to determine the root cause. Your data remains secure and intact - this issue affects only accessibility, not data integrity. We're working diligently towards a quick resolution and will provide updates accordingly. Apologies for any inconvenience. For immediate concerns, please reach out to our support team.

Report: "Intermittent 5xx API errors"

Last update
resolved

Our team has thoroughly investigated the issue and determined that the root cause is likely due to underlying networking issues within AWS.

investigating

We are currently investigating intermittent 5xx API errors and longer than usual response times.

Report: "Service degradation"

Last update
resolved

The APIs are back to full operation and we're assessing any impact on the data during the period of the incident. We will follow up with an analysis.

monitoring

We've identified what seems to be the issue and have deployed a fix. The service seems to be back to normal but we'll be monitoring it and follow up with a confirmation.

investigating

We are investigating an issue where 401 and 503 responses are returned for a subset of requests. We're still assessing the scope of the issue and will keep you posted.

Report: "Partial database outage"

Last update
resolved

This incident has been resolved.

identified

We are investigating an elevated error rate on our end. We've identified issue and working towards fixing it. All of the APIs are operational.

Report: "API downtime"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

API responses are coming through for some requests.

investigating

We’re experiencing timeouts in API endpoints. Investigating.

Report: "Data ingestion issue in the Explore tab"

Last update
resolved

This incident has been resolved.

monitoring

We have resolved the core issue. All data should be available in dashboard. We are going to actively monitor data ingestion.

identified

We are currently observing data ingestion problems that are caused by issue in one of our databases. We are actively working on fixing the core issue. Also there is no data loss in any of the systems.

Report: "API performance degradation"

Last update
resolved

This incident has been resolved.

monitoring

We see improved response times on our API endpoints. We will monitor the situation.

identified

While we are working on a permanent fix we are observing response times to improve.

identified

We have identified the issue. We are working to remediate the root cause.

investigating

We are investigating general API slowdown.

Report: "Internal services issue"

Last update
resolved

Internal systems are stable.

monitoring

We have implemented a fix and are monitoring systems.

identified

We have identified the issue and are working on the root cause of the issue.

Report: "Amazon Web Services disruption"

Last update
resolved

Amazon Web Services mitigated majority of issues around EC2. We will continue to monitor the issues of AWS and update status if needed.

monitoring

AWS updated information that USE1-AZ4 availability zone is the only one affected. We do not see any indicators that castle services are affected. We will continue monitoring of the AWS issue and act to remedy issues if needed.

investigating

We are currently investigating if Amazon Web Services us-east-1 service disruption affects castle services.

Report: "Potential service disruption"

Last update
resolved

Most of the AWS systems are working already and the few that are not are recovering fast.

monitoring

We are keeping this incident open while we wait for AWS systems to recover.

investigating

Our hosting provider AWS is currently experiencing issues, however, it is not affecting the Castle services at the moment. All Castle services are fully functional, and we're actively monitoring the situation. Please see AWS' status page for more information: https://status.aws.amazon.com

Report: "API slowdown"

Last update
resolved

Our APIs have slowed down due to a database issue.

Report: "API Downtime"

Last update
resolved

At 2021-09-06 20:04 UTC we experienced an AWS hardware failure with one of our main databases which led to 7 minutes of downtime impacting our APIs. During this time, the APIs were returning a 500 response code and no data was processed. The database in question is configured to be multi-node with automatic failover, but for unknown reasons the failover didn't happen as expected when the hardware fault occurred. Instead, a full backup had to be recreated, which led to the extended period of downtime. We're currently debugging this with AWS support to make sure we can trust the resiliency of our platform. While the current setup should provide good redundancy, we're simultaneously looking into alternative options to prevent this from happening again.

Report: "Lost events in Castle Dashboard"

Last update
resolved

Between 14:43 and 15:31 UTC Castle experienced an infrastructure issue with our message queuing system that caused some customer event data to get lost. While risk scoring and inline responses were functioning normally, the requests sent during the period of the incident will not be visible or searchable in the Castle Dashboard We're prioritizing efforts to add extra redundancy to our system to prevent this from happening again.

Report: "Service disruption"

Last update
postmortem

On Sunday, April 4th, 2021, beginning at 13:56 UTC, Castle's `/authenticate` endpoint was unavailable. Our teams promptly responded and service was restored at 14:09 UTC. We've conducted a full retrospective and root-cause analysis and determined that the original cause of the incident was the hardware failure \(as confirmed by AWS Support\) of an AWS host instance that contained Castle's managed cache service. This hardware failure caused an accumulation of timeouts, resulting in some app instances being marked unhealthy and automatically restarted in a loop. Although rare, we do expect occasional hardware-level failures, and our system is designed to be resilient to these failures whenever possible. In this case, the accumulated timeouts caused the system to behave in a way we have not seen before. We have re-prioritized our engineering team to implement '[circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html)'-style handling around cache look-ups which will prevent subsequent cache layer failures from impacting synchronous endpoints like `/authenticate`.

resolved

System is back to normal. We will follow up with more details about this incident

investigating

API endpoints responding normally again. Queued requests are catching up. Monitoring

investigating

We’re experiencing timeouts in API endpoints. Investigating

Report: "Service disruption"

Last update
postmortem

On March 30, 2021, Castle’s API became degraded during three distinct windows of time: * 12:02 UTC - 12:45 UTC * 12:59 UTC - 13:41 UTC * 14:48 UTC - 15:25 UTC During this time, some Castle API calls failed, including calls to our synchronous `authenticate` endpoint. The Castle dashboard was up, but due to the API being unavailable was not rendering data. Service was fully restored as of 15:25 UTC, and some data generated from requests to our asynchronous `track` and `batch` endpoints during the incident was recovered from queues and subsequently processed. As we communicated to all active customers yesterday, we take these sort of incidents very seriously, and want to share some of the factors that led to this incident. The root cause of the incident was a failure of one of our primary data clusters. This is a multi-node, fault-tolerant commercial solution and a complete cluster failure is extremely rare. Castle’s infrastructure team responded immediately to the incident and found an unbounded memory leak that caused each node to simultaneously shut down. Over the course of the incident, we learned this memory leak was exacerbated by a specific class of background job that actually began running a day prior but did not begin leaking memory for some time. When the incident began, we detected the issue and immediately restarted the cluster. A full 'cold start' of the entire cluster takes around 40 minutes, and this accounts for the first downtime window. After the cluster restarted, our fault-tolerant job scheduling system attempted to run the jobs again, which caused the cluster to require full cold restarts twice more as we worked to clear out the job queue and replicas. At this time, we believe the reason for the memory leak is a bug in our data cluster provider’s software - we have been able to successfully reproduce the issue in a test environment and have a high-priority case open with their support team. In the meantime, we have audited all active background job systems to ensure performance-affecting jobs are temporarily disabled or worked around. Once again, we apologize for the impact of this interruption. Please feel free to contact us at [support@castle.io](mailto:support@castle.io) if you have any further questions.

resolved

Systems are operating normally and we have put mitigation measures in place to ensure the issue does not reoccur. We'll have a full retrospective and root cause teardown of the incident published within the next few days.

monitoring

API endpoints are responsive again and the system is stabilizing. We're monitoring the situation

identified

We are seeing degraded performance on API endpoints once more, and are working on restoring functionality as quickly as possible.

monitoring

Database cluster operating normally and API endpoints are responding. We're continuing to monitor the situation

investigating

We are experiencing issues with our main database cluster which affects all API endpoints. We're currently investigating this issue.