Historical record of incidents for Castle
Report: "Partial service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We’re currently experiencing an elevated number of errors returned by our API. Our team has identified the root cause and is actively working on a fix. We apologize for any inconvenience and will provide updates as soon as more information becomes available.
Report: "Partial API Outage"
Last updateThis incident has been resolved.
Report: "Slightly elevated number of 5xx responses"
Last updateA fix has been implemented and all issues are resolved.
We are currently experiencing a slightly elevated number of 5xx response errors. Our team has identified the core issue and is actively working on a fix. Please rest assured that no data has been lost.
Report: "Partial service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Dashboard downtime"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We experience a Dashboard downtime due to a failure in the underlying infrastructure changes. Our team is working to resolve the issue and restore full functionality. We apologize for any inconvenience caused and appreciate your patience.
We are currently investigating this issue.
Report: "Service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified and resolved the root cause of the recent system issues. All affected systems have been restored to normal operation.
We experienced a disruption in our API services due to a failure in the underlying infrastructure changes. Our team is currently working to resolve the issue and restore full functionality as soon as possible. We apologize for any inconvenience caused and appreciate your patience.
Report: "Partial unavailability dashboard"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We're experiencing temporary issues with our dashboard's functionality. Our team is currently investigating to determine the root cause. Your data remains secure and intact - this issue affects only accessibility, not data integrity. We're working diligently towards a quick resolution and will provide updates accordingly. Apologies for any inconvenience. For immediate concerns, please reach out to our support team.
Report: "Intermittent 5xx API errors"
Last updateOur team has thoroughly investigated the issue and determined that the root cause is likely due to underlying networking issues within AWS.
We are currently investigating intermittent 5xx API errors and longer than usual response times.
Report: "Service degradation"
Last updateThe APIs are back to full operation and we're assessing any impact on the data during the period of the incident. We will follow up with an analysis.
We've identified what seems to be the issue and have deployed a fix. The service seems to be back to normal but we'll be monitoring it and follow up with a confirmation.
We are investigating an issue where 401 and 503 responses are returned for a subset of requests. We're still assessing the scope of the issue and will keep you posted.
Report: "Partial database outage"
Last updateThis incident has been resolved.
We are investigating an elevated error rate on our end. We've identified issue and working towards fixing it. All of the APIs are operational.
Report: "API downtime"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
API responses are coming through for some requests.
We’re experiencing timeouts in API endpoints. Investigating.
Report: "Data ingestion issue in the Explore tab"
Last updateThis incident has been resolved.
We have resolved the core issue. All data should be available in dashboard. We are going to actively monitor data ingestion.
We are currently observing data ingestion problems that are caused by issue in one of our databases. We are actively working on fixing the core issue. Also there is no data loss in any of the systems.
Report: "API performance degradation"
Last updateThis incident has been resolved.
We see improved response times on our API endpoints. We will monitor the situation.
While we are working on a permanent fix we are observing response times to improve.
We have identified the issue. We are working to remediate the root cause.
We are investigating general API slowdown.
Report: "Internal services issue"
Last updateInternal systems are stable.
We have implemented a fix and are monitoring systems.
We have identified the issue and are working on the root cause of the issue.
Report: "Amazon Web Services disruption"
Last updateAmazon Web Services mitigated majority of issues around EC2. We will continue to monitor the issues of AWS and update status if needed.
AWS updated information that USE1-AZ4 availability zone is the only one affected. We do not see any indicators that castle services are affected. We will continue monitoring of the AWS issue and act to remedy issues if needed.
We are currently investigating if Amazon Web Services us-east-1 service disruption affects castle services.
Report: "Potential service disruption"
Last updateMost of the AWS systems are working already and the few that are not are recovering fast.
We are keeping this incident open while we wait for AWS systems to recover.
Our hosting provider AWS is currently experiencing issues, however, it is not affecting the Castle services at the moment. All Castle services are fully functional, and we're actively monitoring the situation. Please see AWS' status page for more information: https://status.aws.amazon.com
Report: "API slowdown"
Last updateOur APIs have slowed down due to a database issue.
Report: "API Downtime"
Last updateAt 2021-09-06 20:04 UTC we experienced an AWS hardware failure with one of our main databases which led to 7 minutes of downtime impacting our APIs. During this time, the APIs were returning a 500 response code and no data was processed. The database in question is configured to be multi-node with automatic failover, but for unknown reasons the failover didn't happen as expected when the hardware fault occurred. Instead, a full backup had to be recreated, which led to the extended period of downtime. We're currently debugging this with AWS support to make sure we can trust the resiliency of our platform. While the current setup should provide good redundancy, we're simultaneously looking into alternative options to prevent this from happening again.
Report: "Lost events in Castle Dashboard"
Last updateBetween 14:43 and 15:31 UTC Castle experienced an infrastructure issue with our message queuing system that caused some customer event data to get lost. While risk scoring and inline responses were functioning normally, the requests sent during the period of the incident will not be visible or searchable in the Castle Dashboard We're prioritizing efforts to add extra redundancy to our system to prevent this from happening again.
Report: "Service disruption"
Last updateOn Sunday, April 4th, 2021, beginning at 13:56 UTC, Castle's `/authenticate` endpoint was unavailable. Our teams promptly responded and service was restored at 14:09 UTC. We've conducted a full retrospective and root-cause analysis and determined that the original cause of the incident was the hardware failure \(as confirmed by AWS Support\) of an AWS host instance that contained Castle's managed cache service. This hardware failure caused an accumulation of timeouts, resulting in some app instances being marked unhealthy and automatically restarted in a loop. Although rare, we do expect occasional hardware-level failures, and our system is designed to be resilient to these failures whenever possible. In this case, the accumulated timeouts caused the system to behave in a way we have not seen before. We have re-prioritized our engineering team to implement '[circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html)'-style handling around cache look-ups which will prevent subsequent cache layer failures from impacting synchronous endpoints like `/authenticate`.
System is back to normal. We will follow up with more details about this incident
API endpoints responding normally again. Queued requests are catching up. Monitoring
We’re experiencing timeouts in API endpoints. Investigating
Report: "Service disruption"
Last updateOn March 30, 2021, Castle’s API became degraded during three distinct windows of time: * 12:02 UTC - 12:45 UTC * 12:59 UTC - 13:41 UTC * 14:48 UTC - 15:25 UTC During this time, some Castle API calls failed, including calls to our synchronous `authenticate` endpoint. The Castle dashboard was up, but due to the API being unavailable was not rendering data. Service was fully restored as of 15:25 UTC, and some data generated from requests to our asynchronous `track` and `batch` endpoints during the incident was recovered from queues and subsequently processed. As we communicated to all active customers yesterday, we take these sort of incidents very seriously, and want to share some of the factors that led to this incident. The root cause of the incident was a failure of one of our primary data clusters. This is a multi-node, fault-tolerant commercial solution and a complete cluster failure is extremely rare. Castle’s infrastructure team responded immediately to the incident and found an unbounded memory leak that caused each node to simultaneously shut down. Over the course of the incident, we learned this memory leak was exacerbated by a specific class of background job that actually began running a day prior but did not begin leaking memory for some time. When the incident began, we detected the issue and immediately restarted the cluster. A full 'cold start' of the entire cluster takes around 40 minutes, and this accounts for the first downtime window. After the cluster restarted, our fault-tolerant job scheduling system attempted to run the jobs again, which caused the cluster to require full cold restarts twice more as we worked to clear out the job queue and replicas. At this time, we believe the reason for the memory leak is a bug in our data cluster provider’s software - we have been able to successfully reproduce the issue in a test environment and have a high-priority case open with their support team. In the meantime, we have audited all active background job systems to ensure performance-affecting jobs are temporarily disabled or worked around. Once again, we apologize for the impact of this interruption. Please feel free to contact us at [support@castle.io](mailto:support@castle.io) if you have any further questions.
Systems are operating normally and we have put mitigation measures in place to ensure the issue does not reoccur. We'll have a full retrospective and root cause teardown of the incident published within the next few days.
API endpoints are responsive again and the system is stabilizing. We're monitoring the situation
We are seeing degraded performance on API endpoints once more, and are working on restoring functionality as quickly as possible.
Database cluster operating normally and API endpoints are responding. We're continuing to monitor the situation
We are experiencing issues with our main database cluster which affects all API endpoints. We're currently investigating this issue.