Historical record of incidents for Airtame
Report: "Service disruption"
Last updateWe have identified the issue in our systems. The system is back online.
We are currently experiencing a service disruption. Our engineering team is working to identify the root cause and implement a solution.
Report: "Problems connecting to Teams calls"
Last updateThis incident has been confirmed resolved.
We have identified the issue and deployed a fix. We are now monitoring the situation.
We are currently investigating an issue where some Airtame Hub devices fail to connect to a Teams call.
Report: "Service disruption"
Last updateThe incident has been resolved.
We are currently experiencing a service disruption in our email provider. Our Cloud team is working to identify the root cause and implement a solution. Any expected email from Cloud will not be received until the issue is solved.
Report: "Google Platform experiencing degradation in the service"
Last updateOur monitoring system is no longer reporting errors on the communication with Google services. We consider this issue as resolved.
We are experiencing an increased error rate on some airtame cloud applications. So far it has been restricted to some Google Applications and current diagnosis points to issues at the Google Platform.
Report: "Backend API unstable"
Last updateThe issue has been fixed and the Backend API is now behaving as expected.
We are currently experiencing degraded performance in the Backend API handling the requests from our Frontend.
Report: "Elevated number of 5xx errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating an elevated number of 5xx errors. We will provide updates as necessary.
Report: "DNS resolution issue"
Last updateThis incident has been resolved.
Due to an issue with domain renewal, airtame.cloud was unreachable. The issue with our domain provider has since been fixed and services are recovering. We're continuing to monitor the situation and will provide updates as necessary.
Report: "Service disruption"
Last update# **Introduction** On Saturday, 08.02.2020, Airtame Cloud suffered a service disruption from approximately 03:10 to 16:50 UTC, during which most users were unable to use Airtame Cloud. We apologise for the service disruption. With this postmortem we would like to explain how this service disruption was handled, and what we will do to minimise the risk of future service disruptions. # **Timeline** * **03:10** - We receive alerts of high CPU usage on our database instance. * **12:14** - Engineering starts investigating the issue. * **13:46** - A potential issue has been identified with a performance test and the performance test is stopped. Service is briefly restored. This was not due to the stopped performance test, but because we stopped backend services to prepare a failover of our database. This was not clear then, as metrics shipment turned out to be delayed. * **14:03** - The issue starts to occur again with the stopped performance test. Investigation of the root cause continues. * **16:17** - The root cause has been identified with Airtame device firmware 3.8.0-b3 and above. The issue is regarding the new logic of the Cloud component on the device working with the backend in the Cloud. * **16:20** - A hotfix is being deployed on the backend to stop these firmware versions and above to connect to our Cloud. * **16:47** - The issue is mitigated and the service restored for devices running firmware 3.8.0-b2 and below. On Monday, 10.02.2020, a backend fix is developed to also allow firmware versions 3.8.0-b3 and above to connect to our Cloud again. This fix is deployed by 15:00 UTC. # **Explanation** In the Cloud component of the affected firmware versions, a device UUID handler was introduced. This UUID handler would do a full table scan of our database, leading to high CPU usage on our database. This table scan would occur each time the device connects to our backend. On Friday, we saw a 25% increase of users with devices running firmware versions 3.8.0-b3 and above. While the absolute number of added devices was small \(~200\), this was enough to cause a cascading failure due to a combination of circumstances: * The affected devices would connect to our backend. * Each connected device would cause the backend to do a full table scan, causing high CPU usage on our database. * This would result in an increase in query latencies, which in turn would result in WebSocket disconnections. * The devices would try to reconnect the WebSockets, leading to an even higher database load, and thus latencies. The number of connections piling up then led to memory issues on our backends. * Finally, our backends ran out of memory, causing all devices to disconnect from the Cloud entirely. After a random timeout, they would attempt to reconnect, meaning our backends were unable to recover with the database continuously locked up. * Once the affected versions were blocked from connecting to the Cloud, our database and backends were able to recover and service was restored. # **Learnings** We have recently added performance tests, and will continue to add further checks to these. Even though our monitoring system detected the database CPU usage increase, it didn't report the increase in error rates on the WebSocket endpoints. Since the incident, we already implemented new checks that monitor the error levels on our public load balancers. This is currently being validated.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We continue to investigate the root cause of the issues.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently experiencing a service disruption and are investigating the issue.
Report: "Service disruption"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently experiencing a service disruption. Our Infrastructure team is working to identify the root cause and implement a solution. Some users may be affected with devices being offline.
Report: "RDS storage issue"
Last updateThis incident has been resolved.
We are currently investigating elevated error codes.