Airtame

Is Airtame Down Right Now? Check if there is a current outage ongoing.

Airtame is currently Operational

Last checked from Airtame's official status page

Historical record of incidents for Airtame

Report: "Service disruption"

Last update
resolved

We have identified the issue in our systems. The system is back online.

investigating

We are currently experiencing a service disruption. Our engineering team is working to identify the root cause and implement a solution.

Report: "Problems connecting to Teams calls"

Last update
resolved

This incident has been confirmed resolved.

monitoring

We have identified the issue and deployed a fix. We are now monitoring the situation.

investigating

We are currently investigating an issue where some Airtame Hub devices fail to connect to a Teams call.

Report: "Service disruption"

Last update
resolved

The incident has been resolved.

investigating

We are currently experiencing a service disruption in our email provider. Our Cloud team is working to identify the root cause and implement a solution. Any expected email from Cloud will not be received until the issue is solved.

Report: "Google Platform experiencing degradation in the service"

Last update
resolved

Our monitoring system is no longer reporting errors on the communication with Google services. We consider this issue as resolved.

monitoring

We are experiencing an increased error rate on some airtame cloud applications. So far it has been restricted to some Google Applications and current diagnosis points to issues at the Google Platform.

Report: "Backend API unstable"

Last update
resolved

The issue has been fixed and the Backend API is now behaving as expected.

investigating

We are currently experiencing degraded performance in the Backend API handling the requests from our Frontend.

Report: "Elevated number of 5xx errors"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating an elevated number of 5xx errors. We will provide updates as necessary.

Report: "DNS resolution issue"

Last update
resolved

This incident has been resolved.

monitoring

Due to an issue with domain renewal, airtame.cloud was unreachable. The issue with our domain provider has since been fixed and services are recovering. We're continuing to monitor the situation and will provide updates as necessary.

Report: "Service disruption"

Last update
postmortem

# **Introduction** On Saturday, 08.02.2020, Airtame Cloud suffered a service disruption from approximately 03:10 to 16:50 UTC, during which most users were unable to use Airtame Cloud. We apologise for the service disruption. With this postmortem we would like to explain how this service disruption was handled, and what we will do to minimise the risk of future service disruptions. # **Timeline** * **03:10** - We receive alerts of high CPU usage on our database instance. * **12:14** - Engineering starts investigating the issue. * **13:46** - A potential issue has been identified with a performance test and the performance test is stopped. Service is briefly restored. This was not due to the stopped performance test, but because we stopped backend services to prepare a failover of our database. This was not clear then, as metrics shipment turned out to be delayed. * **14:03** - The issue starts to occur again with the stopped performance test. Investigation of the root cause continues. * **16:17** - The root cause has been identified with Airtame device firmware 3.8.0-b3 and above. The issue is regarding the new logic of the Cloud component on the device working with the backend in the Cloud. * **16:20** - A hotfix is being deployed on the backend to stop these firmware versions and above to connect to our Cloud. * **16:47** - The issue is mitigated and the service restored for devices running firmware 3.8.0-b2 and below. ‌ On Monday, 10.02.2020, a backend fix is developed to also allow firmware versions 3.8.0-b3 and above to connect to our Cloud again. This fix is deployed by 15:00 UTC. # **Explanation** In the Cloud component of the affected firmware versions, a device UUID handler was introduced. This UUID handler would do a full table scan of our database, leading to high CPU usage on our database. This table scan would occur each time the device connects to our backend. On Friday, we saw a 25% increase of users with devices running firmware versions 3.8.0-b3 and above. While the absolute number of added devices was small \(~200\), this was enough to cause a cascading failure due to a combination of circumstances: * The affected devices would connect to our backend. * Each connected device would cause the backend to do a full table scan, causing high CPU usage on our database. * This would result in an increase in query latencies, which in turn would result in WebSocket disconnections. * The devices would try to reconnect the WebSockets, leading to an even higher database load, and thus latencies. The number of connections piling up then led to memory issues on our backends. * Finally, our backends ran out of memory, causing all devices to disconnect from the Cloud entirely. After a random timeout, they would attempt to reconnect, meaning our backends were unable to recover with the database continuously locked up. * Once the affected versions were blocked from connecting to the Cloud, our database and backends were able to recover and service was restored. # **Learnings** We have recently added performance tests, and will continue to add further checks to these. Even though our monitoring system detected the database CPU usage increase, it didn't report the increase in error rates on the WebSocket endpoints. Since the incident, we already implemented new checks that monitor the error levels on our public load balancers. This is currently being validated.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We continue to investigate the root cause of the issues.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are currently experiencing a service disruption and are investigating the issue.

Report: "Service disruption"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently experiencing a service disruption. Our Infrastructure team is working to identify the root cause and implement a solution. Some users may be affected with devices being offline.

Report: "RDS storage issue"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating elevated error codes.