Ubidots

Is Ubidots Down Right Now? Check if there is a current outage ongoing.

Ubidots is currently Degraded

Last checked from Ubidots's official status page

Historical record of incidents for Ubidots

Report: "Generalized latency"

Last update
investigating

We're experiencing a generalized latency in the platform. Our DevOps team is already working on resolving this issue. We will provide additional information as it becomes available.

Report: "New CSS custom styles deployment"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

Our development team has been working on a new feature that will allow customizing Ubidots applications even further through the enablement of custom CSS injection in a per app basis.We will be deploying these changes, which consist of both of front- and back-end code. The deployment starts at the same time, but we can't guarantee they [front- and back-end] finish simultaneously. While this takes place, the platform may render unusual styles.

Report: "[AU DEPLOYMENT] Outage"

Last update
resolved

The AU deployment experienced an outage because of a sudden spike in data reception, which overloaded the servers and hence the web app was slow to read and display data. Start time: 10:30 PM UTC-5 End time: 11:15 PM UTC-5 Total time: 45 min During this time: – All data was correctly saved. – Real time update in dashboard was interrupted. – The events engine was taking up to 4 minutes to trigger events.

Report: "Apps with 404 error"

Last update
resolved

Incident Summary: Between 11:00 PM and 4:41 AM UTC-5, some users experienced 404 errors when attempting to access applications deployed in our US region. Root Cause: The issue was traced to an unexpected behavior in our caching mechanism. As part of recent DDoS attack mitigation improvements, changes were made to the cache that inadvertently resulted in these 404 errors. Our apps' cache is set to expire every 7 days, and when the cache expired, the affected applications returned 404 errors. This behavior required manual intervention or periodic tasks to refresh the cache. Resolution: A fix has been deployed to ensure the cache is automatically updated moving forward. We have also implemented additional monitoring checks every 5 minutes to proactively identify and address similar issues. Next Steps: We are exploring additional improvements to further enhance reliability and reduce the risk of similar incidents in the future. We apologize for any inconvenience caused and appreciate your understanding as we continue to improve our systems.

Report: "Minor downtime: BD servers update"

Last update
resolved

Our DevOps team ran a maintenance task consisting of updating our servers packages. This is standard procedure to guarantee the instances are up to day with the latest security patches or package versions. This task took about 5 minutes to complete, where reading data was slow. No data was lost during this time window.

Report: "MQTT"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our MQTT publish service and are currently looking into the issue.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "Events"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our Events service and are currently looking into the issue.

Report: "MQTT"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our MQTT publish service and are currently looking into the issue.

Report: "MQTT intermittencies"

Last update
resolved

After months of troubleshooting our infrastructure design concerning the MQTT broker configuration, the solution has been found in collaboration with the broker provider. The issue was found to be directly embedded in the brokers source code, out of our reach, so the provider had to go in, based on our error logs, and implemented a solution to the root cause: in simple words, a race condition between posting data to our upstream services and the internal broker log generation service. This condition only presented when great amount of data was reaching the broker.

monitoring

We continue to experiences instances where data, although published and an ACK is sent back, is not routed correctly into the database. As such, we have implemented new measures that aim at balancing connections better, and making sure that customers, because of possible bad MQTT client implementations, don't overload the broker with unused connections. Specifically, our DevOps team has: 1. Introduced an authentication rate limit, that is, a token cannot send more than 4 auth messages per second. 2. Implemented a max MQTT connections per user. This means a particular customer cannot create more than X amount of MQTT connections. X has been set based on number of devices, license and current usage. 3. Deployed what's know as "Sticky connection"s, which ensure that an established session is always routed through the same balancing server as long as the connection is still alive. Along with this, instead of a single MQTT balancing server, there are now 3.

monitoring

This week, after a several deep dives in our MQTT broker logs, our DevOps team focused their efforts in troubleshooting one particular point in the MQTT data reception stack: the internal HTTP webhook that interfaces the broker we the internal data ingestion queues. We found that webhook timeout and connection pool size configuration was playing an important role in ensuring data reception. With that: 1. We increased the internal HTTP webhook timeout 2. We increased the maximum connection pool size for every MQTT node. After seeing a positive impact in the alerts occurrence rate, the DevOps team took the following measures, on top of the above 2, to verify the behavior: 1. Disconnected 1 of 3 balancers. We had added one when testing if the problem came from a connections overload. 2. Stopped routing a portion of the traffic to a separate MQTT broker. This was deployed to minimize the load in the deployment. 3. Increased the number of the HTTP-service pods receiving the webhook requests from the MQTT broker. In summary, these actions have meant a substantial reduction in data loss. We now only see very sporadical alerts, but after detailed monitoring of client data, we are no longer seeing data gaps. We will continue to monitor the stability of the MQTT data reception service for further tuning.

monitoring

After the balancer servers fine-tuning and the MQTT Flapping detect mechanism activation last Friday (September 6th), our internal checks still detected failures to deliver data over MQTT, and similar reports were found from test implemented by our support team. Nonetheless, the occurrences of alerts have been decreasing with each measure our DevOps has taken. As our goal is to provide stability and make sure data isn't lost, we continue to implement changes to completely mitigate this MQTT intermittencies. With that, today we have: 1. Deployed an additional load balancer. 2. Increased the number of pods (containers) running the MQTT ingestion services.

identified

Our DevOps team has taken the following additional measures to lower the MQTT intermittencies, although they're still present: 1. Fine tuning of the servers running our load balancers to support greater concurrent connections. 2. Enabled a feature in the MQTT broker that automatically detects client disconnections, which at the same time speaks of the connection rate, in a time window. If a threshold is exceeded during the window, the particular MQTT client is banned for a configurable duration.

identified

After the implementation of the detailed log, we were able to spot that significant traffic reaching our servers came from inactive users. The traffic was not being rejected directly in our load balancers; instead, it was being allowed to connect and published data. We have now block said traffic completely, ensuring only active customer are able to connect. This prevent overloading the MQTT servers with invalid traffic. So far, the internal alerts have decreased considerably, but there are still remains. Our team continues investigating what else is causing the remaining alerts and MQTT intermittencies.

investigating

Over the past 2 weeks, our MQTT service has been experiencing latencies and intermittencies when publishing data or creating connections to do so. This has resulted, in some cases, in data loss, and in a diminished perception of quality of service. Our DevOps team, through the reports sent from users channeled through our support team, is aware of the problem. Our internal checks has pointed us of the issue was well. Our DevOps team has been monitoring the behavior, and so far, we believe there are sudden spikes of connections that causes the intermittencies. The team has: 1. Established more aggressive restrictions with respect to connections per IP. 2. Established a lower rate limit of connection per seconds per IP These 2 changes have improved the issue, but not fixed it completely. As of the time of these note (05/08/2024 16:38 UTC), we're implementing a more robust and detailed log that allows us to trace the networking and usage per client, with the aim of find the direct cause of the spike. This is allow us to determine paths to implemente a definitive solution. We will keep updating this incident as more information becomes available.

Report: "MQTT, TCP & HTTP Ingestion [AU]"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "MQTT"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our MQTT publish service and are currently looking into the issue.

Report: "API Performance Degradation"

Last update
resolved

The degradation occurred between 2:07 AM UTC and 3:23 AM UTC. This incident has been resolved

monitoring

The cause is due to high data extraction via API which increases the response latency and causes slow dashboard loading. After analyzing and identifying the issue, the concurrency level of our core was increased to handle the necessary load.

identified

Australia Deployment: An issue has been identified and a fix is being implemented.

Report: "[EVENTS] Alerts targeting external services didn't work"

Last update
resolved

Between 10/04/2024 6:10 AM UTC-5 and 10/04/2024 6:50 AM UTC-5, the events module presented an issue sending alerts to external services. This impacted the event actions: Email, SMS, Voice call, Telegram, Slack notification, Functions and Webhooks. During this period, although the events triggered, the actions were not send as expected. The root cause came from a modification in our Kubernetes cluster that aims at simplifying the management of DNS records, and prevent errors in the future. Our DevOps team detected the problem in the configuration, fixed it and deployed again so the correct configuration would take effect.

Report: "Performance Degradation in Bulk Data Processing"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are still monitoring this latency. There's a substantial number of dots that are currently queued for ingestion. To mitigate the TTS Plugin data ingestion issue, we have deployed a new version that no longer utilizes the affected endpoint.

monitoring

We are actively working on implementing changes to mitigate the issue.

identified

We are currently investigating a significant performance degradation in our bulk data processing system. This performance bottleneck is impacting the speed of the bulk data endpoint for all users and this incident has the potential to cause significant delays in triggering conditional events and alarms. Our team is currently investigating the root cause and working towards a resolution to restore normal operations as quickly as possible. We have identified that the TTS Plugin is also affected. The reason is due to this plugin relies on the Bulk endpoint to ingest data. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided as we progress in resolving this issue.

Report: "Events"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "HTTP, MQTT, TCP & UDP Ingestion"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "High Network Traffic"

Last update
resolved

This incident has been resolved.

monitoring

Due to an unusual spike in the network traffic of our servers, we experienced an outage in al services. At this moment, the platform is fully operational. We are in communication with our infrastructure provider to better understand why this happened We'll continue to monitor the stability of the service.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our TCP data ingestion service and are currently looking into the issue.

Report: "HTTP, MQTT, TCP & UDP Ingestion"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "UbiFunction outage"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We've identified an outage with our UbiFunctions module. While our DevOps team is investigating, we believe this is an issue coming from our provider, IBM Functions. We have already engaged in conversations with them to find a quick solution. We will update with new information as it becomes available.

Report: "MQTT"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our MQTT publish service and are currently looking into the issue.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "Platform latency"

Last update
resolved

This incident has been resolved.

monitoring

DevOps team deployed a mirror instance of the balancer to manage the high volume of traffic. The platform is operational but we are still attentive to any signs of degradation.

identified

Our DevOps team has found the root cause of the latency: the server holding the load balancer is experiencing a high CPU usage, which causes request the our internal API to be slow. Further information will be provided here once it becomes available

investigating

We are experiencing platform latency. Our DevOps team is investigating the reason behind it.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "Database Performance Latency"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

We are currently encountering latency issues when accessing the database. This latency translates into a delay in loading the platform and querying the API for data retrieval. Our data ingestion service is functioning as expected, and no data loss has occurred during this incident.

Report: "HTTP, MQTT, TCP, UDP Ingestion and Core platform"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our MQTT publish service and are currently looking into the issue.

Report: "HTTP, MQTT, TCP & UDP Ingestion"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "Login Apps & API"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

Our frontend application is suffering a downtime. Our data ingestion is not affected.

Report: "Database latency"

Last update
resolved

The process of rolling back the changes applied yesterday has been completed. The latency is no longer being experienced.

monitoring

After reverting the deployment at 18/07/2023 18:00:00 UTC-5, we've seen the platform came back to normal operation. We will continue to monitor it.

identified

Our DevOps team found that the reason behind this latency was due to a process ran in the servers where it the team was deploying an optimization to the access the database storing time-series data. The process began at 18/07/2023 15:00:00 UTC-5 and the performance impact was identified at around 18/07/2023 17:00:00 UTC-5. The process has since been reverted.

investigating

We're currently experiencing latency accessing the DB. This reflects in slowness while loading the platform, or querying the API to read values. Data ingestion service is operational and no data is being lost.

Report: "Login Apps & API"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to work on a fix for this issue.

identified

We are continuing to work on a fix for this issue.

identified

Our frontend application is suffering a downtime because our cloud provider had to do an urgent server migration. Our data ingestion is not affected.

Report: "HTTP, MQTT, TCP & UDP Ingestion"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "Functions America"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are experiencing Ubifunction latency issues with IBM, our Server-less cloud function provider. All our UbiFuncitions return 501, we are working with the IBM support team the manage this issue. We are actually monitoring the service with our provider in order to solve the issue as soon as possible.

investigating

We are currently investigating this issue.

Report: "EVENTS"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Events engine status

Report: "Some analytics reports not sent on time"

Last update
resolved

Between May 1st 11:00 UTC and May 2nd 10:40 UTC our reporting servers experience a downtime due to an internal SSL certificate expiring, causing reports with values tables, or widget snapshots, not to be sent. On May 2, our team fixed the issue and scheduled the delivery of unsent reports.

Report: "EVENTS"

Last update
resolved

Closing past events issue

monitoring

A fix has been implemented and we are monitoring the results.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Events engine status

Report: "Frontend & API"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Our web portal is currently down. Our team has identified the root cause and is working hard to fix it. Data ingestion from your devices is not affected; only the web portal.

Report: "Frontend"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Our web portal is currently down. Our team has identified the root cause and is working hard to fix it. Data ingestion from your devices is not affected; only the web portal.

Report: "Frontend & API"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Login Apps & API"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

Our frontend application is suffering a downtime because our cloud provider had to do an urgent server migration. Our data ingestion is not affected. We continue to be in touch with our cloud provider to resume the service as soon as possible.

Report: "HTTP, MQTT, TCP & UDP Ingestion"

Last update
resolved

This incident has been resolved.

investigating

We have identified an error with our MQTT publish service and are currently looking into the issue.

Report: "HTTP & TCP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Synthetic Variables"

Last update
resolved

This incident has been resolved.

identified

Our synthetic engine is experiencing latency, which may result in longer compute times for synthetic variables.

Report: "HTTP"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Report: "Frontend"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

Our web portal is currently down. Our team has identified the root cause and is working hard to fix it. Data ingestion from your devices is not affected; only the web portal.

Report: "EVENTS"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Events engine status

Report: "Events and Dashboards Degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Frontend"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Load balancer outage"

Last update
resolved

Access to applications: All end-user's applications have been applied their respective SSL certificate. Access to application is fully operational now Schedule Reports: All queued reports have already been sent

monitoring

Access to applications: We have generated almost all SSL certificates to ensure access to end-user's applications are labeled as a secured connection. This will be ready today. CDN static files: Some users will experience loss of some white-label assets (read previous update). Unfortunately, we cannot recover them, so the fastest way to get them back into the platform is simply upload them again. Schedule Reports: There's still some user reports still repressed in the queue. This process is DB intensive because of the amount of data that needs to be extracted. We will continue to update as the queue clears.

identified

Access to industrial.ubidots.com has been completely restored. Some end-user application are still missing its SSL certificate to ensure a secure connection, so our DevOps team is head on making sure this is applied to all applications as soon as possible. The hard drive failure corrupted static files hosted in the CDN, including assets such as those found in the applications' white-label (Loader logo, Favicon, Header logo, Login screen logo, Background image) configuration, as well as for example, the Profile's avatar image. Some user might have to upload this files again. We have also identified that the Scheduled Reports engine was also compromised because of this hard drive failure. This means that as of this moment, there's a queue of reports to be generated and sent. Some of them might come up with errors.

identified

Access to the main admin accounts at industrial.ubidots.com has been almost completely recovered as we have restored the CDN from which static files are loaded. Some users might experience degraded performance still. End-user's applications static files backup has been mounted and we are generating the SSL certificates for the applications. As they're generated, they are being applied, which means that some applications will start working while others will take more time. Our DevOps team is estimating one more hour for all certificates to generate.

identified

We are in direct contact with our infrastructure provider to find a solution to the disk failure. At this moment, access to the main accounts at industrial.ubidots.com is intermittent, while the end-user's applications are still not accesible. We're mounting a backup of the application's static files to speed up the process of getting the UI back into operation while we have a more definitive solution to the disk failure from our provider.

identified

We have identified the error to be coming from a hard drive failure in the load balance and CDN server. We have moved all data traffic to a new load balancer to stop it from being lost, however, the UI is still not loading, and we're working on mounting a backup of the CDN. As of this moment we know there will be a gap of data from 23:00 to 23:30 UTC.

investigating

We are continuing to investigate this issue.

investigating

We have experienced an error in one of our servers, specifically the one containing our load balancer for data traffic and the CDN from which static files are served. This is preventing data from being ingested through any of the supported protocols (HTTP, MQTT, TCP/UDP). Also, it is affecting the UI from loading, both at industrial.ubidots.com and the end-user's applications We're investigating the root cause.

Report: "Third party email provider with SMTP outbound delay"

Last update
resolved

This incident has been resolved.

identified

Our email provider, Postmark, is experiencing API and SMTP outbound delays. This affects Ubidots's ability to send email as part of our service. Such emails include alerts from the events engine, user invitation, scheduled reports and data export. We will continue to monitor their service and update herein accordingly.