Historical record of incidents for Telerivet
Report: "Emails not being sent"
Last updateOutgoing emails were delayed between 01:08 and 01:55 UTC due to a DNS change made by Telerivet's transactional email service provider (SendGrid), which was incompatible with a configuration setting on Telerivet's servers. This issue has been resolved by updating a configuration setting on Telerivet's servers.
Telerivet has implemented a workaround and emails are being sent now.
Telerivet's servers are currently unable to send email due to an issue with our transactional email service provider.
Report: "Delays in updating search indexes"
Last updateTelerivet has deployed a fix to improve performance of indexing new messages for search. New messages are currently being indexed without a delay.
Telerivet has implented an update to prioritize indexing of contacts. New and updated contacts are now appearing in search without a significant delay.
Telerivet is currently experiencing delays updating search indexes for contacts and messages. After adding or updating contacts, the updated contacts may not appear in the Telerivet web app immediately. Searching for messages may not return new results.
Report: "Database connectivity issue"
Last updateTelerivet has not detected any further issues with database connectivity or data integrity, and has verified that no scheduled messages or other data were lost due to this issue.
The issue was caused by a corrupted database table storing scheduled messages and other scheduled events, which caused the database server to crash when triggering scheduled events. The corrupted database table has been repaired, and scheduled messages are being sent now. All Telerivet functionality should be working normally.
We are continuing to work on a fix for this issue.
The issue has been identified as a particular query causing Telerivet's database server to crash. Telerivet has disabled scheduled message functionality to avoid this issue and is continuing to investigate the issue.
We are continuing to investigate this issue.
Telerivet is currently investigating an issue with database connectivity.
Report: "Database connectivity issue"
Last updateTelerivet has not encountered any issues with database connectivity since reverting the software update yesterday. All systems are operating normally.
The issue with intermittent database connectivity appears to have been triggered by a software update. Telerivet has reverted this update and is continuing to investigate the issue.
Telerivet is currently experiencing intermittent issues with database connectivity which could impact message delivery and service processing. We are currently investigating this issue.
Report: "Nexmo API outage"
Last updateNexmo has reported that all services have recovered. For more information, see https://www.nexmostatus.com/incidents/6p4nlld7n4br .
Messages sent via Nexmo routes are currently encountering errors due to an ongoing issue with the Nexmo (Vonage) service. For more information, see https://www.nexmostatus.com/incidents/6p4nlld7n4br . Non-Nexmo routes are operating normally.
Report: "Nexmo API outage"
Last updateNexmo has reported that all services have recovered. For more information, see https://www.nexmostatus.com/incidents/ykf63np3bq89 .
Messages sent via Nexmo routes are currently encountering errors due to an ongoing issue with the Nexmo service. For more information, see https://www.nexmostatus.com/incidents/ykf63np3bq89 . Non-Nexmo routes are operating normally.
Report: "Nexmo messages not delivered"
Last updateOn October 17, one of Nexmo's US partners identified un-permitted content in outgoing SMS and advised Nexmo to stop that traffic immediately. As a result, Nexmo temporarily blocked all messages sent from that route, which included all messages sent by users with virtual numbers in the US and Canada under Telerivet's Nexmo account. Nexmo has not indicated whether any Telerivet user was responsible for sending un-permitted content. Nexmo has reported that they are planning to improve their internal processes to avoid a situation like this from happening again. Telerivet has already been updated to require manual approval of each user who requests to send messages via Telerivet's Nexmo account, to reduce the risk of Telerivet users sending un-permitted content. Telerivet has already credited affected users with the cost of messages that were not delivered due to this issue, as well as 25% of their monthly service plan fee.
SMS messages sent via Nexmo are being delivered again now. The issue was limited to clients using Telerivet's Nexmo account to send SMS messages in the United States and Canada. Clients using their own Nexmo account were unaffected. Nexmo disabled Telerivet's SMS routes in the US and Canada at 7:25 AM PDT, without notifying Telerivet. Nexmo restored these routes at 2:43 PM PDT. Telerivet is continuing to communicate with Nexmo to determine the reason why Nexmo disabled Telerivet's SMS routes in the US and Canada, and to prevent this issue from reoccurring.
Users with routes provided by Nexmo have been reporting that messages are currently not being delivered with the error "Unroutable". Telerivet is currently working with Nexmo to resolve this issue.
Report: "Delays in Nexmo message delivery"
Last updateThis issue has been resolved by Nexmo, and messages that queued during the incident have been delivered. For more information, see https://www.nexmostatus.com/incidents/jzgfpqrh4dqs .
SMS messages sent via Nexmo in the US and Canada may not be received at this time due to an issue with Nexmo. For more information, see https://www.nexmostatus.com/incidents/jzgfpqrh4dqs .
Report: "Nexmo outage"
Last updateThe Nexmo API outage was resolved about 5 hours ago. For more information, see https://www.nexmostatus.com/incidents/qdl3bhsq1cbz
Users sending SMS via Nexmo have been receiving a "HTTP 504" (Gateway Timeout) error due to an outage with Nexmo. For more information, see https://www.nexmostatus.com/ .
Report: "Network and Database Outage"
Last updateThis incident report describes 3 separate service interruptions as well as the followup actions Telerivet has taken to further improve the reliability of our systems. __Server Hardware Failure – Saturday 31 January 2015, 05:10 to 05:16 UTC__ The first partial service interruption lasted approximately six minutes from 05:10 to 05:16 UTC, causing errors whenever attempting to send a message. This first interruption was caused by an unexpected hardware failure on one of the servers in our message queue cluster, preventing this server from responding to any network requests starting at 05:10 UTC. Telerivet’s automated failover systems actively monitor the availability of the message queue service, as well as several other internal services. These services are designed with redundancy so that Telerivet can quickly and automatically recover from a server failure by failing over to a standby server. The interruption in connectivity was detected at 05:12 UTC and the system attempted to fail over to another message queue server. The automated failover process for the message queue involved making an API request to Amazon Route 53, which provides DNS records for telerivet.com. Unfortunately, the Route 53 API returned a “service unavailable” error, preventing the failover process from completing at 05:12 UTC. After a short wait, the automated failover system tried again at 05:15 UTC. This time, the Amazon Route 53 API worked properly. After the automated failover process was complete, Telerivet returned to normal operation. __Network Hardware Failure – Saturday 31 January 2015, 07:03 to 08:38 UTC__ The second service interruption lasted approximately 95 minutes from 07:03 to 08:38 UTC, causing nearly every user of the web app or API to receive the error message “Couldn't connect to the database”. This second interruption was triggered by network issues in Telerivet’s primary datacenter, caused by a malfunctioning network switch, which failed in an unusual way that prevented the datacenter from automatically failing over to a backup network switch. (For more details, [read the data center’s full incident report](http://status.linode.com/incidents/1c6qfjnl97fd ).) The logs from our monitoring tools showed that the network switch was fixed around 08:27 UTC. After the network hardware was fixed, Telerivet’s API servers and web servers were still unable to connect to the database for an additional 11 minutes. This occurred due to a quirk in the MariaDB database, which (as we later learned) by default will block a host from making future connections after 100 consecutive aborted connections from that host. As a result of the network interruptions starting at 07:03 UTC, Telerivet’s active web and API servers quickly reached the limit of 100 aborted connections, so they became blocked from connecting to our primary MariaDB server. Although other web and API servers were available on standby (which had not been blocked by MariaDB), and other MariaDB hosts were also available on standby (which had not blocked any web or API servers), Telerivet automated failover systems did not make the standby hosts active, because all of the active servers appeared to be working correctly when considered in isolation. Consequently, Telerivet’s system administrators needed to manually diagnose and resolve the problem. Shortly thereafter, one of Telerivet’s sysadmins manually restarted the MariaDB service at 08:38 UTC. This reset the list of blocked hosts and restored Telerivet to normal operation. __Intermittent Network Interruptions – various times from January 28 - February 7 and February 22__ Starting on January 28, Telerivet also experienced a handful of intermittent network interruptions typically lasting less than 30 seconds and occurring once or twice per day, which also resulted in the error “Couldn't connect to the database”. These network interruptions are likely unrelated to the server and network hardware failures on 31 January. Although these intermittent network interruptions actually started a few days earlier, we first detected that they were a recurring problem on 31 January while investigating the other service interruptions. After investigation it was determined that packet loss appeared to occur only between certain pairs of servers. Due to the unpredictable and infrequent nature of the network interruptions, our process for diagnosing this problem basically involved experimenting with a particular change to our server infrastructure and waiting a couple of days to see whether or not it fixed the problem. These network interruptions were mostly resolved by February 5, although there were a small number of additional network interruptions on February 7 and 22 as we continued experimenting with changes to our server infrastructure during weekends when Telerivet usage is somewhat lower. The root cause of this packet loss has not yet been identified. However, migrating a small subset of our servers to new hardware appears to have caused these network interruptions to stop. It is possible that the network interruptions could have been caused by a bug in the hardware or virtualization software used by the affected servers. __Followup Actions Taken__ We know that our customers rely on Telerivet to be available all the time. For nearly 3 years, Telerivet has earned an excellent record and reputation for reliability, in large part because of our significant work to build systems and processes for redundancy, monitoring, alerting, and automatic failover. Generally, occasional hardware problems like these are expected and would not typically result in significant downtime, except that in this case they unfortunately coincided with additional unrelated problems such as the outage of the Amazon Route 53 API, and the MariaDB behavior which inadvertently blocked misbehaving hosts. It was also highly unusual that 3 unrelated hardware problems occurred at nearly the same time. Generally Telerivet's hardware has been highly reliable, and we would normally expect to see 3 problems over the course of one year, instead of 3 in one day. However, these service interruptions highlighted several issues that we have worked to address over the past few weeks: 1. We updated several configuration values for MariaDB to fix poor default values, such as the `max_connect_errors` setting that caused MariaDB to block our own servers after 100 aborted connections. ([Learn more](http://jeremy.zawodny.com/blog/archives/011421.html)) 2. Our automated failover system now checks whether each API and web server can connect to the database and message queue. This allows Telerivet to recover automatically from most failures caused by connectivity issues between certain pairs of hosts, even when each host is working correctly in isolation. 3. Our automated failover system now retries failed requests to the Amazon Route 53 API after a shorter wait time, and in some cases will be able to proceed even if the Amazon Route 53 API is unavailable. 4. We added additional metrics and alerts to our internal server monitoring tools, including the average response time and error rate from our web and API servers. 5. We updated the networking settings on our servers to use a second IP address for internal communication with our other servers. The new configuration allows our servers to communicate with each other on the same Ethernet network without passing packets through an intermediate router, reducing the chance of network interruptions affecting communication between our servers. 6. We have improved our status page to make it easier for Telerivet to communicate with customers about service interruptions. The new status page makes it easy for customers to see the current and historical status of the Telerivet service, and subscribe to be notified of any issues. The status page currently contains 6 real-time public metrics to make Telerivet’s performance more transparent -- uptime, average response time, and error rate, for both the Telerivet API and web app. To customers who were impacted, we sincerely apologize for the impact these service interruptions had on your business or organization. For customers who have reported being impacted by these outages, we have added a credit to your account equal to approximately 10% of your monthly service plan price. We hope you enjoy our new status page – if you want to receive notifications of any future outages, go to http://status.telerivet.com and click the "Subscribe to Updates" button at the top of the page.
The network issue was resolved and database connectivity has been restored.
A network issue in Telerivet's primary datacenter caused the database servers to become unreachable.
Telerivet automatically failed over to a standby message queue server, restoring normal operation.
Telerivet's primary message queue server stopped responding due to a hardware issue.
Report: "Internal network connectivity issue"
Last updateThis incident has been resolved.
One of Telerivet's API servers experienced errors connecting to other Telerivet services at 00:22 UTC, for about 1 minute, causing some API requests to fail. Telerivet's services are currently operational. We are continuing to monitor for additional issues.
Report: "Message queue issue"
Last updateThis incident has been resolved.
At 19:40 UTC, one of Telerivet's message queue servers failed, causing errors when sending messages via the Telerivet web app and REST API. Telerivet's automated failover systems resolved the issue within 45 seconds, and all systems are currently operational.
Report: "Networking issues"
Last updateThe networking problem with the REST API load balancer has been resolved by Google Compute Engine. No further networking issues have been observed or reported for the past 3 hours.
A load balancer serving the Telerivet REST API stopped routing external traffic to Telerivet's servers at 02:29 UTC, causing all REST API requests to fail. Telerivet updated the DNS entries at 02:37 UTC to bypass this load balancer, and the REST API should now be working again. Telerivet is continuing to monitor networking issues within our primary datacenter.
Report: "Amazon S3 Outage"
Last updateThe Amazon S3 outage has ended, so MMS and recorded audio are now working in Telerivet. We are planning to integrate with an additional storage provider to mitigate the impact of future S3 outages on Telerivet's service.
Due to an outage in Amazon S3, Telerivet is currently unable to receive or display MMS (multimedia messages) or play voice calls with recorded audio. Telerivet's SMS functionality is not affected by the S3 outage, and voice calls can play text-to-speech or an audio file from a URL outside of Amazon S3.
Report: "Intermittent errors with message queue"
Last updateIt seems likely that the intermittent errors were caused by a new setting we were testing for certain message queues, including some queues that were processing a high number of messages when the errors occurred. We have disabled this setting for these queues and will continue monitoring to see if the errors recur.
The errors with the message queue appear to have been resolved, however we are still investigating the root cause.
We are currently investigating intermittent errors with the message queue.
Report: "Issue with message queue"
Last updateThis incident has been resolved.
One of Telerivet's message queue servers experienced slow response times and elevated error rates from approximately 2:12 AM PDT until we restarted it at 2:28 AM PDT. We will continue to monitor the message queue and investigate the root cause of the incident.
Report: "Errors sending messages to Nexmo"
Last updateThe outage of Nexmo on Friday was related to the large distributed denial of service attack against Dyn (https://en.wikipedia.org/wiki/October_2016_Dyn_cyberattack). No additional DNS outages have been observed since Friday.
We have temporarily hardcoded the IP addresses of Nexmo's servers on Telerivet's servers, so Nexmo should be working again.
Due to a DNS outage in Nexmo's infrastructure, Telerivet is currently encountering errors when sending messages to Nexmo. We are working to resolve this issue.
Report: "Issue with message queue"
Last updateAt approximately 00:10 UTC, one of our servers running RabbitMQ (the software that Telerivet uses internally to queue messages and other tasks) began experiencing very high CPU usage, very slow response times, and intermittent errors when queueing or dequeueing messages. During this time, messages were still able to be queued (with intermittent errors), and only 2% of API requests failed; however, the slow response times and intermittent errors from RabbitMQ caused the worker processes dequeuing messages to gradually fall further and further behind. Switching to a standby server in our RabbitMQ cluster did not resolve the issue. Eventually, we restarted the RabbitMQ process, at which time the CPU usage returned to normal, the intermittent errors stopped, and the worker processes quickly caught up. At this time, Telerivet has not yet identified a particular bug or configuration issue with RabbitMQ that caused this issue. In the next few days, we will be upgrading RabbitMQ to the latest release, as well as performing additional testing to try to reproduce the behavior in RabbitMQ outside of Telerivet's production environment.
The message queue returned to normal at approximately 00:59, and messages queued during the delay have been sent. We are continuing to investigate the root cause of the delays and intermittent errors with the message queue to prevent the issue from happening again.
Telerivet is currently observing long response times and intermittent errors with the message queue, and we are currently working to resolve the issue.
Report: "Datacenter connectivity issue"
Last updateThis incident has been resolved.
At 02:09 UTC, Google Compute Engine began experiencing severe network connectivity issues worldwide ( https://status.cloud.google.com/incident/compute/16007 ). In preparation for this type of outage, Telerivet has a standby datacenter from another hosting provider (Amazon Web Services), which we activated at 02:19 UTC. Most Telerivet services were operational at that time, except for search and some SMPP shortcode connections. SMPP connections were reactivated over the next half hour. Google resolved the issue with Google Compute Engine at 02:27 UTC, and subsequently we reactivated our primary datacenter in Google Compute Engine. Search indexes for each project have gradually been coming back online. Search is now available for most active projects, and should become available in all projects over the next several hours.
Telerivet's primary datacenter in Google Compute Engine is currently experiencing network connectivity issues, so we have activated Telerivet's secondary datacenter in Amazon Web Services. Most services should now be functional, although search may be unavailable.
Report: "Datacenter connectivity issue"
Last updateThis incident has been resolved.
Search and dynamic groups are now available, and all Telerivet services are operational.
Due to DDoS attacks in the secondary data center, Telerivet has returned to the primary data center, and most services are operational. Search and dynamic groups are temporarily unavailable while we rebuild the index. All SMPP connections should be working now.
Linode has published an update about the DDoS attacks at http://status.linode.com/incidents/mmdbljlglnfd
All services are currently operational in Telerivet's secondary data center. Search, dynamic groups, Android real-time connections, and SMPP connections are now working, with the exception of a small number of SMPP connections where the mobile network or aggregator requires a particular IP address in our primary data center. As the primary data center remains under DDoS attack at this time, Telerivet will remain in the secondary data center for now.
Telerivet's primary data center is again experiencing DDoS attacks ( http://status.linode.com/incidents/5ryq5w4l2mfj ), so we have relocated Telerivet to another data center. Some functionality (full-text search, dynamic groups, and SMPP connections) is currently unavailable in the new data center, and we are working to restore those services.
Report: "Datacenter connectivity issue"
Last updateThis incident is marked as resolved, because no network interruptions have been observed in the past several hours. In the meantime, we're actively working on improving our systems to increase reliability and reduce downtime in case of further DDoS attacks targeted at Linode.
Network connectivity at Telerivet's primary data center has apparently returned to normal, so we have returned Telerivet to that data center, and all services are currently operational.
Telerivet's primary data center is currently experiencing DDoS attacks ( http://status.linode.com/incidents/rknrs83pgjxv ), so we have relocated Telerivet to another data center. Some functionality (full-text search, dynamic groups, and SMPP connections) is currently unavailable in the new data center, and we are working to restore those services.
Report: "Datacenter connectivity issue"
Last updateThis incident has been resolved. Additional standby servers have been provisioned in a separate datacenter to reduce downtime in case Telerivet's primary data center has another extended outage in the future.
Linode appears to have mitigated the DDoS attack in the London data center, and Telerivet is currently online.
Telerivet's datacenter is working to mitigate a large DDoS attack. See http://status.linode.com/incidents/ksvn7mxm8mz2 for updates.
Telerivet's data center is currently experiencing network connectivity issues, causing Telerivet services to be inaccessible. More updates will be provided as information becomes available.
Report: "Network connectivity issue"
Last updateBetween 20:15 UTC and 20:18 UTC, and between 20:36 UTC and 20:49 UTC, there were periods of intermittent loss of network connectivity to Telerivet's servers due to network maintenance in Telerivet's data center. At this time the network issues have been resolved.
We're currently investigating network connectivity issues with several of our servers.
Report: "Network connectivity issue"
Last updateThis incident has been resolved.
Telerivet's servers appear to be reachable again via all networks. It appears that an internet routing issue temporarily prevented certain networks from reaching Telerivet's data center.
Some external networks are currently unable to connect to Telerivet's servers. We are currently investigating.
Report: "Web server connectivity issue"
Last updateTelerivet's data center reported a network attack on another server in the data center. They have mitigated the issue and network connectivity has returned to normal. In total, the interruption to of the Telerivet web app lasted approximately 3 to 4 minutes.
One of the web servers serving the Telerivet web app and telerivet.com went offline at 9:33 PM PDT (~8 minutes ago). Telerivet is now using an alternate web server and should be fully functional. The Telerivet API and message delivery appear to be unaffected.
Report: "Web server connectivity issue"
Last updateThe connectivity issue was identified as a hardware failure. The web server has been moved to new hardware to resolve the problem.
One of the web servers serving Telerivet web app and telerivet.com went offline at 10:40 PM PDT (~10 minutes ago). Telerivet is now using an alternate web server and should be fully functional. The Telerivet API and message delivery appear to be unaffected.
Report: "Hardware issue on API server"
Last updateThe hardware issue has been resolved.
A hardware issue on one of Telerivet's API servers caused api.telerivet.com to be briefly unreachable for a couple of minutes. Telerivet's failover systems automatically redirected traffic to standby API servers, and the API is currently operational.
Report: "Network Connectivity Issues"
Last updateTelerivet's data center reports that the packet loss issue has been resolved. In total, the packet loss resulted in a partial outage of Telerivet's services between 18:39 UTC and 19:11 UTC, with intermittent errors observed until 19:29 UTC.
Telerivet's data center experienced network issues that resulted in intermittent connectivity to Telerivet's servers. Currently, network connectivity appears to be back to normal, and all Telerivet services are operational.
Currently investigating network connectivity issues affecting multiple servers.
Report: "Database Interruption"
Last updateThe hardware issue has been resolved. All Telerivet services continue to be fully operational.
Datacenter personnel have identified a hardware problem with one of Telerivet's database servers, and are working to resolve it. In the meantime, Telerivet is fully operational. The outage lasted about 40 seconds at 02:21 UTC before the automated failover system switched to the standby database.
Telerivet's primary database server is currently not responding. Telerivet has automatically failed over to a standby database server.