Historical record of incidents for Avochato
Report: "Issues with Amazon Web Services"
Last updatePeriodic latency has been observed starting at 8am PST. This may lead top longer than expected load times in inboxes across different organizations. Our team has been investigating the root cause with Amazon Web Services and will report back with findings.
Report: "Issues with cloud provider"
Last updateThe issue was resolved at 10:40am PT. Avochato continues to monitor and investigate the root case for the latency.
Avochato is now monitoring the improvements across impacted inboxes with latency issues, which began 9:30am and resolved by 10:40am PT. Cloud provider resources are still under investigation but all impacted customers should now see normal inbox load times.
Currently seeing improvements to load speeds, however the team is continuing to monitor and work with our cloud provider as the impacted servers may still be impacting certain organizations and inboxes.
Avochato is investigating an issue with the AWS cloud provider, which may lead to latency on inbox loading. Please stay tuned for updates as the team resolves this issue asap.
Report: "Issue importing contacts with custom fields."
Last update## What Happened An underlying database table hit a primary key limit that had not properly been migrated to 8 bytes. Today, we hit the limit on the maximum primary key for that table. This resulted in inboxes using Custom Fields feature see intermittent failures with various features that tried to update the value of custom fields on their contacts. This included creating contacts via API, uploading contacts to broadcast audience lists, editing contacts in the app, and handling answers to survey questions if the surveys stored data on custom fields. Inboxes that did not use the Custom Fields feature were not impacted. Sending messages with $custom\_field embeded in the message were not impacted. Contact uploads for that did not include custom field column names were not impacted even if they had custom fields, and should have finished accordingly. ## Resolution We have fully migrated the impacted table to use the correct primary key size limit and existing uploads continued as usual and reindex the results. We were able to do this without much disruption, though load times for inboxes with many contacts with custom fields may have been slower than usual during the day. Contact uploads that failed were retried after our database migration completed, so if you had a broadcast audience upload that failed, please double check your audience, as it should have all the contacts \(and their custom fields\) accounted for. Impacted surveys will have properly continued down their execution path. ## Next Steps All our new tables for the past few years already use 8 bytes to represent primary keys where appropriate. However, we plan on migrating all existing integer primary keys to use 8 bytes to avoid this issue in the future for any other tables. Thanks for your patience and trusting Avochato with your contact data, Christopher
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are working on an issue with importing, saving, and creating new contacts in inboxes that use custom fields, as well as creating new fields in your inbox. As a temporary workaround, removing custom field columns from CSV uploads and/or API requests will resolve the issue. App load times may be slower than usual while we resolve the underlying issue.
Report: "Carrier issues with calls and texts"
Last updateAvochato continues to monitor the carrier resolution for calls across inboxes.
Carrier call issues have been intermittent with resolutions rolling out across many numbers. Avochato is monitoring the resolution.
Avochato is investigating an issue with calls and texts that may be impacting your inbox.
Report: "Issues loading embedded analytics dashboards and related CSV downloads"
Last updateWe have dug into the underlying connection issue with the Google team and patched the servers. Please contact our support team if you're still having issues accessing the analytics in your Avochato account.
Avochato has had several conversations today with Google Cloud, and the issue has been escalated with the highest urgency. Google is still investigating the issue on their end. We will continue to share updates as soon as they become available.
We've identified a connectivity issue with a third-party vendor we use for visualizing inbox data. This may impact viewing graphs in the Avochato dashboard and the ability to download CSVs of those tables. Legacy analytics views are not impacted, we will share updates on this status page as a resolution is investigated by our third-party vendor.
Report: "Viewing issues on wide mode inbox view and on wide screen monitors"
Last updateThis issue is now resolved.
Avochato is monitoring a fix released for select views in the inbox.
Some Avochato desktop web users have observed screen-sizing issues on wide screens and in wide-mode inbox views, leading to message readability issues. Avochato is currently investigating the issue and recommends using Classic or Compact views if you are experiencing issues reading messages. Mobile web and app views are unaffected. Rest assured that all content sent and received to end-users via SMS is being delivered over SMS, as this issue is specifically a front-end web issue with specific screen widths.
Report: "Delays in loading inbox messages"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Avochato is monitoring the results of resumed normal load times in the inbox.
We are continuing to work on a fix for this issue.
Around 10:05am PT Avochato noticed delays in the loading of messages. This is being investigated with a resolution expected shortly.
Report: "AT&T network outage"
Last updateThis incident has been resolved.
AT&T’s network went down for many of its customers across the United States Thursday morning, leaving consumers unable to place calls, text or access the internet. Although Verizon and T-Mobile customers reported some network outages, too, they appeared far less widespread. T-Mobile and Verizon said their networks were unaffected by AT&T’s service outage and customers reporting outages may have been unable to reach customers who use AT&T. https://www.cnn.com/2024/02/22/tech/att-cell-service-outage/index.html
Report: "Delays retrieving data from cloud service provider"
Last updateOn Friday 12/15, Avochato's cloud provider confirmed the issue has been identified on their end. A temporary solution was provided until the investigation is completed. Avochato will update this issue as additional information is received.
Delays have been observed in retrieving data from Avochato's cloud service provider. Engineering teams on both sides are actively investigating to identify the cause and a resolution.
Report: "Carrier A2P Deliverability Issues Detected"
Last updateThis incident has been resolved.
Providers have identified that some Avochato phone numbers have been continuing to experience deliverability issues even with a fully verified A2P application. This is due to carrier issues with the phone number itself, not the A2P application nor an Avochato-specific issue. This remains escalated directly with carriers and Avochato will be sharing updates as they become available by carriers.
Avochato has identified known issues with US carriers regarding A2P SMS/MMS traffic. Some Avochato customers that are fully approved and established on A2P may be experiencing deliverability errors due to unexpected carrier issues. Avochato is working with carriers on a solution and will post updates here.
Report: "SMS/MMS Delivery Failures to Verizon and AT&T Networks in the US"
Last updateAs of 1pm PT, providers were not experiencing MMS/SMS delivery delays when sending messages to Verizon and AT&T. Avochato has been monitoring results in the meantime and will share all known findings on the incident once that information is provided. This incident is now resolved.
Providers are observing successful MMS/SMS delivery to Verizon and AT&T networks in the US. Avochato will continue to monitor to ensure full service recovery and will provide additional updates.
SMS/MMS providers are actively working on a resolution to delivery failures on Verizon and AT&T Networks in the US for a subset of Long Codes. Experts are on a bridge with a downstream telecom partner to work toward resolution. Customers may be experiencing delivery failures with a 30007 error for any traffic toward Verizon and AT&T. Avochato will continue to provide updates as they are received from providers. More details can also be found on: https://status.twilio.com.
We have confirmed that MMS/SMS short code is not affected by the ongoing incident. We are experiencing MMS/SMS delivery failures to Verizon and AT&T networks in the US over a subset of long-codes. We are actively investigating the and will provide another update in 30 minutes or as soon as more information becomes available.
We are experiencing MMS/SMS delivery failures to Verizon and AT&T networks in the US over a subset of long-codes. Our engineers are working with our carrier partner to resolve the issue. We expect to provide another update in 30 minutes or as soon as more information becomes available.
Report: "Outbound message failures"
Last update# What Happened A brief outage with Twilio’s REST API caused outbound SMS and WhatsApp messages to fail to send. Also, some other actions, including syncing Use-Cases to inboxes, would to fail intermittently and need to be re-attempted. Avochato partners with Twilio to send and receive calls and text messages for some customers. For more information on this particular incident, please see [https://status.twilio.com/](https://status.twilio.com/) ## Impact Some customers may have seen a specific 20503 error code when sending an outbound message via Avochato. _These specific outbound failed messages were not automatically re-tried._ Avochato inboxes continued to receive incoming messages during the impact period \(appx between 5:30-6:30 PST on 8/18/23\), and outbound messages continued to deliver as intended as soon as the issue was resolved. # Next steps Our team will continue evaluating the resiliency of the inbox in case of temporary outages, as well as mechanisms to retry sending these types of failed messages. Meanwhile, customers leveraging other carrier partners, as well as Avochato live chat, were unaffected. Thank you to the Twilio team for communicating status updates for this issue. To our customers, we apologize for any inconvenience during the impact period, and thank you for using Avochato.
This incident has been resolved.
We are observing significantly reduced errors connecting to our carrier partner but are awaiting confirmation that the incident has been resolved.
Outbound messages for certain Avochato phone numbers are still experiencing issues sending outbound texts due to a network outage with our carrier partners.
We are investigating an issue impacting outbound message delivery failures on the carrier network.
Report: "MMS delivery issues (US long code numbers)"
Last updateCarriers have resolved the issues with MMS as of 13:41 PDT.
US carriers have observed issues with MMS delivery towards US recipients using a US long code number. These issues may have impacted your Avochato number(s) on May 5. Our providers have released a patch to resolve this issue. We are in touch and expecting to receive updates on the status of the MMS issues.
Report: "Read-only Mode"
Last updateAvochato continues to investigate the incident.
Report: "Read-only Mode"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring results. Our backlog of messages is being processed -- messages that were scheduled or sent during the impact period will deliver over the next hour. Thanks for your patience.
The issue has been identified and a fix is being implemented.
We are experiencing a platform outage and Avochato is currently in read-only mode. We are working to resolve the issue as quickly as possible.
Report: "Carrier phone lookups experiencing intermittent errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We've noticed a significant spike in errors from our 3rd party service used to validate and choose new phone numbers. We're working to resolve this impact in your Avochato account.
Report: "Intermittent 504 errors when loading Avochato"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're investigating elevated errors when loading pages within Avochato.
Report: "Delays displaying latest messages"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate the cause of a delay in displaying new messages in the inbox. Note that messages continue to be delivered and be received. Thank you for your patience.
We're investigating an issue with up to 15 minute delays in displaying messages and conversations in the inbox.
Report: "Issues creating new tags"
Last update## What Happened Due to continued growth of activity on the platform, a fast-growing and critical database table exceeded the 32 bit memory requirement for inserting new primary keys, preventing additional tags from being created. The inability to create new rows in this table impacted a number of features including adding contacts to broadcasts, adding tags to contacts, and more. Until the column type could be upgraded from Integer to BigInteger, those features were impacted and other business logic within the Avochato application was subsequently impacted. Unfortunately, the upgrade process took significant time due to the size of the table itself, leading to a protracted incident window for the impacted feature set. In order to resolve the issue, the engineering team placed the Avochato platform in maintenance mode during 4:15-5pm in order to attempt to resolve the issue, which directly impacted the ability to read conversations in the inbox during the maintenance window. A second process was applied overnight and into the following business day to fully migrate the database and resolve the issue without losing data until the new primary key column using the BigInteger type was populated, reindexed, and could be swapped for the old column. ## Resolution The team performed an in-place migration to change the column type from Integer to BigInteger. This took quite a bit of time to perform on a table with over 2 billion rows, but once the migration was complete, platform functionality immediately returned to normal. Meanwhile, the design of the Avochato platform allowed failure attempts for things like tagging, broadcasting, and contact uploading to be queued for retry while the platform was impacted. Those tasks were then completed successfully in the order they were received as soon as the incident was resolved. Moving forward, we are auditing our legacy database tables to ensure that primary or foreign keys across our data warehouse will scale decades into the future as we continue to experience growth on the platform. We know how critical Avochato is for communicating with your teams, and appreciate your patience during this period.
This incident has been resolved.
Broadcasting, tagging, and related Avochato functionality has been restored. We are continuing to monitor our systems post-maintenance. Over the next 24 hour period, our system will work through the backlog of business logic relating to contact uploads, tags, and broadcasts.
Our maintenance has successfully resolved the primary issue causing tags and broadcasts. Our system is continuing to work through the backlog of tasks since the start of the incident. Attempts at uploading contacts to broadcasts and manually tagging contacts since the start of the incident will slowly re-attempt. We apologize for the inconvenience and will continue updating as we perform a rolling restart of the system.
Inbox functionality has been restored while we proceed to the next phase of maintenance. Thank you for your patience. We are continuing to work on resolving issues with Broadcasts and Tags.
In order to complete the required maintenance, inboxes will temporarily stop serving conversations. We are trying to minimize this window as much as possible and apologize for the inconvenience.
We are continuing to work on resolving issues tagging contacts and broadcasting. We appreciate your patience.
Resolution is still in progress, but some app functionality including broadcasting and tagging are still impacted. We appreciate your patience as we work to resolve this issue.
In order to expedite resolving issues with Broadcasts and Tag creation, we will be performing emergency database maintenance. Inboxes may not be accessible at this time while we complete maintenance. We appreciate your patience.
We're currently experiencing issues tagging objects in our database and are working on a solution. Broadcasts and campaigns are disabled and will not function in the meantime until we find a resolution.
Report: "Login issues on v2.16 and v2.17 of Avochato for iOS"
Last update## What Happened An update to how we handle state did not properly handle sessions generated in older versions of the app for previously logged in users. This caused the login screen of the iOS app to get in a state where it could not log an older or invalid session out and also could not proceed to the login screen for specific users upgrading from older versions of Avochato. ## Resolution We’ve patched the specific iOS component that handles state when upgrading versions and it should do a better job of cleaning up after older versions of Avochato whenever the app is upgraded or uninstalled and reinstalled. No further action is necessary at this time.
This issue is resolved in our most recent version of Avochato for iOS, 2.19, which is live in the AppStore.
Version 2.18 of Avochato for iOS is now available. We are monitoring progress for users who were unable to log in using the latest version. Older versions of the iOS app, Avochato for Android, and our desktop and mobile web experiences have been unaffected during this period.
We are investigating an issue causing some users to get stuck at the loading screen on our most recent version of the Avochato for iOS app.
Report: "Infrastructure hosting provider performing maintenance"
Last updateThis incident has been resolved.
The maintenance is done, but we will continue to monitor and make sure the infrastructure is stable.
Unfortunately our infrastructure hosting provider is doing emergency maintenance on our databases. Servers should be back to normal shortly.
Report: "Avochato Dashboard Slowness"
Last updateThis incident has been resolved.
We have received communication from our cloud services provider of a partial outage within their data center. Recovery is in progress. Access to the Avochato mobile and desktop app may be intermittent depending on your location.
We're seeing latency and slow requests in our environment and investigating.
Report: "Conversation list slowdown"
Last update## What Happened The morning after a data migration to improve inbox search performance and add new search functionality, a series of complex queries caused our search service to enter a bad state and became throttled during the impact period. This appeared to be due to the size of our Elasticsearch shards exceeding a critical limit in terms of size. We exceeded the size threshold due to additional columns that were added to support new indices, multiplied by the size of the production dataset. As a result, search results \(including the default inbox experience, contacts list, etc\) timed out until we could completely reboot the search infrastructure. ### Impact Customer Data \(including conversations, messages, tags, etc\) was not lost and messages continued to deliver as intended during the period. Most functionality was available including account and user management, though the experience for searching through the inbox was severely degraded. Customers who were online trying to use the app or API during the impact period would not have easily been able to lookup conversations or contacts. It was difficult to create and populate broadcasts during the period. Users were still able to navigate to conversations directly from links and from notifications. ## Resolution Engineers used our metrics to identify the issue was specifically related to our search infrastructure. The team scaled up additional instances of our Elasticsearch infrastructure to try and solve the problem. The team then rebuilt the Elasticsearch cluster as they were not able to "reboot" it, and began routing traffic to the new cluster. This resolved the issue for customers on a rolling basis, as some connections hit the new instance while other data routed to bad shards. The rebuild process was clocked at taking about 1 hour and 15 minutes to complete syncing all data to build new Elasticsearch indices, so unfortunately some customers were impacted during this entire period. We have since moved from using 8 Elasticsearch shards to 16, and reindexed the dataset, which cut the total size of each shard in half. ## Additional Notes In a separate issue that occurred around the same time period, the [www.avochato.com](http://www.avochato.com) domain was flagged automatically by Avast Antivirus' anti-phishing browser extension. We believe this was a false positive, but it caused Avast users who had the extension installed to be unable to view Avochato. Users who whitelisted Avochato in their extension were able to continue to log in, and the team worked quickly to submit an appeal. Avast has since then removed us from their phishing blacklist. Avochato poses no known phishing threat to its users, but we encourage users who suspect phishing attack vectors to submit their reports to [www.avochato.com/bugbounty](http://www.avochato.com/bugbounty) Thanks again for your patience while we resolved this issue, and for being an Avochato customer, Christopher, CTO & CISO
This incident has been resolved, but we will continue to monitor during the rest of the weekend.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate long delays when loading lists of conversations and contacts in the inbox. We have deployed a patch to resolve one of the identified sources of latency and are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Elevated page load times"
Last updateThis incident has been resolved.
Latency has been resolved and we're seeing 100% of normal speeds.
A fix has been implemented and we are monitoring the results.
We are investigating the root cause of this issue with our cloud services provider.
Report: "Elevated page load times"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are still actively working with our cloud services provider to resolve elevated page load times.
We are still actively working with our cloud services provider to resolve elevated page load times.
We are still working with our hosting provider to resolve an issue with elevated page load times. Thank you for your patience.
We are working with our cloud services provider to resolve an active issue. We appreciate your patience.
We are continuing to investigate slow load times across the platform.
We're continuing to investigate elevated page load times.
We are currently investigating this issue.
Report: "Platform latency"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Investigating delays in updates to conversations in the inbox"
Last update## What Happened Inboxes were receiving stale data that was sometimes 10-15 minutes out of date. This meant that even when an inbox was updated to reflect an incoming call or new message, it would rapidly be replaced by stale data, which made it really hard to find new or active conversations. Messages and calls were being received in real-time, but the Avochato inbox was not being properly updated to reflect that, and if you missed a notification or refreshed your inbox, you would not always get the latest list of active conversations \(though you could still view those conversations and respond to them if you still had a link to them from a notification\). This was exacerbated by ever-increasing daily platform usage such as syncing contacts and new conversations, which slowed our services in inboxes. It was particularly challenging for inboxes with many open conversations or conversations with very long message histories. This year, we analyzed and foresaw these growth challenges specifically with indexing, and have been making architectural changes to avoid these types of slowdowns. However, our growth rate outpaced our estimates of shipping the new version of our search indexing. During this time, no data was lost, though it was in some cases difficult to respond to incoming calls as well as handle live chats, and certain automation like closing conversations or marking conversations as addressed would appear to the end user as incorrect status in the app. ## Resolution First, the engineering team implemented a patch to the adapter handling indexing platform-wide, but that did not make a significant enough impact to eliminate the symptoms of this incident. After analyzing the results, we decided to edit a specific index to exclude certain expensive queries, namely full-text searching of incoming messages for all new records. Historical messages in conversations will still be searchable for the time being. We are actively rearchitecting how we index messages within conversations \(among other things\) so that they are easily fully searchable from the inbox and our API, as well as increasing the speed and accuracy of fetching conversations in an inbox on mobile and desktop devices. We thank you for your patience and for choosing Avochato as your business communications platform of choice, Christopher Neale CTO & CISO
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are still seeing delays showing the most recent list of conversations for some inboxes, and are working on a permanent resolution.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Issues loading Contacts and Broadcasts pages"
Last update## What Happened A subset of our cloud infrastructure suffered a hardware failure, which caused specific customers associated with that hardware to be unable to view broadcasts and contacts, as well as intermittent issues polling for live inbox updates. Once the hardware issue was resolved, traffic for the specific subset of impacted inboxes returned to normal. No data was lost. ## Resolution Avochato cloud engineers escalated the issue to our cloud partner, who escalated the issue internally. The hardware failure was resolved and our team will continue working with the vendor to ensure that similar incidents properly rotate the faulty hardware out of the cluster.
This incident has been resolved.
We are seeing improvements for the impacted inboxes and are continuing to monitor.
We are continuing to monitor while we partner with our cloud provider to resolve an issue with cloud hardware. Thank you for your patience while we resolve this issue for the subset of impacted inboxes.
We are continuing to monitor. Specific inboxes may continue to be impacted.
A fix has been implemented and we are monitoring the results.
Certain features of the Avochato app are being impacted by a partial cloud outage. We are working to restore full coverage.
We are currently investigating this issue.
Report: "Slow application response times"
Last update_Note This incident was unrelated to an outage relating to CDNs that may have impacted customers sending MMS images with public-facing URLs powered by Fastly \(_[_read more here_](https://www.fastly.com/blog/summary-of-june-8-outage)_\)._ ## What Happened Avochato cloud services were temporarily unable to split traffic across our secure cloud databases, causing only one Avochato database to manage all load on the platform at once time. This resulted in slower than average speeds when processing updates to records and when serving pages throughout the app on mobile and desktop, as well as requests to our API. This slowdown compounded during the middle of the day, causing delays in fetching and writing data as well as delays syncing data to our search nodes. It is also possible that the delays would have impacted any time-sensitive operations such as adding contacts to a broadcast before the broadcast’s scheduled date. ## Resolution The engineering team patched the issue once it was determined safe to do so. Traffic patterns returned to normal and our throughput returned to expected levels. We have also made proactive performance improvements to double the normal throughput for certain data pipelines and implemented monitoring to immediately detect this regression in traffic. We apologize for the inconvenience to you and your team, Christopher, CTO
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Our team is investigating slower than average response times when using the app.
Report: "Client unavailability"
Last update## What happened During routine development cycle, database provisioning for a new service within our cloud services provider inadvertently applied an authorization change to our production database cluster. This caused requests to the Avochato app and homepage to fail temporarily until the changes were reverted. ## Resolution We immediately reverted the change when operations identified that it impacted production and the change rolled back out. ## Impact While Avochato was unavailable, queued automations and uploads entered our retry queues and successfully continued once we recovered. Incoming messages received during the period were not lost. Moving forward, our team has adjusted our processes for provisioning services regardless of their proximity to production services.
This incident has been resolved.
We are investigating an issue preventing users from accessing Avochato.
Report: "Issues with Telecom Provider"
Last update## What Happened One of our third-party software vendors suffered a platform-wide outage which prevented us from leveraging their APIs to deliver SMS and Voice functionality. During the impact period, we were not able to leverage their APIs for sending messages and managing Avochato Numbers, as well as routing calls. Attempts to send messages were queued and retried until we could successfully deliver the messages. Customers sending traffic through our other SMS and Voice partners were not impacted. ## Resolution The vendor performed their procedures for triaging and resolving the incident, and successfully returned to full operational capability.
We've seen most Telecom Errors errors resolve for the past hour. We're continuing to monitor the situation and will update as we know more.
We've stated seeing success sending outbound SMS and MMS through our Telecom Providers. We'll continue monitoring the situation until it fully improves.
Calls seem to be unaffected, however sending SMS and MMS are experiencing errors. We're in direct contact with our Telecom Provider to resolve this ASAP.
We've identified an issue with our Telecom Provider where outbound messages are disabled.
Report: "Platform Latency"
Last update## What Happened During routine auto-scaling in response to automated rotating of application servers, the Avochato platform suffered network failures brokering client-side websocket requests to application servers. An application-layer resolution to client-side javascript errors experienced by some customers inadvertently amplified the volume of retry requests, and this caused an insurmountable queue of requests to our websocket broker database. Secure websocket connections are used to deliver real-time notifications and app updates in the live inbox, and have inherent retry mechanisms to keep clients connected even if they lose connectivity intermittently. A high volume of concurrent retry requests timed out and filled the retry queue, where they continued to timeout and fail exponentially as browsers interacted with Avochato. This led to an effective denial of service as the retry mechanisms created an insurmountable volume of requests, compounding based on peak platform usage by our user-base. Exponential back-off mechanisms did not prevent individual clients from sending requests below a safe threshold our network could process expediently. Unlike the control we have over server-side resources, the Avochato engineering team did not have effective means to disable toxic clients from reconnecting, and rushed to isolate and stem the root cause, specifically by deauthenticating certain sessions remotely. Avochato servers remained operational and available on the open internet during the impact period, but interactions with the app became queued at the network level, causing extreme delays to end-users and API requests, as well as delays tagging data and uploading contacts, and delays in attempting to make outbound calls or route incoming calls. The incident persisted while the massive queue of requests was processed, but the Avochato engineering team did not have tools available to clear the queue without risking data loss. ## Resolution The Avochato Platform auto-scaled application servers in response to the increase in traffic to handle peaks in usage. Engineers were alerted and immediately began triaging reports of latency. After evaluating the network traffic and logs, our team identified the root cause and began developing mechanisms to stem websocket retry requests. Various diagnostics by the engineering team were able to decrease but not eliminate the above-average in-app latency so long as problematic clients were still online. Some cohorts of users were securely logged out remotely in order to prevent their clients from overloading Avochato. Backoff mechanisms have been modified to dramatically increase the period between retry requests. Meanwhile, upgrades to the open-source websocket broker libraries used by the platform were identified, patched, tested, and deployed to production application servers in order to prevent the root cause. Additional logging was also implemented to better identify the volume of these requests for internal triage. Functionality to securely reload or disable runaway client requests has been developed and deployed to production in order to prevent the root-cause from occurring across the platform. Additional architecture points of failure were identified at the networking level and upgrades to those parts of the system have been proposed and prioritized to prevent this type of service disruption from occurring in the future. ## Final Thoughts We know how critical real-time conversations are to your team, and how important it is to be able to service your customers throughout the business day. Our team is committing to responding as promptly as possible to incoming support requests and providing as much information as possible during incidents. Thank you again for choosing Avochato, Christopher Neale CTO and Co-founder
This incident has been resolved and we are observing normal page load times, but we are continuing to investigate the root cause.
Our team has deployed a software update to address the root cause and is monitoring the results.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We have identified the issue and engineers are working on a resolution.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Intermittent connectivity issues"
Last updateThis incident has been resolved. We will keep monitoring throughout the day.
Page load times are returning to normal following updates to cloud networking. We will continue to monitor connectivity.
We've identified network connectivity issues with one of our cloud database providers and are taking steps to ameliorate in-app latency some customers are experiencing.
Connectivity issues have been resolved but page load times are still above average. We are continuing to investigate.
Some clients are currently unable to connect to avochato.com
Report: "Platform Outage"
Last update## What happened Actions taken by our cloud hosting provider as part of an investigation into Avochato infrastructure unexpectedly caused applications and browsers to be temporarily unable to reach Avochato from the web or the Avochato mobile apps. ## Resolution Our ops team worked quickly to immediately rebuild our application cluster and get up and running, but some devices with cached DNS may have experienced 500 errors and the inability to take or make phone calls while DNS rolled out. We are working with our provider to make sure this misunderstanding does not happen in the future. We apologize for any inconvenience and thank you for choosing Avochato.
Avochato platform experienced a temporary outage in the early morning as a result of unplanned cloud hosting diagnostics. From approximately 2:40-3:00 PST access to the Avochato mobile and desktop app, incoming calls, as well as the API was temporarily unavailable. Inbound messages were not lost and scheduled or automated messages were delivered as service was resumed.
Report: "Application Latency"
Last update## What Happened Platform automation handling live notifications for messages led to excessive queueing of updates to critical tables in our write database, which in turn led to longer turnaround times for live updates inside the inbox. After identifying the root cause, Engineering deployed a fix \(which succeeded in stemming the root cause\), but in the meantime, we over-scaled our workers to adjust to their load which caused connection issues for non-workers. This led to an escalation in page load times while our congested databases could no longer serve our applications in a timely manner. Our cloud operations automatically scaled to handle the increased pressure from the root cause while we resolved the issue, but over-scaled improperly, leading to increased load times when accessing the app. Load times additionally spiked as we swapped the databases then returned to normal. During this period, one specific database seemingly had failure despite no usage. This led to a second period of increased load times after a brief reprieve. This database was safely replaced and servers were were routed to a replacement, and load times returned to normal. ## Impact Initial delays in receiving live inbox updates followed by high page load times when viewing the app. Delays in updating conversations and contacts. Due to delays in receiving live updates, some messages that appeared to be double-sent manually were delivered exactly once as intended. ## Resolution The team has already deployed updates to prevent the root cause from occurring. We have made adjustments to prevent the second issue including adjusting the maximum connections to the write database as well as safely replacing the faulty database that appeared to failover during this period. Our engineering team will audit configurations that led to the lack of cloud resources when auto-scaling.
This incident has been resolved. We will continue monitoring throughout the day.
We are observing average load times returning to normalcy. We will continue monitoring cloud infrastructure performance.
We are still seeing elevated load times and are applying diagnostics to reduce the impact on application performance.
We are still seeing some higher than average load times for specific inboxes and are continuing to investigate.
Average load times have returned to normal and we are continuing to monitor cloud infrastructure performance.
Our team has identified the issue and is working to speed up the queued messages and events.
We are investigating slower than average inbox load times and inbox live updates.
Report: "Broadcasts not starting"
Last updateA software update relating to the broadcast scheduling mechanism introduced a regression in our system, which caused broadcasts to stay stuck in an initial “sending” state. This has been resolved with a subsequent software update. This issue was not caught by automated or QA testing, and we have updated our automated testing and quality assurance controls to prevent a similar issue from happening in the future. Affected broadcasts eventually retried and delivered their message to their audience after a delay.
This incident has been resolved. Broadcasts that were scheduled to begin during the affected period will be restarted.
The issue has been identified and a fix is being implemented.
We are investigating an issue that is preventing scheduled broadcasts from moving beyond the "sending" phase.
Report: "Slow page load times"
Last update## What Happened A sudden escalation in CPU in our production database cloud database provider led to a period of longer than normal load times. While our platform was under high load for Cyber Monday, our investigation leads us to believe this was a hardware-related failure and our operations team responded accordingly. Additionally, the period of slow response times led to our Slack application automatically turning off as per Slack’s platform policy. Engineers manually turned the app back on numerous times, but unfortunately some messages sent via Slack threads were unable to be delivered during the windows when the Slack app was disabled. ## Resolution Data was not lost as we were able to fail over to a read replica database. We have upgraded database hardware to move away from potentially degraded hardware and performed routine system maintenance.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Platform Latency"
Last update## What happened High concurrent outbound message volume caused our production write database to run out of connections. This caused most queued processes to take an extremely long time to finish, as well as page load times to time out for many users who tried accessing the platform during the impact period. ## Impact Pending messages, inbound messages, and broadcasts during this period may have remained queued but were not dropped. Inbound calls initiated to Avochato numbers during this period were often unable to connect or be forwarded properly. Upon resolution, inbound messages and queued work retried themselves and in most identifiable cases were received properly. ## Resolution Our database automatically failed over to a read replica and was able to resume serving requests, however we are investigating ways for this failover to happen sooner to prevent longer periods of inaccessibility. Our engineers have identified the root cause relating to message callback method prioritization, and we patched our production application servers with both a fix for the root cause as well as new safeguards to prevent excess resource consumption during periods of extreme load. We are evaluating solutions to make our infrastructure more resilient while continuing to offer a best in class live inbox experience for customers of all sizes. As a team, we have committed to aggressively monitoring our platform’s health and proactively deploying updates to bottlenecks detected in our current application. We appreciate the trust you place in our platform for communicating to those that matter most to you, and thank you for your patience during this busy time of the year. Thank you for choosing Avochato, Christopher Neale, CTO and Co-founder
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are working to deploy an update to resolve issues impacting clients.
We are currently investigating this issue.
Report: "East Coast platform latency"
Last update## What happened Our East Coast cloud infrastructure was routing requests to West Coast databases, sometimes multiple trips for a single request. This caused delays for customers whose DNS were automatically routed to the East Coast, as well as network requests from API servers with an East Coast region. Messages and application load times were delayed for customers closer to the East Coast region than the West Coast. ## Resolution We altered the threshold for sending traffic to the East Coast data center. We have rolled back networking changes to East Coast infrastructure and systems have returned to normal
This incident has been resolved.
We have made an infrastructure update which appears to have resolved latency impacting East Coast customers. We will continue to monitor page load times for impacted routes.
We are aware of an infrastructure issue that is causing slower than average page times for US East Coast users. We are working on a resolution.
Report: "Client Latency and Platform Outage"
Last update**What Happened** Starting in the afternoon, routine Conversation Management automation within the Avochato Platform began running on a disproportionately large body of background work using the default priority queue. This ultimately was due to a combination of account-specific settings, infrastructure restraints, and timing of the load across the Avochato platform. The Avochato platform suffered from growing latency in a series of waves, a short maintenance window of hard down-time, and another wave of latency as we addressed the root cause of the issue. All Avochato services were impacted. This lead to an exponential concurrent amount of background jobs performing and competing for all platform resources. **Ultimately, fixing the issue required putting the platform into maintenance mode while replacing hardware used in our cloud services.** To clarify, _this was not a planned or routine maintenance window, but the user experience was the same: app users would see a maintenance page \(or error page for some users\) and an inability to access the inbox. This was done in the interest of time and will be revised by the engineering team in the future._ During this period it was not clear where the source of runaway automation described above came from, but it caused the Avochato Platform to attempt to queue a new type of asynchronous job designed to push data to websockets. Because jobs and websockets use the same hardware, the influx basically ate up 100% of memory, as jobs that could not find available websockets could not complete and more and more jobs of that nature piled up waiting to publish to a websocket. The source of this issue specifically relates to a recent platform upgrade deployed in previous weeks to reduce the turnaround time for users to send messages and receive notifications quickly. While this functionally worked for our customer-base, it ultimately moved the burden to a different part of the architecture in a way that scaled disproportionately under specific circumstances, and without proper limitations on concurrent throughput. The result caused our platform to be unable to process additional web requests \(meaning high page load times\) and queued a massive excess of background jobs in a short period \(meaning delays in messages and lack of real-time notifications and inbox updates, etc\). Additionally, the latency and eventual outage led to our team being unable to respond to many customers who reached out to us during the impacted period in the timely manner that they have been accustomed, due to the platform failure. The Engineering team prepared and deployed a migration to switch those types of new jobs from the default priority queue into a new lower-priority queue to constrain their impact. Deployment of this patch was done per our usual high-availability deployment process which involves taking one-third of our application servers offline at a time, reducing platform capacity while we deploy. Regardless, in order to handle the overall volume of queued work and return to normalcy, Engineering applied emergency steps to replace the cloud computing instance storing the jobs with one twice its size but this could not be done without postponing the work as we switched the infrastructure. All efforts were made to prevent dropping the background jobs though ultimately not all jobs could be saved. Emergency steps to resolve the situation \(during which Avochato switched into maintenance mode in order to purge the system of the busy processes\) led to a short period of hard downtime and loss of queued jobs including processing contact CSV uploads, creating broadcast audiences, sending messages, and displaying notifications. Once the necessary hardware was replaced, the root source of the resource-intensive automation continued to create excess jobs. However, it gave engineers the ability to reduce the noise, identify the source, and design a final resolution to treat the cause instead of the symptom. Another migration was prepared to make it easy for admins to turn off functionality for specific sources of automation. Once deployed, systems administrators were able to eliminate the source of resource-intensive automations once and for all and new safeguards were installed for taking expedient, atomic actions in the future that would not require hardware or software deployments. This ultimately returned our systems to normal as of yesterday evening. **Next Steps** Engineering has drafted and is prioritizing a series of TODOs regarding infrastructure points of failure, is implementing in-app indicators for when the system is under similar periods of stress and is working closely to resolve any impacted accounts that got into a bad state due to the actions taken during the period. Infrastructure planning has been prioritized to reduce the burden on specific parts of our architecture and prevent specific architecture from bearing multiple responsibilities that led to the failure. We are continuing to monitor platform latency and take proactive steps to mitigate unforeseen combinations of Avochato automation from ever impacting the core inbox experience. We understand the level of trust you place in the Avochato Platform to communicate with those most important to you. On behalf of our team, thank you for your patience, and thank you for choosing Avochato, Christopher Neale, CTO and co-founder
This incident has been resolved and our team is continuing to monitor the stability of the platform and process outstanding queued work.
We are monitoring the resolution of the incident and services are being rolled back online.
The Avochato Platform is entering a temporary maintenance period.
We are continuing to experience delays in serving pages and handling messages. Our ops team is deploying a patch to our infrastructure and we will monitor the result.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Unable to make outbound calls from mobile apps"
Last update**What Happened** An upgrade to the client library prevented calls from being initialized inside the context of our mobile applications. This was unfortunately not detected by our QA process and resulted in a regression for app users regardless of mobile app version. **Actions Taken** We have patched the initialization of the library in all clients and we are re-evaluating the team’s ability to QA mobile applications against stage environments.
This incident has been resolved.
An issue impacting outbound call origination has been identified and a fix has been deployed. We are continuing to monitor outbound call capability from Avochato mobile apps.
Our team is aware of issues with making outbound calls from the Avochato for iOS and Avochato for Android mobile apps for some customers. Mobile and desktop browsers are not affected.
Report: "Client Latency"
Last update## What happened A large spike in network requests combined with a backlog automated usage led to the Avochato platform queueing HTTP requests for a longer than average period of time. The resulting callbacks that resulted from the spike in usage created a large backlog of work to be done by our servers and led to page load times to spike and delays in processing sending messages. Subsequently, the load-balancer for our platform ran out of available connections for HTTP requests as websocket escalations piled up due to our users refreshing their browsers during the period of degraded performance. This caused a negative feedback loop leading to longer delays to process requests and connect to live updates, which then contributed to live updates for inboxes and conversations continueing to be intermittent and HTTP requests being dropped. ## Action items Specific bottlenecks in our platform infrastructure’s ability to broker websockets have been identified and implemented. Some additional updates to our asynchronous architecture are being planned and prioritized to prevent a similar incident in the future.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Our team has taken steps to mitigate platform latency which has improved but not resolved performance. We are continuing to monitor performance.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Higher than Average Platform Latency"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Platform Latency"
Last update### What Happened At approximately 9AM PST, the Avochato cloud platform began receiving double the average volume of requests for that time period. This appeared to be due to high platform utilization at the top of the hour. Our system scaled to handle this load and no requests to our servers were dropped, however the response was delayed as our system queued the requests and handled them in order. This began to slowly impact the end-user experience, with latency peaking at roughly 9:30am PST, and subsequently began to decrease. Messages that needed to be delivered to customers were queued and delivered during this time. However, accessing conversations suffered from 5-7 second delays. The root cause appeared to be a bottleneck in our cloud infrastructure preventing our servers from handling the sheer load, causing requests to be queued. Latency persisted for some users until approximately 10:05am PST and is now has regressed to the mean. ### Resolution While our systems automatically adjusted to handle the load, the time to do so left users with a suboptimal experience for navigating Avochato inboxes. The Avochato operations team investigated the source of the volume and took action to limit the volume of requests that were being caused by outsized usage. ### What will be done Tonight we will be performing maintenance to address this bottleneck and continue increasing the threshold of concurrent volume that can impact the user experience. We know how important it is to quickly respond to your customers and appreciate the trust you place in our team to deliver the best possible experience for those important to you. Thanks again for choosing Avochato, Christopher
This incident has been resolved -- we will be performing maintenance tonight to introduce preventative measures.
Latency due to unprecedented platform usage has been identified, and our team is monitoring platform volume.
We are currently investigating this issue.
Report: "In-app latency"
Last updateWe have identified end-user latency due to high throughput across the platform. The team has addressed a deficiency in part of our cloud systems. Application responsiveness has returned to normal, but we are continuing to monitor system performance.
Report: "Delays due to High Platform Volume"
Last updateThis incident has been resolved.
Our systems have stabilized and user experieince is returning to normal. We are continuing to monitor the situation.
We are currently investigating the issue.
Report: "Reports of slowness on the Avochato app"
Last updateWe're receiving reports of the dashboard loading slower than normal and are investigating.
Report: "Latency in website performance"
Last updateAn unexpectedly large number of API requests caused noticeable latency in page loads across www.avochato.com during the morning of June 5, 2020, PT. The engineering team released an update to production that same day, which introduced API rate limits to mitigate the chances of latency occurring in the future due to API request spikes.
Report: "Intermittent Issues with Avochato Dashboard"
Last updateWe are experiencing an abnormally high load and will continue to monitor
Report: "Intermittent Issues with Avochato Dashboard"
Last updateUsers report instances of being unable to perform certain actions on the dashboard. We have deployed a fix and will continue monitoring for further issues.
Report: "Avochato Dashboard Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.