Eleos Technologies

Is Eleos Technologies Down Right Now? Check if there is a current outage ongoing.

Eleos Technologies is currently Operational

Last checked from Eleos Technologies's official status page

Historical record of incidents for Eleos Technologies

Report: "support@eleostech.com address not creating support tickets"

Last update
resolved

Between 11:54 AM EDT and 7:42 PM EDT yesterday, May 28th, emails sent to our primary support address, support@eleostech.com, were not received and did not create support tickets. If you sent an email to support@eleostech.com during this time, please resend the email. You would have received an error email with the subject "Delivery Status Notification (failure)" in reply to your message. Any tickets opened via the support portal were not impacted. We're very sorry for the disruption this issue has caused for your business.

Report: "support@eleostech.com address not creating support tickets"

Last update
Resolved

Between 11:54 AM EDT and 7:42 PM EDT yesterday, May 28th, emails sent to our primary support address, support@eleostech.com, were not received and did not create support tickets.If you sent an email to support@eleostech.com during this time, please resend the email. You would have received an error email with the subject "Delivery Status Notification (failure)" in reply to your message.Any tickets opened via the support portal were not impacted.We're very sorry for the disruption this issue has caused for your business.

Report: "support@eleostech.com address not creating support tickets"

Last update
resolved

We've resolved the issue with the support@eleostech.com address and support tickets are being created again. We determined that emails sent to this address between 11:54 AM EDT and 7:42 PM EDT were not received. If you sent an email to support@eleostech.com during this time, please resend the email. You would have received an error email with the subject "Delivery Status Notification (failure)" in reply to your message. Any tickets opened via the support portal were not impacted. We're very sorry for the disruption this issue has caused for your business.

investigating

Emails sent to our primary support address, support@eleostech.com, are not currently being received or creating support tickets. Support requests opened via the portal at https://eleostech.zendesk.com are creating tickets successfully. We're investigating the issue with support@eleostech.com. In the meantime, if you need to reach support, please email us at support@eleostech.zendesk.com or login to the portal at https://eleostech.zendesk.com to create a ticket.

Report: "Drivers not receiving routes"

Last update
postmortem

On May 5, 2025, our engineering team was paged for elevated error rates that exceeded normal operational thresholds. These errors stemmed from our routing service provider unexpectedly rate-limiting our requests. This caused Trip Planner and route requests originating from our mobile applications to fail. This also affected Trip Planner Studio’s Build a New Route requests made via the App Manager. The vast majority of route requests failed from 14:16 UTC to 14:48 UTC. Upon investigation and communication with the external provider, it was determined that the rate-limiting was the result of the deployment of a malformed configuration by the provider and was not the result of our usage patterns or volume, which remained normal. Routing functionality was restored at 14:48 UTC. As a precautionary measure, we waited to resolve the status page incident until 15:54 UTC when we confirmed that the routing provider had identified the underlying cause of the outage and did not expect a reoccurrence.

resolved

Routing success rates have remained stable and under normal operating conditions. The upstream provider has acknowledged the issue on their side and has indicated the issue has been resolved on their end.

monitoring

Routing success rates still appear to be stable and under normal operating conditions, but we are continuing to monitor at this time.

monitoring

Routing success rates appear to be stable and under normal operating conditions, but we are continuing to monitor.

monitoring

Routing functionality with an upstream provider has been restored and we're seeing routing success rate increase. We will be monitoring to make sure that drivers are able to receive new routes in the mobile apps, and the Build a Route functionality in Trip Planner Studio will continue to function.

investigating

We are currently investigating an issue with drivers not being able to receive routes at this time.

Report: "Eleos Platform Metrics Not Rendering Graphs"

Last update
resolved

The Eleos Platform Metrics seem to be fully functional at this point, again, if you continue to experience any issues rendering these charts please reach out to support at support@eleostech.com.

monitoring

Our vendor has indicated that the issue has been resolved, and we have observed that the Eleos Platform Metrics are now rendering and functional. If you are still experiencing issues, please reach out to our support team at support@eleostech.com.

identified

We have confirmed this issue is due to an outage from our vendor, and we are currently in direct communication with them. We will provide another update within 30 minutes or as soon as we receive more information from our vendor.

identified

We are continuing to work on a fix for this issue.

identified

We have identified an issue preventing the Eleos Platform Metric charts from rendering within App Manager. It appears, at this time, our vendor for rendering these charts is experiencing an outage.

Report: "Delayed Message Processing"

Last update
resolved

The issue is resolved and we are no longer seeing message delays. No message information was lost during this time. We are continuing to monitor our message backlog and ensure messages are no longer being delayed but it is currently stable.

identified

We are continuing to work on a fix for this issue.

identified

We are currently experiencing delays in messages being sent from the driver to customer backend services. No message information is being lost during this time. We are still successfully processing and delivering messages but are working to mitigate a growing backlog of message that are causing delays.

Report: "Delayed Message Processing"

Last update
resolved

We investigated an incident with delayed integration of messages starting at 17:02UTC, which was resolved at 17:20UTC. The 99th percentile of message delay was just under 10 minutes, and most messages were delivered well under that time frame. No message information was lost and workflow actions were not impacted.

Report: "Monitoring delayed message processing"

Last update
resolved

Messages integrated successfully and the message backlog remains stable.

monitoring

We remain caught up on our backlog of messages. We are assessing if all messages that were cleared from the backlog have been fully integrated. At this time it appears so, but we want to confirm using a separate source of data.

monitoring

We have completed processing all messages in our backlog. We will continue to monitor the backlog to ensure the queue is stable and messages are no longer being delayed.

monitoring

We are still monitoring messages being sent from the driver to customer backend services. No message information has been lost and workflow actions have not been impacted. We are continuing to see a decrease in the backlog and working on solutions to get through the backlog faster.

monitoring

We are continuing to monitor messages being sent from the driver to customer backend services. Still no message information has been lost during this time as we work through our backlog of messages. Workflow actions have not been impacted during this time. We are successfully processing and delivering messages and are seeing our backlog decrease steadily.

monitoring

We are continuing to monitor messages being sent from the driver to customer backend services. As we continue to work through the backlog of messages, we are also implementing mitigations to increase the system's ability to work through the backlog faster. Still no message information has been lost during this time.

monitoring

We are monitoring our sub-system for sending messages from the driver to customer backend services. We are successfully integrating messages at this time to customer backend web services but we are currently working through the backlog of messages that we do have. No message information has been lost during this time. We will continue to monitor until we have successfully integrated all messages in our queue.

monitoring

We are currently monitoring our backlog of messages. Messages are being processed and we are working through a large backlog that has been queued.

investigating

We are continuing to work on applying the mitigation and will notify once complete.

investigating

We have identified the cause of the issue and are working on a mitigation.

investigating

We are investigating a delay in the delivery time of transactions. Our on-call team is engaged and will provide an update shortly.

Report: "Investigating reduced route request volume"

Last update
resolved

We've confirmed that our normal service level monitoring and alerting is working correctly, that logs are being ingested and stored as normal, and that the issue is isolated to our logging provider's web console. We've raised the issue to them, and we have identified a workaround should logs be needed in the interim.

monitoring

We've identified this is a problem with our monitoring, and that routing is working normally across the Platform. Thank you for your understanding on the false alarm—we try to err on the side of notifying on the status page in cases where there's uncertainty as to whether there's an emerging issue, but in this case there turned out to be no user impact.

investigating

Our monitoring indicates a significant drop in route requests beginning around 16:33 UTC. We’re investigating and will provide more information on impact as soon as possible. It is not yet clear if the problem is affecting end users or if the problem is with our monitoring, but we will clarify shortly.

Report: "Pluggable Build a New Route"

Last update
resolved

This incident has been resolved.

identified

The fix for pluggable engines within Build A New Route of the Platform Dashboard is in place, and has been confirmed to be working as expected again. We are sorry for the inconvenience this may have caused. Please reach out to support@eleostech.com if you have any questions.

identified

We are currently deploying a fix that should resolve the issue within Build A New Route when attempting to use pluggable routing engines. The estimate for the fix to be in place is 25 minutes. We will provide an update when the fix is in place, and Build A New Route is functioning normally.

identified

We are continuing to work to revert the set of changes that broke pluggable route engines within Build a New Route. We will provide updates here as available or in the next 20 minutes, whichever comes first.

identified

We are continuing to work to revert the set of changes that broke pluggable route engines within Build a New Route. We will provide updates here as available or in the next 20 minutes, whichever comes first.

identified

We are continuing to work to revert the set of changes that broke pluggable route engines within Build a New Route. We will provide updates here as available or in the next 20 minutes, whichever comes first.

identified

We have determined the cause of the pluggable route engine rendering incorrectly within Build a New Route, and are working on reverting the change. We will provide updates here as available or in the next 20 minutes, whichever comes first.

identified

We have identified an issue where trying to use Build a New Route within Trip Planner Studio will fail when selecting a pluggable routing engine. We have identified the issue, and are working on putting a fix in place. We will provide updates here as available or in the next 20 minutes, whichever comes first.

Report: "Degraded Performance"

Last update
resolved

Response times for the APIs and mobile apps are back to normal.

investigating

We are investigating slow response times for API requests made to the Eleos Mobile Platform, which also affects mobile apps. Our on-call team is engaged and will provide an update shortly.

Report: "Missing reporting data in App Manager"

Last update
resolved

Due to a configuration error, between 2024-12-05 at 20:27 UTC and 2024-12-06 at 16:37 UTC, Eleos Platform Metric charts in App Manager failed to display data. We have corrected the issue and confirmed that the charts are now displaying correctly.

Report: "SSL Certificate Changes to Public Platform APIs"

Last update
postmortem

On 2024-11-11 at 8:12 UTC, as part of standard TLS certificate management procedures, our systems automatically renewed the certificate used by [platform.driveaxleapp.com](http://platform.driveaxleapp.com) ahead of its upcoming expiration. Some customer integrations did not recognize the validity of the new certificate, and were thus unable to initiate secure connections to the Eleos Platform API until they were updated to do so. ## Impact Affected customers’ integrations were able to connect to the Eleos Platform API, but aborted the connection before making a request because they were unable to determine the validity of the TLS certificate presented by the server. Primarily, this prevented messages from being delivered to drivers via the `PUT /api/v1/messages/{handle}` API endpoint. Depending on the integration, other API calls may have also been affected, including: * Load updates and deletes via `/api/v1/users/{username}/`... endpoints * User refreshes via `POST /api/v1/users/{username}/updates` * API-based app management, such as form and screens * API-based platform data management, such as trip plan fetching Mobile app communication with Eleos backend systems was not affected directly. The issues with the noted customer systems indirectly degraded API-driven functionality, such as backoffice-to-driver message delivery. App Manager and Document Hub were not affected. Customers with an integration using an up-to-date root store were not impacted. ## Background HTTPS APIs such as the Eleos Platform API rely on the web’s Public Key Infrastructure \(PKI\) to allow client applications to verify the identity of the server and prevent machine-in-the-middle attacks. As part of PKI, clients such as web browsers, HTTP libraries, and language runtimes provide a “root store” of pre-trusted certificates controlled by audited third parties, called certificate authorities \(CAs\), who cryptographically certify the issuance of certificates for individual websites. These pre-trusted certificates form the “root of trust” for verifying server identities. For a successful connection, the server must present to the client a certificate that is transitively signed by one or more of these roots, and the client must have at least one of those same roots pre-loaded in its root store. If the server presents a certificate signed by a CA that is not present in the client’s root store, the connection will fail. From time to time, new CAs pass the necessary audits and are included in major root stores, and some CAs go out of business or fail to comply with necessary policies and are proactively removed, since they can no longer be trusted to properly verify certificates. Finally, root certificates are only valid for a finite period of time. For these reasons, root stores are not static, and must receive periodic updates to ensure continued interoperability with the greater web, including the Eleos Platform APIs. In this instance, the previous certificate for [platform.driveaxleapp.com](http://platform.driveaxleapp.com) was rooted with the Starfield Class 2 Certification Authority certificate. Because of a new policy in Mozilla and Chromium’s root store programs that limits the lifetime of any given root certificate to 15 years, this CA will no longer be trusted by these major root stores in April 2025. To ensure no certificates remain in use past that date, our certificate issuer is transitioning away from it proactively. Both the new and expiring certificates included two Amazon-managed root CAs as part of their trust chains, Starfield Services Root Certificate Authority - G2 and Amazon Root CA 1. These root CAs date to 2011 and 2017, respectively, and have been widely included in the root stores used by browsers and operating systems since that time. Because Amazon Root CA 1 is cross-signed by ​​Starfield Services Root Certificate Authority - G2, trusting either root CA was sufficient to be unaffected by the change. However, affected systems relied on the older Starfield Class 2 CA to validate the [platform.driveaxleapp.com](http://platform.driveaxleapp.com) certificate and did not include either of the new CAs in their root store. As a result, those systems were unable to confirm the validity of the new certificate and rejected the attempted connections to the Eleos Platform API. To avoid Eleos Platform outages caused by expired certificates, certificate renewal is a fully automated operation, and certificates are designed to expire relatively frequently to ensure these mechanisms are exercised regularly. These certificates can change at any time, even ahead of the end of their validity period. We use multiple, distinct monitoring systems to ensure our APIs present widely-trusted and valid certificate chains at all times, and both automation and monitoring functioned as expected during this incident. Our commitment is to continue to use certificates issued by established, trusted CAs that are present in the major \(CCADB, Chromium, Mozilla, Microsoft, Apple, Java\) root stores. To ensure API clients are able to successfully validate the authenticity of Eleos APIs, we recommend root stores be kept up to date via the mechanisms available through operating system and language runtime vendors. We do not recommend “pinning” or manually trusting observed certificates by adding them to a custom root store.

resolved

For those still having any SSL certificate issues when connecting to the Eleos APIs, please ensure that you add Amazon's CAs to your trust store. For more information, please see Amazon's guide on this: https://aws.amazon.com/blogs/security/acm-will-no-longer-cross-sign-certificates-with-starfield-class-2-starting-august-2024/ As a follow-up, we have confirmed that the updated certificate only impacted customers' backoffice systems contacting our public API, and that supported mobile app versions were able to connect to the Eleos API and were not directly affected. Impacted customer systems may have caused issues with drivers' mobile apps. For example, if your messaging web service was affected, your drivers would not have received messages sent to them until your service could contact the Eleos API. However, messages sent by drivers were delivered successfully.

monitoring

For those still having any SSL certificate issues when connecting to the Eleos APIs, please ensure that you add Amazon's CAs to your trust store. For more information, please see Amazon's guide on this: https://aws.amazon.com/blogs/security/acm-will-no-longer-cross-sign-certificates-with-starfield-class-2-starting-august-2024/ For any questions, please reach out to support@eleostech.com.

monitoring

For anyone having any SSL certificate issues when connecting to the Eleos APIs, please ensure that you add Amazon's CAs to your trust store. For more information, please see Amazon's guide on this: https://aws.amazon.com/blogs/security/acm-will-no-longer-cross-sign-certificates-with-starfield-class-2-starting-august-2024/ Again, this issue is limited to those who rely on Starfield C2 being contained within the certificate chain, if your systems do not rely on this, or you aren't experiencing any issues connecting to the Eleos APIs, please disregard. For any questions, please reach out to support@eleostech.com. We apologize for this inconvenience.

identified

The recently issued certificate no longer contains Starfield C2 in it's certificate, this issue will manifest to any services that rely on Starfield C2 to be contained in the certificate chain. If your integration or systems do not rely on Starfield C2 to be contained in the certificate chain, or otherwise, you are not seeing any issues connected to our API endpoints, then you can ignore this message. If you are experiencing any issues connecting or making API calls to the Eleos Platform, please reach out to our support team at support@eleostech.com

investigating

Around 11/11 8:00 UTC our SSL certificate chain for our driveaxleapp.com domain was updated. This update may have caused issues connecting to our public facing platform APIs. We are investigating the issue now.

Report: "Monitoring Degraded Performance"

Last update
resolved

We've identified the primary contributor to the increased latency to the Platform and have applied mitigations which have improved response times. This involved directing certain kinds of request processing to a dedicated subset of servers, freeing capacity on other servers. This is a temporary measure while we continue to work on a permanent solution to ensure systems run smoothly going forward. During this time, no data was lost and driver impact was isolated to slower-than-normal app functionality. We're sorry for the inconvenience this has caused and apologize for the disruption.

monitoring

Our systems are showing increased response times and latency when mobile apps and customer web services interact with our systems. Requests to our systems are succeeding despite the increase in delay. Drivers and App Manager users will experience slightly higher than normal latency but requests will still succeed. No driver data is lost; any requests that take too long to complete will be retried. We are currently monitoring.

Report: "Monitoring Indicated Problem"

Last update
resolved

Our systems have remained stable and we have not observed a recurrence of degraded performance. We'll continue to monitor our systems as usual. We're actively working to identify the root cause of the degraded performance that has affected our systems. We have confirmed that it is not related to the update we rolled back on Monday, Sept. 23. We know you and your drivers rely on our systems, and we're working to ensure they continue to work smoothly. We're sorry for the impact this has on your business.

monitoring

The system remains operational; customers and drivers should not be experiencing delays. Out of an abundance of caution, we're continuing to actively monitor system performance. If we observe future performance degradation, we'll update here.

monitoring

Around 16:30 UTC our system response times returned back to normal levels, we will continue to monitor and provide updates.

investigating

Beginning at 15:45 UTC, our systems indicated increased response times and latency when mobile apps and customer web services interact with our systems. Requests to our systems are succeeding despite the increase in delay. Drivers and App Manager users will experience slowness. No driver data is lost; any requests that take too long to complete will be retried. We're currently seeing these delays recover, and we'll continue to update.

investigating

Our monitoring indicates a problem with our systems. We’re investigating and will provide more information on impact as soon as possible.

Report: "Retransmitted signature images"

Last update
resolved

As of 9/24/2024 at 20:57 UTC, we’ve completed retransmitting messages with linked signature images that were originally sent to customers’ messaging web services between 13:54 and 20:50 UTC on 9/23/2024. The retransmitted messages include corrected URLs for signature images, in addition to the rest of the message as it was originally delivered. All customers who had impacted messages with linked signature images were contacted directly prior to the retransmission. Additionally, we’ve finished compiling and sending message data to send to those customers who indicated a need to receive them outside of their messaging web service. Any messages that did not include signature images were unaffected by the issue, and those messages were therefore not retransmitted. We’re sorry again for the disruption this issue has caused for your business.

Report: "Issues with signature images"

Last update
resolved

At 13:54 UTC on 9/23/2024, we released a server update that resulted in an issue in how we transmit signature images through our message API. We reverted this change at 20:50 UTC and confirmed that signature images were again being transmitted successfully. Although the images were not accessible during this time period, these images were stored properly and no data was lost. We're currently working toward retransmitting these messages so that the signature images can be accessed. Customers whose forms include signature fields should expect to receive retransmissions with corrected signature image URLs. Additionally, during this time period, one customer had difficulty with driver app logins. This was isolated to their environment, and we confirmed that other customer app logins were not affected. Other image transmission mechanisms, namely document images accessed via the API, were not affected. We're very sorry for the impact this issue has had on your business, and we’ll be working toward better verification around this portion of our systems to prevent similar issues in the future.

Report: "Investigating a Problem"

Last update
resolved

We've additionally verified that document images retrieved through the Document API were unaffected and accessible as normal.

monitoring

We've rolled back a change we made this morning at 9:54 AM EDT. This earlier change caused authentication failures for one customer, and additionally prevented signature images attached to messages from being transmitted correctly. We've confirmed that the rollback has resolved these issues, and continue to monitor the systems.

investigating

We are investigating potential issues affecting the Eleos Mobile Platform. Our on-call team is engaged and will provide an update shortly.

Report: "Investigating Increased Response Times"

Last update
resolved

At this time the system has stabilized, and we are resolving this incident. Beginning at 13:54 UTC the Eleos Mobile Platform started experiencing elevated response times across the system while error rates remained nominal. A mitigation was put in place at 14:35 UTC to stabilize the response times which had an immediate impact that brought response time down to a normal rate at 14:38 UTC. During this period mobile app users would have experienced slightly longer periods when waiting for the app to refresh with new data, and API requests would've taken longer than normal to make.

monitoring

We are seeing stable response times across the system, and are continuing to monitor the situation. We will post another update in 30 minutes, or when new information is available which ever comes first.

monitoring

We are seeing stable response times across the system, and are continuing to monitor the situation. We will post another update in 30 minutes, or when new information is available which ever comes first.

monitoring

We have applied mitigations and our response times are back to normal. We are continuing to monitor.

investigating

We have identified the cause for the elevated response times and are applying mitigations.

investigating

We are investigating potential issues affecting the Eleos Mobile Platform. Our on-call team is engaged and will provide an update shortly. We are experiencing higher-than-normal latency. While the system appears to be stable, our message response time is slower than expected.

Report: "Degraded platform performance"

Last update
resolved

The platform continues to remain stable and responsive since 16:30 UTC. During the period of the incident, from 16:10 to 16:30 UTC, a minority of App Manager pages failed to load, but the mobile app experience was largely unaffected as failed requests (such as completing workflow actions, sending messages, and uploading documents) were retried automatically. We're sorry for the impact this had on your business.

monitoring

The platform has remained stable and responsive since 16:30 UTC. We're continuing to monitor for degraded performance.

investigating

Since 16:30 UTC, response times have returned to their normal level. We're still investigating the cause and monitoring. We'll update if system performance begins to degrade again.

investigating

We're investigating degraded performance and slow response times from platform endpoints and services, beginning at 16:08 UTC. This is affecting the platform API, mobile apps, and App Manager. Despite the degraded performance, our systems are functioning and returning successful responses.

Report: "Push Notification Failures"

Last update
postmortem

Between 5/20/2024 19:17 UTC and 5/22/2024 18:05 UTC our systems failed to deliver most push notifications to drivers using Android devices within a subset of Eleos Platform apps. We have directly reached out to each customer who fell under this subset of affected apps, if you have not heard from us directly about this issue, then you and your drivers were unaffected by this incident. ‌ During this window, any drivers using this app on an Android device would not have received push notifications for messages they received within the Eleos Platform app. Along with this, any [manual load refresh API](https://dev.eleostech.com/platform/platform.html#operation/postLoadUpdate) calls would not have triggered the driver’s loads to be refreshed automatically.  ‌ This outage was caused by a vendor we use for delivering push notifications deprecating an API that was setup on a subset of apps. This vendor had notified us of this deprecation, however, they had shut off the API a month earlier than they had scheduled. Any apps set up to use this deprecating/legacy API for push notifications would have been affected. To remedy the issue, we went through and upgraded all affected applications to the vendor’s latest version of this push notification API. ‌ We currently lack monitoring around certain push notification failures, which is why this issue lasted as long as it did and why we were not aware of it until a customer notified us. We will be working to build out more robust monitoring around push notifications failures, such that any issues like this going forward will be identified and resolved in a more appropriate amount of time. ‌ We are deeply sorry for any disruption we may have caused for you and your drivers.

resolved

We have completed in rolling out our fix for these Android push notification failures. Since 18:05 UTC, our logging indicates we are no longer encountering issues processing push notifications for Android drivers. During this time period, the push notifications for iOS drivers should have been unaffected. Thank you again for your patience.

monitoring

We have implemented a solution across affected customers and are continuing to monitor the push notification issues. If you continue to receive reports from drivers not receiving push notifications, please reach out to us at support@eleostech.com

identified

We are continuing to test our potential solution to fix these push notification issues, we will follow up once we have verified this solution works and is in place.

identified

We believe we have identified the root cause of these push notification failures and we are working to implement a possible solution. Thank you again for your patience, we apologize for the disruption. We will update once we have confirmed the solution.

investigating

As we continue to dig into the root cause, our logging indicates that roughly 10 percent of Android users may not be receiving push notifications. At this point, we still believe iOS to be unaffected.

investigating

We are still investigating the root cause and will continue to provide updates as we learn more. Thank you for your continued patience.

investigating

We appreciate your patience while we dig into this issue. We are still investigating the root cause and will continue to provide updates as we learn more.

investigating

We are continuing to investigate this issue, we will provide another update within the next 30 minutes.

investigating

We have received reports of push notifications not being sent to Android devices. We are actively investigating.

Report: "Inbound workflow issues with telematics enabled"

Last update
postmortem

We experienced a partial outage on May 6th between 18:32 UTC and 22:40 UTC and on May 7th from 13:19 UTC to 14:04 UTC, for a total of 4 hours and 53 minutes.  We made an update to our servers that broke error handling for certain error conditions.  Once we rolled back the changes, the outage was resolved. The outage delayed the Eleos Platform's ability to process actions and messages that were flagged to include telematics data.  This affected drivers who met all of the following criteria: 1. The driver sent an inbound message, action, or workflow using a form with `enable_telematics_data` set to `true` 2. The customer environment had the Geotab telematics integration enabled 3. However, the driver did **not** have the telematics integration configured During these outages: * Actions and messages that included telematics data submitted by drivers were delayed until after the outage.  Workflow actions that were submitted during these times fell back to an offline state if offline workflows were configured.  The apps then synchronized actions and messages after the outage. * Drivers with the `manage_shipments` flag enabled potentially failed to retrieve updated load data. Actions and messages sent using all other forms were unaffected.  The messages and actions that failed were re-tried by the mobile apps and, after the outage, they were processed and transmitted to customer web services. Platform mobile app users who met the above criteria and were using the system during this time period were affected.  If you and your users were affected by this, we have already reached out with more specific details. Due to the small number of drivers who met the above configuration criteria, these errors did not occur in sufficient volume to trip our existing alerting mechanisms. As a result, the errors were not evident to the on-call operator for a relatively long period of time prior to being identified and rolled back. To prevent this from happening again, we are improving the integration between our servers and our existing monitoring tools to better surface low-volume errors introduced as part of a deployment.  We're sorry for the impact this had on you and your drivers.

resolved

We've confirmed the affected functionality is now fixed, so we're marking the incident as resolved. Specifically, this issue caused an internal error when all of the following were true: 1. Sending an inbound message, action, or workflow using a form with `enable_telematics_data` 2. on an account with a telematics integration enabled, but 3. as a mobile app user who does not have the telematics integration configured Inbound messages and actions that experience an error are retried, subject to device connectivity. Affected workflow actions will be delivered to your messaging service now that the underlying issue is fixed. Affected mobile apps would have reverted to offline workflow during the incident period from 18:30 UTC until 21:39 UTC.

monitoring

We're continuing to verify that the functionality is working as expected, although monitoring indicates it's functioning normally. We're also working to isolate the specific configurations that were affected so we can share some additional detail about the scope of the failures. At this time, we believe that actions for users with a telematics integration configured at the account level, but disabled at the user level, were affected.

monitoring

We've rolled back the associated change and have seen the errors related to retrieving telematics info for inbound workflow actions drop to expected levels. At this time, we believe workflow actions should be functioning normally, but we are doing additional checks to confirm.

identified

We are continuing to work on a fix for this issue.

identified

We've identified a correlated change to an area of code responsible for attaching telematics info to inbound messages. We are rolling back the change.

investigating

We've received some reports of issues with inbound workflow messages that have telematics info enabled starting around 18:30 UTC. Our team is investigating these errors to determine the cause. We'll post another update shortly once we have more information.

Report: "Elevated Error Rates for Drive Axle, Document Hub, and Platform Document Delivery"

Last update
postmortem

There were two Eleos Platform outages on May 2 from 18:15 UTC to 19:46 UTC and from 20:25 UTC to 20:45 UTC, for a total of 1 hour and 51 minutes.  During these outages, users could not log into Drive Axle, Document Hub, or App Manager. During these outages: * Workflow actions and messages that included telematics data submitted by drivers were delayed until after the outage.  Workflow actions that were submitted during these times fell back to an offline state if offline workflows were configured.  The apps then synchronized actions and messages after the outage. * Drivers with the `manage_shipments` flag enabled potentially failed to retrieve updated load data. * Drivers would have experienced delays when they attempted to upload scanned documents.  If drivers logged out while documents were still queued for upload, those documents were lost. * Drivers would have experienced delays when they attempted to retrieve their previously-scanned document list. * Users who were already logged into App Manager would have experienced difficulties with editing document types and editing forms that have document types. Due to a simultaneous outage of a telematics partner, Platform features that relied on their provided services, such as telematics-enabled messages and workflows, would have fallen back to their offline functionality if configured. Regarding users who could not log into Drive Axle, Document Hub, and App Manager, our system experienced these failures because certain authentication calls inadvertently depended on telematics integration logic.  Because of the simultaneous outage, these authentication calls timed out, causing resource exhaustion that cascaded to other, non-authentication requests.  These requests should be independent.  To make them independent, we are making changes that will decouple these requests. We are deeply sorry for the interruptions, delays, and distraction this incident caused for you and your drivers.  Compounding that, we did not communicate the existence of a known incident promptly.  We are reviewing and adjusting our on-call procedures and training to correct this.

resolved

Error rates have returned to normal. We apologize for interruption of service.

monitoring

We're currently monitoring the system as the error rates have gone back to normal.

investigating

Error rates have reduced dramatically during this time period. We're still currently investigating the cause.

investigating

We are actively investigating these issues. During this time period, logging into App Manager is also affected along with logging into the Document Hub. Drive Axle users are experiencing difficulties logging in.

investigating

Logging into App Manager is also affected during this time.

investigating

We are currently investigating elevated error rates for Drive Axle and the Document Hub. Scanning and retrieval of sent documents are affected for Eleos Platform customers as well. Scanned documents will not be lost during this time and will be retried.

Report: "Reduction in Errors Logged to Error console"

Last update
postmortem

We have identified the root cause of the Eleos Platform outage that occurred on May 2, 2024 and have prepared an emergency mitigation that can be applied in case of another occurrence. This outage resulted from unexpected system behavior during a partner’s separate outage. The partner has indicated a full resolution as of May 2, 2024 at 22:00 UTC, so at this time, we do not expect an imminent recurrence. However, we’re working to implement a complete fix to the root cause of the unexpected behavior. We’ll share a more detailed post-mortem in the coming week.

resolved

We are marking this incident as resolved for now, but we're continuing to monitor system performance.

monitoring

We are still currently monitoring at this time and error rates are normal. During 18:15 UTC to 20:45 UTC, a small percentage of workflow actions that used telematics data and a percentage of loads requests for users with the `manage_shipments` flag set on their authentication requests failed. Intermittently failed requests during this time period would have been retried.

monitoring

Our third party partner has indicated they have applied mitigations on their end, and our error rates are are normal. We are continuing to monitor the situation. Error console is now logging at it's normal rate.

monitoring

We are continuing to investigate this issue. At this time, it appears that error rates are approaching normal, however we are continuing to monitor the situation. We believe that this issue is related to difficulties involving a third party partner, and we're working to mitigate the effect these issues are having on Drive Axle, the Document Hub, and App Platform.

investigating

We are still investigating the cause of this issue. Drive Axle, the Document Hub, and App Manager login are not working.

investigating

We are currently not logging errors to error console at this time.

investigating

We are seeing the same issues as earlier for Drive Axle, Document Hub, App Manager, and Platform document processing.

investigating

We are reducing the number of errors being logged to error console to 50% of their normal volumes during this time period to mitigate potential performance issues.

Report: "Elevated Error Rates"

Last update
resolved

Error Console logging is currently operating normally, and our system error rates are within our normal operating range. To protect overall system stability, at this time, a percentage of web service errors that occur may not be reflected in the Error Console or the Error Console API during periods of large overall error volume.

monitoring

We have turned on logging to the Error Console, and are monitoring the stability of the Eleos Platform.

monitoring

We are currently in the process of turning on Error Console logging. Some logs will continue to be missing for about 10 minutes until we bring Error Console logging fully back on. We will post an update when the Error Console is fully available again.

monitoring

We are working to put in place a fix for the underlying cause of the recent incidents. As part of the deployment of the fix we are going to be temporarily disabling writing service API errors to the Error Console and the Error Console API starting at 15:00 UTC (in about 5 minutes) until about 15:20 UTC. We appreciate your understanding as we work to get this issue completely resolved.

monitoring

As of 14:23 UTC we have reenabled service errors being written to the Error Console and the Error Console API. We are continuing to monitor the system.

investigating

As of 14:08 UTC error rates on the Eleos Platform have stabilized with mitigations in place. We are continuing to monitor the system.

investigating

As of 14:06 UTC we have temporary disabled new service errors going to the Error Console or the Error Console API.

investigating

As of 13:50 UTC the Eleos Platform started experiencing elevated error rates. This effects the platform APIs, and the mobile apps may fall back into offline mode until the issue is resolved.

Report: "Elevated Error Rates"

Last update
resolved

The Eleos Platform has returned to it's normal operating status. Error rates and overall system performance is have returned to normal. The elevated rates within the Eleos Platform began at 10:07 UTC and back to normal rates at 10:31 UTC where they have remained. API service errors were not captured between 10:29 UTC and 11:04 UTC.

monitoring

As of 10:31 UTC error rates on the Eleos Platform have stabilized with mitigations in place, and as of 11:04 UTC we have turned on writing errors to the Error Console and the Error Console API as of 11:04 UTC. We are continuing to monitor the system. Mobile apps should begin to send messages that were capture during the incident, and the platform should be fully operational at this time.

investigating

As of 10:29 UTC we have temporary disabled new service errors going to the Error Console or the Error Console API.

investigating

As of 10:06 UTC the Eleos Platform started experiencing elevated error rates. This effects the platform APIs, and the mobile apps may fall back into offline mode until the issue is resolved.

Report: "Elevated Error Rates"

Last update
resolved

The Eleos Platform has returned to it's normal operating status. Error rates and overall system performance is have returned to normal.

monitoring

Error rates are still within the normal operating range, but out of an abundance of caution we are continuing to monitor the system.

monitoring

Error rates are within the Eleos Platform's normal operating range, but we are continuing to monitor the system to ensure error rates do not elevate again.

monitoring

We have applied mitigations and are currently monitoring. Error rates have gone back to normal. Apps should now come back online and re-upload any messages, forms, and/or documents while the system was under increased error rates.

identified

We are continuing to investigate elevated error rates. We are actively applying mitigations and have seen an overall reduction in error rates, but they are not within acceptable ranges yet.

investigating

We are investigating elevated error rates for the Eleos Platform API. Apps may fall back to offline mode during this time period.

Report: "Elevated Error Rates Between 15:01 and 15:14 UTC"

Last update
resolved

We have continued to monitor the systems since the error rates declined at 15:14 UTC. We have observed that after the error rates declined they stayed steady, and are currently, at normal levels.

monitoring

Between 12/6 15:01 UTC and 12/16 15:14 our systems experienced elevated error rates, during that window drivers may have experienced intermittent connectivity issues. The elevated errors have since been resolved, but we will continue to monitor the systems.

Report: "Elevated Error Rates"

Last update
postmortem

Between 2023-11-13 at 23:05 UTC and 2023-11-14 at 00:01, for a total of 56 minutes, the Eleos Platform failed to process approximately 45% of incoming requests. At 23:14 UTC, our on-call engineers were paged due to the elevated error rates, and we started taking efforts to stabilize the errors by scaling up the processing capabilities for the system. This temporarily reduced the error rate, as the errors started to decline around 23:36 UTC, but then started to increase again at 23:49 UTC. Our engineers continued to take efforts to scale up the system and at around 2023-11-13 23:57 UTC, the failure rate started to decline, and by 2023-11-14 00:01 the error rate had returned to normal. During the incident window, drivers using the mobile app would have experienced intermittent issues with logging in, fetching their loads and sending messages, for example, and the app would have essentially functioned as if it was in offline mode or experiencing marginal network conditions. This also impacted requests to our public-facing APIs, such as retrieving documents from the document API and sending messages to drivers.  ‌ The underlying issue was due to an unanticipated interaction between the subsystem responsible for recording web service integration errors and overall request processing. The subsystem responsible for dealing with client API errors got behind on processing errors, which held up other subsystems, resulting in widespread request failure. The responding engineers worked to rectify this problem and saw the error rate drop dramatically. ‌ While the responding engineers worked to relieve the above issues, less processing power was being used by the Platform servers since normal request processing was reduced due to the above issue. This caused the automatic scaling process to scale the number of running servers down, which further exacerbated the problem. The engineers immediately intervened and forced the automatic scaling process to scale up instead of down. ‌ To prevent this from happening again, Eleos has identified and confirmed the underlying issue through logging, application tracing, recorded metrics, and new testing methods designed to identify unexpected interactions like this one. We are working to implement and release a complete fix. Until this work has been completed, mitigations are in place to prevent a recurrence.

resolved

Extra capacity has been provisioned, and error rates and response times are normal.

monitoring

Error rates and response times are normal. We have provisioned extra capacity and are investigating the contributing factors to the incident.

monitoring

Errors rates are below 5% and response times are approaching normal as we continue monitoring. At this time, apps will come out of offline functionality and will resend messages as they come back online.

identified

We've identified a problem with our database capacity and we're working on provisioning more connections to compensate. Error rates are beginning to trend downwards.

investigating

We are experiencing elevated error rates and are currently investigating at this time. Apps will fall back into offline mode and will retry sending messages.

Report: "Drive Axle iOS 1.53.102 scanning"

Last update
resolved

The engineering team has investigated and determined that the issue causing certain Drive Axle iOS document scans to be blank is not a widespread concern and is limited in impact to specific users in certain conditions. We're working directly with those users to diagnose and resolve their issues. We try to err on the side of communicating early if we have any reason to think there's an emerging issue. In this case, we lacked tools or telemetry to quickly assess the impact of this particular symptom, and so we posted a status page incident out of an abundance of caution. However, in this case, we believe this issue is isolated to a few specific users.

investigating

We're currently investigating an issue affecting some drivers using version 1.53.102 of the Drive Axle iOS app. Document scans for these drivers may appear as blank. Our engineering team is working to identify and resolve the issue. Scanning on platform apps is unaffected, as is scanning in the Drive Axle Android app.

Report: "Intermittent duplicate outbound messages"

Last update
resolved

As of 22:18 UTC we completed the rollback of the infrastructure change that caused the errant behavior, and have confirmed that messages submitted to the API more than once behave as expected. The rollback of this infrastructure change should not cause any unexpected behavior or changes in functionality of the Eleos Mobile Platform.

identified

We've identified an issue that can cause outbound messages sent using the API to appear as duplicates in the mobile app. In addition, change updates (delivered via the inbound message API) for these messages may be delivered multiple times with a single client handle, but with different message UUIDs, one per duplicate. This change specifically affects messages that are submitted to the API more than once. This should normally be permitted without resulting in duplicates, but a recent infrastructure change caused this functionality to not work correctly resulting in duplicate messages seen by users within the app. We are rolling back the change.

Report: "Elevated error rates for Document API"

Last update
resolved

We've applied the mitigation and the document API error rate is now zero.

identified

A performance regression as a result of the database upgrade is causing significantly elevated error rates for the document API at https://platform.driveaxleapp.com/api/v1/documents/queued/next. We're working on a mitigation to this performance regression and expect to have the mitigation in place in about 25 minutes. Documents will remain queued in the meantime.

Report: "iOS Notifications Degraded"

Last update
resolved

From Monday, June 12, at 17:12 UTC to Wednesday, June 14, at 14:34 UTC, approximately 15% of iOS push notifications were not sent to user devices due to an issue with one of several servers that are responsible for this task. The actual content of the message was still delivered, but may have lacked an audible notification. The mobile apps continued to receive messages when the app performed a sync with the server or when manually refreshed by the driver. We apologize for the interruption of service.

Report: "Partial outdated Platform Dashboard Metrics data"

Last update
resolved

The fix has been applied and the "Detailed User Activity (current and last month)" report is now showing up to date information.

identified

We have a potential fix in place and we're working with an upstream service provider to resolve a related issue.

identified

We're continuing to implement and verify a fix for this issue.

identified

The "Detailed User Activity (current and last month)" report in Platform Dashboard metrics is currently reporting on outdated data. The underlying data is being tracked accurately, but not all recent data is available to the report. We've identified the cause and are working to implement a fix. Other Platform Dashboard reports are unaffected.

Report: "Delayed Message Integration"

Last update
resolved

Our error rates for sending push notifications to devices have remained normal since 14:15 UTC, and we are delivering message notifications to users within a timely manner. Our queue of messages that need integrated to customer's messaging services has remained low since 13:37 UTC, and we are integrating messages within our normal time frame. This incident was caused by an increase in response times from Google's push notification service.

monitoring

Around 14:03 UTC we began to see a marked decrease in errors in delivering push notifications to devices. As of 14:15 UTC we have cleared our backlog of notifications that queued during the course of the incident, and are now delivering notifications within a timely manner. At this time we believe the root cause of this issue was due to an increase in error rates seen by Google's notification service.

investigating

We have put in place a mitigation at 13:28 UTC to ensure that inbound messages are delivered to customers in a timely manner. We are still seeing delays in sending push notifications to user devices at this time.

investigating

Beginning around 13:12 UTC we started experiencing delays in integrating driver messages to customer web services. We are also investigating delays in delivering push notifications to drivers. We are investigating and will provide an update in 20 minutes or when new information.

Report: "Delayed Message Integration"

Last update
resolved

As of 15:11 UTC, all queued notifications, including iOS push notifications delayed as a side effect, had either been successfully sent or were marked as permanently failed. As of 16:48 UTC, our internal monitoring indicated error rates for sending push notifications to Android devices via Google's services had stabilized at normal levels. At 17:05 UTC, Google confirmed their corresponding incident had been resolved as of 16:51 UTC. Given both observed and confirmed upstream resolution, we're marking this incident as resolved.

identified

We are continuing to experience a slight elevation in error rates when attempting deliver push notifications to Android users due to slow responses with Google's notification service although failed notifications are eventually sent to users due to our attempts to retry notifications that fail to send. Google has indicated that their mitigation to drop the errors rates has made an improvement, but are still working to bring the notification service response times down to normal operation. We are continuing to monitor our services, and will post an update when new information becomes available.

identified

Google has indicated that their error rates for sending push notifications have decreased, and we have observed that we are successfully delivering push notifications to users. We are going to continue to provide updates here every 30 minutes, and will not resolve this incident until Google indicates they have resolved their service outage.

identified

We are continuing to monitor errors with delivering notifications to Android users. Google has indicated that they are putting mitigation in place on their notification services.

identified

Google continue to have an incident with delivery push notifications to android app users. Due to a mitigation put in place iOS users should begin to see notification delivered to their devices.

identified

The cause of the backlog of notifications failing to be sent to app users is caused by a Google outage of their notification services. We are monitoring our system, and investigating ways to mitigate overall impact of the outage to users.

investigating

We have a put a mitigation in place, and have caught up on the backlog of messages that began being queued to be integrated around 13:04. At this time, we also believe that notifications to app users are delayed. We are investigating on the cause of this to resolve notifications.

investigating

We are continuing to investigate the issue around delivering messages to messaging web services. We believe we have found a root cause and are working to put a mitigation in place to get messages flowing to integrations again.

investigating

We are currently experiencing a delay integrating messages sent by drivers to message integration services. Messages sent since 12:55 UTC are affected by this delay in integration. We will post updates here every 20 minute during the incident, or when a new update is available over the course of our investigation.

Report: "Document FTP Outage"

Last update
resolved

We have resolved the configuration issue, and the FTP service is now available and operating normally.

identified

We have a temporary resolution in place, and some documents are being delivered. We're now working to make the temporary resolution permanent, and document delivery via FTP may be delayed for a time again as we do so. Document delivery via the Document API is still operational.

investigating

We are still investigating the FTP service outage. At this time we believe this was caused by a configuration change made by one of our vendors that is causing the FTP service to fail to connect to our database to retrieve document and authentication information. We are working on making modifications to the service to enable it to make this connection and resume operation.

investigating

We are continuing to investigate the FTP service outage/

investigating

We are continuing to investigate the FTP service outage, and bringing additional on-call engineers to help investigate the situation.

investigating

We are continuing to investigate the outage surrounding the FTP service.

investigating

Beginning at 16:17 UTC on December 27th the Drive Axle and Eleos Platform FTP gateway became unavailable. Documents will queue for download and will be available once the FTP service is restored. Documents will continue to be available for download via the Document API.

Report: "Platform Dashboard inaccessible"

Last update
resolved

We have not seen this issue again since the fix was implemented around 50 minutes ago.

monitoring

We are continuing to monitor for any further issues.

monitoring

The Platform Dashboard is back online.

identified

The Platform Dashboard is currently inaccessible. We have a cause identified and we're working to resolve it quickly. No impact to the apps at this time.

Report: "Residual Errors From Incident"

Last update
postmortem

From 2022-11-18 22:12 UTC until 2022-11-18 22:42 UTC \(30 minutes total\), the Eleos Mobile Platform experienced a partial outage caused by a sudden increase in latency and error rates when performing writes to one of our primary data stores. During the incident period, some drivers experienced intermittent failures when using the Eleos mobile app, similar to the behavior seen when the app is offline. Similarly, some Platform Dashboard users experienced slowness and failures when attempting to view or change app settings and content. API clients, such as integrations attempting to send outbound messages to drivers, would have experienced higher-than-normal error rates. Because of the nature of the underlying issue and our high-availability architecture, not all users would have experienced or noticed errors during the 30 minute period. Although our monitoring immediately detected the issue and the on-call engineer responded quickly, it took 27 minutes before the first customer-facing update to the status page occurred. This delay undermines the value of the status page, and we’re revising our incident handling procedures accordingly to better emphasize earlier communication. This initial incident resulted in an additional data consistency issue affecting a small number of users, which persisted over the weekend until a server fix was deployed at 2022-11-21 17:44 UTC. Drivers affected by this additional data consistency issue were unable to receive updated app data after they modified \(e.g., viewed or deleted\) a subset of messages that were sent during the incident on the 18th. The server fix resolved this error without additional driver or customer action. A more detailed narrative and root cause analysis is available from your account executive upon request.

resolved

We’ve deployed a fix for this issue to production and confirmed that affected users are now seeing successful syncs. The fix resolves the issue server-side. Drivers do not need to log out and back in to see resolution.

identified

We are aware of an issue affecting a small number of Eleos Platform mobile app users. This issue prevents the app from synchronizing new or updated data, such as messages, from the device to our servers. Logging out and back will briefly work around this issue, but the issue will manifest again shortly. We have identified the cause of this issue, and we are working to put a fix in place in production as soon as possible. We expect to have this fix deployed in the next 30 minutes and will follow up at that time.

Report: "Partial Eleos Platform Outage"

Last update
postmortem

This outage started at around 22:06 UTC and ended at around 22:39 UTC.

resolved

We have stabilized the system and error rates have fallen back within our normal range. Users should expect normal operation, and work done on the device (such as sending messages) will retry and be processed normally.

investigating

We are experiencing a partial outage across the Eleos Platform. Users will an experience a mix of online and offline behavior - the app will queue work as if it's offline that it cannot proceed with because it can't reach the Eleos Platform. A large number of Eleos Platform API request will also fail. We are investigating the issue now, and will post here in 30 minutes with an update.

Report: "Elevated Error Rates and Response Times"

Last update
resolved

The error rate has remained low and performance has gone back to nominal levels.

monitoring

We have begun to see our error rate recover to normal levels, and see response times continue to improve across the Eleos Platform. We will continue to monitor and provides updates here with new information.

identified

We're applying mitigations to deal with the degradation of services, and we're continuing to investigate.

investigating

We are currently investigating elevated error rates and response times. Apps may fall back to offline functionality and save data to be transmitted when the platform returns to normal operation.

Report: "Document notification emails delayed"

Last update
resolved

Queued document notification emails have all been sent and new ones are no longer delayed.

monitoring

We've identified the cause and we've brought some additional capacity online to catch back up.

investigating

Notification emails for new documents in the Hub are delayed, beginning at about 12:04 UTC. We're investigating the cause. Documents retrieved directly via the Hub and via FTP or API integration are unaffected.

Report: "Inbound messages delayed"

Last update
resolved

All queued messages have been delivered and new messages are being delivered normally.

monitoring

We've identified the bottleneck and applied a mitigation. Inbound message delivery is recovering rapidly and should be back to near-real-time within in the next few minutes. At that point, we'll mark the incident as resolved.

investigating

The delivery of messages and forms sent by app users to customer API endpoints are temporarily delayed. We're investigating the cause and will provide additional updates.

Report: "Geotab Drive co-driver login error"

Last update
resolved

The following information is only applicable to customers using the Geotab Drive telematics integration "Add Team Driver" feature. Earlier today at 04:00 UTC, Geotab rolled out a change to how driver login sessions, which we use for both API calls and for Geotab Drive SSO, are issued. Although our Geotab integration is compatible with these changes, the Geotab Drive app is not fully compatible. Geotab Drive will not honor still-valid sessions issued prior to this change and instead displays the error "You cannot add a driver from a different server or database." If you receive reports of this issue, please open a ticket with Geotab via your reseller requesting that the affected sessions be invalidated. Once the broken sessions are invalidated, our integration will automatically re-issue a new session the next time the co-driver attempts to log in. You may be able to work around the issue from within the Eleos Platform Dashboard by changing the password used for the integration account listed under Service Config - Geotab Configuration. This has the effect of clearing all driver Geotab sessions, whether they are valid or not. Only consider this option as a last resort, as it may lead to hitting undocumented Geotab authentication rate limits. (Disabling and re-enabling the integration is not sufficient to invalidate all sessions.)

Report: "Document delivery outage"

Last update
resolved

Document processing has returned to normal operation, and the Platform Dashboard and Document Hub are responding as normal.

monitoring

Our document processing pipeline has begun handling documents again. Our systems are working through the queue of pending work now, and documents are being delivered.

identified

The Platform Dashboard and Document Hub are experiencing degraded performance with elevated error rates. We're continuing to monitor as our upstream provider works to resolve the issue.

identified

Beginning at 10:53 EDT, document processing and delivery has been experiencing an outage due to issues with an upstream provider. Our other systems are currently not affected. Pending documents are being queued during the upstream issue, and will be processed and delivered once the system is operational again. We're monitoring our systems and we'll update as the issue changes.

Report: "Elevated error rates"

Last update
resolved

Metrics have returned to normal and the system is operating as expected. We're continuing to work toward resolving the underlying performance bottleneck with our current development efforts.

monitoring

We are continuing to monitor system stability at this time. System response times have returned to normal.

monitoring

Error rates have returned to normal, but some portions of the system are experiencing longer than normal response times. We're continuing to monitor and work toward complete resolution. Drivers may experience slight delays in receiving messages.

investigating

The mitigation we applied has improved system stability but error rates are still elevated. We're continuing to pursue further mitigations to return error rates to normal levels.

investigating

We're continuing to investigate and believe we've identified a proximate cause. We're applying a mitigation to bring the system to a more stable state.

investigating

We're currently investigating elevated error rates for the platform. Apps may fall back to offline functionality and save data for later transmission when error rates return to normal levels.

Report: "Elevated Error Rates"

Last update
resolved

We're sorry for the impact to system performance caused by this incident. We have ongoing development work that addresses the underlying performance bottleneck.

monitoring

Error rates have dropped to normal levels, however we are still monitoring the situation.

investigating

Error rates are still elevated but have decreased since the beginning of the incident. We're continuing to investigate.

investigating

We are currently investigating elevated error rates for the platform. During this period, apps may fall back to offline functionality and save data to be transmitted when the platform returns to normal operation.

Report: "Elevated error rates"

Last update
resolved

The upstream provider has closed their own incident, indicating the outage was caused by an incorrect configuration change and was immediately rolled back. Given this explanation and the fact this outage had a well-understood cause, we don't anticipate a re-occurrence, and we're marking this as resolved.

monitoring

The affected underlying service is back online and error rates have returned to normal. Error rates varied from 6-10% of all requests from 20:42 UTC until 20:56 UTC. The outbound messaging API returned an increased number of HTTP 500s during this time, which would have caused calling applications to back off and delay outbound messages. This API is fully operational again. Inbound messages from drivers would also have been delayed during the outage, but all messages that were successfully delivered to the server from the mobile app have now been delivered to customer web service endpoints. Most or all mobile app "verify" requests, which are responsible for updating the configuration of the app menu and dashboard, failed during this incident, which would have caused updates made using App Editor or custom server-side logic to not be immediately reflected in the apps. We believe this incident is resolved, but are leaving it in monitoring until we have a better understanding of the upstream cause. We're working with the vendor of that service to do so.

investigating

We're investigating increased error rates affecting about 10% of requests. These appear to be originating from the failure of a service we use, but we're working to confirm and will share an update with additional detail and information about user impact in 10 minutes.

Report: "Elevated error rates of Eleos Platform Apps"

Last update
resolved

Our investigation indicated memory usage was spiking in response to particular traffic patterns from the mobile apps. This traffic has stopped naturally, which allowed the system to stabilize. Error rates returned to normal at 07:35 UTC. We've applied a mitigation to prevent this specific traffic from causing a re-occurrence and will resume our normal monitoring.

monitoring

We have stabilized the system, and at this time we believe we have found the root cause. We are working to put a fix in place.

monitoring

The stability problems appear to be related to memory usage. We've pulled in more of the team to help determine the cause and apply a fix.

monitoring

With the mitigation efforts in place we are continuing to monitor the error rates of the Platform.

investigating

We have applied some mitigating measures to bring down the error rate across the Eleos Platform, but we are continuing to investigate the root cause of the errors and come to a complete resolution of the issue. We will post another update here in 30 minutes.

investigating

We are currently experiencing elevated error rates within the Eleos Platform when fetching up-to-date data, and when sending messages. Messages sent during this time will be stored on device, and Eleos Platform apps will continue to retry sending the messages until they are successfully sent. We will post an update here in 30 minutes or sooner if new information is available.

Report: "Elevated error rates"

Last update
postmortem

Last night, our monitoring systems observed elevated error rates, and we posted this incident out of an abundance of caution as we investigated the root cause. However, after investigation, we determined that our systems remained healthy and operational throughout the entirety of this incident, and our monitoring systems misidentified some things as internal errors when they were not actually internal errors. We have made adjustments to our monitoring systems to avoid this happening again, and we apologize for the false alarm.

resolved

This incident has been resolved.

investigating

Error rates have returned to normal levels. We are continuing to monitor the situation.

investigating

We are continuing to investigate this issue.

investigating

We are currently experience higher than normal error rates. We are investigating the issue.

Report: "Elevated Failure Rate"

Last update
resolved

On Monday evening from about 4:15 PM EST to 5:20 PM EST we observed an elevated failure rate for client API calls, which was the result of a bug in our system that resulted in excessive resource contention. During this period most requests succeeded, but some drivers would have experienced some failures in the platform app doing things like fetching loads, todos, workflows, etc. We have identified a root cause, and we are identifying a solution to prevent this from happening again.