Historical record of incidents for HyperTrack
Report: "Elevated errors and performance degradation"
Last update# **Postmortem: System-Wide Outage Due to Database Degradation** **Incident Date:** May 23, 2025 **Time to Resolution:** 85 minutes **Status:** Resolved **Severity:** Critical \(P0\) ### **Summary** On May 23, 2025, our platform experienced a widespread outage due to degraded performance in our database infrastructure. Specifically, a set of read replicas were affected during the period. This degradation resulted in elevated error rates and unavailability across multiple APIs, including Orders, Workers, Places, and SDK-related services. The issue was fully resolved within 85 minutes. We understand how critical our services are to your operations and sincerely apologize for the disruption. ### **What Happened** A query pattern in our system targeting a key Orders API table failed to use a necessary index. This led to full table scans that overloaded some of our reader instances. As a result, several core APIs failed or experienced extreme latency. ### **Impact** * Customers experienced timeouts or errors when accessing Orders, Workers, and Places APIs * Monitoring and dashboard functionality was temporarily unavailable ### **What We Did** * Identified the problematic query * Deployed a hotfix to ensure proper index usage * Applied a secondary patch to reduce load when workers were not actively tracking * Restarted degraded infrastructure and monitored stabilization * Performed a full incident review across impacted components ### **Remediation and Next Steps** We are taking the following actions to ensure this does not happen again: * **Automated slow-query detection**: We’re enhancing our review pipeline with weekly audits and real-time alerting. * **Improved infrastructure alarms**: CPU and query performance alarms will provide earlier visibility into degradation. ### **Final Thoughts** We are committed to providing a stable and resilient platform. This incident has highlighted areas we must improve, and we’re taking swift action to reinforce our architecture. Thank you for your trust and patience.
Issues have been resolved at 18:45 UTC. The team is gathering data for the postmortem and action steps to prevent future degradations.
We are continuing to investigate this issue. The issue emerged at 17:20 UTC.
We are currently investigating the issue and working to resolve it.
Report: "Elevated errors and performance degradation"
Last updateIssues have been resolved at 18:45 UTC. The team is gathering data for the postmortem and action steps to prevent future degradations.
We are continuing to investigate this issue. The issue emerged at 17:20 UTC.
We are currently investigating the issue and working to resolve it.
Report: "Platform degradation and elevated errors"
Last updateDue to underlying database conditions, we are experiencing degraded platform performance and elevated errors. Team is currently investigating and troubleshooting.
Report: "Performance degradation"
Last updateWe experienced degraded performance in one of our databases. This has caused elevated response times and timeouts in parts of our service from 12:22 am to 12:50 am UTC.
Report: "ETA Processing Issue"
Last updateLess than 5 % of ETA calculations were behind for up to one hour between April 1, 20:00 UTC and to April 2, 00:00 UTC due to a processing issue.
Report: "ETA Processing Issue"
Last updateLess than 5 % of ETA calculations were behind for up to one hour between 22:00 UTC and to 23:00 UTC due to an underlying infrastructure issue.
Report: "Orders API Incident – Small Segment of Customers Impacted"
Last updateWe identified a bug in the Orders API implementation that caused an issue with workers' devices switching logic. As a result, a segment of customers using Orders API is affected. Starting at Sep 10, 2024, 6:16:22 AM UTC, The bug impacted certain tracked orders where device tracking was inadvertently stopped before order was completed. The issue has been resolved at Sep 10, 2024, 4:38:07 PM UTC, and our team is closely monitoring to ensure full system stability.
Report: "Webhook Degradation"
Last updateA significant increase in traffic caused delays in webhook delivery of up to 30 minutes. This issue started at 10:30 UTC and was resolved by 19:30 UTC. We’ve upgraded our infrastructure to handle similar traffic spikes in the future, ensuring more stable and timely webhook processing.
Report: "Platform Degradation"
Last updateA configuration update caused a small subset of devices to have degraded tracking data.
Report: "Critical platform services outage"
Last updateEngineering team deployed a fix. Critical services are returning back to normal and being actively monitored. There was no event data loss from mobile devices during orders in active state.
The issue has been identified. The team is coming up with a fix.
Critical infrastructure systems are experiencing errors. Engineering team is currently investigating.
Report: "Repeat webhook payloads have been resent for several hours today UTC."
Last updateRepeat webhook payloads have been resent for several hours today UTC.
Report: "Geotags ingestion issue"
Last updateAPI networking issue caused geotags to be dropped.
Report: "Partial system delays"
Last updateThe core pipeline experienced delays affecting a small portion of devices causing delays up to an hour.
Report: "Trips API webhooks outage"
Last updateTrips v1 API arrival and exit webhook payload processing failed due to a faulty code release. The problem is resolved.
Report: "Cloud Provider Outage impacted HyperTrack Service"
Last updateCloud provider outage impacted data ingestion