Historical record of incidents for Sardine AI
Report: "Intermittent degradation due to an issue on a third party provider"
Last updateThe following services might be affected: - API latency - Dashboard access Team is investigating root causes. As soon as we have more information we'll update here
Report: "Data provider blockchain analysis provider outage"
Last updateOur downstream data provider for blockchain analysis is experiencing outage from 2025-06-11 12:30:27 UTC. We are working with them for resolution
Report: "Degradation on customer intelligence search"
Last update- The search bar that allowed searching certain attributes (like email, sessionKey and others) has been disabled momentarily. - Certain attributes are not working on the filter option
We are currently investigating this issue.
Report: "Dashboard not available for certain areas of the Analytics reports"
Last updateThis incident has been resolved.
Certain users may experience issues with loading some of the areas of the dashboard such as parts Analytics. We are working on restoring the services at the moment.
Report: "Dashboard not available for certain areas of the Analytics reports"
Last updateThis incident has been resolved.
Certain users may experience issues with loading some of the areas of the dashboard such as parts Analytics. We are working on restoring the services at the moment.
Report: "Analytics dashboard is not available"
Last updateThe issue has been resolved.
Currently the analytics dashboard is not working (user cannot view data, however the data itself is ok). Our engineering team is working on restoring the dashboard.
Report: "Analytics dashboard is not available"
Last updateThe issue has been resolved.
Currently the analytics dashboard is not working (user cannot view data, however the data itself is ok). Our engineering team is working on restoring the dashboard.
Report: "Elevated latency with intermittent connectivity issues on endpoints"
Last updateWe encountered elevated latency on endpoints with intermittent connectivity issues. This incident is resolved, and we will be providing more details of this issue through a postmortem.
We are experiencing elevated latency on endpoints with intermittent connectivity issues. Our engineering team is investigating. There was a downtime of 1:44 - 1:49 PM UTC.
Report: "Elevated latency with intermittent connectivity issues on endpoints"
Last updateWe encountered elevated latency on endpoints with intermittent connectivity issues. This incident is resolved, and we will be providing more details of this issue through a postmortem.
We are experiencing elevated latency on endpoints with intermittent connectivity issues. Our engineering team is investigating.There was a downtime of 1:44 - 1:49 PM UTC.
Report: "Elevated latency in production instance (NOT EU)"
Last updateWe had an increase in latency around 20:40 UTC. It's solved now.
Report: "Bank account verification error"
Last updateFrom 02:16 UTC to 04:03 UTC on April 26th, bank account verification requests were resulted in BVERR (Bank data provider error. The bank data provider returned an error while trying to retrieve the bank account information.) reason code due to outage with our downstream data provider.
Report: "Bank Account Verification issue"
Last updateFrom 17:00 UTC to around 23:59 UTC on April 25th there was an issue with Bank Account Verification, causing an abnormal increase in BVERR (Bank data provider error. The bank data provider returned an error while trying to retrieve the bank account information.) reason code
Report: "Intermittent connectivity issue on EU region"
Last update### **Introduction** * **Purpose:** This report provides an overview of the recent service disruption impacting users in the EU region. * **Apology:** We sincerely apologize for the inconvenience this may have caused and remain committed to maintaining a high level of service reliability. ### **Incident Overview** * **Duration:** 45 minutes, from 2025-04-22 13:45 to 14:30 UTC * **Region Affected:** EU * **Services Affected:** `/v1/customers` endpoint and **business-events** service ### **Root Cause Analysis** * **Primary Issue:** A misconfigured feature flag initiated the disruption. * **Secondary Factor:** A related configuration change caused service instability. ### **Impact** * **Service Accessibility:** Intermittent connectivity issues were experienced throughout the incident window. * **Service Downtime:** The **business-events** service was fully unavailable for part of the duration. * **Summary: intermittent connectivity issues across the EU region** during the affected window. ### **Corrective Actions and Improvements** * **Immediate Response:** The misconfiguration was reverted and services were promptly restored. * **Ongoing Improvements:** We are implementing additional safeguards around configuration changes and enhancing monitoring across regional environments. ### **Conclusion** * **Commitment:** We remain focused on delivering dependable and resilient services to all partners. * **Appreciation:** Thank you for your understanding and continued trust.
We noticed an intermittent connectivity issue to certain endpoints. /v1/customers in particular on EU region. Engineers were tasked to fix it up immediately. We have got everything back up and running smoothly by 2:30 p.m. (PT), 9:30 p.m. (CEST)
Report: "Increased latency in the API request, rate limiting errors and issues with dashboard access"
Last update## - ALL TIMES ARE PST ## Overview _On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic._ `/customers`, `/issuing/risks`, `/feedbacks` and `/devices` APIs were affected in following times \(pacific time\): 08:34-08:46, 08:51-08:53 Apr 19 08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23 ## What happened Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation. ## Impact `/customers`, `/issuing/risks`, `feedbacks` and `/devices` API experienced increase in latency. ## Timeline | **Date** | Status | | --- | --- | | \*\*April 19 2025 | | | 0834hrs - 0846hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | We started manually scaling Nginx horizontally and vertically. \(While autoscaler was there, we did the manually to make it faster\) | | | \*\*April 19 2025 | | | 0846hrs - 0851hrs\*\* | All APIs were back up | | \*\*April 19 2025 | | | 0851hrs - 0853hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | April 19 2025 0853hrs onwards | All APIs were back up. No issues moving forth. | | ————— | ————— | | April 23 2025 | | | 0813hrs - 0816hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and started enabling some rate limit rules. | | | April 23 2025 | | | 0837hrs - 0842hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules | | | Sardine engineers also started to scale up our nginx servers vertically. | | | April 23 2025 | | | 0854hrs - 0858hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules. | | | Also a ban rate limit rule was set \(which took into effect at 0911hrs\) | | | April 23 2025 | | | 0911hrs - 0916hrs | Risk and Device Apis were severely rate limited due to a misconfiguration | | April 23 2025 | | | 0916hrs Onwards | Misconfiguration was lifted and systems went back online | ## What we’re doing to prevent future issues * We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked * We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with. * We are also creating dedicated instances to better handle such spikes.
This incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Increased latency in the API requests"
Last update## - ALL TIMES ARE PST ## Overview _On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic._ `/customers`, `/issuing/risks`, `/feedbacks` and `/devices` APIs were affected in following times \(pacific time\): 08:34-08:46, 08:51-08:53 Apr 19 08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23 ## What happened Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation. ## Impact `/customers`, `/issuing/risks`, `feedbacks` and `/devices` API experienced increase in latency. ## Timeline | **Date** | Status | | --- | --- | | \*\*April 19 2025 | | | 0834hrs - 0846hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | We started manually scaling Nginx horizontally and vertically. \(While autoscaler was there, we did the manually to make it faster\) | | | \*\*April 19 2025 | | | 0846hrs - 0851hrs\*\* | All APIs were back up | | \*\*April 19 2025 | | | 0851hrs - 0853hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | April 19 2025 0853hrs onwards | All APIs were back up. No issues moving forth. | | ————— | ————— | | April 23 2025 | | | 0813hrs - 0816hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and started enabling some rate limit rules. | | | April 23 2025 | | | 0837hrs - 0842hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules | | | Sardine engineers also started to scale up our nginx servers vertically. | | | April 23 2025 | | | 0854hrs - 0858hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules. | | | Also a ban rate limit rule was set \(which took into effect at 0911hrs\) | | | April 23 2025 | | | 0911hrs - 0916hrs | Risk and Device Apis were severely rate limited due to a misconfiguration | | April 23 2025 | | | 0916hrs Onwards | Misconfiguration was lifted and systems went back online | ## What we’re doing to prevent future issues * We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked * We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with. * We are also creating dedicated instances to better handle such spikes.
This incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Partial downtime"
Last update## - ALL TIMES ARE PST ## Overview _On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic._ `/customers`, `/issuing/risks`, `/feedbacks` and `/devices` APIs were affected in following times \(pacific time\): 08:34-08:46, 08:51-08:53 Apr 19 08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23 ## What happened Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation. ## Impact `/customers`, `/issuing/risks`, `feedbacks` and `/devices` API experienced increase in latency. ## Timeline | **Date** | Status | | --- | --- | | \*\*April 19 2025 | | | 0834hrs - 0846hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | We started manually scaling Nginx horizontally and vertically. \(While autoscaler was there, we did the manually to make it faster\) | | | \*\*April 19 2025 | | | 0846hrs - 0851hrs\*\* | All APIs were back up | | \*\*April 19 2025 | | | 0851hrs - 0853hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | April 19 2025 0853hrs onwards | All APIs were back up. No issues moving forth. | | ————— | ————— | | April 23 2025 | | | 0813hrs - 0816hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and started enabling some rate limit rules. | | | April 23 2025 | | | 0837hrs - 0842hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules | | | Sardine engineers also started to scale up our nginx servers vertically. | | | April 23 2025 | | | 0854hrs - 0858hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules. | | | Also a ban rate limit rule was set \(which took into effect at 0911hrs\) | | | April 23 2025 | | | 0911hrs - 0916hrs | Risk and Device Apis were severely rate limited due to a misconfiguration | | April 23 2025 | | | 0916hrs Onwards | Misconfiguration was lifted and systems went back online | ## What we’re doing to prevent future issues * We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked * We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with. * We are also creating dedicated instances to better handle such spikes.
Partial downtime on API and dashboard from 15:40 UTC to 15:56 UTC
Report: "Sandbox connectivity issues"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are experiencing connectivity issues on the Sandbox instance and we are currently investigating the root cause.
Report: "Elevated Latency"
Last updateWe experienced elevated latency on our /customer and /devices API starting 18:15 UTC up to 18:55 UTC
Report: "SANDBOX environment is inaccessible"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Elevated latency for customers API and issuing risk API"
Last update## Overview Repeated timeouts and increased query latency on a few of read replica clusters resulted in client-visible API latencies. ## What happened Recent code changes and changes in traffic patterns resulted in slow database queries. This resulted in slower latency, resulting in retries from some of our clients. Because we auto-scaled pods based on traffic volumes, that resulted in spikes in database connections, which resulted in further performance issues. ## Impact Our API latency has degraded severely in following times March 20 20:06-20:41 March 22 1:00-1:23 March 22 2:14-2:34 March 25 18:26-19:01 ## What went wrong * Internal communication took a while before we can communicate issue with our clients * Detecting root cause took us a while ## Action items | Action Item with Description | target | | --- | --- | | Scale up database resources | DONE | | Update database connection limit and other configurations | March 31 | | provision a separate DB resource fo one of our service | April 1 | | Tighten up internal timeout config | April 1 | | Optimize known slow query 2 | April 1 | | Optimize known slow query 1 | DONE | | Optimize feature computation backend | ongoing project, end of Q2 |
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Performance degradation on customers API and issuing risk APIs"
Last updateOur service has experienced higher than usual latency from 19:03-19:14 UTC because one of our database was overloaded. We have updated the infra configuration and issue is now resolved.
Report: "Payment service has downtime"
Last updateThis incident has resolved as of 4:10 PM PT
We are experiencing anomalous surge in traffic since 11:40AM PT and crypto.sardine.ai is experiencing instability.
Report: "Degradation across all APIs due to very high traffic spike"
Last updateFrom 21:14 UTC to 21:23 UTC we had degradation across all APIs due to a huge spike in traffic
Report: "Dashboard not showing transaction data"
Last updateThis incident has been resolved.
We are currently experiencing an issue with displaying the transaction data for sessions in the dashboard for Production and Sandbox instances. We are actively working on this, in the meantime the information is available in the "View Request" widget, and it can be viewed there.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently experiencing an issue with displaying the transaction data for sessions in the dashboard for Production and Sandbox instances. We are actively working on this, in the meantime the information is available in the "View Request" widget, and it can be viewed there.
Report: "Dashboard can't be accessed"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Crypto service experiencing intermittent issues"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Increased latency on customers API"
Last updateWe were experiencing elevated latency for the Customers API during 19:04 and 19:12, and between 20:06 - 20:13.
Report: "Increased latency"
Last update## Overview _On Mar 3rd, Sardine experienced an increase in latency due to increased database usage._ `/customers` endpoint experienced spikes in latency around the start of every hour for 15 minutes between 8 and 11 Pacific time. And`/issuing/risks` endpoint experienced intermittent higher latency during 08:00 - 10:54 Pacific time. `*/customers*` latency spikes: * 8:00 - 8:15 * 9:00 - 9:15 * 9:55 - 10:15 * 11:00 - 11:15 `*/issuing/risk*` endpoint latency spikes: * 08:02 - 08:44 * 09:42 - 10:12 * 10:30 - 10:54 ## What happened Sardine encountered a surge in traffic originating from a client. The intermittent and unpredictable nature of these spikes presented challenges in real-time detection and impact assessment, subsequently hindering our ability to implement timely mitigation strategies. ## Impact `*/customers` and`/issuing/risks`API experienced intermittent higher latency during 08:11- 10:06 Pacific time, Clients using the advanced aggregation feature were more impacted.\* ## Timeline \(all Pacific time\) * 08:02: Incident starts - latencies for both endpoints start going up. * 8:10: Oncall engineer paged due to the increased latency. * 8:23: Alert was auto resolved. Oncall engineer started digging into the root cause, latency was not yet back to normal but in a more manageable situation. * 9:00 - 9:35: We had an Oncall handoff meeting, this latency issue was mentioned but no root cause was detected yet but the latency seemed under control. * 9:42: A customer notices the latency issue and communicates to the Sardine team. * 10:00: p95 latency becomes a sustained issue, new Oncall engineer starts investigating. * 10:20: Oncall discovers the queries creating the bottlenecks, engineers start checking if there’s a bypass or a quick enhancement possible to remove the bottleneck or if scaling the DB is our only option. * 10:40 AM: The DB is scaled up, in a few minutes the incident ends. ## What went wrong * Slow to detect request spikes and queries that were causing latency issues. * Oncall handoff happened during the incident, issue wasn’t properly handed off * Old oncall engineer thought the issue was a one-off latency increase due to spikes * New oncall engineer wans't tagged on old threads about this topic. ## Action items * Enforce more strict query timeouts for issuing API * Ongoing query optimizations on the advanced aggregations feature that will mitigate risks of this happening again * Enhance internal process around alerts and escalation * Process update to oncall handoff and incident handling - if incident happens around oncall handoff both new and old oncall with use handoff meeting as working session
We were experiencing elevated latency for customers and issuing APIs during 8:00am PT and 10:00am PT due to unusual traffic volume, the issue is resolved.
Report: "Short spike in latency for customers and issuing APIs"
Last updateWe had elevated latency for customers and issuing APIs during 5:04am PT and 5:06am PT (about 2mins time-window). This was caused by AlloyDB overload. Issue is resolved.
Report: "Customers API degradation"
Last updateWe experienced a short degradation for our customers API between ~5:07am PT and ~5:14am PT. All services are restored to normal activity.
Report: "Device Features and Device Risk Signals are missing from some session in the Dashboard (UI issue, NOT data issue)"
Last updateThis incident has been resolved.
It started on February 20, 2025 at 4:38:37 AM GMT-8 What is being affected specifically? - Visualization of device features in dashboard (API requests/responses were unaffected) - The underlying information is NOT affected, it will be shown again once we fix the issue.
We are currently investigating this issue.
Report: "/v1/rules endpoint in PROD not responding properly"
Last updateThis incident has been resolved. This is a purely informational endpoint (not transactional) to allow customers list the rules they own.
We are having some issues with /v1/rules endpoint, we are investigating.
Report: "Elevated API latency"
Last update## Overview _On Jan 31, Sardine experienced an increase in latency due to increased database usage._ `/customers`, `/issuing/risks`, and `/devices` API experienced intermittent higher latency during 9:03 - 13:16 Pacific time. ## What happened Sardine experienced an increase in traffic due to overall organic traffic growth. Additionally, some of our clients sent us batch-based traffic that was bursty in the nature. Some of our database didn’t scale well to handle the bursty traffic. ## Impact `/customers`, `/issuing/risks`, and `/devices` _API experienced intermittent higher latency during 8:03 - 13:16 Pacific time._ ## Timeline \(all Pacific time\) * 8:03 AM: Incident starts * 8:04 AM: Oncall engineer paged due to the increased latency * 8:21 AM: Alert was auto resolved. Oncall engineer concluded it was one-off latency spike * Around 9:00 AM: Client reached out to Sardine due to latency concern * 9:00AM - 13:00: Multiple alerts were triggered. Oncall engineer were investigating * 13:33: root cause identified * 13:40: scaling and config change were performed \(incident ends\) ## What went wrong * While internal alert notified oncall engineer, it wasn’t escalated soon enough * Inefficient internal logic caused performance bottleneck * Rate limit didn’t provide the sufficient control ## Action items * Enhance internal process around alerts and escalation * Fix inefficient old logic to avoid future similar issue
This incident has been resolved. We'll be adding a PostMortem with the full description and action items here as soon as we have it.
We are currently experiencing elevated latency and are actively working to resolve the issue
Report: "Issue in storing request/responses data"
Last updateDue to a configuration issue, some of the request/response data was not stored in our system. You may encounter an error when you attempt to "view request" or "view response" from the Sardine dashboard. This issue didn't affect any of the online risk scoring. Incident timeline 6:36 AM Jan 28 to 5:41 AM Jan 30 PST for api.eu.sardine.ai 3:36 PM Jan 29 to 5:41 AM Jan 30 PST for api.sardine.ai
Report: "pub/sub outage"
Last update### Summary One of Sardine’s cloud provider \(Google Cloud\) had [an outage with their pub/sub service](https://status.cloud.google.com/incidents/ghMho2Gka33Exr9UNavz) \(messaging service\). This results in data loss for the secondary data we show in Sardine dashboard. Incident began at **2025-01-08 06:54** and ended at **2025-01-08 08:07** \(all times are **US/Pacific**\). ### Root cause Sardine utilizes pub/sub for various subsystems inside the system for scalability and low latency. Sardine system writes to our primary database within the API request path, but some logic was then performed asynchronously triggered by a message to pub/sub. This resulted in missing data in Sardine dashboard. Additionally, this resulted in data loss for our API log table, which makes data backfill very difficult. ### Future enhancement * \[Q1\] Remove pub/sub dependency from API logging so we can replay APIs in case like this * \[Q1\] Enhance alerting around pub/sub issues
One of Sardine’s cloud provider (Google Cloud) had an outage with their pub/sub service (https://status.cloud.google.com/incidents/ghMho2Gka33Exr9UNavz) (messaging service)
Report: "EU - Data delay in Dashboard (API NOT affected)"
Last updateEU clients had experienced data delay in dashboard. Session/transaction pages were not loading for new data. From Jan 20th 12:30 PM UTC to Jan 21th 17:02 UTC Now it is resolved. We've isolated root cause and will be fixing it for good shortly.
Report: "High latency on API requests"
Last update## Overview _On Jan 16 morning US time, Sardine experienced an increase in latency due to increased database usage._ `/customers`, `/issuing/risks` and `/feedbacks` API experienced higher latency during 17:57-18:29 UTC. ## What happened Sardine experienced an increase in traffic due to 1\) an internal backfill job, 2\) large backfill triggered by a client, and 3\) overall traffic growth. Resource usage of one of the database system, resulted in long latency. ## Impact `*/customers`, `/issuing/risks` and `/feedbacks` API experienced higher latency during 17:57-18:29 UTC.\* ## Timeline * 17:00 UTC: Sardine started to observe higher than usual latency. While latency is higher than usual, it didn’t match the condition for alert this point * 17:53 UTC: Latency increased drastically * 17:57 UTC: Page triggered * 17:58 UTC: Oncall engineer acknowledged the page * 18:05 UTC: Oncall engineer escalated to DevOps oncall and other senior team members * 18:19 UTC: Database upsize was triggered * 18:29 UTC: Latency goes back to normal ## What went wrong * We didn’t have sufficient internal control around internal data processing job that stressed the system * It took long time until resource upsize was performed * It took long time before Sardine update the status page ## Action items * Enhance internal process around data load so we can better control the resource usage * Review and adjust resource configuration * Improved internal training for oncall engineer
17:00 UTC to 17:52 UTC - moderate latency increase - p90 from stable 500ms to spikes to 2s 17:52 UTC to 18:27 UTC - heavy latency increase (p50 >2s) from 140ms
The issue has been identified and a fix is being implemented.
Report: "Errors on API response"
Last updateDue to increased traffic on our DB some requests returned errors and weren't successfully assessed. Timestamp: 19:27-19:37 UTC Team already solved it and will be working on a root cause fix.
Report: "Increased latency and errors on API"
Last updateOn Jan 8 15:17 - 15:32 hours UTC, Sardine had some intermittent issues on our pods. Affected endpoints include (/v1/feedbacks, /v1/events*, /v1/customers, /v2/devices) - Some Customers may have faced intermittent 504s with /v1/feedbacks. We have identified root causes and have placed some fixes in.
Report: "Very minor outage on /v2/devices and /v1/events endpoints"
Last updateOn Jan 7 07:22 - 07:23 hours UTC, Sardine had some intermittent issues on our device-events pods. About 5-10% of the traffic would have 502 returned. Endpoints pertaining to our devices, /v2/devices and /v1/events would have been impacted. We have identified the cause and are putting in a fix to resolve this.
Report: "Increased latency in Customers and Issuing API"
Last updateClients may have experienced higher than usual latency and rate of errors between 3:50 and 4:23 UTC on the Customers API and Issuing API due to a performance degradation from higher than usual activity to our servers Minor latency on unified-comments, cases, and list-items APIs was noticed as well.
Report: "Service degradation"
Last update**Date**: December 13, 2024 ## Summary Production environment experienced significant performance degradation due to database connection issues, impacting service latency across multiple API endpoints. The incident lasted approximately 1.5 hours, from 8:39 AM to 10:12 AM PT \(pacific time\). The immediate resolution was achieved through database scaling, though the root cause investigation is stil ongoing. ## Description The incident started with deployment to production. During the incident period, the primary database experienced: * Dramatic increase in connection count and database load between 8:29 AM - 10:24 AM PT * Elevated lock contention * Degraded query performance * No corresponding traffic spike was detected during this period ## Timeline \(pacific time\) * 5:48 AM - backend deployment * 8:29 AM - another backend deployment * 8:29 AM - Database began showing increased load and connection count * 8:39 AM - Incident began - high latency detected * 8:46 AM - first alert paged oncall engineer * 9:14 AM - First remediation attempt: Rolling back all deployments to earlier SHA \(unsuccessful\) * 9:30 AM - Started scaling primary DB * 10:12 AM - Service latency returned to normal * 10:24 AM - Database metrics returned to normal levels ## Impact * Increased latency across multiple API endpoints \(v1/customers, v1/issuing/risks, and v1/feedbacks\) * Service degradation affecting database-dependent operations ## Root Cause Analysis Investigation is still ongoing ## Short-term resolution The following actions were taken to resolve the incident: 1. Rolled back deployments to previous stable version 2. Vertically scaled primary database 3. Performed cleanup of old records in one of the database table 4. Reverted some other recent config changes ## What Went Well 1. Team responded quickly to alerts and began investigation 2. Multiple remediation strategies were attempted 3. Service was successfully restored through database scaling ## What Needs Improvement 1. Root cause remains unclear despite multiple rollback attempts 2. Several initial remediation attempts were unsuccessful ## Action Items * Complete detailed root cause analysis * Evaluate if primary DB can be safely scaled down \(planned for Monday review\) * Contact client provider
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Report: "Bulk decisioning alerts not working"
Last updateThis incident has been resolved.
Bulk decisioning alerts in alert queues is not working, if you try you'll get an error. The team is already working on a fix. In the meantime you can decision one at a time and it will work fine.
Report: "Increased latency in the API requests"
Last updateIncreased latency on requests performed to the API from 1:56 AM UTC to 2:02 AM UTC
Report: "Performance Degradation (high latency on requests)"
Last updateOn November 21, 2024 during 8:40PM - 9:28 UTC, the system suffered a significant performance degradation. This resulted in increased latency for APIs - customers API, issuing/risks API and document-verification APIs. Resolution: System resources were upscaled to mitigate the immediate impact and a controlled restart of affected application components was initiated. This action successfully reduced the number of active connections and stabilized the system. Next Steps: - Enhance internal runbook so oncall engineers can triage and handle issue like this faster - Optimize usage of database to avoid similar issues
Report: "Increased latency and rule evaluation errors"
Last updateThis incident has been resolved. Timestamps of the issue: 17:00 - 17:21 UTC 17:36 - 17:52 UTC
A fix has been implemented and we are monitoring the results.
Report: "Increased latency and rule evaluation errors"
Last updateWe were experiencing degradation of services on rules-engine between 7:30am to 7:50am CST. Rules-engine is currently fully operational.
Report: "Latency spike"
Last updateThis incident has been resolved.
we observed latency increase from 23:48 pacific time. We have adjusted infrastructure and montioirng
Report: "SSN enrichment failing"
Last updateThis incident has been resolved around 8:24pm UTC.
We're getting error from SSN data provider and missing SSN signals since 2024-11-08 20:02 UTC, we're actively working with partner for the resolution
Report: "Latency degradation of customers API and feedback API"
Last updateWe've had an Customer and Issuing latency spike from 9:45 to 10:25 UTC today. Feedback APIs increased latency was mainly between 9:45 to 10:05 UTC Root cause was found and we'll be working towards solving it for good.
Report: "Short degradation of service"
Last updateSome clients may have experienced an intermittent short downtime on Nov 1 CST 7:44:30 to 7:48:45 on Sardine's endpoints. Some requests may also have had high latency. There was a configuration change on our Nginx servers and this configuration change spiked the cpu and memory by a large amount which was unexpected. This has been resolved and the team is working on ensuring such spikes do not happen again.
Report: "Degradation on customers API and issuing risk API with increased latency"
Last updateSardine's platform experienced degradation on customers API and issuing risk API with increased latency between 10:35:40-10:38:30 pacific time. Issue is now resolved, and team is looking into the root cause and future mitigation.
Report: "Sonar ACH Datapack Issue"
Last update# Incident Description The Sonar API started responding to a series of 500 errors to the ACH datapack users, it happened between 3:20 pm to 6:50 pm EST. ## Impact ACH Datapack users were unable to receive a proper response to their requests for 3 hours. ## Timeline \(all EST\) * At 3:20 PM manual config change on Sardine risk service \(not SONAR service\) was performed by a engineer * At 3:25 PM the 5xx monitor started showing non 0 5xx responses to requests using the ACH Datapack * At 5:06 PM we received a message from a ACH datapack user that reported receiving 5xx error messages for some time * At 5:18 PM The team got notified and started investigating * At 5:30 PM The most recent deployment was rolled back, but did not impact the errors. * From 5:30 PM to 6:30 Further investigation is done and a hotfix has to be deployed. * 6:50PM the hotfix is in place and fixes the issue. ## Root cause We had an exceptional deployment for all risk related services at Sardine that Tuesday evening and also some internal configurations were changed at the same time, this led to the sonarAPI receiving extra data from the bank enrichment service which SonarAPI was not parsing correctly yet, causing the application to panic when doing bank enrichment. ## What went wrong * Our monitors didn’t cover 5xx responses at a low volume like this case, we were at the highest point returning 18 5xx per 5 minutes, our threshold for warning was at least double that, and for alerts it would be 4x this volume. * Our automated test didn’t cover this scenario as it relies on certain production configuration ## Actions * Fixed the monitoring tool to now respond to any 5xx sonar responses, previously it was calibrated to be resilient to some 5xx requests before triggering warnings and alerts * We’ll review all the bank enrichment code in sonarAPI so that extra/unwanted/invalid values won’t affect SonarAPI reliability. * We’re continuing to add more tests to guarantee the integration between sonar and the enrichment vendors is covered.
ACH Datapack users were unable to receive a proper response to their requests
Report: "Elevated latency for customers API"
Last update## Overview The latency for `/v1/customers` `v1/issuing/risks` and `/v1/feedbacks` API was increased for majority of customers from 21:02 - 23:51 Oct 9 Pacific Time. This was caused by certain traffic pattern \(likely fraud attack\) that stressed one of our backend DBs. While overall traffic volume was same as usual, the certain traffic pattern stressed Sardine’s feature computation backend and caused slow queries. ## Impact * Customers had increased latency at these time periods * We saw increased number of database timeout, meaning some features were not correctly computed ## Timeline \(all Pacific Time\) 7:43 PM Oct 9: Initial latency alert triggered. Since it was self resolved after a few minutes, no further investigation was done 9:16 PM Oct 9: another latency alert triggered, it got self-resolved after 10 minutes as well. Oncall engineer assumed it was transient spike and didn’t investigate further. We had couple other alerts but those were assumed to be noisy alerts 11:02PM: Sardine received error reports from a few clients 11:32PM: Issue got escalated by one of our Integration Managers. 11:40PM: Oncall engineer identified database CPU usage is extremely high 11:48PM: Oncall engineer performed database scaling 11:51 PM: Incident resolved ## What went wrong * Self-resolved pagers were ignored at night as they tend to be pretty noisy * Investigation took us lot of time * There is no auto-scaling available for this database product * We didn’t have enough safeguard about this particular traffic pattern that caused spike in the latency ## Action items * Improving alerts and pager setup * Enhance alerts around DB metrics * Build auto-scaler for database * Establish better runbook so oncall engineer can diagnose and act faster * Improve backend code so it’s more robust against this type of traffic pattern
Sardine's platform experienced elevated latency for the /customers API between 21:02 - 23:50 pacific time
Report: "Experiencing degradation in service"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "High Latency in SDK event for the last 3 hours"
Last updateThe incident has been resolved.
We are noticing high latency in SDK event for the last 3 hours. The team is actively investigating and we will keep you updated on the progress