Jul 24, 2025

Report: "Risk evaluations might be coming up as Very High risk with blocklisted items when this is not actually true"

Last update 2025-07-24T20:41:50.667Z

identified2025-07-24T20:41:50.664Z

The issue has been identified and a fix is being implemented.

Jul 17, 2025

Report: "Error evaluating rules in /v1/customers API endpoint"

Last update 2025-07-17T20:59:11.540Z

identified2025-07-17T20:59:11.538Z

The issue has been identified and a fix is being implemented.

Jul 3, 2025

Report: "Increased latency in API"

Last update 2025-07-03T17:54:49.414Z

resolved2025-07-03T13:30:00.000Z

We had increased latency on the following endpoints: - v1/customers - v1/issuing/risks From 13:45 UTC to 14:13 UTC

Report: "Crypto on ramp high latency issue"

Last update 2025-07-03T14:18:08.138Z

resolved2025-07-03T12:30:00.000Z

- Latency was up to 10s and roughly 65% of requests timed out between 12:37 UTC and 12:45 UTC

Jul 1, 2025

Report: "Latency slow for customers API in Production"

Last update 2025-07-01T11:42:48.085Z

resolved2025-07-01T11:42:48.075Z

We had elevated latency and timeouts from 3h40 to 4h15 PDT, due to simultaneous spike in activity in multiple customers. Responses are back to normal on all APIs now, we are continuing monitoring.

Jun 28, 2025

Report: "Latency increase in EU region for customers and issuing/risks API"

Last update 2025-06-28T21:45:32.783Z

resolved2025-06-28T08:30:00.000Z

We experienced high latency on our APIs: - Issuing/risk - Customers - Feedback From 8:31 UTC to 8:42 UTC on June 28.

Jun 17, 2025

Report: "Increased latency during a brief period"

Last update 2025-06-17T16:59:13.655Z

resolved2025-06-17T16:59:13.646Z

Between 16:10 and 16:12 UTC we experienced increased latency

Jun 12, 2025

Report: "Intermittent degradation due to an issue on a third party provider"

Last update 2025-06-12T18:22:26.676Z

investigating2025-06-12T18:22:26.673Z

The following services might be affected: - API latency - Dashboard access Team is investigating root causes. As soon as we have more information we'll update here

Jun 11, 2025

Report: "Data provider blockchain analysis provider outage"

Last update 2025-06-11T12:56:55.983Z

identified2025-06-11T12:56:55.980Z

Our downstream data provider for blockchain analysis is experiencing outage from 2025-06-11 12:30:27 UTC. We are working with them for resolution

Jun 10, 2025

Report: "Degradation on customer intelligence search"

Last update 2025-06-10T13:37:13.337Z

investigating2025-06-10T13:37:13.334Z

- The search bar that allowed searching certain attributes (like email, sessionKey and others) has been disabled momentarily. - Certain attributes are not working on the filter option

investigating2025-06-10T13:31:28.722Z

We are currently investigating this issue.

Jun 2, 2025

Report: "Dashboard not available for certain areas of the Analytics reports"

Last update 2025-06-02T17:48:56.042Z

resolved2025-06-02T17:48:56.022Z

This incident has been resolved.

investigating2025-05-31T04:00:59.799Z

Certain users may experience issues with loading some of the areas of the dashboard such as parts Analytics. We are working on restoring the services at the moment.

Report: "Dashboard not available for certain areas of the Analytics reports"

Last update 2025-06-02T12:48:00.000Z

Resolved2025-06-02T12:48:00.000Z

This incident has been resolved.

Investigating2025-05-30T23:00:00.000Z

Certain users may experience issues with loading some of the areas of the dashboard such as parts Analytics. We are working on restoring the services at the moment.

May 30, 2025

Report: "Analytics dashboard is not available"

Last update 2025-05-30T04:27:36.892Z

resolved2025-05-30T04:27:36.877Z

The issue has been resolved.

investigating2025-05-29T14:28:59.680Z

Currently the analytics dashboard is not working (user cannot view data, however the data itself is ok). Our engineering team is working on restoring the dashboard.

May 29, 2025

Report: "Analytics dashboard is not available"

Last update 2025-05-29T23:27:00.000Z

Resolved2025-05-29T23:27:00.000Z

The issue has been resolved.

Investigating2025-05-29T09:28:00.000Z

Currently the analytics dashboard is not working (user cannot view data, however the data itself is ok). Our engineering team is working on restoring the dashboard.

May 23, 2025

Report: "Elevated latency with intermittent connectivity issues on endpoints"

Last update 2025-05-23T16:52:50.489Z

resolved2025-05-23T16:52:50.470Z

We encountered elevated latency on endpoints with intermittent connectivity issues. This incident is resolved, and we will be providing more details of this issue through a postmortem.

investigating2025-05-23T13:45:30.000Z

We are experiencing elevated latency on endpoints with intermittent connectivity issues. Our engineering team is investigating. There was a downtime of 1:44 - 1:49 PM UTC.

Report: "Elevated latency with intermittent connectivity issues on endpoints"

Last update 2025-05-23T11:52:00.000Z

Resolved2025-05-23T11:52:00.000Z

We encountered elevated latency on endpoints with intermittent connectivity issues. This incident is resolved, and we will be providing more details of this issue through a postmortem.

Investigating2025-05-23T08:45:00.000Z

We are experiencing elevated latency on endpoints with intermittent connectivity issues. Our engineering team is investigating.There was a downtime of 1:44 - 1:49 PM UTC.

May 12, 2025

Report: "Elevated latency in production instance (NOT EU)"

Last update 2025-05-12T20:57:39.554Z

resolved2025-05-12T08:30:00.000Z

We had an increase in latency around 20:40 UTC. It's solved now.

Apr 26, 2025

Report: "Bank account verification error"

Last update 2025-04-26T04:50:15.657Z

resolved2025-04-26T02:15:00.000Z

From 02:16 UTC to 04:03 UTC on April 26th, bank account verification requests were resulted in BVERR (Bank data provider error. The bank data provider returned an error while trying to retrieve the bank account information.) reason code due to outage with our downstream data provider.

Report: "Bank Account Verification issue"

Last update 2025-04-26T04:48:04.556Z

resolved2025-04-25T17:00:00.000Z

From 17:00 UTC to around 23:59 UTC on April 25th there was an issue with Bank Account Verification, causing an abnormal increase in BVERR (Bank data provider error. The bank data provider returned an error while trying to retrieve the bank account information.) reason code

Apr 24, 2025

Report: "Intermittent connectivity issue on EU region"

Last update 2025-04-24T16:09:40.539Z

postmortem2025-04-23T12:55:49.468Z

### **Introduction** * **Purpose:** This report provides an overview of the recent service disruption impacting users in the EU region. * **Apology:** We sincerely apologize for the inconvenience this may have caused and remain committed to maintaining a high level of service reliability. ### **Incident Overview** * **Duration:** 45 minutes, from 2025-04-22 13:45 to 14:30 UTC * **Region Affected:** EU * **Services Affected:** `/v1/customers` endpoint and **business-events** service ### **Root Cause Analysis** * **Primary Issue:** A misconfigured feature flag initiated the disruption. * **Secondary Factor:** A related configuration change caused service instability. ### **Impact** * **Service Accessibility:** Intermittent connectivity issues were experienced throughout the incident window. * **Service Downtime:** The **business-events** service was fully unavailable for part of the duration. * **Summary: intermittent connectivity issues across the EU region** during the affected window. ### **Corrective Actions and Improvements** * **Immediate Response:** The misconfiguration was reverted and services were promptly restored. * **Ongoing Improvements:** We are implementing additional safeguards around configuration changes and enhancing monitoring across regional environments. ### **Conclusion** * **Commitment:** We remain focused on delivering dependable and resilient services to all partners. * **Appreciation:** Thank you for your understanding and continued trust.

resolved2025-04-22T19:30:00.000Z

We noticed an intermittent connectivity issue to certain endpoints. /v1/customers in particular on EU region. Engineers were tasked to fix it up immediately. We have got everything back up and running smoothly by 2:30 p.m. (PT), 9:30 p.m. (CEST)

Report: "Increased latency in the API request, rate limiting errors and issues with dashboard access"

Last update 2025-04-24T13:42:00.861Z

postmortem2025-04-24T13:41:59.402Z

## - ALL TIMES ARE PST ## Overview _On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic._ `/customers`, `/issuing/risks`, `/feedbacks` and `/devices` APIs were affected in following times \(pacific time\): 08:34-08:46, 08:51-08:53 Apr 19 08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23 ## What happened Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation. ## Impact `/customers`, `/issuing/risks`, `feedbacks` and `/devices` API experienced increase in latency. ## Timeline | **Date** | Status | | --- | --- | | \*\*April 19 2025 | | | 0834hrs - 0846hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | We started manually scaling Nginx horizontally and vertically. \(While autoscaler was there, we did the manually to make it faster\) | | | \*\*April 19 2025 | | | 0846hrs - 0851hrs\*\* | All APIs were back up | | \*\*April 19 2025 | | | 0851hrs - 0853hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | April 19 2025 0853hrs onwards | All APIs were back up. No issues moving forth. | | ————— | ————— | | April 23 2025 | | | 0813hrs - 0816hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and started enabling some rate limit rules. | | | April 23 2025 | | | 0837hrs - 0842hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules | | | Sardine engineers also started to scale up our nginx servers vertically. | | | April 23 2025 | | | 0854hrs - 0858hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules. | | | Also a ban rate limit rule was set \(which took into effect at 0911hrs\) | | | April 23 2025 | | | 0911hrs - 0916hrs | Risk and Device Apis were severely rate limited due to a misconfiguration | | April 23 2025 | | | 0916hrs Onwards | Misconfiguration was lifted and systems went back online | ## What we’re doing to prevent future issues * We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked * We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with. * We are also creating dedicated instances to better handle such spikes.

resolved2025-04-23T16:29:46.595Z

This incident has been resolved.

identified2025-04-23T16:16:05.640Z

The issue has been identified and a fix is being implemented.

Report: "Increased latency in the API requests"

Last update 2025-04-24T13:41:56.258Z

postmortem2025-04-24T13:41:54.798Z

## - ALL TIMES ARE PST ## Overview _On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic._ `/customers`, `/issuing/risks`, `/feedbacks` and `/devices` APIs were affected in following times \(pacific time\): 08:34-08:46, 08:51-08:53 Apr 19 08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23 ## What happened Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation. ## Impact `/customers`, `/issuing/risks`, `feedbacks` and `/devices` API experienced increase in latency. ## Timeline | **Date** | Status | | --- | --- | | \*\*April 19 2025 | | | 0834hrs - 0846hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | We started manually scaling Nginx horizontally and vertically. \(While autoscaler was there, we did the manually to make it faster\) | | | \*\*April 19 2025 | | | 0846hrs - 0851hrs\*\* | All APIs were back up | | \*\*April 19 2025 | | | 0851hrs - 0853hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | April 19 2025 0853hrs onwards | All APIs were back up. No issues moving forth. | | ————— | ————— | | April 23 2025 | | | 0813hrs - 0816hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and started enabling some rate limit rules. | | | April 23 2025 | | | 0837hrs - 0842hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules | | | Sardine engineers also started to scale up our nginx servers vertically. | | | April 23 2025 | | | 0854hrs - 0858hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules. | | | Also a ban rate limit rule was set \(which took into effect at 0911hrs\) | | | April 23 2025 | | | 0911hrs - 0916hrs | Risk and Device Apis were severely rate limited due to a misconfiguration | | April 23 2025 | | | 0916hrs Onwards | Misconfiguration was lifted and systems went back online | ## What we’re doing to prevent future issues * We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked * We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with. * We are also creating dedicated instances to better handle such spikes.

resolved2025-04-23T15:39:54.134Z

This incident has been resolved.

identified2025-04-23T15:15:41.581Z

The issue has been identified and a fix is being implemented.

Report: "Partial downtime"

Last update 2025-04-24T13:41:51.893Z

postmortem2025-04-24T13:40:46.644Z

## - ALL TIMES ARE PST ## Overview _On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic._ `/customers`, `/issuing/risks`, `/feedbacks` and `/devices` APIs were affected in following times \(pacific time\): 08:34-08:46, 08:51-08:53 Apr 19 08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23 ## What happened Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation. ## Impact `/customers`, `/issuing/risks`, `feedbacks` and `/devices` API experienced increase in latency. ## Timeline | **Date** | Status | | --- | --- | | \*\*April 19 2025 | | | 0834hrs - 0846hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | We started manually scaling Nginx horizontally and vertically. \(While autoscaler was there, we did the manually to make it faster\) | | | \*\*April 19 2025 | | | 0846hrs - 0851hrs\*\* | All APIs were back up | | \*\*April 19 2025 | | | 0851hrs - 0853hrs\*\* | Risk and Device Apis were experiencing increase in latency. | | April 19 2025 0853hrs onwards | All APIs were back up. No issues moving forth. | | ————— | ————— | | April 23 2025 | | | 0813hrs - 0816hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and started enabling some rate limit rules. | | | April 23 2025 | | | 0837hrs - 0842hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules | | | Sardine engineers also started to scale up our nginx servers vertically. | | | April 23 2025 | | | 0854hrs - 0858hrs | Risk and Device Apis were experiencing increase in latency. | | Sardine engineers analyzed the traffic and hardened the rate limit rules. | | | Also a ban rate limit rule was set \(which took into effect at 0911hrs\) | | | April 23 2025 | | | 0911hrs - 0916hrs | Risk and Device Apis were severely rate limited due to a misconfiguration | | April 23 2025 | | | 0916hrs Onwards | Misconfiguration was lifted and systems went back online | ## What we’re doing to prevent future issues * We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked * We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with. * We are also creating dedicated instances to better handle such spikes.

resolved2025-04-19T03:30:00.000Z

Partial downtime on API and dashboard from 15:40 UTC to 15:56 UTC

Apr 2, 2025

Report: "Sandbox connectivity issues"

Last update 2025-04-02T13:16:17.930Z

resolved2025-04-02T13:16:17.910Z

This incident has been resolved.

investigating2025-04-02T11:23:47.583Z

We are continuing to investigate this issue.

investigating2025-04-02T11:20:30.632Z

We are experiencing connectivity issues on the Sandbox instance and we are currently investigating the root cause.

Mar 31, 2025

Report: "Elevated Latency"

Last update 2025-03-31T20:06:06.362Z

resolved2025-03-31T18:00:00.000Z

We experienced elevated latency on our /customer and /devices API starting 18:15 UTC up to 18:55 UTC

Mar 27, 2025

Report: "SANDBOX environment is inaccessible"

Last update 2025-03-27T20:08:18.397Z

resolved2025-03-27T20:08:18.377Z

This incident has been resolved.

monitoring2025-03-27T19:55:14.768Z

A fix has been implemented and we are monitoring the results.

investigating2025-03-27T19:44:01.837Z

We are currently investigating this issue.

Report: "Elevated latency for customers API and issuing risk API"

Last update 2025-03-27T00:01:25.719Z

postmortem2025-03-26T23:58:39.394Z

## Overview Repeated timeouts and increased query latency on a few of read replica clusters resulted in client-visible API latencies. ## What happened Recent code changes and changes in traffic patterns resulted in slow database queries. This resulted in slower latency, resulting in retries from some of our clients. Because we auto-scaled pods based on traffic volumes, that resulted in spikes in database connections, which resulted in further performance issues. ## Impact Our API latency has degraded severely in following times March 20 20:06-20:41 March 22 1:00-1:23 March 22 2:14-2:34 March 25 18:26-19:01 ## What went wrong * Internal communication took a while before we can communicate issue with our clients * Detecting root cause took us a while ## Action items | Action Item with Description | target | | --- | --- | | Scale up database resources | DONE | | Update database connection limit and other configurations | March 31 | | provision a separate DB resource fo one of our service | April 1 | | Tighten up internal timeout config | April 1 | | Optimize known slow query 2 | April 1 | | Optimize known slow query 1 | DONE | | Optimize feature computation backend | ongoing project, end of Q2 |

resolved2025-03-25T19:01:27.920Z

This incident has been resolved.

monitoring2025-03-25T18:49:08.649Z

A fix has been implemented and we are monitoring the results.

investigating2025-03-25T18:26:48.549Z

We are currently investigating this issue.

Mar 25, 2025

Report: "Performance degradation on customers API and issuing risk APIs"

Last update 2025-03-25T22:47:52.274Z

resolved2025-03-18T19:03:00.000Z

Our service has experienced higher than usual latency from 19:03-19:14 UTC because one of our database was overloaded. We have updated the infra configuration and issue is now resolved.

Report: "Payment service has downtime"

Last update 2025-03-25T22:43:15.974Z

resolved2025-03-22T23:40:22.837Z

This incident has resolved as of 4:10 PM PT

identified2025-03-22T22:52:55.000Z

We are experiencing anomalous surge in traffic since 11:40AM PT and crypto.sardine.ai is experiencing instability.

Report: "Degradation across all APIs due to very high traffic spike"

Last update 2025-03-25T22:42:47.940Z

resolved2025-03-18T00:00:00.000Z

From 21:14 UTC to 21:23 UTC we had degradation across all APIs due to a huge spike in traffic

Report: "Dashboard not showing transaction data"

Last update 2025-03-25T19:27:52.546Z

resolved2025-03-25T19:27:52.528Z

This incident has been resolved.

investigating2025-03-25T15:37:18.497Z

We are currently experiencing an issue with displaying the transaction data for sessions in the dashboard for Production and Sandbox instances. We are actively working on this, in the meantime the information is available in the "View Request" widget, and it can be viewed there.

investigating2025-03-25T15:36:39.920Z

We are continuing to investigate this issue.

investigating2025-03-25T15:33:55.695Z

We are continuing to investigate this issue.

investigating2025-03-25T15:31:23.435Z

We are currently experiencing an issue with displaying the transaction data for sessions in the dashboard for Production and Sandbox instances. We are actively working on this, in the meantime the information is available in the "View Request" widget, and it can be viewed there.

Mar 19, 2025

Report: "Dashboard can't be accessed"

Last update 2025-03-19T18:12:06.029Z

resolved2025-03-19T18:12:06.008Z

This incident has been resolved.

monitoring2025-03-19T18:10:05.843Z

A fix has been implemented and we are monitoring the results.

investigating2025-03-19T18:04:41.490Z

We are currently investigating this issue.

Mar 16, 2025

Report: "Crypto service experiencing intermittent issues"

Last update 2025-03-16T18:13:06.873Z

resolved2025-03-16T18:13:06.859Z

This incident has been resolved.

investigating2025-03-16T17:48:52.223Z

We are currently investigating this issue.

Mar 7, 2025

Report: "Increased latency on customers API"

Last update 2025-03-07T21:22:04.732Z

resolved2025-03-07T22:00:00.000Z

We were experiencing elevated latency for the Customers API during 19:04 and 19:12, and between 20:06 - 20:13.

Report: "Increased latency"

Last update 2025-03-07T17:23:28.006Z

postmortem2025-03-07T17:21:54.111Z

## Overview _On Mar 3rd, Sardine experienced an increase in latency due to increased database usage._ `/customers` endpoint experienced spikes in latency around the start of every hour for 15 minutes between 8 and 11 Pacific time. And`/issuing/risks` endpoint experienced intermittent higher latency during 08:00 - 10:54 Pacific time. `*/customers*` latency spikes: * 8:00 - 8:15 * 9:00 - 9:15 * 9:55 - 10:15 * 11:00 - 11:15 `*/issuing/risk*` endpoint latency spikes: * 08:02 - 08:44 * 09:42 - 10:12 * 10:30 - 10:54 ## What happened Sardine encountered a surge in traffic originating from a client. The intermittent and unpredictable nature of these spikes presented challenges in real-time detection and impact assessment, subsequently hindering our ability to implement timely mitigation strategies. ## Impact `*/customers` and`/issuing/risks`API experienced intermittent higher latency during 08:11- 10:06 Pacific time, Clients using the advanced aggregation feature were more impacted.\* ## Timeline \(all Pacific time\) * 08:02: Incident starts - latencies for both endpoints start going up. * 8:10: Oncall engineer paged due to the increased latency. * 8:23: Alert was auto resolved. Oncall engineer started digging into the root cause, latency was not yet back to normal but in a more manageable situation. * 9:00 - 9:35: We had an Oncall handoff meeting, this latency issue was mentioned but no root cause was detected yet but the latency seemed under control. * 9:42: A customer notices the latency issue and communicates to the Sardine team. * 10:00: p95 latency becomes a sustained issue, new Oncall engineer starts investigating. * 10:20: Oncall discovers the queries creating the bottlenecks, engineers start checking if there’s a bypass or a quick enhancement possible to remove the bottleneck or if scaling the DB is our only option. * 10:40 AM: The DB is scaled up, in a few minutes the incident ends. ## What went wrong * Slow to detect request spikes and queries that were causing latency issues. * Oncall handoff happened during the incident, issue wasn’t properly handed off * Old oncall engineer thought the issue was a one-off latency increase due to spikes * New oncall engineer wans't tagged on old threads about this topic. ## Action items * Enforce more strict query timeouts for issuing API * Ongoing query optimizations on the advanced aggregations feature that will mitigate risks of this happening again * Enhance internal process around alerts and escalation * Process update to oncall handoff and incident handling - if incident happens around oncall handoff both new and old oncall with use handoff meeting as working session

resolved2025-03-03T06:00:00.000Z

We were experiencing elevated latency for customers and issuing APIs during 8:00am PT and 10:00am PT due to unusual traffic volume, the issue is resolved.

Mar 3, 2025

Report: "Short spike in latency for customers and issuing APIs"

Last update 2025-03-03T17:19:56.851Z

resolved2025-03-03T03:00:00.000Z

We had elevated latency for customers and issuing APIs during 5:04am PT and 5:06am PT (about 2mins time-window). This was caused by AlloyDB overload. Issue is resolved.

Feb 28, 2025

Report: "Customers API degradation"

Last update 2025-02-28T13:47:11.695Z

resolved2025-02-28T03:00:00.000Z

We experienced a short degradation for our customers API between ~5:07am PT and ~5:14am PT. All services are restored to normal activity.

Feb 20, 2025

Report: "Device Features and Device Risk Signals are missing from some session in the Dashboard (UI issue, NOT data issue)"

Last update 2025-02-20T17:21:37.284Z

resolved2025-02-20T17:21:37.265Z

This incident has been resolved.

identified2025-02-20T17:09:30.206Z

It started on February 20, 2025 at 4:38:37 AM GMT-8 What is being affected specifically? - Visualization of device features in dashboard (API requests/responses were unaffected) - The underlying information is NOT affected, it will be shown again once we fix the issue.

investigating2025-02-20T17:07:09.177Z

We are currently investigating this issue.

Feb 4, 2025

Report: "/v1/rules endpoint in PROD not responding properly"

Last update 2025-02-04T18:29:33.922Z

resolved2025-02-04T18:29:33.903Z

This incident has been resolved. This is a purely informational endpoint (not transactional) to allow customers list the rules they own.

investigating2025-02-04T17:36:42.309Z

We are having some issues with /v1/rules endpoint, we are investigating.

Jan 31, 2025

Report: "Elevated API latency"

Last update 2025-01-31T22:44:20.449Z

postmortem2025-01-31T22:42:50.199Z

## Overview _On Jan 31, Sardine experienced an increase in latency due to increased database usage._ `/customers`, `/issuing/risks`, and `/devices` API experienced intermittent higher latency during 9:03 - 13:16 Pacific time. ## What happened Sardine experienced an increase in traffic due to overall organic traffic growth. Additionally, some of our clients sent us batch-based traffic that was bursty in the nature. Some of our database didn’t scale well to handle the bursty traffic. ## Impact `/customers`, `/issuing/risks`, and `/devices` _API experienced intermittent higher latency during 8:03 - 13:16 Pacific time._ ## Timeline \(all Pacific time\) * 8:03 AM: Incident starts * 8:04 AM: Oncall engineer paged due to the increased latency * 8:21 AM: Alert was auto resolved. Oncall engineer concluded it was one-off latency spike * Around 9:00 AM: Client reached out to Sardine due to latency concern * 9:00AM - 13:00: Multiple alerts were triggered. Oncall engineer were investigating * 13:33: root cause identified * 13:40: scaling and config change were performed \(incident ends\) ## What went wrong * While internal alert notified oncall engineer, it wasn’t escalated soon enough * Inefficient internal logic caused performance bottleneck * Rate limit didn’t provide the sufficient control ## Action items * Enhance internal process around alerts and escalation * Fix inefficient old logic to avoid future similar issue

resolved2025-01-31T22:10:52.807Z

This incident has been resolved. We'll be adding a PostMortem with the full description and action items here as soon as we have it.

investigating2025-01-31T21:48:10.305Z

We are currently experiencing elevated latency and are actively working to resolve the issue

Jan 30, 2025

Report: "Issue in storing request/responses data"

Last update 2025-01-30T14:38:51.836Z

resolved2025-01-30T08:00:00.000Z

Due to a configuration issue, some of the request/response data was not stored in our system. You may encounter an error when you attempt to "view request" or "view response" from the Sardine dashboard. This issue didn't affect any of the online risk scoring. Incident timeline 6:36 AM Jan 28 to 5:41 AM Jan 30 PST for api.eu.sardine.ai 3:36 PM Jan 29 to 5:41 AM Jan 30 PST for api.sardine.ai

Jan 28, 2025

Report: "pub/sub outage"

Last update 2025-01-28T12:24:23.182Z

postmortem2025-01-28T12:23:23.334Z

### Summary One of Sardine’s cloud provider \(Google Cloud\) had [an outage with their pub/sub service](https://status.cloud.google.com/incidents/ghMho2Gka33Exr9UNavz) \(messaging service\). This results in data loss for the secondary data we show in Sardine dashboard. Incident began at **2025-01-08 06:54** and ended at **2025-01-08 08:07** \(all times are **US/Pacific**\). ### Root cause Sardine utilizes pub/sub for various subsystems inside the system for scalability and low latency. Sardine system writes to our primary database within the API request path, but some logic was then performed asynchronously triggered by a message to pub/sub. This resulted in missing data in Sardine dashboard. Additionally, this resulted in data loss for our API log table, which makes data backfill very difficult. ### Future enhancement * \[Q1\] Remove pub/sub dependency from API logging so we can replay APIs in case like this * \[Q1\] Enhance alerting around pub/sub issues

resolved2025-01-08T07:00:00.000Z

One of Sardine’s cloud provider (Google Cloud) had an outage with their pub/sub service (https://status.cloud.google.com/incidents/ghMho2Gka33Exr9UNavz) (messaging service)

Jan 21, 2025

Report: "EU - Data delay in Dashboard (API NOT affected)"

Last update 2025-01-21T19:22:54.726Z

resolved2025-01-20T12:30:00.000Z

EU clients had experienced data delay in dashboard. Session/transaction pages were not loading for new data. From Jan 20th 12:30 PM UTC to Jan 21th 17:02 UTC Now it is resolved. We've isolated root cause and will be fixing it for good shortly.

Jan 17, 2025

Report: "High latency on API requests"

Last update 2025-01-17T08:14:49.661Z

postmortem2025-01-17T08:14:41.845Z

## Overview _On Jan 16 morning US time, Sardine experienced an increase in latency due to increased database usage._ `/customers`, `/issuing/risks` and `/feedbacks` API experienced higher latency during 17:57-18:29 UTC. ## What happened Sardine experienced an increase in traffic due to 1\) an internal backfill job, 2\) large backfill triggered by a client, and 3\) overall traffic growth. Resource usage of one of the database system, resulted in long latency. ## Impact `*/customers`, `/issuing/risks` and `/feedbacks` API experienced higher latency during 17:57-18:29 UTC.\* ## Timeline * 17:00 UTC: Sardine started to observe higher than usual latency. While latency is higher than usual, it didn’t match the condition for alert this point * 17:53 UTC: Latency increased drastically * 17:57 UTC: Page triggered * 17:58 UTC: Oncall engineer acknowledged the page * 18:05 UTC: Oncall engineer escalated to DevOps oncall and other senior team members * 18:19 UTC: Database upsize was triggered * 18:29 UTC: Latency goes back to normal ## What went wrong * We didn’t have sufficient internal control around internal data processing job that stressed the system * It took long time until resource upsize was performed * It took long time before Sardine update the status page ## Action items * Enhance internal process around data load so we can better control the resource usage * Review and adjust resource configuration * Improved internal training for oncall engineer

resolved2025-01-16T18:48:45.220Z

17:00 UTC to 17:52 UTC - moderate latency increase - p90 from stable 500ms to spikes to 2s 17:52 UTC to 18:27 UTC - heavy latency increase (p50 >2s) from 140ms

identified2025-01-16T18:30:41.987Z

The issue has been identified and a fix is being implemented.

Jan 9, 2025

Report: "Errors on API response"

Last update 2025-01-09T19:59:04.813Z

resolved2025-01-09T19:30:00.000Z

Due to increased traffic on our DB some requests returned errors and weren't successfully assessed. Timestamp: 19:27-19:37 UTC Team already solved it and will be working on a root cause fix.

Jan 8, 2025

Report: "Increased latency and errors on API"

Last update 2025-01-08T17:30:16.216Z

resolved2025-01-08T15:00:00.000Z

On Jan 8 15:17 - 15:32 hours UTC, Sardine had some intermittent issues on our pods. Affected endpoints include (/v1/feedbacks, /v1/events*, /v1/customers, /v2/devices) - Some Customers may have faced intermittent 504s with /v1/feedbacks. We have identified root causes and have placed some fixes in.

Jan 7, 2025

Report: "Very minor outage on /v2/devices and /v1/events endpoints"

Last update 2025-01-07T15:33:54.218Z

resolved2025-01-07T07:00:00.000Z

On Jan 7 07:22 - 07:23 hours UTC, Sardine had some intermittent issues on our device-events pods. About 5-10% of the traffic would have 502 returned. Endpoints pertaining to our devices, /v2/devices and /v1/events would have been impacted. We have identified the cause and are putting in a fix to resolve this.

Jan 4, 2025

Report: "Increased latency in Customers and Issuing API"

Last update 2025-01-04T17:02:41.612Z

resolved2025-01-04T04:00:00.000Z

Clients may have experienced higher than usual latency and rate of errors between 3:50 and 4:23 UTC on the Customers API and Issuing API due to a performance degradation from higher than usual activity to our servers Minor latency on unified-comments, cases, and list-items APIs was noticed as well.

Dec 13, 2024

Report: "Service degradation"

Last update 2024-12-13T23:34:56.093Z

postmortem2024-12-13T23:34:46.545Z

**Date**: December 13, 2024 ## Summary Production environment experienced significant performance degradation due to database connection issues, impacting service latency across multiple API endpoints. The incident lasted approximately 1.5 hours, from 8:39 AM to 10:12 AM PT \(pacific time\). The immediate resolution was achieved through database scaling, though the root cause investigation is stil ongoing. ## Description The incident started with deployment to production. During the incident period, the primary database experienced: * Dramatic increase in connection count and database load between 8:29 AM - 10:24 AM PT * Elevated lock contention * Degraded query performance * No corresponding traffic spike was detected during this period ## Timeline \(pacific time\) * 5:48 AM - backend deployment * 8:29 AM - another backend deployment * 8:29 AM - Database began showing increased load and connection count * 8:39 AM - Incident began - high latency detected * 8:46 AM - first alert paged oncall engineer * 9:14 AM - First remediation attempt: Rolling back all deployments to earlier SHA \(unsuccessful\) * 9:30 AM - Started scaling primary DB * 10:12 AM - Service latency returned to normal * 10:24 AM - Database metrics returned to normal levels ## Impact * Increased latency across multiple API endpoints \(v1/customers, v1/issuing/risks, and v1/feedbacks\) * Service degradation affecting database-dependent operations ## Root Cause Analysis Investigation is still ongoing ## Short-term resolution The following actions were taken to resolve the incident: 1. Rolled back deployments to previous stable version 2. Vertically scaled primary database 3. Performed cleanup of old records in one of the database table 4. Reverted some other recent config changes ## What Went Well 1. Team responded quickly to alerts and began investigation 2. Multiple remediation strategies were attempted 3. Service was successfully restored through database scaling ## What Needs Improvement 1. Root cause remains unclear despite multiple rollback attempts 2. Several initial remediation attempts were unsuccessful ## Action Items * Complete detailed root cause analysis * Evaluate if primary DB can be safely scaled down \(planned for Monday review\) * Contact client provider

resolved2024-12-13T18:15:38.191Z

This incident has been resolved.

monitoring2024-12-13T18:01:27.580Z

A fix has been implemented and we are monitoring the results.

Dec 5, 2024

Report: "Bulk decisioning alerts not working"

Last update 2024-12-05T19:15:59.978Z

resolved2024-12-05T19:15:59.966Z

This incident has been resolved.

identified2024-12-05T17:42:18.635Z

Bulk decisioning alerts in alert queues is not working, if you try you'll get an error. The team is already working on a fix. In the meantime you can decision one at a time and it will work fine.

Dec 3, 2024

Report: "Increased latency in the API requests"

Last update 2024-12-03T12:14:13.632Z

resolved2024-12-03T02:00:00.000Z

Increased latency on requests performed to the API from 1:56 AM UTC to 2:02 AM UTC

Nov 26, 2024

Report: "Performance Degradation (high latency on requests)"

Last update 2024-11-26T19:02:13.369Z

resolved2024-11-21T09:00:00.000Z

On November 21, 2024 during 8:40PM - 9:28 UTC, the system suffered a significant performance degradation. This resulted in increased latency for APIs - customers API, issuing/risks API and document-verification APIs. Resolution: System resources were upscaled to mitigate the immediate impact and a controlled restart of affected application components was initiated. This action successfully reduced the number of active connections and stabilized the system. Next Steps: - Enhance internal runbook so oncall engineers can triage and handle issue like this faster - Optimize usage of database to avoid similar issues

Nov 14, 2024

Report: "Increased latency and rule evaluation errors"

Last update 2024-11-14T18:02:02.217Z

resolved2024-11-14T18:02:02.204Z

This incident has been resolved. Timestamps of the issue: 17:00 - 17:21 UTC 17:36 - 17:52 UTC

monitoring2024-11-14T17:00:20.000Z

A fix has been implemented and we are monitoring the results.

Report: "Increased latency and rule evaluation errors"

Last update 2024-11-14T17:19:21.988Z

resolved2024-11-14T17:19:21.978Z

We were experiencing degradation of services on rules-engine between 7:30am to 7:50am CST. Rules-engine is currently fully operational.

Nov 13, 2024

Report: "Latency spike"

Last update 2024-11-13T08:56:24.381Z

resolved2024-11-13T08:56:24.365Z

This incident has been resolved.

monitoring2024-11-13T08:38:06.203Z

we observed latency increase from 23:48 pacific time. We have adjusted infrastructure and montioirng

Nov 8, 2024

Report: "SSN enrichment failing"

Last update 2024-11-08T20:41:41.924Z

resolved2024-11-08T20:41:41.912Z

This incident has been resolved around 8:24pm UTC.

investigating2024-11-08T20:30:50.193Z

We're getting error from SSN data provider and missing SSN signals since 2024-11-08 20:02 UTC, we're actively working with partner for the resolution

Nov 7, 2024

Report: "Latency degradation of customers API and feedback API"

Last update 2024-11-07T20:20:49.682Z

resolved2024-11-07T09:30:00.000Z

We've had an Customer and Issuing latency spike from 9:45 to 10:25 UTC today. Feedback APIs increased latency was mainly between 9:45 to 10:05 UTC Root cause was found and we'll be working towards solving it for good.

Nov 4, 2024

Report: "Short degradation of service"

Last update 2024-11-04T18:53:17.622Z

resolved2024-11-01T13:30:00.000Z

Some clients may have experienced an intermittent short downtime on Nov 1 CST 7:44:30 to 7:48:45 on Sardine's endpoints. Some requests may also have had high latency. There was a configuration change on our Nginx servers and this configuration change spiked the cpu and memory by a large amount which was unexpected. This has been resolved and the team is working on ensuring such spikes do not happen again.

Oct 17, 2024

Report: "Degradation on customers API and issuing risk API with increased latency"

Last update 2024-10-17T19:25:09.022Z

resolved2024-10-17T13:30:00.000Z

Sardine's platform experienced degradation on customers API and issuing risk API with increased latency between 10:35:40-10:38:30 pacific time. Issue is now resolved, and team is looking into the root cause and future mitigation.

Oct 11, 2024

Report: "Sonar ACH Datapack Issue"

Last update 2024-10-11T17:06:38.583Z

postmortem2024-10-11T17:06:17.103Z

# Incident Description The Sonar API started responding to a series of 500 errors to the ACH datapack users, it happened between 3:20 pm to 6:50 pm EST. ## Impact ACH Datapack users were unable to receive a proper response to their requests for 3 hours. ## Timeline \(all EST\) * At 3:20 PM manual config change on Sardine risk service \(not SONAR service\) was performed by a engineer * At 3:25 PM the 5xx monitor started showing non 0 5xx responses to requests using the ACH Datapack * At 5:06 PM we received a message from a ACH datapack user that reported receiving 5xx error messages for some time * At 5:18 PM The team got notified and started investigating * At 5:30 PM The most recent deployment was rolled back, but did not impact the errors. * From 5:30 PM to 6:30 Further investigation is done and a hotfix has to be deployed. * 6:50PM the hotfix is in place and fixes the issue. ## Root cause We had an exceptional deployment for all risk related services at Sardine that Tuesday evening and also some internal configurations were changed at the same time, this led to the sonarAPI receiving extra data from the bank enrichment service which SonarAPI was not parsing correctly yet, causing the application to panic when doing bank enrichment. ## What went wrong * Our monitors didn’t cover 5xx responses at a low volume like this case, we were at the highest point returning 18 5xx per 5 minutes, our threshold for warning was at least double that, and for alerts it would be 4x this volume. * Our automated test didn’t cover this scenario as it relies on certain production configuration ## Actions * Fixed the monitoring tool to now respond to any 5xx sonar responses, previously it was calibrated to be resilient to some 5xx requests before triggering warnings and alerts * We’ll review all the bank enrichment code in sonarAPI so that extra/unwanted/invalid values won’t affect SonarAPI reliability. * We’re continuing to add more tests to guarantee the integration between sonar and the enrichment vendors is covered.

resolved2024-10-11T17:05:42.772Z

ACH Datapack users were unable to receive a proper response to their requests

Oct 10, 2024

Report: "Elevated latency for customers API"

Last update 2024-10-10T07:50:39.485Z

postmortem2024-10-10T07:48:45.798Z

## Overview The latency for `/v1/customers` `v1/issuing/risks` and `/v1/feedbacks` API was increased for majority of customers from 21:02 - 23:51 Oct 9 Pacific Time. This was caused by certain traffic pattern \(likely fraud attack\) that stressed one of our backend DBs. While overall traffic volume was same as usual, the certain traffic pattern stressed Sardine’s feature computation backend and caused slow queries. ## Impact * Customers had increased latency at these time periods * We saw increased number of database timeout, meaning some features were not correctly computed ## Timeline \(all Pacific Time\) 7:43 PM Oct 9: Initial latency alert triggered. Since it was self resolved after a few minutes, no further investigation was done 9:16 PM Oct 9: another latency alert triggered, it got self-resolved after 10 minutes as well. Oncall engineer assumed it was transient spike and didn’t investigate further. We had couple other alerts but those were assumed to be noisy alerts 11:02PM: Sardine received error reports from a few clients 11:32PM: Issue got escalated by one of our Integration Managers. 11:40PM: Oncall engineer identified database CPU usage is extremely high 11:48PM: Oncall engineer performed database scaling 11:51 PM: Incident resolved ## What went wrong * Self-resolved pagers were ignored at night as they tend to be pretty noisy * Investigation took us lot of time * There is no auto-scaling available for this database product * We didn’t have enough safeguard about this particular traffic pattern that caused spike in the latency ## Action items * Improving alerts and pager setup * Enhance alerts around DB metrics * Build auto-scaler for database * Establish better runbook so oncall engineer can diagnose and act faster * Improve backend code so it’s more robust against this type of traffic pattern

resolved2024-10-09T05:30:00.000Z

Sardine's platform experienced elevated latency for the /customers API between 21:02 - 23:50 pacific time

Oct 3, 2024

Report: "Experiencing degradation in service"

Last update 2024-10-03T20:04:39.573Z

resolved2024-10-03T20:04:39.555Z

This incident has been resolved.

monitoring2024-10-03T19:42:44.608Z

A fix has been implemented and we are monitoring the results.

identified2024-10-03T19:37:54.543Z

The issue has been identified and a fix is being implemented.

investigating2024-10-03T19:14:14.051Z

We are currently investigating this issue.

Sep 17, 2024

Report: "High Latency in SDK event for the last 3 hours"

Last update 2024-09-17T12:51:18.215Z

resolved2024-09-17T12:51:18.200Z

The incident has been resolved.

investigating2024-09-17T05:50:53.764Z

We are noticing high latency in SDK event for the last 3 hours. The team is actively investigating and we will keep you updated on the progress