Onfido

Is Onfido Down Right Now? Check if there is a current outage ongoing.

Onfido is currently Operational

Last checked from Onfido's official status page

Historical record of incidents for Onfido

Report: "Degraded performance for Identity Reports"

Last update
resolved

Identity report were slower to be processed in the EU region between 20:30 and 21:50.

Report: "Degraded performance for Identity Reports"

Last update
Resolved

Identity report were slower to be processed in the EU region between 20:30 and 21:50.

Report: "Studio results page latency"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been deployed and we're monitoring the result.

identified

The issue has been identified, we expect to have it resolved in the following minutes for live workflow runs.

investigating

We've identified an issue that is causing an increase in studio results page latency. Workflow runs execution are not being affected by this incident, only results page.

Report: "Dashboard check latency (US region)"

Last update
resolved

Incident resolved

monitoring

A fix was implemented, we're monitoring the result

identified

We observed some latency on displaying check update in dashboard. Customers may experience check display delay in the dashboard. We're currently working on the fix

Report: "Deterioration in QES product"

Last update
resolved

Our provider has restored full service now. QES processing back to normal.

monitoring

QES requests are being processed again now. We continue to monitor the situation.

monitoring

QES processing is starting to come back now. We are continuing to monitor the situation.

monitoring

We are monitoring our 3rd party provider as they deal with the underlying issues

identified

The issue has been identified with our 3rd party provider

investigating

We are continuing to investigate this issue.

investigating

The QES product is currently down. Due to 3rd party provider. We are currently monitoring the situation as it develops with our provider

Report: "Increased latency on check creation"

Last update
postmortem

### Summary One of our components contributing to automatic processing for Document Reports had a spike of timeout errors from 9.05pm until 9.20pm in the EU cluster. All Document Reports created between 9:20pm and 9:40pm UTC were processed with a higher TaT by manual analysts. ### Root Causes Two faulty nodes in our production cluster temporarily slowed down the execution of a CPU intensive component. ### Timeline _9:21pm UTC: Elevated error rates for the relevant component trigger an on-call alert._ _9:28pm UTC: We identified two nodes of our cluster as culprits for slow CPU intensive executions._ _9:33pm UTC: Restart the two nodes._ _9:40pm UTC: The affected component recovers successfully._ _9:41pm UTC: Backlog of reports observed. Public incident raised to inform customers of expected time to clear._

resolved

This incident has been resolved. A small backlog of manual tasks will be cleared within the next 1-2 hours.

monitoring

The issue has been resolved and we are monitoring the results.

investigating

We are currently experiencing an issue that is negatively impacting latency on check completion.

Report: "Facial Similarity and Known Faces service degradation"

Last update
postmortem

At around 6pm UTC on 13th March 2025, we were alerted for higher turnaround times \(and consequent delays\) in processing Facial Similarity and Known Faces reports in the EU region. This will have affected all clients running reports during this 15 minute time period. These reports didn’t fail, but were only delayed in the end. ### Summary Higher turnaround times \(and consequent delays\) in processing Facial Similarity and Known Faces reports. ### Root Causes * Known Faces and Facial Similarity reports took longer than expected to be processed * because a database was struggling \(heavy CPU usage\) * because an ongoing query was monopolising the database * because the query was not optimised \(and not configured to time out\) * because the depending service is an internal operational tool for report drill down and investigation ### Timeline 17:56 UTC: We get alerted to a high number of pending reports, due to higher turnaround times in processing 18:03 UTC: A suspected feature is turned off as a potential culprit, but nothing changes – not root cause 18:07 UTC: Problem stops, ongoing reports are now being normally processed \(although it is unrelated with feature that was turned off, upon further investigation\) 18:10 UTC: Investigation shows high CPU usage in database 18:34 UTC: Query originating in internal operation tool is identified as culprit 18:35 UTC: Pending reports are seen as dropping, which should indicate process for graceful recovery is being handled. But a quirk in the metric tricks us, and we realise pending reports are stuck 18:36 UTC: Pending reports seem stuck, and are not automatically being recovered, so we resort to manual action to re-run them 18:37 UTC: Search feature in internal operational tool causing bad query is disabled \(functionality removed\) 19:00 UTC: We retrieve all of the affected reports from our logging platform 19:12 UTC: We have re-run all affected reports and incident is over ### Remedies In order to make sure this doesn’t happen again: * We will remove the search feature from the internal operational tool for report drill down whilst we optimise the query powering it * We will only reinstate the search feature after the query is optimised and set to use a read replica instead of a write replica for our PostgreSQL database * We will only reinstate the search feature after the query is optimised and adequate query timeout is set * We will fix the Cron job for automatic and graceful recovery of pending reports * We have fixed the operational dashboards to use the right metric for pending reports monitoring

resolved

All reports have been recovered. We're now back to normal processing and the incident is over.

monitoring

We're monitoring the run of pending reports. Almost done now. As previously stated, ongoing processing is back to normal. We'll update again once all pending reports affected during the incident have been recovered.

identified

A bad query has been identified as the main culprit. We continue to investigate the issue.

investigating

Processing times are back to normal for ongoing reports. There are some pending reports being automatically re-run by our graceful handling of errors as we update this incident page. We are continuing to investigate the issue.

investigating

We are currently investigating higher processing times for Facial Similarity and Known Faces reports in the EU region.

Report: "Smart Capture Link - Service Outage"

Last update
resolved

The issue affecting the Smart Capture Link due to expiration of the onfido.app has been resolved. While the issue has been resolved, we will continue to closely monitor the service to ensure full stability and address any potential lingering effects.

monitoring

While the domain renewal changes have propagated, some users may still experience intermittent issues. Next Steps * We are actively monitoring the situation

identified

Issue Summary The Smart Capture Link service is currently unavailable due to an expired domain (onfido.app), which caused disruptions in domain resolution. Current Actions Taken * The onfido.app domain has already been renewed * We will provide updates as soon as we have more information on restoration progress.

Report: "Report creation failure in EU and US regions"

Last update
postmortem

### Summary Onfido experienced an outage on Feb 24th, 14:14 UTC, for 9 minutes. Client API requests to upload documents and create checks returned an error response during this period. While this also impacted our SDK traffic, our logs indicate a very small number of user sessions failed to complete, as they were largely recovered by re-tries. ### Root Causes During preparations to rollout a planned database update, part of the change was inadvertently pushed to our production environment before it was ready. This was due to a misconfigured release pipeline. ### Timeline 14:14 UTC: A change to a database is inadvertently pushed to production 14:15 UTC: The team responsible for this upgrade is alerted to an increase in database errors 14:17 UTC: The problem is identified, and a fix is released 14:22 UTC: The fix is fully applied to impacted regions ### Remedies The offending pipeline is being corrected and an investigation will be done to assess if any other pipeline suffers from the same misconfiguration. Additional controls will be established to avoid such misconfigurations in future.

resolved

Onfido experienced a critical outage across all regions on Feb 24th, 2025, at 14:15 UTC, lasting 9 minutes (resolved on Feb 24th, 2025, 14:24 UTC). During this time, any request made to our systems failed, resulting in no applicants or reports being created and no uploads being processed. We take pride in running a robust, reliable service and are working hard to prevent this from happening again. Once we conclude our investigation, we will publish a detailed postmortem.

Report: "Studio webhooks delivery partial outage"

Last update
postmortem

### Summary On the 10th February 2025 around 17:34 UTC, a dependency upgrade, scoped to the component responsible for sourcing webhooks system, was deployed to production which caused a partial outage in the delivery of webhooks for Studio, affecting ~38% of traffic in the CA and US regions, for the duration of incident. There was no impact in EU. ### Root Causes A third party dependency upgrade in a component which buffers events for internal broadcasting, caused a fraction of them to be dropped, which resulted in the corresponding webhooks not being sent. Investigation was unusually hard due to difference in impact across regions and being intermittent in nature. ### Timeline 10/2 17:38: change with the dependency upgrade was deployed to the CA region 10/2 18:19: change with the dependency upgrade was deployed to the US region 10/2 18:58: first instance of webhooks not delivered in the US region 10/2 20:38: first instance of webhooks not delivered in the CA region 11/2 18:27: incident was reported and investigation started 12/2 02:14: change was reverted in the CA region 12/2 02:48: change was reverted in the US region ### Remedies * Additional per-region monitoring will be employed to identify these partial outages of critical services, such as webhooks, in a more timely manner, such as setting more aggressive per-region thresholds. The fact that EU, region with largest volume, was unaffected, diluted the measurement globally. * New standard operating procedure: only upgrade a single dependency per deployment on this system. * End-to-end testing of webhook delivery will be expanded to validate this additional scenario. * Reliance on the affecting third party dependency will be phased out.

resolved

On the 10th February 2025 around 17:34 UTC, a dependency upgrade, scoped to the component responsible for sourcing webhooks system, was deployed to production which caused a partial outage in the delivery of webhooks for Studio, affecting ~38% of traffic in the CA and US regions, for the duration of incident. There was no impact in EU.

Report: "Increased Flag rate for Video Reports"

Last update
resolved

Between 13:10 UTC and 14:23 UTC, a majority of Live Video reports processed in all regions were incorrectly flagged for payload integrity. This was caused by a recent code changed that has been rolled back. All impacted reports were rerun within 2 hours of the incident being resolved. We apologize for the inconvenience this has caused.

Report: "Workflow errors and latency"

Last update
postmortem

### Summary On the 11th February 2025 around 14:12 UTC, a database change on a primary key index in the EU region impacted the creation and management of workflow runs via the Studio API and the SDK. ### Root Causes While performing a schema migration on a table to change a primary key, an index was inadvertently dropped, which affected the performance of some critical operations relying on it. The incident lasted around 15 minutes, until the index for the previous primary key was re-introduced. The rollout of this change failed to strictly follow our established internal change management procedures for database migrations. ### Timeline 13:35:35 progressive rollout of the schema migration started in US region 14:09:30 unsuccessful attempt to manually abort the rollout of the migration after an increase of P50 endpoint latency was observed in that region 14:12:00 progressive rollout of schema migration started executing it in the EU region 14:19:49: first related 500 API error was recorded due to database query timeouts 14:23:24: monitoring alarm triggers due to surge of 5XX HTTP errors in the Workflow API 14:30:58: incident was reported 14:32:00: index on the previous primary key started being re-created 14:35:29: API recovered 14:38:45: alarm recovered 14:50:47: incident closed ### Remedies * For the foreseeable future, all Studio database migrations will require explicit review by a senior engineering leader to ensure rollout strictly follows our established internal change management procedures for database migrations; * Additional measures to test the impact of schema changes, prior to any production rollout, will be performed via load tests, with the goal of measuring deviations on the P50/P75 of API endpoint latency; * A longer progressive rollout interval across regions will be applied in order to provide more opportunity to spot issues before moving on to the higher-volume regions; * Improve monitoring around abnormal API endpoint latency surges in order to automatically detect deviations without requiring active human observation.

resolved

We're seeing elevated workflow errors on creation and completion for Studio customers. Slower latency in general is also observed on EU region. incident started at 14:19 UTC and resolved at 14:36 UTC.

Report: "Dashboard Users unable to Login"

Last update
resolved

This incident has been resolved.

monitoring

We are monitoring an issue where Users were unable to login when trying to access the Dashboard in the EU region. The issue started at 17:38 UTC and a fix was deployed at 18:51 UTC. Users that were already logged in into the Dashboard were unaffected.

Report: "Increased turn around time for Autofill on all regions."

Last update
resolved

We have identified a temporary spike in turnaround time for our Autofill product, which has already been resolved. We apologise for any inconvenience this may cause.

Report: "Device Intelligence Report Failure to run"

Last update
postmortem

### Summary A product change released on Device Intelligence caused reports to be withdrawn from 14:48 to 15:26 UTC January, 24. Both Studio Workflows and Classic Checks were impacted. The Device Intelligence reports running inside Studio workflows caused these to transition to the Error state. Classic Checks that included these reports were still completed based on the outcome of the remaining reports. 12% of Studio workflows running during that period were affected and 9% of Checks. ### Root Causes The primary cause of the incident was a backwards-incompatible change in a timestamp field during a Device Intelligence report calculation update. Our alerting system detected this immediately, but due to a misconfiguration, it wasn’t marked as an urgent issue. This misclassification occurred because the change only impacted a small subset of the entire report set. Despite the alert not being flagged as urgent, one team promptly began investigating. Once they escalated the issue to the team responsible for the change, a rollback was initiated, and the issue was resolved swiftly. ### Timeline * 14:48 GMT: Gradual deployment begins * 14:58 GMT: Deployment completes * 14:59 GMT: Internal alert triggered with non-urgent priority * 15:03 GMT: Investigation team begins analysis * 15:19 GMT: Responsible team informed and starts immediate investigation * 15:25 GMT: Rollback initiated * 15:27 GMT: Incident resolved ### Remedies * Completed Actions: * **Alert Priority Adjustment**: Fixed the priority of alarms to mark similar issues as urgent and activating on-call; * **Alarm Sensitivity**: Tuned other alarms to be more sensitive to errors, ensuring they trigger during deployment; * Ongoing Actions: * **Deployment Monitoring**: Implement more granular monitoring during deployments to catch backwards-incompatible changes earlier in the process; * **Timestamp Standardization**: Develop and enforce strict guidelines for timestamp handling across all systems to prevent future compatibility issues. * **Workflow Recovery**: Implement mechanisms to recover workflows during temporary failures instead of cancelling out; * **Device Intelligence Report Handling**: Ensure Device Intelligence reports are not withdrawn but moved to a Dead Letter Queue \(DLQ\) for potential recovery and investigation.

resolved

All Device Intelligence reports from 14:48 to 15:26 UTC January, 24 did not complete. The device intelligence reports running on Studio workflows were withdrawn along with any other reports being executed in parallel with them.

Report: "Known Faces service degradation"

Last update
postmortem

### Summary Facial Similarity and Known Faces reports experienced turnaround time degradation in the US region, causing service slowdowns and ~1.2% of Known Faces reports being withdrawn. ### Root Causes Cluster data imbalance caused sustained CPU spikes to happen in one of the supporting nodes holding a big portion of our data, leading to performance issues felt especially in turnaround time. ### Timeline 21:13 GMT: We’re automatically alerted by our monitoring systems for increased turnaround times in Known Faces reports, but it auto-recovers quickly 22:10 GMT: We’re once again automatically alerted by our monitoring systems for increased turnaround times across Known Faces and Facial Similarity reports 23:42 GMT: Cluster health deteriorates and then goes back to a healthy state. But we keep seeing issues and identify an action as a possible way to alleviate the issue: increasing the amount of data nodes in the cluster 23:44 GMT: We scale up the amount of data nodes in the cluster 01:37 GMT: Cluster is still unstable, but shows signs of improvement on-and-off, related with ongoing rebalancing operations 03:20 GMT: Service fully stabilised, normal functioning is re-established ### Remedies The following day, the team outlined a plan for recovering stability to guarantee this wouldn’t happen again. The root cause for this issue related with a suboptimal sharding strategy of data. The main strategy involved a long migration operation to change our sharding strategy. This allows hot indices of faces \(e.g., clients holding a large amount of users relative to other clients\) to be better sharded across our cluster, not causing imbalances in the way data is spread across our running data nodes.

resolved

Known Faces reports in the US were working back to normal from 3:20am UTC.

monitoring

Cluster now more stable. We're continuing to monitor the situation.

investigating

We've identified unusual CPU activity in two of our search cluster nodes. In the meantime, we've provisioned additional nodes to tame the negative effect of such occurrence. We are continuing to investigate the root cause for this issue.

investigating

We are continuing to investigate the issue.

investigating

We are currently investigating higher processing times for Known Faces reports in the US region.

Report: "Watchlist KYC Fallback"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and is currently in testing.

investigating

Onfido's data partner for Watchlist KYC (not full) has been experiencing issues starting on Jan 14, 2025 - 11:40 UTC. Some of Onfido’s Watchlist KYC checks may not complete and may remain in the 'pending' status. The fallback provider has been enabled as a replacement. As soon as possible, Onfido will re-run any checks that are pending. We apologise for any inconvenience this may cause.

Report: "Identity Enhanced - Decrease in clear rates"

Last update
resolved

This incident has been resolved

investigating

Identity Enhanced is experiencing a major decrease in clear rates for applicants all over the world except UK

Report: "Increased check processing latency in Canada"

Last update
resolved

This incident has been resolved.

monitoring

This incident is resolved. All queued reports have been cleared. We apologize for the degraded service.

monitoring

The underlying problem has been identified and fixed. New reports are being processed as normal, and we are beginning to handle reports that were impacted and queued.

investigating

The latency increase is contained to reports going through manual processing from our analysts. We are continuing to investigate the issue.

investigating

We are facing an issue in our Canada instance where processing latency is increasing. We are investigating.

Report: "Decrease of clear rate for Identity Enhanced"

Last update
resolved

The clear rate is back to normal.

monitoring

The issue has been identified and the situation is going back to normal.

investigating

Identity Enhanced is experiencing a slight decrease in clear rates for applicants.

Report: "Studio Degradation"

Last update
postmortem

### Summary On 19 November 2024 18:45 UTC, a database change on an index in the EU and US regions seriously impacted the creation and management of workflow runs via the Studio API and the SDK. ### Root Causes The addition of a column to an existing index in a core database table, aimed at improving the performance of a specific combination of filters in the Dashboard results page, was performed by first dropping the existing index, and then recreating it with the additional column. The first operation resulted in a spike in CPU overhead in all database operations involving that table, which deprioritized the second operation. The instability of the system continued until the new index was force-created. ### Timeline The timeline below refers to timestamps for the the 19 November 2024; all entries are in UTC\): * 18:44:25 - operation dropping the index was started * 18:45:37 - first request fails due to statement timeout * 18:49:00 - alarm triggers for high surge of 5xx HTTP errors for the Studio API * 19:18:02 - incident was reported * 19:47:00 - index starting being manually force-created in the US * 19:57:00 - US region recovered * 19:59:00 - index starting being manually force-created in the EU * 20:09:00 - EU region recovered * 20:47:09 - incident resolved ### Remedies * Integrate database migration acceptance rules and broaden list of reviewers; * Introduce a kill switch to enforce Studio API “maintenance mode” in order to be able to prioritize recovery actions and reduce overall Mean Time To Recovery; * Full split of data migration pipeline from code deployment pipeline; * Spilt Dashboard read-only post-execution traffic from critical path.

resolved

This incident is resolved. Post-modern with more details will be provided soon.

monitoring

A fix was deployed 15 mins ago, clients should see all systems back to normal. We're still monitoring.

identified

Clients using studio feature are seeing 5xx errors. We're restoring the service

Report: "Degraded performance for Identity Enhanced reports"

Last update
resolved

This incident has been resolved.

monitoring

The clear rates are back to normal values. We're monitoring the situation to make sure the issue is fixed.

investigating

Identity Enhanced is experiencing a slight decrease in clear rates for applicants all other the world except UK

Report: "anomalous number of 5xx errors"

Last update
postmortem

### Summary A database schema migration to one of Studio’s tables caused a temporary mismatch between the database status and the application side ORM model, during a deployment. Temporarily, in old instances of the application, the ORM tried to access columns that no longer existed. As consequence, it caused a surge of errors on the application, which led to 5xx http errors accessing multiple Studio endpoints - around 23% of the traffic during a 15 minutes interval - contributing to an overall error rate of 0.52% during the whole day. During the incident period, Workflow Runs also could not be completed and ended up in error status - 31.8% cancellation rate during the 15 minutes interval - overall cancellation rate was of 0.28% during the day. ### ​Timeline 08:18 - Database migration is deployed \(updated application side code still not rolled out to all instances\). 08:20 - Alarms triggered and Engineering starts evaluating the error. 08:31 - Issue identified. 08:33 - Issue resolved. ### ​Remedies The process for reviewing and releasing database migrations will be updated to automatically detect and enforce migrations and DB schema changes that need to be done in two different steps.

resolved

studio related features were partially down due to a deployment happened this morning. The impact period is 8:18 to 8:33.

Report: "Increase in cancellation rate of Qualified Electronic Signature capture Studio tasks"

Last update
resolved

The issue is now solved.

monitoring

The cancellation rate is back to normal values. We're monitoring the situation to make sure the issue is fixed.

investigating

We currently investigating an issue affecting QES capture Studio tasks, for which we see an increase of cancellation.

Report: "EU Dashboard delays to display checks list"

Last update
resolved

This incident has been resolved.

monitoring

The new compute capacity has been fully provisioned and there is no more delays in check processing.

monitoring

We have implemented a fix for this issue. We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

identified

New compute capacity is still being added. There is still a delay to see new data in the Dashboard but that delay is slowly reducing.

identified

The Dashboard should now load properly. However customers will experience a delay to view new data in the Dashboard. We are working with our infrastructure provider to provision more capacity to fix the dashboard reporting performance.

identified

Due to an infrastructure issue, our EU Dashboard has delays in showing new and existing checks in the Checks list. We have identified the problem and working on a solution right now.

Report: "Elevated error rate affecting report creation in the EU cluster"

Last update
postmortem

### Summary On September 24th from 10:33 UTC to 11:14 UTC customers faced a very high error rate when creating new document Reports. Reports created before 10:33 UTC and non-document reports continued to be processed. For the duration of the incident only 10% of new document reports requests were accepted and processed. ### Root Causes At 10:33 we started the release of an infrastructure change. A migration from a legacy deployment system to a newer standard one. The new configuration contained an typographical error that removed some permissions for an internal service. As a result, the service wasn’t able to interact with other infrastructure components. That service is in charge of handling documents, making any document media upload or download fail for our customers. ### Timeline * 10:33:03 - An infrastructure change is released * 10:37:00 - The error rate for our internal api used to process uploaded documents reaches 80% * 10:38:37 - On-Call monitoring is triggered and an incident is created * 10:47:00 - Log analysis identifies the error and the services impacted * 11:03:00 - The root cause is identified and work on a fix is prepared * 11:10:34 - The release of the fix is triggered * 11:14:00 - The service is fully operational ### Remedies * The release process of this class of infrastructure change will be hardened to reduce the feedback loop and expedite the resolution process. This will be completed before any further related changes are applied.

resolved

This issue is now resolved. We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

monitoring

We have implemented a fix for this issue. We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

identified

The issue has been identified and a fix is being implemented. We will provide a further update at 11:30 UTC

investigating

We're currently experiencing elevated error rates impacting report creation, affecting all clients. We will provide an update at 11:15 UTC

Report: "Disruption to check processing in EU"

Last update
postmortem

### Summary On 12th September 2024, from approximately 16:15 UTC to 17:45 UTC, we experienced a period of instability that impacted applicant updates \(68%\), applicant deletions \(100%\) and document check completion rates \(40%\). Because our clients and partners use our applicant endpoints in various ways, this didn’t uniformly affect them across the board. It’s estimated it impacted the creation of at most 14% of our verifications overall during that period, and slightly less on Studio \(11%\) in our European region. This was followed by an extended period during which many historic check PDFs were unavailable for retrieval via the dashboard and API. Full correction of this issue was completed at 08:30 UTC on September 14th. ### Root Causes The maximum available key size was exceeded in a database table used by one of our services to store historical applicant data. Due to this, the application was unable to save new data for existing applicants \(not affecting new applicants\). Due to the volume of data stored by this table, while correcting this issue, priority was given to restore service availability at the expense of historical data retrieval, which prevented download check PDFs of older applicants until it was fully restored on September 14th at 8:30 UTC. ### Timeline \(times in UTC\) September 12th 16:15 - Incident started 16:23 - Issue was identified 17:45 - Issue was fixed September 14th 08:30 - Historical data backfill completed ### Remedies Review key types and growth rates of other tables across our data stores. Monitor table growth and migrate to required larger key types.

resolved

This incident has been resolved. A postmortem will follow once we've concluded a full investigation.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The cause of this issue has been identified and we are applying a fix

investigating

We are currently investigating this issue.

Report: "Identity Enhanced degraded performance"

Last update
resolved

Performance is now back to normal

monitoring

The 3rd party provider issue is now resolved. We are continuing to monitor for any further issues.

identified

Identity Enhanced is experiencing a major decrease in clear rates for applicants all other the world except UK

Report: "Degraded performance on Identity Enhanced reports"

Last update
resolved

Performance is now back to normal

monitoring

We are continuing to monitor for any further issues.

monitoring

The 3rd party provider issue is now resolved. We are continuing to monitor for any further issues.

identified

Identity Enhanced is experiencing a slight decrease in clear rates for applicants all other the world except UK

investigating

We are currently investigating the issue.

Report: "Timeouts invoking Studio API & Dashboard"

Last update
postmortem

### Summary Between **2024-09-06** **18:40:00** and **2024-09-07** **02:25:00** \(UTC\), a surge of requests performed by a single account, which retrieved older pages in the workflow runs dataset used to render the Workflow Results page of Onfido Dashboard, caused overhead in database resource management, which escalated and eventually impacted all Studio-related traffic requiring database access, approximately 0.2% of SDK and API traffic, and 6.3% of Dashboard related traffic. A second instance of this incident happened between **2024-09-11** **09:58:00** and **2024-09-11** **11:18:00** \(UTC\), having similar impact as described above \(0.39% of SDK and API traffic, 1.22% of Dashboard related traffic\). ### Root Causes The affected request paginates the dataset using a `page` query parameter, which the SQL query used to retrieve the results from the database translates to LIMIT/OFFSET. The performance of these queries degrades the higher the OFFSET. The requests associated with the incident had an unusually high `page` value associated \(2000 or higher\), which the database struggled to respond to, degrading its performance in the process. Due to a misconfiguration, the database session statement timeout was ignored, which prevented the database from force-terminating the queries and freeing up resources earlier. ### Remedies ​The statement timeout was correctly configured for all affected database sessions. Moreover, measures will be taken in order in order to prevent and minimize the impact of user actions.

resolved

Customers faced a partial performance degradation between 2024-09-06 18:40:00 and 2024-09-07 02:25:00 (UTC) when invoking the public API and Dashboard for the Studio component.

Report: "Deteriorate performance for Watchlist product"

Last update
resolved

Impacted checks are now completed. Performance is now back to normal

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

The 3rd party provider issue is now resolved. Watchlist performance is now back to normal. Any impacted checks will be re-run shortly

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified on 3rd party provider

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating the issue

Report: "Increased error rates and latency on US region"

Last update
resolved

AWS confirms issue is resolved.

monitoring

We are still monitoring the situation, even though AWS does not report full recovery, we don't see errors on our side.

identified

We are still monitoring the situation, we are seeing low error rates. We will post when we have further updates.

identified

We are still monitoring the situation, we are seeing lower error rates and latencies through all services.

identified

We are still monitoring the situation, AWS reports they expect the issue to be resolved in 2-3 hours.

identified

We are monitoring the situation and we are in contact with AWS to get the latest details. They are still working on a fix.

identified

We're currently experiencing elevated error rates and latency, impacting our US region, across all our products. Initial investigations have identified upstream issues in us-east-1 AWS infrastructure. We are currently in contact with AWS, AWS has identified the problem and they are actively working on a fix.

Report: "Elevated error rate affecting API requests in the US region"

Last update
postmortem

# Summary On July 19th our customers experienced an increased error rate in our US instance from 13:25 to 15:42. From 13:25 to 14:50 the error rate went from ~0% to 0.5%. It then peaked to 6% until the problem was solved at 15:42. # Root Causes A routine infrastructure change was introduced on July 19th at 13:10 UTC, progressively rolling out to all our systems over the next few hours. The change was incompatible with a legacy configuration of our logging system in the US instance, leading to sporadic application errors. ‌Aggregate error rates were initially too low to trigger an alert, delaying the time to respond. And as error logs were interrupted, this made it much more difficult to identify the source of the problem. # Timeline \(UTC\) 13:10 - Infrastructure change is published and progressive rollout starts. 13:25 - Error rate starts increasing to ~0.5%. 13:50 - According to our monitoring, a single service was impacted and a new version had just been released. The initial thought is that this new version is the culprit. 14:20 - The service is rolled back to its original version. But the high error rate isn’t fixed. 14:50 - More services start to be impacted and the error rate increases. 15:15 - The root cause is identified and a fix \(rollback\) is in preparation. 15:25 - The rollback is ready, preparing the release. 15:42 - The fix is fully rolled out and the error rate is back to ~0%. # Remedies Amend the incompatible US legacy observability configuration to restore consistency with our standard regional setup.

resolved

This issue is now resolved: Elevated error rate affecting API requests in the US region We take pride in running a robust, reliable service and are working hard to prevent this from happening again. Once we've concluded our investigation, a detailed postmortem will follow.

monitoring

We've identified the potential issue and have implemented a fix. We will monitor in the next 30 minutes, to guarantee the fix is valid.

investigating

We're currently experiencing elevated error rates impacting all API requests in the US region. We will provide an update in 30 minutes.

Report: "Issue while processing proof of address uploaded documents"

Last update
postmortem

### Summary On July 10th our customers experienced increased latency for Proof-of-Address \(POA\) tasks. ### Root Causes An routine infrastructure change was introduced on July 10th at 10:41 UTC. It was slowly rolled out to all our systems over the day. The change was incompatible with an internal service that served Proof-of-Address document images. As a result, our analysts performing manual Proof-of-Address tasks verification would sporadically fail to load images. Due to the slow roll out, it took some time before there was noticeable impact on our analysts and made identifying the root cause harder. ### Timeline \(UTC\) * 10:41 - infrastructure change is published and progressive rollout started * 18:20 - 67% of our systems were updated, a growing and significant analyst impact led to incident being opened * 18:22 - Investigation starts. Some analysts are still able to process some tasks, the backlog slowly increases * 19:42 - The root cause is identified and a fix \(rollback\) is in preparation * 20:07 - The rollback is ready, preparing the release * 20:30 - The fix is fully rolled out and analysts no longer experiencing problems * 20:50 - After 20 minutes of stability and no more error reported we close the incident. Analysts are able to process the backlog quickly. ### Remedies * The incompatible application will be updated prior to reapplying the infrastructure upgrade * A new monitor and alert has been added to the internal service that serves POA document images

resolved

The system is now working as expected. We are closing the incident.

monitoring

A fix has been implemented. We are monitoring our services to ensure everything is running correctly. The backlog of Proof-of-Address tasks is being processed.

investigating

We've detected an issue in our infrastructure that is preventing us from processing the proof of address uploaded documents.

Report: "Studio webhook delivery latency degradation"

Last update
resolved

This incident has been resolved.

investigating

This incident has been resolved. Sorry for any inconvenience this has caused.

investigating

customers should see improvement since around 30 mins ago, we're still monitoring

investigating

Studio customers may experience latency on webhook delivery, related teams are investigating the issue.

Report: "Degraded performance creating checks"

Last update
postmortem

### Summary On June 5th 2024, between 15:00 UTC and 16:41 UTC our non-Studio customers in EU experienced an ever increasing error rate for Check creation for 52 minutes, followed by 49 minutes of full Check creation downtime. Regions other than EU were not affected. Subsequently, between 16:42 UTC and 18:02 UTC customers using synchronous Check results suffered timeouts for 70 minutes \(webhooks were not affected\). ### Root Causes On May 2nd 2024, we performed a routine operation to upgrade the ruby version used in two lambdas functions that are used for processing checks. Both functions were running on Ruby 2.7, which was [deprecated by AWS on Dec 12th 2023](https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html). Such change was deployed with success and without any impact in latency or error rate. However, the version change in the dependency system \([bundler](https://bundler.io/)\) also removed the identifier to use [pessimistic versioning](https://thoughtbot.com/blog/rubys-pessimistic-operator), which would allow a patch version change in the Lambda runtime. Our cloud provider \(AWS\) ran automated upgrades to a new patch version of the Ruby 3.3 Lambda runtimes. The critical path of our Check creation process involved that Lambda function. The AWS runtime upgrade meant that new lambda instances being launched would not work because they failed to start with a new \(higher\) patch version of the runtime. The same effect was seen on another Lambda function, which is responsible to handle the response to clients when a synchronous check is completed. Therefore, causing a delay in check completion because those wouldn’t complete during the request time of `POST /vX/checks`. The runtime upgrades by AWS were done progressively, over an hour, making pinpointing the exact root cause complex because the failure mode was slow rather than immediate. ### Timeline \(in UTC\) 15:00 - Cloud provider AWS starts the rollout of the upgrade of Lambda runtime ruby3.3 from v4 to v6; 15:12 - Alerts for high error rate in the lambda that orchestrates check creation; 15:15 - Investigation starts; 15:57 - Status page is updated with ongoing incident; 16:00 - AWS automated upgrade for the runtime involved in our Check creation process is completed \(check creation affected reached 100%\); 16:02 - The root cause is now confirmed and we start the implementation of a solution; 16:33 - Deployment of the solution \(pinning the runtime to our specified version\) is triggered; 16:41 - The solution to fix our Check Creation process is fully applied; 16:42 - We receive alerts for high error rate in Lambdas involved in synchronous Checks creation \(for customers using synchronous checks, but not using webhooks\); 17:01 - We start listing all impacted Lambdas to implement the same corrections; 17:41 - Lambdas are fixed and progressively rolled out as soon as they are ready; 18:02 - All Lambdas involved in the entire Checks creation workflow are fixed. ### Remedies The following actions have resulted from our root-cause analysis: * Reviewing all our Lambda functions provisioning configuration to ensure no unsupervised and automated update happens so that we can be in control of even patch version upgrades of AWS lambda runtimes; * Review on-call profile permissions for AWS Lambda resources; * Expand our run-books with some additional instructions on how to handle similar failures with Lambdas; * Planned in our roadmap to implement short-circuit CI/CD pipelines for on-call engineering to use, that will allow us to skip certain steps to be able to faster restore service \(reduce MTTR\) in situations, such as this, of full system down.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "EU Dashboard partially unavailable"

Last update
resolved

The incident is now closed.

monitoring

A fix has been implemented. Customers still facing issue accessing the Dashboard in EU will need to clear their cache to get the latest version of all components.

identified

Our EU dashboard is partially unavailable. We have identified the issue and are working on a solution.

Report: "webhook delivery degradation"

Last update
postmortem

# Summary Between 03:00 UTC and 05:20 UTC on 28th March, a large increase in duplicate webhook notifications led to webhook queues starting to grow. Webhook events created between 05:20 and 09:37 were stalled due to the message queue becoming completely congested and were therefore not delivered; delivery of these queued webhooks resumed at 12:10 and were fully cleared by 13:30. All new events from 09:37 were processed as normal. # Root Causes We ran into a quota limit imposed by our cloud provider for the maximum size of messages queued. This was caused by a combination of: * An unusually high volume of duplicated webhooks messages; and * The configuration of our retry strategy. All webhooks messages were being processed via a single queue; to overcome the quota limit, a secondary queue was required to allow new messages to be processed. The retry process was preventing previously queued messages from being cleared; hence, retries were temporarily suspended to resume processing. # Timeline * 28/03/2024 03:00 UTC: Webhook events began to get congested, leading to a gradual degradation in webhook processing. * 28/03/2024 04:08 UTC: Monitoring alerted on-call to a build-up in queued webhook events, and an engineer began investigating. * 28/03/2024 04:20-04:55 UTC: The webhook service was scaled up, but this did not resolve the problem. * 28/03/2024 05:20 UTC: In-flight message quota is reached and no delivery for newly created events. * 28/03/2024 08:25 UTC: AWS were contacted to increase in-flight message quota. * 28/03/2024 08:30 UTC: It was determined that a secondary queue, inheriting an increased quota, would be required to unblock new webhook deliveries. * 28/03/2024 09:37 UTC: Secondary queue deployed, unblocking new webhook deliveries; in the mean time, engineers continue to work on a solution for events generated between 05:20 to 09:37. * 28/03/2024 12:10 UTC: Retries were temporarily suspended to resolve delivery of the queued webhook events generated between 05:20 to 09:37. * 28/03/2024 13:30 UTC: All queued webhooks cleared. # Remedies * Additional monitoring to alert on-call when approaching queue quota limits \(DONE\). * Move webhook retries to a dedicated secondary queue to avoid blocking new events from being processed \(ETA: April 2024\). * Introduce filtering to discard duplicate messages, to avoid redundant queue expansion \(ETA: April 2024\).

resolved

This incident has been resolved.

monitoring

Delivering missing webhooks from 3:00am to 10:30am UTC

identified

we're working on a script to resend webhooks from 3 am to 10:30 am UTC+0. During the fix, if you rely on this webhook data, we invite you to switch to API calls to get the data.

identified

we've deployed a fix. The new webhooks should be delivered correctly. The old webhooks will still expect latency. we'll update soon about old webhooks.

identified

we've identified the issue. we're applying a fix. we'll update later

investigating

still investigating the issue

investigating

Clients may see webhook events duplication or latency. We're still investigating the root cause.

investigating

We are currently investigating the issue

Report: "Increased turn-around-time for document reports in the EU region"

Last update
resolved

We experienced a small disruption of our services related to Document checks. We are now processing at a normal rate, and expect all the backlog to be completed in the next 2 hours. We are sorry for any inconvenience.

Report: "Known Faces Degraded request times"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Unable to create checks in EU"

Last update
postmortem

# Summary On Monday 4 Mar 2024 we ran a routine infrastructure component upgrade on the computing layer of our EU region. This upgrade is regularly performed with a Blue/Green strategy: a secondary compute cluster is created, all services are deployed, and traffic is then progressively switched from the primary to that secondary cluster. This process was validated on multiple test environments and had already been successfully applied to our other production regions. However, the networking configuration in the EU installation had an inconsistency, which led to traffic to our API authorization service remaining on the primary cluster. As a result, when the primary cluster was scaled down, it led to the outage of our API for 15 minutes. During the outage, many customers accumulated requests on their side and proceeded with retries when we restored the service. This means that after the API was restored, we faced a particularly high surge of traffic. Because the new cluster was without traffic for 15 minutes, automated scaling procedures started to scale it down. As a result, a key Document processing service did not handle the traffic surge. It took more time than we aim for to scale it back up, during which document reports were unavailable. **Overall this means 15 minutes of report creation downtime, and a further 50 minutes of disruption to document report creation.** # Root Causes * Inconsistent network configuration of our EU environment compared with all others. * Autoscaling up isn’t fast enough for some services when faced with extreme demand. * Downscaling of the primary cluster was too aggressive. # Timeline * 10:20 UTC: traffic switch to secondary cluster triggered. * 10:37 UTC: primary cluster is scaled down automatically in response to lower traffic. * 10:37 UTC: our authentication component fails to authorize API requests. * 10:52 UTC: traffic is properly routed to the secondary cluster and the API can handle authorization requests again. * 10:58 UTC: alerts for high error rate on the API is triggered. * 11:06 UTC: key document service not able to handle the load is identified. * 11:27 UTC: key document service scale up is manually accelerated on both clusters. * 11:29 UTC: the primary cluster upgrade is done. Traffic is progressively moved back to it. * 11:42 UTC: all services are upscaled and stable. Full functionality is restored. # Remedies * Fix traffic switching network configuration of our EU region. * Change upgrade process such that any downscaling of the primary cluster is done conservatively. * Improve the responsiveness of the failing document service to handle surge demand.

resolved

Following subsequent monitoring, our systems continue to be stable since services were fully restored at 11:42 UTC, and this incident is now closed. There remains a very small backlog of impacted reports that will be completed in the next few hours. This incident was caused by the failure of a routine maintenance operation. The failure led to our API being unavailable, resulting in all EU customers facing downtime from 10:37 UTC to 10:52 UTC. There was a further problem in the subsequent recovery period that prevented documents from being uploaded from 10:52 UTC to 11:42 UTC. This blocked the creation of checks that depended on document capture. For Studio customers, workflows were created, advancing until the document capture step. After document upload recovered, applicants were able to resume from that workflow step. A more detailed postmortem will follow. We pride ourselves on the reliability of our service, and apologise for the disruption caused by this incident.

monitoring

The API is now stable and correctly accepting checks and documents. We are processing all reports. We will keep monitoring.

identified

A fix for the issue has been built, we are in the process of deploying it to all infrastructure. Dashboard and API are available except to create checks.

identified

The issue has been identified and a fix is being implemented. The API should already be partly available.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Increased latency on check processing"

Last update
resolved

This incident has been resolved.

monitoring

We implemented a fix and processing latency is back to normal. We are monitoring closely until all the backlog is processed.

investigating

We are currently investigating this issue.

Report: "Identity report with increased TaT for USA jurisdiction"

Last update
resolved

This incident has been resolved.

monitoring

Found the root cause. TaT are back to normal.

investigating

We are currently investigating the issue.

Report: "Delays in check processing"

Last update
postmortem

## Summary On 12th Feb 2024, from 03:57 GMT to 04:36 GMT a fraction of facial similarity checks and of document checks requiring manual review for some customers in the EU region experienced an increase in turnaround time. Delayed checks were then processed and at 05:15 GMT the system was back to normal. ## Root Causes An erroneous configuration was released on 9th Feb 2024 for a service responsible for handling checks requiring manual review leading to most replicas of the service to eventually fail. ## Timeline     03:57 GMT: Decrease in throughput of manual reviews in EU.     04:05 GMT: The On-call team was alerted to an increase in error rate on a specific service.     04:36 GMT: Fix is applied and system was back to normal operation.     05:15 GMT: Completion of all backlog of checks generated during incident window. ## Remedies Reverted the erroneous configuration. Review internal documentation for preventing that from happening. Reviewed monitoring for alerting Engineering more promptly.

resolved

This issue has been resolved and we've applied a fix. We expect to complete the remaining delayed checks in the next 45 minutes or so. We apologize for any inconvenience this has caused.

monitoring

We are monitoring.

investigating

We are currently investigating this issue.

Report: "Smart Capture Link service degradation"

Last update
resolved

A service release caused Smart Capture Link media upload to fail. This issue is now resolved.

Report: "Known Faces service degradation"

Last update
resolved

This issue is now resolved.

monitoring

We have implemented a fix for this issue. We are monitoring closely to make sure the issue has been resolved and everything is working as expected.

identified

We are experiencing service degradation impacting the Known Faces product. The root cause has since been identified and a fix is being applied.

Report: "Increased latency on check processing"

Last update
postmortem

### Summary From Jan 2nd 2024, 14:28 UTC, until Jan 2nd, 16:06 UTC, there was an increase in Turn Around Time \(TaT\) for a subset of document checks being processed. The percentage of document reports impacted was as follows: * EU: ~50% reports; * US: ~40% reports; * CA: Not impacted. The majority of these reports were completed immediately after the problem was resolved. The remaining that needed manual review were progressively cleared, with all reports completed by Jan 3rd, 2024, 06:00 UTC. ### Root Causes A functional bug was introduced upon releasing a service responsible for processing part of the document check, resulting in incomplete processing of some document reports. ### Timeline _Times are displayed in UTC_ Jan 2nd 2024 * 14:28: Our automatic monitoring triggers raising an elevated timeout rate * 14:53: Incident is declared once impact has been assessed * 15:00: Public status page is created * 15:08: We rolled back some services that were released close to the issue starting, as a precaution, to remove potentially impactful changes while investigations continued * 15:34: The above-mentioned rollback is finished, but no improvements were observed, so the investigation continues * 15:52: The root cause is identified, and so a revert of the faulty code is done * 16:04: The revert is complete, and we start processing newly submitted live document reports normally * 16:06: Incomplete reports submitted during the incident start to complete Jan 3rd 2024 * 06:00: All the affected reports were complete ### Remedies Additional E2E tests are being added to cover relevant edge cases \(ETA: Q1 2024\). Review our observability procedures to improve the recovery time for document report processing incidents \(ETA: Q2 2024\).

resolved

This incident is resolved. Most of the reports that were queued as a result of this issue have been cleared and turnaround times are returning to normal. Any remaining queued reports are expected to complete within the next five hours. We apologize for the degraded service.

monitoring

We are continuing to monitor for any further issues.

monitoring

We have implemented a fix for this issue in the EU region. We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Increased latency on check processing"

Last update
resolved

This incident has been resolved.

monitoring

We implemented a fix and processing latency is back to normal. We are monitoring closely until all the backlog is processed.

identified

We have identified the source of the disruption and are working on a solution

investigating

We've currently experiencing issues that are negatively impacting latency on check processing.

Report: "Increased latency on check creation"

Last update
resolved

This incident has been resolved.

monitoring

We have implemented a fix for this issue. We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

investigating

We've currently experiencing issues that are negatively impacting latency on check creation. Our next update will be in 15 minutes. Thank you for your patience.

Report: "New checks missing in Dashboard"

Last update
resolved

We have cleared the backlog, all checks will show in the Onfido Dashboard

identified

We have identified the issue and we are processing the backlog of checks.

investigating

Checks/reports issued in the past 2 hours are missing from the Onfido Dashboard, we are working on restoring the data.

Report: "Limited delays in check processing"

Last update
resolved

This incident has been resolved.

investigating

The issue is mitigated and there is no impact. We will continue monitoring.

investigating

We are continuing to investigate this issue.

investigating

The issue is limited to EU. We are still experiencing limited degraded performance in EU.

investigating

We are continuing to investigate this issue.

investigating

We've currently experiencing issues that are negatively impacting latency on check completion. Our next update will be in 15 minutes. Thank you for your patience.