Hypatos

Is Hypatos Down Right Now? Check if there is a current outage ongoing.

Hypatos is currently Operational

Last checked from Hypatos's official status page

Historical record of incidents for Hypatos

Report: "Performance Degradation in EU Cluster"

Last update
resolved

This incident is now resolved. We did not see any more issues during the monitoring phase.

monitoring

A Fix was implemented and we're monitoring it's status.

investigating

We are currently experiencing poor performance during document upload and processing on our EU Cluster. Our Customer Care and Engineering teams are actively working on it. Sorry for the inconvenience created. We will come back with an update at 16h00 CET.

Report: "Service Disruption - Extraction Processing Delays"

Last update
resolved

The incident is now solved after confirmation. Please let us know if you feel any disruption.

monitoring

Right now the fix is fully implemented and we're monitoring the system. The documents are now flowing.

identified

The issue was identified and we're now in the process of implementing a fix.

investigating

We are continuing to investigate this issue.

investigating

We are currently experiencing delays in our Extraction processing system. This may result in slower document processing times than usual. Our team is actively investigating the issue and working towards a resolution. We apologize for any inconvenience caused and will provide updates as soon as possible.

Report: "API Disruption - Elevated Number of Errors"

Last update
resolved

The issue is now solved and systems are back to normal. Please contact our Customer Care team if you need further assistance.

identified

The issues was identified and the team is now implementing the solution.

investigating

We have identified an elevated level of API errors and are currently looking into the issue. This means that users may be unable to access or utilise our API functionalities. Our engineering team is working on resolving the issue and we will keep you informed of any progress. We will send an additional update as soon as possible.

Report: "Service Disruption - Extraction Processing Delays"

Last update
resolved

This incident is now solved as we do not see any more issues.

monitoring

A fix is already in place for the last 30 minutes and we're now monitoring the system. We will provide more informations for this case in the near future.

identified

We've found the issue and we're now reporting it to the provider. A workaround is being implemented and the system should be available in the next minutes.

investigating

We are continuing to investigate this issue.

investigating

We are currently experiencing delays in our Extraction processing system. This may result in slower document processing times than usual. Our team is actively investigating the issue and working towards a resolution. We apologize for any inconvenience caused and will provide updates as soon as possible.

Report: "Performance degradation on EU Cluster"

Last update
resolved

Our product performance is back to normal after our Engineering identified the issue. There was a problem in CloudAMQP that was reported to their Support and is now solved. We are closing this incident to share that we are not currently experiencing issues. Please contact Hypatos Customer Care team if you have questions about this issue or any other situation.

investigating

We are currently experiencing poor performance during document upload on our EU Cluster. Our Customer Care and Engineering teams are actively working on it. Sorry for the inconvenience created. We will come back with an update at 17h00 CET.

Report: "Studio Upload low performance"

Last update
resolved

Our product performance is back to normal after our Engineering teams refactored the way the OCR service is used in our products. We are closing this incident to share that we are not currently experiencing issues. Please contact Hypatos Customer Care team if you have questions about this issue or any other situation.

monitoring

The issue was identified and we have applied some measures to solve it. Our teams are now monitoring the upload queue permanently to assure that the system is behaving as expected. We will come back with feedback as soon as the issue is closed. Thank you for your patience.

investigating

The issue has been mitigated and the documents are being processed now. We are still investigating the issue and will come with more details as soon as we have a clear picture of what cause this performance decrease. Thank you for your patience!

investigating

Our teams are still working on the issue. Sorry for the inconvenience and thank you very much for your patience. We will come back with an update at 19h00 CET.

investigating

We are currently experiencing poor performance (OCR related) during document upload. Our Customer Care and Engineering teams are actively working on it. Sorry for the inconvenience created. We will come back with an update at 17h30 CET.

Report: "Dec 2nd 2021 - We are experiencing issues in Hypatos Studio and Studio API since 14:20 CET"

Last update
postmortem

## Executive summary During a routine system procedure a migration job was triggered. At the time of the occurrence we were experiencing a peak of processed documents that we related to customers end of month/year processes. This migration job had a bug that caused a processing loop causing our database cluster to go out of space with the auto-scaler having no time to increase. The down time was due to our DB cluster nodes having to recover all data \(~140Gb at crash time\). ## Postmortem report | **Instructions** | **Report** | | --- | --- | | **Leadup** List the sequence of events that led to the incident. | Automatic migration routine job started Bug increased data stored in DB by creating a loop processing Space grow rate was higher then scale-up rate | | **Fault** Describe what didn't work as expected. If available, include relevant data visualizations. | The migration was not expected to increase DB size substantially even during peak times and the Atlas auto-scaler was expected to handle the DB growing increasing the DB size accordingly. ![](blob:https://hypatos.atlassian.net/ccbb2820-0ce2-4130-b384-a62f97424f0d#media-blob-url=true&id=5a65a525-297c-4d8d-97af-a14cecb780e6&collection=contentId-624263492&contextId=624263492&mimeType=image%2Fpng&name=image-20211204-002758.png&size=16068&height=194&width=585&alt=) | | **Impact** Describe how internal and external users were impacted during the incident. Include how many support cases were raised. | Studio was down and all its features were inaccessible by all customers and by internal users. Studio API was also down during all incident period and customers were nor able to use it. One support ticket was opened by a customer and later one by the support team. Some customers complained by email to customer success team. | | **Detection** Report when the team detected the incident and how they knew it was happening. Describe how the team could've improved time to detection. | First to detect the problem were internal users complaining about files stuck on upload in studio. The Engineering team was contacted directly and the Support team followed the Engineering analysis and investigation. Our monitoring system alerted with a lot of messages in our queueing mechanism and when all the systems went down our StatusCake and Pingdom monitoring systems also started sending alerting messages. | | **Response** Report who responded to the incident and describe what they did at what times. Include any delays or obstacles to responding. | 12h02 - The first symptoms \(documents stuck on upload\) were communicated by internal teams and they started investigating 14h27 - First customer request arrived 14h42 - First response sent to customers 14h44 - An Incident team was connected \(video call\). Engineering and Support worked together to investigate and communicate with customers 15h05 - API is down for the first time 15h57 - Atlas support was contacted and started investigating further 16h22 - First response from Atlas support telling that the nodes were healing and we should let the process finish | | **Recovery** Report how the user impact was mitigated and when the incident was deemed resolved. Describe how the team could've improved time to mitigation. | 17h44 - first secondary replica recovered 17h54 - All DB nodes recovered 19h10 - API/Studio is back up 19h14 - Monitoring phase started 19h20 - Studio API is back up 20h29 - Incident resolved | | Five whys root cause identification Run a [5-whys analysis](https://www.atlassian.com/team-playbook/plays/5-whys) to understand the true causes of the incident. | Why did studio/API went down? Because MongoDB cluster crashed; Why did Mongo DB cluster crashed? Because the 3 replicas went out of space; Why did the replicas went out of space? Because Atlas resource manager didn’t had enough time to scale-up and handle the DB size grow; Why did the DB grow rate was higher then the scale-up rate? Because the migration caused a processing loop with huge files; Why did the migration cause a processing loop? Because the migration had a bug that occurred during peak time; | | **Related records** Check if any past incidents could've had the same root cause. Note what mitigation was attempted in those incidents and ask why this incident occurred again. | N/A | | **Lessons learned** Describe what you learned, what went well, and how you can improve. | Improve tests on QA to make sure all this types of bugs are resolved before going to production Enhance vendor response time on critical issues |   ## Incident timeline 11h22 - Migration job started 12h02 - The first symptoms \(documents stuck on upload\) were communicated 14h41 - A fix was deployed to correct the migration bug 15h57 - Atlas support was contacted and started investigating further 16h22 - first response from Atlas support telling that the nodes were recovering 17h44 - first secondary replica recovered 17h54 - All DB nodes recovered 19h10 - Studio is back up 19h14 - Monitoring phase started 19h20 - Studio API is back up 20h29 - Incident resolved ## Follow-up tasks ‌ | **Issue** | **Owner** | **Action items** | **Documentation** | | --- | --- | --- | --- | | Improve tests on QA to make sure this type of bugs don’t go to production | Engineering | Improve tests on QA to make sure this type of bugs are resolved before deploy to production | | | Enhance vendor response time on critical issues | Engineering and Product | Review support contracts and TTRs to improve vendors response time |   |

resolved

Issue is completely resolved now. We tested our products and all pending requests were processed successfully, our solution is stable. Thank you for your patience.

monitoring

Our systems are back up again. We are monitoring all systems carefully to make sure our products are stable now. We are very sorry for this inconvenience and we are working to make sure it won't happens again! Thank you for your patience!

identified

We are still finishing bringing up all our systems. We expect to have an update in 30 minutes. Thank you for your patience.

identified

Our DB services are now up and running. We are finishing bringing up all systems. We expect to have an update in 15 minutes. Thank you for your patience.

identified

We are still bringing back our Database up and we expect to get our system up soonest possible. Sorry for the inconvenience, next update will be in 30 minutes.

identified

We are still bringing back our Database up and we expect to get our system up soonest possible. Sorry for the inconvenience, next update will be in 30 minutes.

identified

We are still bringing back our Database up and we expect to get our system up soonest possible. Sorry for the inconvenience, next update will be in 30 minutes.

identified

We are now bringing back our Database up and we expect to get our system up soonest possible. We will update you in 15 minutes.

investigating

We are now bringing back our Database up and we expect to get our system up soonest possible. We will update you in 15 minutes.

investigating

We are experiencing issues in Hypatos Studio and Studio API since 14:20 CET. Our Engineering team is investigating to have a solution soon. Sorry for the inconvenience, we will update you in 15 minutes.