Historical record of incidents for Apwide
Report: "Unexpected outage"
Last update**After a thorough analysis, we identified that one of our components had reached the maximum number of database connections it was allowed to establish.** As traffic continued to grow, our health check system marked the component as unavailable and automatically restarted it in an attempt to restore normal operation. It took approximately 30 minutes for the component to come back online with a sufficient number of available connections to handle the load. **Remediation:** We have increased the maximum number of allowed connections for this particular component. We sincerely apologize for any inconvenience this outage may have caused. Please rest assured that we are committed to providing the best service possible.
Golive & Time Squad were unavailable this morning for a duration of approximately 30 minutes. We are currently conducting an investigation to identify the root cause.
Report: "Unexpected outage on managed database"
Last updateGolive & Time Squad were unavailable this morning for a duration of 1 hour and 10 minutes. The root cause of the issue was a problem accessing the application databases. After contacting our hosting provider, they confirmed that a network maintenance was scheduled for this morning. Traffic was supposed to be rerouted, but a configuration issue caused a routing error. We apologize for the inconvenience caused.
Report: "Infrastructure being patched"
Last updateInfrastructure has been patched and issue referenced by https://public-cloud.status-ovhcloud.com/incidents/ffhr44srdcln is now fixed on our cluster. We apologize for the inconvenience it caused.
Here is the reference to the issue currently impacting our cluster: https://public-cloud.status-ovhcloud.com/incidents/ffhr44srdcln
Following our upgrade/outage of 2024-07-08, our hosting provider has to exceptionally patch our nodes. Golive/Time Squad won't be available for 30 minutes. Sorry for this inconvenience.
Report: "There is an ongoing issue with our hosting provider, which is causing our applications to be unavailable"
Last updateIssue has been fixed on hosting provider side. Issue was due to a "CPU spikes on the ingress gateway" (load balancers exposing Golive & TIme Squad application on the web). We apologize for any inconvenience Thank you for your patience.
We are currently experiencing an issue with our hosting provider, causing our applications to be temporarily unavailable. We apologize for any inconvenience and are actively working to resolve the issue as quickly as possible. Thank you for your patience.
Report: "Golive instabilities"
Last updateThe issue appears to be successfully addressed. It stemmed from contention within our system responsible for communication with Jira instances. To remedy this, we have refined the initiation of calls to Jira, reducing their frequency. Additionally, we have augmented the pool size of the system overseeing the contention, enhancing its capacity to handle tasks effectively.
A fix has been implemented and we are monitoring the results.
A potential root cause has been pinpointed, and a configuration workaround has been implemented in the production environment. We are actively monitoring its impact. Concurrently, efforts are underway to develop a permanent fix.
Platform is stable for 5 hours but analysis to identify the root cause is still ongoing.
Apwide Golive is currently experiencing instabilities. Our teams are working to identify the causes.
Report: "Golive unavailable"
Last updateDuring a short period of time (3:52am UTC to 3:56am UTC), Golive tomcat thread pool was saturated by a high number of requests and was not available. In details: 3:22am UTC: Tomcat thread pool consumption slowly increased (usually 10 threads out of 200 available but increased to 18) but was still able to handle the load. 3:39am UTC: Thread pool consumption increase accelerated. 3:52am UTC: No more available thread in the pool (200 busy threads), Tomcat stopped handling new requests. (outage) 3:56am UTC: Current busy threads completed their tasks, most of them are released, and processing was back to normal. (10 threads out of 200) We are sorry for the inconvenience it caused and we are currently working on the app to mitigate the risk of this happening again.
Report: "Major outage"
Last updateThis incident has been resolved
A fix has been implemented by our hosting provider and the service is restored. We are monitoring the situation.
Our hosting provider is still working on a fix and we are actively following up. There is no estimated resolution time for the time being.
A fix is being implemented by our hosting provider.
The issue has been identified by our hosting provider and a fix is being implemented. More information: https://public-cloud.status-ovhcloud.com/incidents/6sd9lwym7zdt
We are currently investigating this issue.
Report: "Issue on Rest API"
Last updateNew patch applied on production has fixed the issue and monitoring does not show any new occurrence.
Fixed deployed to production has only partially solved the problem. Currently, search endpoint is not able to select correct output format in case client calling the API does not specify "accept" content type. A new fix is in progress.
A fix has been pushed to production and we are currently monitoring its impact.
Some calls to Rest API endpoint GET /environments/search/paginated ends with a HTTP 405. This can impact an integration made with Rest API including Jenkins Shared Library. A fix is in progress.
Report: "Major outage"
Last updateThis incident has been resolved.
After a configuration change, issue seems to be definitely resolved.
A fixed has been applied on production and seems to have fixed the issue. Environments are under monitoring
We are currently investigating this issue.
Report: "Some email notifications are not sent correctly"
Last updateIncident is now resolved. Issue was due to the upgrade of Jira cloud rest API client to the v3 which resulted in some email addresses not correctly retrieved by email sending job.
We are continuing to monitor for any further issues.
Fix has been pushed to production and we're currently monitoring if issue is solved.
We've identified the potential root cause of this issue. A fix is in progress.
We're currently investigating possible issues impacting emails sent by Golive automation engine and watcher capability.
Report: "Partial outage on Golive Rest API"
Last updateFollowing an operational/infrastructure change (update of our API gateway), Golive Rest API was in partial outage. Impact: Golive frontend was up and running, but some generated Rest API tokens were considered revoked. This resulted in some failing API calls (HTTP 500). Actions (UTC time): - 3:00am: rollback operational change - 6:32am: re-apply change with first attempt of resolution + monitoring - 8:00am: new occurrences identified on production - 9:32am: push fix for the second attempt of resolution - 10:30am: still occurrences of error found on production - 1:10pm: apply new fix on API gateway + monitoring - 4pm: no more occurrences, problem seems fixed We apologize for the inconvenience caused.
Report: "Major outage"
Last updateThis incident has been resolved.
The service is back and we are monitoring it closely.
We are currently investigating this issue.
Report: "Network degradation"
Last updateGolive and Time Squad cloud were not available during 5 minutes due a network issue to our hosting provider. More details here: https://network.status-ovhcloud.com/incidents/5mldyhd6v99c
Report: "Gadgets are not loading anymore (Golive & Time Squad Cloud)"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Major system outage"
Last updateThis incident has been resolved.
The outage is due to a network misconfiguration from our hosting provider. Fix is currently in progress. Golive and Time Squad data are safe.
We are currently investigating this issue.
Report: "Golive Outage"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Golive Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Golive Cloud - Partial outage"
Last updateThis incident has been resolved.
Some functionalities are not working properly on Golive Cloud App, we are working on a solution.
The issue has been identified and a fix is being implemented.
Report: "Major system outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The outage is due to a Load Balancer failure. Our hosting provider is working on a fix. More information: http://travaux.ovh.net/?do=details&id=50234
We are currently investigating this issue.
Report: "Major system outage"
Last updateThis incident has been resolved, it was linked to an issue with our Load Balancer.
We are currently investigating this issue.
Report: "Major system outage"
Last updateThe service is restored for both Golive Cloud and Time Buddy Cloud.
Backups have been restored successfully. Our hosting provider is now fixing an issue impacting our load balancer that will allow us to fully restore the service.
We are proceeding with our Disaster Recovery procedures
We are proceeding with our Disaster Recovery procedures
The incident is due to a fire that damaged one of the Data Centers of our hosting company. More information: http://travaux.ovh.net/?do=details&id=49471 We are activating our Disaster Recovery Plan and will update you on the progress.
The incident is due to a fire that damaged one of the Data Centers of our hosting company. More information: http://travaux.ovh.net/?do=details&id=49471 We are activating our Disaster Recovery Plan and will update you on the progress.
We are currently investigating this issue.
Report: "Golive Cloud - Major system outage"
Last updateThe service is back.
We are currently investigating this issue.
We are currently investigating this issue.
Report: "Partial Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are investigating the issue
Report: "Major Outage"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating the issue
Report: "Golive Cloud - Major system outage"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Golive Cloud - Major system outage"
Last updateAll Golive Cloud services are now back to normal.
We apologize for inconvenience caused by this first major outage of Apwide Golive Cloud. **Root cause** Outage was caused by a failure of the middleware managed by our hosting provider. **What we have done to restore the service** We have first safely transferred all customer data to an alternate data center. We have then deployed our applicative stack to fully restore the service for our customers. **What we have learned from this incident** * low level infrastructure or middleware failures happen an may happen again in the future * monitoring of our services works well. We were instantly aware of the incident * we are able to rebuild from scratch our productive infrastructure. This means that our disaster recovery procedure \(DRP\) is fully operational **What we will improve for the future** * we will improve our DRP in order to reduce the outage duration if we have to switch again from a data center to another * we will better integrate Status Page to improve communication about status of our services with our customers Thanks for having read this postmortem and for trusting Apwide Golive. We are at your disposal to answer to your [questions](https://jira.apwide.com/servicedesk). Enjoy your day, Kind Regards, Guillaume Vial / David Berclaz CEO’s
A fix has been implemented and the service is up, except email notifications. We are now monitoring the platform.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The root cause has been identified, we are working on the problem resolution.