Historical record of incidents for Cronofy
Report: "Google Sync errors"
Last updateWe are seeing an elevated number of errors from Google and we are investigating
Report: "Increased error rate communicating with Outlook.com calendars"
Last updateWe've deployed a mitigation which allows for users connecting Outlook.com accounts for the first time (or who had connected for the first time earlier during this incident) to begin syncing successfully. We've still not seen acknowledgement or action from Microsoft of the underlying issue. Due to our mitigations, calendar sync is working for all users, and the only issue affecting users as a result of this incident in practice is as previously mentioned: - We are unable to pick up changes to affected users' calendar lists and may have an outdated view of secondary calendars as a result.
There has been no significant change in behaviour seen on Microsoft's API, but we are testing a change to our calendar listing mechanism which may allow us to completely work around Microsoft's erroring behaviour, which would resolve this issue completely. We will update as we have progress, for now the scenario is the same as the previous update: - Calendar event syncing is working for Outlook.com calendars we were previously aware of. - We are unable to pick up changes to affected users' calendar lists and may have an outdated view of secondary calendars as a result. - Users connecting Outlook.com calendar accounts to Cronofy for the first time may have issues getting successfully connected.
Our workaround has been successful, and has allowed calendar sync to resume for the affected users. We continue to observe the elevated error rate on Microsoft's API, and will not be able to fully resolve the issue until this underlying cause is fixed. During this time: We are unable to pick up changes to affected users' calendar lists and may have an outdated view of secondary calendars as a result. Users connecting Outlook.com calendar accounts to Cronofy for the first time may have issues getting successfully connected. While Microsoft hasn't acknowledged the issue yet, this has been raised by other integrators, for example: https://learn.microsoft.com/en-us/answers/questions/2279133/getting-500-response-errors-to-get-me-calendars-al https://github.com/microsoftgraph/microsoft-graph-explorer-v4/issues/3861 We will continue to monitor, and shall update when there are notable changes or at least by 9AM UTC tomorrow, June 3rd.
We've now fully rolled out our workaround and are monitoring the results as we work through our backlog of earlier failures. Since we observed errors specifically when attempting to list calendars, as a temporary measure we are allowing the rest of the sync process to continue based on our latest copy of the user's calendar list. This allows calendar event data to resume syncing and avoids us having stale availability information for the affected users. This is not a full resolution since it does mean that if affected users create or edit secondary calendars, we won't pick up the newly added calendars or new calendar names.
We've found that we are seeing errors only from a particular API call which we use to list calendars, and have seen success with a limited rollout of a workaround that allows the rest of the sync process to continue. We are watching its progress and increasing our confidence in order to roll this out further.
We've found that approximately one third of Outlook.com calendars are being affected. We do not believe a change our side has caused the elevated errors, but nor has Microsoft acknowledged an incident on their side. We are continuing to investigate the affected cohort, and which API calls are affected, in case there is any workaround we can implement on our side.
We've identified that a cohort of Outlook.com profiles are seeing an increased error rate when we communicate with Microsoft's APIs, causing sync failures for those affected. We're investigating the root cause and will update this incident as we understand more.
Report: "Increased error rate communicating with Outlook.com calendars"
Last updateWe've deployed a mitigation which allows for users connecting Outlook.com accounts for the first time (or who had connected for the first time earlier during this incident) to begin syncing successfully.We've still not seen acknowledgement or action from Microsoft of the underlying issue.Due to our mitigations, calendar sync is working for all users, and the only issue affecting users as a result of this incident in practice is as previously mentioned:- We are unable to pick up changes to affected users' calendar lists and may have an outdated view of secondary calendars as a result.
There has been no significant change in behaviour seen on Microsoft's API, but we are testing a change to our calendar listing mechanism which may allow us to completely work around Microsoft's erroring behaviour, which would resolve this issue completely.We will update as we have progress, for now the scenario is the same as the previous update:- Calendar event syncing is working for Outlook.com calendars we were previously aware of.- We are unable to pick up changes to affected users' calendar lists and may have an outdated view of secondary calendars as a result.- Users connecting Outlook.com calendar accounts to Cronofy for the first time may have issues getting successfully connected.
Our workaround has been successful, and has allowed calendar sync to resume for the affected users. We continue to observe the elevated error rate on Microsoft's API, and will not be able to fully resolve the issue until this underlying cause is fixed.During this time:We are unable to pick up changes to affected users' calendar lists and may have an outdated view of secondary calendars as a result.Users connecting Outlook.com calendar accounts to Cronofy for the first time may have issues getting successfully connected.While Microsoft hasn't acknowledged the issue yet, this has been raised by other integrators, for example: https://learn.microsoft.com/en-us/answers/questions/2279133/getting-500-response-errors-to-get-me-calendars-al https://github.com/microsoftgraph/microsoft-graph-explorer-v4/issues/3861We will continue to monitor, and shall update when there are notable changes or at least by 9AM UTC tomorrow, June 3rd.
We've now fully rolled out our workaround and are monitoring the results as we work through our backlog of earlier failures.Since we observed errors specifically when attempting to list calendars, as a temporary measure we are allowing the rest of the sync process to continue based on our latest copy of the user's calendar list. This allows calendar event data to resume syncing and avoids us having stale availability information for the affected users.This is not a full resolution since it does mean that if affected users create or edit secondary calendars, we won't pick up the newly added calendars or new calendar names.
We've found that we are seeing errors only from a particular API call which we use to list calendars, and have seen success with a limited rollout of a workaround that allows the rest of the sync process to continue. We are watching its progress and increasing our confidence in order to roll this out further.
We've found that approximately one third of Outlook.com calendars are being affected. We do not believe a change our side has caused the elevated errors, but nor has Microsoft acknowledged an incident on their side.We are continuing to investigate the affected cohort, and which API calls are affected, in case there is any workaround we can implement on our side.
We've identified that a cohort of Outlook.com profiles are seeing an increased error rate when we communicate with Microsoft's APIs, causing sync failures for those affected.We're investigating the root cause and will update this incident as we understand more.
Report: "Public Link Issues"
Last updateBetween 08:43 and 13:23 UTC on Thursday May 22nd 2025, attempts to book times via the Public Links feature of the Scheduler failed, visitors to Public Links would have instead seen an erroneous message that the Public Link was disabled. The root cause was the development of a new feature on top of Public Links, where normal booking flows were erroneously interpreted as making use of the new feature, and then failing a validation check that should not have been applied. As a result, 23 attempted bookings were not accepted. During the incident, we fixed the root cause with a patch. Following the incident, we contacted all affected owners of the impacted Public Links. ## Timeline All times are on May 22nd 2025. * 08:43 UTC - A change is merged which causes Public Link bookings to fail with a “disabled” message * 12:50 UTC - The issue is raised internally as a result of our own testing of the product * 12:57 UTC - The problematic code is identified, but we are unable to confidently issue a simple rollback since other changes to the area had been since introduced * 13:18 UTC - A fix to the code is written and reviewed * 13:23 UTC - The fix is approved and merged * 13:30-16:00 UTC - Work is undertaken to identify all affected public links * 16:29 UTC - All affected customers notified ## Retrospective We always ask the questions: * Could the issue have been resolved sooner? * Could be issue have been identified sooner? * Could be issue have been prevented? In this case, we feel our resolution time was reasonably good. In hindsight, we may have been able to have the fix released around 15 minutes faster had we acted more confidently to implement a patch concurrently with internal debate and investigation over the ability to roll back. We are not happy with our speed of identification, and also that this was an issue that could have been prevented quite easily. On identification, the issue was live for around 4 hours before being noticed internally. The failure in this case manifested in users being routed to an otherwise normal “disabled” page, and bookings not being made. It’s harder to alert on things _not_ happening, given the usage rate of this feature, and the natural peaks and troughs of daily activity. We did see an area for improvement in that we were using a normal “disabled” page as a catch-all for a few other error cases; by adding Telemetry around these different cases, we can positively identify unusual behaviour separately from the normal “disabled” case. We also saw possibilities to improve our playbook for monitoring usage of new features. In this case — had we configured a trigger with a shorter duration & period — we would have seen unexpected activity for the unreleased new feature. This would have led us to investigate and notice the issue sooner. Prevention is the place with the clearest room for improvement. We had focused our manual testing on the new feature being developed, and failed to test the vanilla case of Public Link bookings which were touched by the code changed. We use both automated and manual tests during feature development. The nature of the underlying issue, and the necessary interaction of multiple steps of the booking flow to cause it, made it less likely for our automated test suite to reasonably catch it. However, we failed to catch the erroneous code at Code Review stage, and failed to manually test the critical path adjacent to the changes being made. ## Actions We are going to more strictly manually test critical paths in our system when making changes adjacent to those areas to ensure there aren’t unintended side effects from in-development features. We are going to add increased monitoring of the error cases in the affected area — before they are sent to any fallback pages — so that anomalous activity and behaviour is positively identifiable and triggers a visible alert. We are going to review our playbook for new feature usage telemetry to add better guidance for engineers to set up triggers that are more visible by default.
We have reverted the change and confirmed that Public Links are now creating events as expected. A postmortem of the incident will take place and be attached to this incident the next 3 working days. If you have any queries in the interim, please contact us at support@cronofy.com
We have reverted the broken code, it is deploying at present and we are monitoring to find out how many requests may have been affected.
We have identified a recent change has made public links error - we are reverting this imminently.
Report: "Public Link Issues"
Last updateWe have reverted the change and confirmed that Public Links are now creating events as expected.A postmortem of the incident will take place and be attached to this incident the next 3 working days.If you have any queries in the interim, please contact us at support@cronofy.com
We have reverted the broken code, it is deploying at present and we are monitoring to find out how many requests may have been affected.
We have identified a recent change has made public links error - we are reverting this imminently.
Report: "Small number of 5XX errors in US"
Last updateBetween 17:25 and 18:25 UTC on Tuesday April 16th 2025, some Availability Query requests were made that provoked an issue with the Availability calculation. This caused resource usage to balloon to the point where some servers ran out of memory. During this time a small number \(< 1%\) of requests to our US data center received an HTTP 500 response rather than being processed correctly. The root cause was an issue where some requests to the Availability Engine attempted to return values several orders of magnitude larger than intended, requiring significant resources to process over a period of 20 minutes. The subset of servers handling these requests eventually exhausted their available resources and began to fail. This lead all in-flight requests on those servers to receive a HTTP 500 status. Less than 1% of traffic was impacted between 17:25 - 18:25 UTC. During the incident, we identified and patched the root cause. We have also held a retrospective and decided on further actions to take to further harden these paths and monitoring to prevent similar issues and try to make them easier to spot. ## Timeline 17:25 UTC - A customer makes some Availability Queries that will go on to cause the memory exhaustion. 17:59 UTC - Several servers run out of memory and are killed. 18:03 UTC - On-call engineers are paged as the number of HTTP 5XX responses increases. 18:05 UTC - Engineers begin investigating, finding little that stands out as traffic and load patterns seem usual. 18:23 UTC - Engineers spot that a small number of servers have unusually high memory usage approaching the limit and are replaced. 18:31 UTC - Memory limits are temporarily increased to see if that eases pressure. It does not. 18:47 UTC - Customer stops making the Availability Query calls that are driving the issue. While normal service resumes quickly, we decide to keep the incident open until we understand and mitigate the root cause. 19:11 UTC - After investigating several areas of the infrastructure configuration, the Availability Query calls driving the memory consumption are spotted 19:57 UTC - A fix has been written and is reviewed 20:03 UTC - Fix is approved and merged 20:22 UTC - After confirming all systems are operating normally, the incident is resolved. ## Retrospective We always ask the questions: * Could the issue have been resolved sooner? * Could be issue have been identified sooner? * Could be issue have been prevented? In this case, we feel that our time to resolve and identify were fairly good given that a symptom of the issue was a small number of logs and metrics were missing due to the killed servers. Counter-intuitively, a larger scale issue would have been easier to spot, but this one was subtle enough that it took some time to uncover. Conversely, had the user who incidentally triggered the bug have been malicious, this could have lead to an effective denial of service attack on our system. We have reviewed the steps we would have taken and believe we would have been in a position to also mitigate this behavior, but may have an opportunity for clearer internal playbooks on the topic. When considering how long it took us to identify the root cause, it took some time to uncover that servers were being killed due to running our of memory. This is not a trivial thing to spot and record, but we are going to look at how we can capture and alert from this data. This would have saved us a few minutes of waiting and watching as the next generation of servers hit the same issue. Finally, prevention is the clearest step we can improve here. The issue was hard but not impossible to spot, and we are now much stricter on the values allowed in the impacted area of the system. We have also identified several other spots that our code can be more defensive to avoid similar incidents in the future. ## Actions Our Site Reliability Engineers will be looking at how we can track and monitor Out Of Memory errors, while our Product Engineers will be hardening the impacted parts of the Availability Engine to limit the scale of processing without impacting existing behavior. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
We have deployed a mitigation to prevent the same issue from arising. Normal operations have continued and we are marking this incident as resolved. A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at support@cronofy.com
We have identified the root cause of the issue. Normal operations have resumed. We will continue to monitor and will be carrying out further work to prevent a repeat.
We have modified our configuration and seen a reduction in the number of errors to near zero. We are still investigating the cause of the issue.
We are seeing a small number of 5XX errors being returned from our US data center. We are investigating and will update shortly.
Report: "Degraded sync performance for Microsoft 365"
Last updateBetween 20:40-21:40 UTC we received responses from 365 from both EWS and Graph that told us the credentials we held were invalid. Microsoft are reporting on this as incident MO1020913. As this issue lasted longer than our quarantine period of 20 minutes, designed to ignore temporary and/or one-off issues, we considered the credentials we held for accounts attempting to sync during this period as truly invalid and so sent relink request emails to them. Users that received such emails will need to go through the relink process as instructed by the email to fully reinstate their calendar synchronization.
We are continuing to monitor, our telemetry shows that sync operations to Microsoft 365 returned to normal levels by 21:40. Because of this, some users may have erroneously had their credentials invalidated and so they will need to relink their accounts. Microsoft are reporting on this as incident "MO1020913".
We are still investigating, but we have also seen sync operations begin to return to normal levels of failures. The majority of syncs are now successful
We are currently investigating sync failures to Microsoft 365. The majority of syncs are resulting in failure, and both Exchange Web Services and the Graph API are affected.
Report: "Application Impersonation retired by Microsoft 365 - Action is required if you still use Exchange Web Services"
Last updateIf you have a service account connection through Microsoft 365 and have not completed the migration from Exchange Web Services (EWS) to Graph API then we will soon no longer be able to sync your calendars, and this may already be the case. Follow our guide to complete the migration as soon as possible, https://docs.cronofy.com/calendar-admins/faqs/ews-migration/ From 07:20 UTC today, we saw an increase in errors when performing EWS sync operations with Microsoft 365. The initial increase in errors tailed off as we stopped attempting to perform syncs which were no longer allowed. At 13:20 we began sending additional emails out to customers who have yet to complete the migration from EWS to Graph API. Again, we urge any customers who have not completed the migration from EWS to Graph API to do so as soon as possible.
We are continuing to monitor the situation. We are also sending additional emails to those who have yet to complete the migration from EWS to Graph API.
We are continuing to monitor the situation and have no further updates at this time
We strongly believe the errors we are seeing are related to ApplicationImpersonation being retired by Microsoft 365. For more information and to restore connections please follow the instructions on our docs site - https://docs.cronofy.com/calendar-admins/faqs/ews-migration/
We are continuing to monitor. The errors our telemetry shows are very gradually reducing. We will reduce the frequency of updates as we continue to monitor.
We are continuing to monitor for any further issues.
Metrics continue to trend in the right direction but we're still seeing a very small number of errors being returned from Microsoft 365. We are continuing to monitor.
The number of EWS sync operation errors with Microsoft 365 continues to decrease, but they are not quite back to usual levels. We are continuing to monitor.
Our telemetry shows that the number of errors is decreasing but have not returned to normal levels, and that Microsoft Graph API syncs are operating as normal, this is only impacting EWS sync operations. Other background processing tasks are not impacted by this incident.
We have identified an issue with calendar sync operations when connecting to Microsoft 365. We are seeing a higher error rate than usual, but most sync operations are still successful.
Report: "Increased error rate in US"
Last updateOn Wednesday January 29th between 10:00 and 11:09 UTC our US data center experienced degraded performance. This was caused by multiple concurrent delete operations on a heavily used database table, which in turn resulted in slower than usual responses and some failures. The majority of API traffic was unaffected, but a small percentage of calls would have resulted in a HTTP 500 response. Further details, lessons learned, and further actions we will be taking can be found below. ## Timeline _All times rounded for clarity and UTC_ On Wednesday January 29th at 10:00 we began processing a large number of data deletion jobs in line with GDPR compliance. Due to the much higher than usual number of processes of this type, our database system soon began to struggle due to the number of deletions it was being asked to perform. In particular, the CPU usage was very high. At 10:08 the first alarm for high CPU usage alerted our engineering team, and they began to investigate the cause. Additional alarms followed notifying the team of failed jobs, 5xx responses and slower API response times. Starting at 10:25, the database was under so much load that some new connections were refused. Also, during this time our monitoring systems show our slowest API responses. This was cleared by 10:38. From 10:47 to 10:57 we saw response times increase again, but not as severely as during the earlier window. This time we didn’t see any refused database connections. From 11:00 activity on our database had reduced significantly and by 11:09 our database activity had returned to the usual levels. ## Retrospective The questions we ask ourselves in an incident retrospective are: * Could it have been identified sooner? * Could it have been resolved sooner? * Could it have been prevented? We look for holistic improvements alongside targeted ones. **Could it have been identified sooner?** No. We feel that during this incident we were quick to respond to the alarms we received and to identify the cause. **Could it have been resolved sooner?** Possibly. We were a little hesitant to completely stop the large number of processes from being performed as they related to compliance. However, we have identified improvements we can make to our incident playbook that could have helped to us to determine how many remaining tasks there were, as in other scenarios, direct intervention would have been required. **Could it have been prevented?** We use rate limits throughout Cronofy to provide a robust service. However, they were not applied in this area, which was also a resource intensive task. We have already applied rate limiting to spread the load out and prevent a repeat. ## Actions As mentioned, we have already applied rate limiting to the GDPR processes to spread out the load within sensible bounds. We are going to update our incident playbook to highlight where statistics around the remaining number of GDPR tasks can be found. Had this already been in place we would have been able to determine how much longer the incident was going to last. We’re also going to perform an audit of our other background jobs to determine whether there are other areas of our system that lack rate limits on the number of concurrent jobs. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
Normal operation has resumed and been consistent for the last 30 minutes, so this incident is resolved. About 1.75% of API calls between 10:08 and 10:38 received a HTTP 500 response due to database connections being refused. A combination of high-load tasks, both from API calls and internal DB processes, all happened at once. This caused the database to degrade in performance and refuse some connections. Attempts to manually kill some of the processes did not succeed, for reasons we will investigate. Once these jobs completed, performance recovered and normal operation resumed. Jobs which had failed and retried have since completed.
Database load is back to normal and error have ceased. We are continuing to monitor to ensure normal operation has resumed.
We are still seeing occasional errors via the API. These are in much less common and in smaller numbers, but not back to zero yet. The underlying cause is above average load on our database, which we are working to resolve.
Error rates have returned to normal; we are continuing to monitor.
We're investigating an increase in API calls resulting in HTTP 500 errors
Report: "US data center scaling problems"
Last updateOn Wednesday December 4th between 13:30 and 19:40 UTC our US data center experienced a prolonged period of degraded performance primarily impacting background processing. This meant that operations such as synchronizing schedules which usually commence within seconds were delayed, at times, by a minute or more. This was caused by a degradation in service of an AWS managed service that is vital to our ability to scale our capacity to match demand. We are awaiting an RCA from AWS around this, but don't feel it is overly material to our own postmortem as we've been told we could not have resolved the underlying problem ourselves. _We will update this postmortem as necessary once the RCA has been received from AWS._ Further details, lessons learned, and further actions we will be taking can be found below. ## Timeline _All times rounded for clarity and UTC_ On Wednesday December 4th at 10:50 an additional permissions policy was added to the DE, US, and non-production Amazon Elastic Kubernetes Service \(EKS\) clusters in those environments. These three environments are older than the others, and so they lacked some permissions the newer environments had inherited by default. This was not affecting the operation of any of the data centers, but we wanted to bring their configuration inline after noticing the difference. These changes were applied successfully and everything appeared to operate as normal. More than two hours later at 13:20 we found the first signs of there being an issue within our US data center. _Without the RCA from AWS, we are assuming the configuration change is somehow related, but the fact that it only affected one of the three altered environments casts some doubt on that._ EKS provides the control plane of Kubernetes, with the nodes from the worker pool communicating with it to coordinate the distribution of work and scaling activity. At this time, the nodes and processes running on them stopped being able to communicate with the EKS as usual. This meant that things like processes responsible for triggering deployments to scale up could not do so, and that other processes that relied on obtaining leases from the control plane to elect a leader could not. Most crucially it meant that newly provisioned nodes could not fully join the cluster as they could not provision their networking stack and signal themselves as ready to run other processes. The first notification we received around the issue came at 13:40, and the first alert at 13:50. At 14:00 we attempted to increase the capacity of the background processes to provide headroom as things were not scaling dynamically but this was unsuccessful due to the underlying issue. At 14:15 we made the scaling change more directly to directly provision as much capacity as we could. We also attempted to add more compute capacity by adding more nodes to the cluster, but as they were unable to fully register themselves we were stuck with the capacity we had. For context, on the previous Wednesday, background processing fluctuated between 30 and 100 replicas during this period, with the servers in the cluster also fluctuating to provide the capacity for those replicas to run. As the issue began we were at 50 replicas, with direct intervention we were able to get it to 70 replicas. The gap between necessary capacity and peak capacity is the source of the performance degradation. We were able to process all tasks successfully but did not have the throughput available to keep up with spikes in load as they arrived. As we had made a change to multiple EKS clusters earlier in the day, we looked for signs for similar behavior in other environments, including those unchanged but did not find any. An incident being outside of our control, without it being part of a wider outage in a given AWS service or region, is historically rare. We spent the next two hours on activities such as undoing and redoing the change from earlier that day, manually comparing multiple environments configuration in case of some other drift, and such like. At 16:00 we came together to review the situation. As part of this we realized this may have become noticeable to users and that the situation was likely to become worse over the next hour as 17:00 is usually the time of peak load for our US data center. At 16:10 we opened this incident on our status page. We decided to try and add capacity to our US data center by provisioning a new EKS cluster and working out how to scale up capacity there. With our own diagnostic paths exhausted and our options going forward limited, we opened a ticket with AWS support at 16:55 whilst we working on provisioning a sibling cluster. At 17:25 AWS support request permission to review logs which we grant. Work continued to provision a new EKS cluster into which we could successfully register new nodes. Work then switched to how and what we would need to deploy into the second cluster to get something that would function without causing more issues than it would solve. At 18:45 we realized we had heard nothing from AWS for over an hour and so initiated a chat session. After some back and forth we had confirmation that it was being investigated at 19:09. At 19:13 we were asked to check how the cluster looked and there were signs of improvement but still issues. At 19:26 the AWS agent joined our conference call and helped us triage lingering issues. At 19:30 we'd been able to add additional nodes to the cluster which meant we could deal with the background processing backlog. By 19:35 all issues within the cluster had been resolved and it further scaled up through the automatic mechanisms. We continued to monitor whilst reverting changes made to provide as good as service as was possible throughout the incident, before returning fully to our usual configuration around an hour later. ## Retrospective The questions we ask ourselves in an incident retrospective are: * Could it have been identified sooner? * Could it have been resolved sooner? * Could it have been prevented? Also, we don't want to focus too heavily on this specifics of an individual incident, instead look for holistic improvements alongside targeted ones. **Could it have been identified sooner?** Yes, whilst we received alerts they were slower than we would like and pointing towards more symptoms of the issue rather than the issue itself. **Could it have been resolved sooner?** Absolutely, with AWS having to resolve it, opening a ticket with them much sooner would have helped. We also suspect opening a chat with them rather than an email ticket may have influenced the speed of response. **Could it have been prevented?** From our actions, we don’t believe so. The bargain you make with managed services is that if you use them correctly, they’ll work. To our current knowledge, AWS failed on their side of the bargain which we can’t prevent without extremely significant overhead day-to-day. ## Actions This incident uncovered a flavor of infrastructure failure that is not sufficient covered by our alerting. We’ll be reviewing and improving alerts within this area of our stack to guide our future selves more rapidly to the root cause of similar issues in future. On a similar note, we found our diagnostic tools to be weak in this area, and relied on ad-hoc knowledge more than we would like. In concert with the review of alerts, we’ll improve our playbooks and scripts for assessing the situation when such alerts are triggered. Finally, we are documenting guidance around when and how support tickets should be raised with AWS to reduce the amount of ad-hoc decisions we have to make on this front. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
Our US data center experienced degraded performance today, primarily impacting background processing. There was an issue with the Amazon Elastic Kubernetes Service (EKS) cluster which affected our ability to scale capacity with demand. This was resolved by AWS intervening to resolve an issue with our EKS cluster. We will be receiving a RCA from AWS relating to this. Once we receive AWS's RCA, we will follow our own process to publish a postmortem of this incident. If you have any queries in the interim, please contact us at support@cronofy.com
Our US data center experienced degraded performance today, primarily impacting background processing. There was an issue with the Amazon Elastic Kubernetes Service (EKS) cluster which affected our ability to scale capacity with demand. This was resolved by AWS intervening to resolve an issue with our EKS cluster. We will be receiving a RCA from AWS relating to this. Once we receive AWS's RCA, we will follow our own process to publish a postmortem of this incident. If you have any queries in the interim, please contact us at support@cronofy.com
A fix has been implemented in conjunction with AWS support and our ability to scale our resources has returned. Capacity has then been able to increase to match demand and service levels appear to have returned to normal. We are continuing to monitor to ensure service remains stable.
We are working through an issue which is preventing our US data center from scaling as normal. Performance of background processing has been degraded somewhat through the past hour with processing taking up to 60 seconds to be started rather than being near instant. We are in contact with AWS to resolve the underlying problem.
We are working through an issue which is preventing our US data center from scaling as normal. Performance of background processing was degrading somewhat through our peak time around 17:00 UTC with processing taking up to 60 seconds to be started rather than being near instant. As we are now past that peak, we anticipate any impact on background processing to be reduced but still possible until scaling operations return to normal.
We are working through an issue which is preventing our US data center from scaling as normal. Performance may be degraded as we approach our peak time (around 17:00 UTC) until this is resolved. This will primarily affect background processing times, with synchronization potentially taking longer than usual.
Report: "Degraded Sync performance for Microsoft 365"
Last updateWe have seen a significant improvement in the reliability of calls to Microsoft 365. This has now been close to, or within, the bounds of usual behavior for around an hour. This aligns with Microsoft's communication against incident MO941162. Microsoft are not saying it is fully resolved but that they have mitigations in place, and are continuing work to fully resolve the problem. As the impact of this has been minimal for a significant period, we are resolving this incident. We will keep a closer eye on this area until Microsoft call their incident resolved as well.
We have seen some improvement in the reliability of calls to Microsoft 365, but are far from normal service. We remain past the busiest point of the day for Cronofy's services which means we're in a more stable position overall. We are continuing to monitor our systems and Microsoft incident MO941162. Microsoft are continuing to work on the underlying issue and sync tasks continue to retry automatically. We will provide a further update by 22:30 UTC.
We have seen little improvement of reliability of calls to Microsoft 365. However, we are now well past the busiest point of the day for Cronofy's services which means we're in a more stable position overall. We are continuing to monitor our systems and Microsoft incident MO941162. Microsoft are continuing to work on the underlying issue and sync tasks continue to retry automatically. We will provide a further update by 21:30 UTC.
We have seen little further improvement of reliability of calls to Microsoft 365. We are continuing to monitor our systems and Microsoft incident MO941162. Sync tasks continue to retry automatically. We will provide a further update by 20:30 UTC.
Many calls to Microsoft 365 are still failing. We have seen some improvement over the past hour but are still far from a full recovery. Synchronization of affected calendars will be degraded with failures retried automatically as required. Microsoft are reporting on this as incident MO941162, with fixes in progress but currently no timeline to resolution. We are continuing to monitor and mitigate where we can, and will provide another update by 19:30 UTC.
Many Graph API calls to Microsoft 365 are still timing out and we have yet to see any improvement. We are continuing to monitor the situation and mitigate any impact on operations. We will update again at 18:30 UTC.
We have been tracking a degradation in MS Graph connections since approximately 07:30 UTC today. While most operations have been succeeding in short order, this has worsened to the point where sync is starting to be noticeably degraded for some accounts. Microsoft are reporting on this as incident "MO941162". We are continuing to monitor and mitigate where we can, and will provide another update at 17:30 UTC.
Report: "US data center performance degradation"
Last updateOn Wednesday, October 2nd between 00:56 and 01:04 UTC, an increasing number of requests to [app.cronofy.com](http://app.cronofy.com) and [api.cronofy.com](http://api.cronofy.com) failed entirely or timed out while being processed. The root cause was our primary database being unable to process requests in a timely fashion. The subsequent back pressure then caused dependent services to time out resulting in an outage. ## Timeline `00:56` - Primary database begins showing signs of congestion. `00:57` - Timeouts start being reported by monitoring. `00:59` - Initial alerting thresholds are breached. On-call engineer is notified. `01:00` - Investigation begins. [app.cronofy.com](http://app.cronofy.com) and [api.cronofy.com](http://api.cronofy.com) return timeout statuses. `01:01` - Additional alert thresholds for API response and HTTP status breached. `01:04` - Confirmation of performance degradation. `01:04` - Database congestion clears. `01:04` - Last timeout statuses are returned. [app.cronofy.com](http://app.cronofy.com) and [api.cronofy.com](http://api.cronofy.com) return healthy statuses. `01:05` - Rate limits hit for some clients as failed requests are retried in bulk. `01:09` - Engineer confirms resumption of service. `01:10 - 02:40` - Investigation and monitoring ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? While this issue resolved itself before engineer intervention it could arguably have done so sooner. The initial identification of issues routing requests from our load balancers to our servers, while correct, has ultimately proven to be a symptom rather than the root cause and, though our services did effectively self-heal, we have identified areas for improvement that should enable them to avoid the need to do so in future. This incident has also highlighted some gaps in our monitoring that would have enabled us to take action before the point at which timeouts began to be returned and made identifying the root cause a simpler task. ### Actions We’re going to be spending some time re-working and improving our database monitoring to address the areas we’ve identified. They’re largely 1-in-a-million events but, when processing the number of events we do, that’s more frequent than we feel is acceptable. We’ll be adding additional telemetry and improving our handling of database statements. This is to enable us to notice negative trends in performance well in advance of them becoming an issue and to be even more robust in how we handle them. We’re adding a new section to our playbook to cover additional actions for scenarios similar to aid in speeding up our response.
US data center performance has remained normal and the incident is resolved. Around 00:56 UTC inbound traffic to api.cronofy.com and app.cronofy.com began to show signs of performance degradation. This was observed to be an issue routing traffic from our load balancers to their respective target groups and on to our servers. This resulted in an increase in processing time which, in turn, resulted in some requests timing out. By 01:04 UTC the issue with the load balancers routing traffic had been resolved and traffic flow returned to usual levels. A small backlog of requests was worked through by 01:10 UTC and normal operations resumed. A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at support@cronofy.com.
We're continuing to monitor traffic flow but all indicators show that, an increase in incoming traffic being retried remotely aside, as of 01:06 UTC routing had returned to normal.
Performance has returned to expected levels. Between 00:56 and 01:04 UTC, traffic making it's way from our load balancers to our servers did not do so in a timely manner. This will have resulted in possible timeouts for requests to api.cronofy.com and app.cronofy.com and potential server errors for API integrators and Scheduler users.
We have seen some performance degradation in our US data center. Initial findings appear similar to those of 26 Sept 2024. Improved monitoring has highlighted this issue earlier and we are in the process of investigating further.
Report: "US data center performance degradation"
Last updateBetween 15:16-15:18 and 15:44-15:46 UTC we experienced degraded performance in our US data center. During these times, a little under 3% of requests to api.cronofy.com and app.cronofy.com resulted in a server error that potentially affected API integrators and Scheduler users. These errors coincided with an AWS issue in the North Virginia region - https://status.aws.amazon.com/#multipleservices-us-east-1_1727378355, where load balancer target groups experienced slower than normal registration times We are recording this incident retrospectively as whilst we were aware of the issue with target groups, we had a gap in our alerting that led us to believe there was no impact to customers related to it. That gap has now been filled. If you have any questions, please email support@cronofy.com.
Report: "Background processing degraded"
Last updateOn Monday April 22nd between 11:00 and 13:30 UTC our background processing services had a major performance degradation meaning background work was delayed for around 2 hours in some cases. This impacted operations such as synchronizing schedules to push events into calendars and to update people's availability. A change in our software's dependencies led to our background processors pulling work from queues but not processing that work as expected. This led to work messages being stuck in a state where the queues believed they were being worked on, so did not allow other background processors to perform the work instead. For a subset of the background processing during this period we had to wait for a configured timeout of 2 hours to expire, at which point the background work messages became available again, and the backlog was cleared. Full service was resumed to all data centers, including processing any delayed messages, by 13:30 UTC. Further details, lessons learned, and further actions we will be taking can be found below. ## Timeline _All times rounded for clarity and UTC_ On Monday April 22nd at 10:55 a change was merged which incorporated some minor version changes in dependencies that we use to interact with AWS services. This was to facilitate work against an AWS service we were not previously using. This change in dependencies interacted with a dependency that had not changed such that our calls to fetch work messages from AWS Simple Queue Service \(SQS\) reported as containing no messages when in fact they did. This meant that messages were being processed as far as AWS SQS was concerned \(in-flight\), but our application code did not see them in order to process them. This change went live from 10:58, with the first alert as a result of the unexpected behavior being triggered at 11:12. The bad change was reverted at 11:15 and fully removed by 11:20. This meant that background work queued between 10:58 and up to 11:20 was stuck in limbo where AWS SQS thought it was being processed. For our data centers in Australia, Canada, UK, and Singapore regular service was resumed at this point. New messages could be received and processed, and we could only wait for the messages in limbo to be released by AWS SQS to process those. In our German and US data centers we had hit a hard limit of SQS with 120,000 messages being considered "in flight" for our high priority queue. This meant that we were unable to read from those queues, but were still allowed to write to them. Once we realised and understood this issue, we released a change to route all new messages to other queues and avoid this problem. This was in place at 12:00. Whilst we were able to make changes to remove the initial problem, and avoid the effects of the secondary problem caused by hitting the hard limit. The status of the individual work messages was outside of our control. AWS SQS does not have a way to force messages back onto the queue which is the operation which we needed to resolve the issue. We looked for other alternatives but the work messages aren't accessible in any way via AWS APIs when in this state. Instead we had to wait for the configured timeout to expire, which would release the messages again. We took more direct control over capacity throughout this incident, including preparing additional capacity for the backlog of work messages being released. Once the work messages became visible after reaching their two hour timeout, we were able to process them successfully with the full service being resumed to all data centers, including processing any delayed messages, by 13:30 UTC. We then reverted changes applied during the incident to help handle it, returning things back to their regular configuration. ## Retrospective The questions we ask ourselves in an incident retrospective are: * Could it have been identified sooner? * Could it have been resolved sooner? * Could it have been prevented? Also, we don't want to focus too heavily on this specifics of an individual incident, instead look for holistic improvements alongside targeted ones. ### Could it have been identified sooner? For something with this significant an impact, it taking 12 minutes to alert us was too slow. Halving the time to alert would have significantly reduced the impact of this incident, potentially avoiding the second-order issue experienced in our German and US data centers. The false-negative nature of the behavior meant that other safeguards were not triggered. Cronofy's code was not mishandling or ignoring an error, the silent failure meant our application code was unaware of a problem. ### Could it have been resolved sooner? The key constraint on the resolution of the incident was the "in flight" timeout we had configured for the respective queues. We don't want to rush such a change to a critical part of our infrastructure but our initial analysis suggests a timeout of 15-30 minutes is likely reasonable and would have made a significant difference to the time to full service recovery. ### Cloud it have been prevented? As the cause was a change deployed by ourselves rather than an external factor, undoubtedly. In hindsight, something touching AWS-related dependencies must always be tested in our staging environment and this change was not. This would likely have led to the issue being noticed before being deployed at all. ## Actions We will be creating additional alerts around metrics that went well outside of normal bounds that would have drawn our attention much sooner. We will be reducing the timeouts configured on our AWS SQS queues to reduce the time messages are considered "in-flight" without any other interaction to align more closely with observed background processing execution times. We are changing how we reference AWS-related dependencies to make them more explicit, and alongside carrying a warning to ensure full testing is performed in our staging environment first. We will also be adding the AWS dependencies to our quarterly patching cycle to keep them contemporary, reducing the possibility of such cross-version incompatibilities. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
Background task processing has remained normal and the incident is resolved. Around 11:00 UTC, a change was deployed to production which inadvertently broke the background processing. We received our first alert at 11:12 UTC. We reverted the offending change at 11:15 UTC, but this did not restore background processing completely. Multiple changes were made to minimize the impact of the issue, which restored most functionality for newly created jobs by 11:54 UTC. This left a backlog of work queued for processing between 11:00-11:54 UTC stuck in an unprocessable state. The stuck backlog work was due to requeue at around 13:00 UTC due to reaching a configured timeout of 2 hours. We attempted to find a way to process this work sooner but were unsuccessful. As anticipated, the work became available from around 13:00 UTC and was processed for about half an hour, completing by 13:30 UTC. This fully restored normal operations. A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at support@cronofy.com.
The backlog of jobs has been processed and we are monitoring to verify that everything remains normal.
New work continues to process as normal, and the stuck jobs are expected to be processed in the next hour.
Capacity has been increased but the stuck jobs are still not processing due to an issue with the queue. We are working to find a way to process these jobs. New jobs are avoiding the stuck queue and are processing as expected.
We have mitigated most of the impact for background processing but have jobs stuck in US & DE. We are working to further increase capacity and execute these jobs.
Processing has recovered in all data centers apart from US & DE, which have one partly recovered. We are working to mitigate the issue in these two data centers.
We have seen background processing degrade in all data centers after a recent deployment. We have reverted the change and are investigating the cause.
Report: "Apple sync degraded"
Last updateAt 07:47 UTC, we saw a sharp increase in the number of Service Unavailable errors being returned from Apple’s calendar servers across all of our data centers, causing sync operations to fail. This was escalated to our engineering team, who investigated and found that no other calendar providers were affected and so the issue was likely not within our infrastructure. However, very few operations between Cronofy and Apple were succeeding. At 08:20 UTC, we opened an incident, to mark Apple sync as degraded, as customers may have seen an increased delay in calendar sync. This coincided with a sharp drop in the level of failed network calls, which returned to normal levels at 08:18 UTC. The service stabilized and Cronofy automatically retried failed sync operations to reconcile calendars. Over the next hour, we saw communications with Apple return to a mostly healthy state, though there were still occasional spikes in the number of errors. Cronofy continued to automatically retry failed operations, so impact on users was minimal. At 09:15 UTC, these low numbers of errors decreased back to baseline levels and stayed there. As we’ve now seen more than 30 minutes of completely healthy service, we are resolving the incident
We are still observing small numbers of errors. These are being automatically retried so the service is largely unaffected, and we are continuing to monitor.
We are still seeing occasional, smaller number of errors. These are being retried automatically by Cronofy and the service is largely unaffected. We are continuing to monitor error levels.
We saw a rise in Service Unavailable errors from Apple's calendar servers between 07:46 UTC and 08:19 UTC. Normal operation has resumed. Cronofy will automatically retry any failed communications with Apple, so no further intervention is required. We are continuing to monitor the situation to be sure that the incident is over.
We are investigating unusually high numbers of errors when syncing Apple calendars.
Report: "Degraded Apple calendar sync"
Last updateError rates from Apple's API have returned to normal levels, and calendar syncs for Apple-backed calendars are healthy again. Apple Calendar sync performance dropped starting at 13:36 UTC and finishing at 15:43 UTC. During this time no other calendar provider sync operations were affected.
We continue to observe a high error rate on Apple's Calendar API. Around 22% of Apple Calendar operations are resulting in an error. We'll continue to monitor things and update accordingly.
Errors when communicating with Apple calendar increased significantly across all data centers from 13:36 UTC. This is not affecting communications with any other calendar providers. We continue to monitor the situation.
Report: "8x8 conferencing requesting moderator login"
Last updateSince launching support for provisioning video conferencing when creating a calendar event in 2020 we have used 8x8.vc links as an explicit option and as an anonymous, browser-based conferencing fallback when calendar-native conferencing providers such as Google Meet and Microsoft Teams are not available. 8x8 accounts for around 1% of the video conferencing we have provisioned in recent weeks, the other 99% being made up of calendar-integrated conferencing solutions like Google Meet and standalone providers like Zoom. 8x8 removed support for 8x8.vc links being used anonymously earlier this week, without any notification and with no available workaround. Ideally, we would have received prior notification from 8x8 of this change so that we could have managed a graceful transition. As that did not happen, we will regrettably be dropping support for 8x8 as a conferencing option with immediate effect. When using our API, "8x8" can be selected explicitly, or chosen as a fallback when using the "default" conferencing option. We have released a change which stops the generation of the anonymous, browser-based 8x8.vc conferencing links and so calendar events will be created, but without any conferencing details. Put another way, "8x8" will have a similar effect to providing "none", and there will no longer be a catch-all conferencing option provisioned when using "default". https://docs.cronofy.com/developers/api/conferencing-services/create/#param-conferencing.profile_id We will continue to accept both values to as to not break any existing integrations. "8x8" has been deprecated and the documentation relating to "default" has been updated to reflect this change in behavior. If you subscribe to notifications based on conferencing being provisioned, you will be notified of the failure to provision any conferencing in cases where 8x8 would previously have been used. https://docs.cronofy.com/developers/api/conferencing-services/subscriptions/ We truly regret this situation and can only apologize for the disruption this has caused. As there is no further action we are able to take on this, we are resolving this incident. If you have any further questions, please contact us at support@cronofy.com
[Message edited for easier reading] We announced the changes detailed in the resolution message but in the future tense.
We have had a response from 8x8 confirming that this change in service is intentional, will not be reversed, and that there is no workaround for any existing 8x8.vc links. This is the worst outcome we could foresee and are sorry for any disruption this is causing. We will be finalizing and sharing our plan for handling this by 15:00 UTC.
Since launching support for provisioning video conferencing when creating a calendar event in 2020 we have used 8x8.vc links as an explicit option and as a fallback when calendar-native conferencing providers such as Google Meet and Microsoft Teams are not available. These links were previously anonymous, requiring no account to be set up for them to be used. These links no longer support anonymous access, instead users will now be presented with "Waiting for moderator" and an option to login. There does not appear to be a way for anyone to create an account meaning these 8x8 links are currently unusable with no known workaround. We did not receive any notice of this change, we first received reports relating to this on Tuesday October 17th, and have since confirmed the change in behavior ourselves. We have reached out to our contact at 8x8 about this change in their service and are evaluating our options
Report: "Degraded Google calendar sync"
Last updateCalendar sync for Google-backed calendars has remained healthy since the previous message, so we are considering this as resolved. Google have updated their incident record of the underlying issue, where they likewise consider it resolved: https://www.google.com/appsstatus/dashboard/incidents/7uJZ5F1Uy4n1n74iMacQ
Error rates from Google's API have returned to normal levels, and calendar syncs for Google-backed calendars are healthy again. Google Calendar sync performance dropped starting at 13:30 UTC, and was heavily degraded until improvements were seen starting at 16:00 UTC. A full recovery of service was reached around 16:10 UTC.
We're observing a much improved success rate from Google's Calendar API since 16:00 UTC. Google have advised that a fix has been made on their side and is rolling out currently, and expect to be fully resolved within the hour: https://www.google.com/appsstatus/dashboard/incidents/7uJZ5F1Uy4n1n74iMacQ We’ll continue to monitor the situation and update here.
Google’s Calendar API is still returning a high error rate, with no notable change to the situation since the previous update. Google are tracking the incident at their incident dashboard at https://www.google.com/appsstatus/dashboard/incidents/7uJZ5F1Uy4n1n74iMacQ We’ll continue to monitor the situation and update here.
We continue to observe a high error rate on Google’s Calendar API. This appears to affect some user cohorts more than others, rather than being spread evenly across all of our syncronized Google calendars. While this doesn’t affect the error rate of Cronofy’s API, it means that we’re failing to pick up the latest changes to affected users’ calendars, and may present stale availability information. Events written via the Cronofy API may be delayed before being pushed successfully to the external Google calendar. We queue and retry such operations automatically and do not anticipate any remedial action to be necessary as a result. We'll continue to monitor things and update accordingly.
Errors when communicating with Google calendar increased significantly across all data centers from 13:31 UTC. We have taken action to reduce the impact on other calendar providers and continue to monitor.
We are seeing a significantly higher than normal amount of errors when trying to interact with Google calendars.
Report: "Increased error rates"
Last updateOn Monday 22nd August between 09:09 and 09:20 UTC all API calls creating or deleting events failed. Users of the Scheduler would be unaffected, as operations were retried automatically after 09:20 UTC. This outage was caused by a bug in a change to our API request journalling which records each API request received by Cronofy. ## Timeline At 09:04 a deployment was triggered including an update to our API request journal. At 09:09 the deployment began rolling out, and the change came in to force. Seconds later, an alert was triggered and engineers began investigating. At 09:11 an additional alarm triggered for our Site Reliability team informing them of an increase in the number of failing API requests. At 09:15 with many more alerts triggering, we triggered a further deployment reversing the change. At 09:19 all deployments reverting the change completed, and the last error was observed. ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? After identification, the issue was resolved in approximately 4 minutes. We believe our automated deployment pipeline strikes a good balance between speed and robustness so no significant improvement can be found here. The change had been highlighted as one in a risky area and had passed code review. Due to the anticipated risk, an engineer was actively checking for errors after the deployment. It took around 6 minutes from the first error being seen to making the call to revert the change. Given the severity of the issue, this was too long and we have taken action to avoid this in future. The change was being made in a critical area of our platform. This is an area that has recently been under development. Manual testing was performed against our staging environment but failed to exercise the affected path. Our reviews focussed too heavily on the intended change in behavior. We missed the unintended side effects of the change which led to this issue. Our automated tests for this area were not as comprehensive as we thought and did not detect the bug either. ## Actions Automated tests in the area have been reviewed and expanded to provide more certainty when making changes in this area. This will prevent such changes passing review. We’ve strengthened guard clauses in this area to produce more descriptive errors, earlier, if a similar mistake were to be made in future. This will both prevent such changes passing review, and in the worst case aid faster identification of issues. We’ve altered our playbook for deploying high-risk code changes to recommend at least two engineers are present and monitoring errors and telemetry. This will improve our chances of identifying issues sooner.
A change deployed at 09:08UTC introduced a bug that lead to an internal component being overwhelmed and dropping API requests. Our alerting spotted this and we rolled back, completing at 09:19UTC. We will be conducting a post-mortem to understand why this wasn’t caught before release and to improve our QA processes.
We’re investigating an increased error rate in all data centers; the error rate has returned to normal but we’re investigating any side affects.
Report: "Outlook and Zendesk Scheduler Extension Loading issue"
Last updateOn Wednesday 12th July between 14:00 and 16:15 UTC users of the Cronofy Scheduler extensions for Outlook and Zendesk would be unable to access the extension. Other instances of the Scheduler, such as the Chrome Extension, Integrations such as with Greenhouse, and the web version of the Scheduler, continued to operate normally. The underlying cause was that the Cronofy Outlook add-in and Zendesk App were not manually validated during the release of a change to the Scheduler extension. In line with our principles, we are publishing this public post-mortem to explain why this happened, and what we will do to prevent this occurring again. ## Timeline _Times are from Wednesday 12th July 2023, in UTC and rounded for clarity._ At 14:04 we deployed an update to our extensions. This had gone through our normal request and review process. At 15:49 one of our customers reported that they were unable to use the Outlook add-in to create a scheduling request. The customer observed a spinning progress wheel, and the Scheduler form did not load. At 15:54 our support engineers replicated the issue in their own Outlook add-in, and escalated the issue internally to our first responder. At 16:08 our engineering team located the problem, and identified the original change that caused the problem. At 16:10 we reverted the change, and deployed this immediately. We checked this internally to verify that this deployment corrected the problem, and the Zendesk and Outlook extensions were working again. At 16:20 the customer confirmed that the issue was resolved. ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? The root cause for this issue is twofold. Firstly, this area is difficult to create automated tests around, as it requires the extension to be loaded inside of Outlook or Zendesk to trigger. Secondly, and more importantly, given that we know about the lack of automated tests, we failed to manually test this change to the loading process of extension using the Outlook add-in or Zendesk App. There is a different build process that affects the Outlook and Zendesk versions of the extension, where the extension is loaded in a different way. This alternate loading method triggered a bug that did not exist in the other extensions. Once we were made aware of this issue by our customer, we resolved it in under 30 minutes. We don’t feel we can improve our response time, but we see having to be notified by a customer as a failure. From an identification perspective, we should have identified this ourselves by checking the Outlook or Zendesk extensions once we had deployed the change manually. We favour preventing the issue over earlier identification. In the future, we could have an event that triggers in the extension if the scheduler form fails to load, and informs a separate errors service. We feel that with some small improvements to the guidance we give our engineers, can prevent an issue like this from happening again. ## Actions to be taken * We will ensure that engineers are familiar with the differences between the extension build processes, making it clear which areas require manual testing. We will also cover what to be aware of when publishing changes that affect multiple different platforms at the same time. * We will create internal guidance listing all the extensions, and how to properly check each extension. * We will add an additional hint to our pull request template when extension files are being changed which specifically calls out to the engineer creating the PR and the engineers reviewing it that they should examine the impact to all extensions. We have considered adding more automated testing to this area of the solution, and we plan on discussing this in more detail within the department. Tests in this area have historically given a poor return on investment. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
From 14:07 UTC - 16:08 UTC, Scheduler Extensions (such as Outlook and Zendesk) were not able to load the Scheduler, instead showing only the loading spinner. Scheduler integrations (such as Greenhouse and Workday) were unaffected. This was due to an update being deployed which expected some data only available on the Scheduler website but not in the extensions, which was not caught by our tests or QA before release. We apologise for any inconvenience and will be improving our processes to be more rigorous.
Report: "US data center infrastructure issue"
Last updateAWS's us-east-1 region, where our US data center is hosted, experienced an issue affecting some services Cronofy's platform relies upon. The impact to service was low, at its peak resulting in a small degradation in performance within our US data center and a handful of server errors being returned by the service. This incident has been resolved.
AWS are reporting many services as fully recovered in us-east-1 and we have observed stable service in our US data center since around 21:45 UTC. We will continue to monitor but expect to resolve this incident soon.
The issue with AWS Security Token Service in us-east-1 has become worse for roughly the past 15 minutes. This is mainly impacting scheduling tasks from triggering as they are not able to authenticate properly within the US data center to perform their tasks. This will lead to degraded service in tasks such as the polling of Apple calendars. The impact of this has been minimized for core services as we prevented them from scaling in earlier in the incident.
The impact on service in our US data center still appears to be minimal. AWS are resolving the underlying issue in us-east-1 and we will continue to monitor until their incident is closed.
We have seen elevated errors in our US data center, these appear to be related to an open issue in AWS us-east-1: https://health.aws.amazon.com/health/home#/account/dashboard/open-issues?eventID=arn:aws:health:us-east-1::event/MULTIPLE_SERVICES/AWS_MULTIPLE_SERVICES_OPERATIONAL_ISSUE/AWS_MULTIPLE_SERVICES_OPERATIONAL_ISSUE_798F8_DDDD18AFADD&eventTab=details The impact appears low at present, but given it is at the infrastructure level it could quickly become significant and so we have opened this incident.
Report: "Degraded Outlook.com sync performance"
Last updateThe performance of Outlook.com requests has stayed stable at levels in line with its usual performance for over an hour. At 07:54 UTC we saw a sharp increase in the number of connections failing to Outlook.com, which lasted until 09:30 UTC. This may have caused some changes to take longer to sync. Due to recent issues with Microsoft platforms, we have monitored for an extended period before resolving this issue. We will continue monitoring but we believe the issue has been resolved.
We're seeing some improvement in Outlook.com requests, but still with enough failures that we consider it to be in a degraded state. We're continuing to monitor the situation.
We are currently experiencing degraded sync with Outlook.com. This is not having an impact on our overall system performance. We are closely monitoring the situation.
Report: "Degraded Microsoft 365 sync performance"
Last updateFrom 19:22 to 22:27 UTC calendar sync operations to Microsoft platforms were degraded. During this time around 60% of operations between Cronofy and Microsoft resulted in failure. We took steps to mitigate the effects of this on other platforms and began monitoring the issue which has now been resolved. This was a recurrence of an earlier issue, https://status.cronofy.com/incidents/5ftvqz5tp40d.
We continue to experience degradation with Microsoft 365 sync operations. We're monitoring our systems closely.
We continue to experience degradation with Microsoft 365 sync operations. We're monitoring our systems closely.
We are seeing a recurrence of the earlier issue with Microsoft 365, https://status.cronofy.com/incidents/5ftvqz5tp40d. We are taking the same steps to mitigate the effects of this on other background processing operations.
Report: "Degraded Microsoft 365 sync performance"
Last updateFrom 14:13 to 15:47 UTC calendar sync operations to Microsoft platforms were degraded. During this time around 69% of operations between Cronofy and Microsoft resulted in failure and led to a backlog when processing requests. We took steps to mitigate the effects of this on other platforms and began monitoring the issue which has now been resolved. We believe this is related to Microsoft alert: MO571683 (https://portal.office.com/adminportal/home?#/servicehealth/:/alerts/MO571683)
Calendar sync operations to Microsoft services appear to have returned to normal around 16:00 UTC. We are continuing to monitor.
Our telemetry is showing that calender sync operations to Outlook.com and Microsoft 365 are also affected. We have mitigated the performance impact this issue was having on calendar sync operations with other providers. We are tracking open Microsoft incidents, with IDs MO571683 and EX571516, and continuing to monitor our platform closely.
We have identified an issue with calendar sync operations when connecting to the Microsoft Graph API. We are working to mitigate this impact on the rest of our background processing jobs.
We are currently investigating degraded performance with background processing in both our DE and US data centers.
Report: "Degraded Apple calendar sync"
Last updateApple calendar synchronization has returned to normal operation and credentials have been reinstated where the invalidation was related to the incident.
Apple's calendar servers appear to have started responding normally since 08:15 UTC. We will reinstate credentials where possible in the next 20 minutes so long as things continue to appear normal.
Some Apple credentials have been invalidated during this incident before we stepped in to prevent that process from happening. Once Apple's calendar servers are responding normally we will reinstate the credentials invalidated during the window of this incident where possible.
We have seen an elevated number of errors when attempting to synchronize Apple calendars across all data centers since around 07:25 UTC.
Report: "US API degradation"
Last updateOn Wednesday April 12th 2023 between 03:16 and 05:35 \(2 hours 19 minutes\) Cronofy experienced an issue causing 17% of the API traffic to our US environments to fail, returning an HTTP 500 error to the API caller. The underlying cause was that our system failed to heal following a database failover. An incorrectly configured alarm, and a gap in our playbook resulted in the issue extending longer than it should. In line with our [principles](https://www.cronofy.com/about#our-principles), we are publishing a public post-mortem to describe what has happened, why it impacted users, and what we will do to prevent it from happening in the future. ## Timeline _Times are from Wednesday April 12th 2023, in UTC and rounded for clarity._ At 03:16 the primary node that writes to the database cluster failed. The secondary node was promoted and the cluster failed over automatically, recovering as planned. This incident alerted us via our pager service, triggering an investigation, which started at 03:20. After the database failover, all visible metrics were pointing towards a successful automated resolution of the issue. Our core metrics were looking healthy and at this time, we marked the incident as recovered without manual intervention at 03:30. It is normal for some of our incidents to be resolved in this way. The on-call engineer makes a decision based on available information as to whether an incident room is necessary to handle an alarm. When a failover occurs, the original failed node is rebooted, and then becomes the read-only partner in the cluster, whilst the existing read-only secondary node is promoted to primary. Some of the connections from the web application nodes were still connected to the read-only secondary database node, but were treating it as if they were connected to the primary writable node. This lead to failures in some of the actions that were taking place in the API. At this time, our monitoring system was not alerting, as the although the metric for measuring HTTP 500 was reporting correctly, the alarm was misconfigured. When the alarm returned no data, this was viewed the same a 0 errors. This resulted in no further alarms to alert us to the degradation of the service, and reinforced the belief that the service was healthy again. At 05:00 an automated notification was posted into our internal slack monitoring channel to show that two health metrics had not fully recovered. This wasn’t an alarm level notification, so did not re-trigger the on-call pager service. At 05:30 an engineer reviewed the notification in slack, and inspected the health metrics. The increased levels of HTTP 500 errors being returned by the Cronofy API was identified. The incident was reopened, and investigation restarted. Our site reliability team was reactivated to triage the issue. At 05:35 Cronofy took the decision to replace all application instances in the US environment. An automated redeployment was triggered. This reset all the nodes connections to the database cluster, flushing out the connections to the read-only node, and returned API error responses to normal levels by 05:37. ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? The root cause for this issue is that our systems did not correctly self-heal when a database failover occurred. Although the database failover was the event that triggered the system to fail, it is something that is a rare, expected, and unavoidable event. The database cluster correctly recovered, and was back in an operating mode within a few minutes. Another significant factor in the severity of the impact of this issue was the robustness of Cronofy’s response to the issue. This was the first occurrence of a database failover happening outside of core working hours. During working hours, multiple engineers would see the error and respond to it, each checking a different set of metrics or systems. This collaborative approach was not correctly translated in to the guidance available in the incident response playbook, resulting in an incomplete set of checks taking place once the database failover had completed. This could definitely have been resolved sooner, by forcing the nodes to reconnect to the database. This should trigger automatically in cases of database failover, and there should also be enough information available to the issue response team to have knowledge that a wider issue is still ongoing. The lack of identification the impacts of the issue had not been fully resolved is what prolonged the incident, and this is linked to several factors: * Confirmation bias of errors trending downwards, and the early hours wake up causing the on-call engineer to miss the elevated error rate. * Misconfigured alarm adding to the confirmation bias by not making it clear that the issue was ongoing. * A reliance on tacit group knowledge instead of explicit documented steps for database failovers meaning that the on-call engineer didn’t know the additional validation and checks that would have identified the issue sooner. ## Actions to be taken We are disappointed with the length of time that it took us to resolve this incident. There were multiple smaller failures that led to this incident having a higher impact. A few different checks, technical changes or alarms being raised could have mitigated this error, and prevented the longer outage. * We are improving the way our applications self-heal in the event of a database failover. Some error messages occurred indicating that the system was trying to write to the wrong part of the database cluster, but these messages did not cause the system to reset. * We will update our guidance for our internal teams on the actions to take when a database failover occurs. This will include a more precise checklist, and specific metrics to review. * We are going through all our alarms to ensure that they are looking at the right data, at the right time. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
Our US data center experienced a failover of the primary database at around 03:15 UTC. From this point around 20% of API requests encountered an error resulting in a 500 response, for specific operations such as creating events the failure rate was 80%. This lasted until around 05:30 UTC (around 2h15m later) when the elevate errors were recognized and all processes restarted. Errors then returned to normal levels a few minutes later. From an initial investigation we believe "write" connections were attempting to write to a node which had become a replica as part of the failover, which then failed. Some processes restarted automatically, but API processes appear to have not done so, nor did in the face of intermittent errors. That issue was cleared when we made all the processes restart and service then returned to normal. We will be conducting a full postmortem of this event and will post that against this incident by the end of the week.
Report: "Elevated 500 errors on Read Event"
Last updateFrom 16:30 UTC to 16:36 UTC, calls to the Read Events API encountered a higher level of 500 Server Error responses than usual. This was due to a bug released at this time which allowed a small number of events to cause an error while being rendered. This was immediately spotted and rolled back, which was completed at 16:36 UTC.
Report: "Apple sync degraded"
Last updateAt 15:58 UTC, communication between Cronofy and Apple began to timeout for a proportion of our requests. This quickly worsened, causing Apple calendar syncs to take several attempts. A change was deployed to slow down the rate of communication with Apple to reduce the pressure on their API and to avoid service being impacted for other calendar providers. At 16:50 UTC, we started to see most communication succeed again and monitored the situation until it had remained stable for 30 minutes. At 17:20 UTC, we removed the rate-limiting and continued to monitor the situation. Normal service has continued, so we are marking this incident as resolved. If you have any queries, please contact us at support@cronofy.com
Apple API traffic has recovered, and normal operation has resumed. We are continuing to monitor the service.
We are seeing an increase in connection errors to Apple's API globally. Apple sync is still operating in a degraded state. We will update at 17:00 UTC.
We are seeing increased errors from Apple's calendar API since 16:00 UTC; we are investigating and taking mitigating action.
Report: "Elevated API 500 errors"
Last updateFrom 09:59 UTC to 10:07 UTC, calls to our API encountered a higher level of 500 Server Error responses than usual. This was due to a bug released at this time which was not caught by our automated tests or during code review. Once we were notified of the issue with the release, we initiated a rollback, which took about 7 minutes to roll out. We will be improving our test suite and adding better error handling in this area.
Report: "365 synchronization errors"
Last updateThe performance of syncing Microsoft 365-backed calendars was degraded for about 90 minutes. Microsoft 365 began returning an elevated number of errors and timing out just after 07:05 UTC until 08:30 UTC. This was part of a wider issue within Azure, where Microsoft 365 is hosted. Steps were taken to increase capacity and be more aggressive in timing out connections to Microsoft 365 to mitigate the impact on background processing. The sudden change to the volume of background jobs did lead to a slight delay in processing being present for much of the issue. Normal service resumed a few minutes after the Microsoft 365 APIs began to operate normally, once a surge of calendar update notifications was processed. We will be reviewing our mechanisms for reducing the side effects of such incidents in the future.
Error rates for Microsoft 365-backed calendars have returned to normal levels, and we have processed a surge in calendar updates. We will continue to monitor the service but believe this issue to now be resolved.
Microsoft have sent a 365 health notification (MO502273) relating to some users being unable to access multiple Microsoft 365 services. This is likely related to the issue we are seeing. Errors for 365-backed calendar continue to be elevated.
We are seeing a higher than usual number of errors affecting calendars hosted on Microsoft 365 across all our data centers for connections using Exchange Web Services (EWS) and Microsoft's Graph API. Background processing has scaled up to compensate but may be degraded as a result.
Report: "Degraded performance in our US data center"
Last updatePerformance was degraded in our US data center for around 2 hours between 16:00 and 18:00 UTC. This was down to our primary database struggling under load. Steps were taken to remove any background processes to reduce the load as much as possible to aid the system to return to regular operation. API responses may have been slower than usual during this period, and background processing such as synchronizing calendar data will also have been slower than usual with messages taking up to 3 minutes to be picked up at the peak of the incident. We will be bringing forward the maintenance to upgrade this database cluster from this coming Sunday 18th December to tomorrow Friday 16th December. A notice for this maintenance change will be posted shortly.
The queued work has now been processed and we can see that performance is no longer degraded. We are continuing to monitor the situation.
Our US database is experiencing slower than usual disk performance. We have taken steps to ease the pressure, such as temporarily disabling maintenance tasks. The amount of queued work is reducing. We're continuing to work to bring performance levels back to the usual level. We have also added bigger database nodes to the cluster in case we need to failover to those. However, this would require a short outage and so we are holding off on failing over to those just yet.
We have taken steps to ease the pressure on our US database. This has resulted in better, but still degraded, performance. We're continuing to investigate the root cause
We are investigating degraded performance in our US data center
Report: "US data center reachability"
Last updateFrom 04:44 to 04:47 UTC our US data center may have been unreachable. This was caused by a short-lived database issue which was resolved without intervention.
From 04:44 to 04:47 UTC our US data center may have been unreachable. This was caused by a short-lived database issue which was resolved without intervention.
Report: "DE data center reachability issues"
Last updateOur German data center was unreachable for two periods, 17:12-17:15 UTC and for around a minute at 17:37 UTC. In terms of symptoms, this was very similar to what we saw in our US data center in August https://status.cronofy.com/incidents/32fc8mjcr1zw We have applied changes developed to alleviate that issue to our German data center and it has been stable since.
In terms of symptoms, this is very similar to what we saw in our US data center in August https://status.cronofy.com/incidents/32fc8mjcr1zw We have applied changes developed to alleviate that issue to our German data center and are monitoring.
Our German data center has been unreachable for two periods, 17:12-17:15 UTC and for around a minute at 17:37 UTC. We are investigating the underlying cause and potential remediations.
Report: "DE data center reachability"
Last updateFrom 22:15 to 22:18 UTC Cronofy's German data center may have been unaccessible for API and web traffic. This was due to an underlying database issue which has since cleared.
Report: "Incorrect identification of Outlook.com calendars"
Last updateFrom 09:50 UTC on Wednesday October 19th 2022 through to 20:20 UTC on Wednesday October 26th 2022 \(7.5 days\) Cronofy had a bug which meant that users going through the OAuth flow for Outlook.com calendars were being incorrectly associated with accounts within Cronofy. This led to data being shared incorrectly due to the misidentification of Outlook.com accounts and the resulting API authorizations pointing towards the misidentified account rather than separate accounts. _Microsoft 365 and on-premise Exchange calendars were unaffected. Only calendars from Outlook.com, Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years._ As a data processor we have already contacted API integrators with users impacted by this issue on Thursday October 27th 2022. In the interests of transparency, in line with [Cronofy's principles](https://www.cronofy.com/about#our-principles), we are publishing a public postmortem. ## Timeline and background _Times are from October 2022, in UTC, and rounded for clarity_ At 09:50 on Wednesday 19th a change was deployed in support of work to move from using Microsoft's Outlook.com-specific API to using Microsoft's Graph API for Outlook.com accounts. This change inadvertently altered the shape of response we receive from Microsoft at the end of an OAuth authorization process, which meant Cronofy was not extracting Microsoft's unique identifier for the account correctly, instead getting a null value from the process. This broke assumptions made about identity by the rest of the system which led to the described behavior. When receiving a result from an OAuth authorization flow, we receive several values, the key of which is an unique identifier for the account, alongside an email address and the OAuth tokens. This may relate to a calendar account already within Cronofy, so we look up first by the unique identifier for the provider, then secondarily attempt a match by email. The incorrect extraction of a null value as the unique identifier for Outlook.com accounts broke an implicit contract that other parts of Cronofy’s system relied upon. This meant that the first person experiencing this bug either resolved via email address or created a new entry within Cronofy, either of which resulted in a record tied to the provider with a unique identifier of a null value. As any user passing through the flow would have a null value for this field due to the bug, every subsequent passage through the OAuth flow would resolve to this one record relating to a single Outlook.com account. Subsequently, processes downstream behaved as if the users identity had been correctly verified, leading to authorizations pointing to accounts unexpectedly. Access exposure was limited to a single Outlook.com account in each data center, but with multiple integrators having access to it. Updates to and from this calendar account were not successful after the second user resolved to the account due to safeguards in place relating to the underlying calendar IDs changing completely. This minimized the inadvertent exposure of data. With hindsight, we have identified a support ticket received at 17:50 on Wednesday 19th which likely related to this bug. At the time looked like a common issue encountered by developers when integrating and so did not trigger further action. Aside from this we have only been able to identify the support ticket which triggered our response. That ticket that triggered our response was received a week later at 14:15 on Wednesday 26th. After requesting and receiving some example accounts to investigate the described problem, we noticed something looked very odd and the alarm was raised internally at 19:50. By 20:20 we had prevented a null value unique identifier from ever being used for matching, preventing the growth of the issue. By 20:55 we had also reverted the change introduced the previous Wednesday to be entirely certain the scope of the issue would not grow. With the problem contained, we decided the best course of action would be to revoke all authorizations that had resulted from this behavior. Our view was removing some legitimately received access was better than risking leaving any illegitimate access active. Especially as users would be able to reinstate access as necessary. Work continued along this path, at first generating reports for manual verification of the intended actions, followed by taking the actions required. All identified API integrator authorizations were revoked, any potential user sessions invalidated, and the relevant Outlook.com account was deleted by 05:00 on Thursday 27th. Work continued on Thursday 27th to identify API integrators we needed to inform, along with an idea of the number of users affected to help inform their response. Those notices were sent between 17:00 and 21:30 on Thursday 27th. Throughout the following days, we worked to produce more exhaustive reports for each customer by reconciling a number of data sources. These have already been distributed to API integrators that requested them. ## Opportunities for improvement On Thursday November 3rd we held an internal retrospective relating to this incident. Whilst it was disappointing for the bad change to be deployed, it was a subtle problem that was difficult to pick up in both development and review. It is in an area where it is difficult to automate tests as it is dependent on external input, the result of a user going through an OAuth journey. It also relied on testing including both of: * Multiple Outlook.com calendar accounts being present, most local test environments only have a single calendar from each provider * Multiple passes through the OAuth process with different accounts, most manual testing will happen once, and generally against the same account This combination was not part of standard testing practices, especially for what looked like a simple change. It also relies on manual actions which, with the best will in the world, can not be trusted to be performed. Moving from the specifics of the bad change, we looked at more holistic issues. The acceptance of null as a value for Outlook.com identity was what led to the misclassification happening. This was prevented from being possible at a lower level during the handling of the incident. A null value is never expected in this situation, and we are making modifications to the Cronofy platform to assert this fact at different layers to avoid a mistake in a single location from being all that is needed to bypass this assumption. This will mean that a similar regression in future will "fail fast" rather than silently continuing as happened in this case. It is always disappointing when we find out about issues from our customers, especially one as severe as this. We looked at other signals we may have been able to alert on based upon the behavior observed during the incident. Whilst there were no errors being raised, there are metrics such as the number of times we believe a calendar account has completed an OAuth within a given period that would have stood out here. We will be doing a further investigation into such signals to understand where we may introduce things such as alerts, soft limits, and hard limits to reduce the impact of similar problems in future. Cronofy's event sourced architecture made it reasonably straightforward to review the history of the system and undo what had been done as a result. However, due to the nature of the issue, it took several days to build a clear enough picture in order to generate a PII-containing report to share with API integrators without the risk of sharing PII we should not. We're expanding our telemetry and reporting around OAuth flows to make such reconciliation more straightforward in future. Communicating with API integrators affected by the incident was a difficult, mostly manual process. This introduces the possibility of errors and delay, neither of which are desirable in the process of handling an incident. We are going to bring forward work planned to improve this process for service-related messages so we can send them directly from the Cronofy platform. To summarize the actions we are taking: * We are deepening our checks relating to identity across all providers including, but not limited to, manual testing playbooks and code-level assertions for all providers, not just Outlook.com * We will investigate detecting, and potentially preventing, behavioral anomalies relating to identity and authorization * We are enhancing our telemetry and reporting around identity and authorization processes * We will implement a new mechanism for sending service messages to customers ## Further information If you are an affected API integrator and wish to obtain a copy of your report of impacted users, please get in touch via [support@cronofy.com](mailto:support@cronofy.com) before Thursday December 1st 2022. As these reports contain PII we can only retain them for a short period and so will be deleting them after this date. As ever, please contact us at [support@cronofy.com](mailto:support@cronofy.com) if you have any further questions.
From 09:50 UTC on Wednesday October 19th 2022 through to 20:20 UTC on Wednesday October 26th 2022 (7.5 days) Cronofy had a bug which meant that users going through the OAuth flow for Outlook.com calendars were being incorrectly associated with accounts within Cronofy. This led to data being shared incorrectly due to the misidentification of Outlook.com accounts and the resulting API authorizations pointing towards the misidentified account rather than separate accounts. Microsoft 365 and on-premise Exchange calendars were unaffected. Only calendars from Outlook.com, Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years. As a data processor we have already contacted API integrators with users impacted by this issue on Thursday October 27th 2022. We are backfilling this incident in order to publish a public postmortem.
Report: "Microsoft Defender SmartScreen reporting US OAuth URL as unsafe"
Last updateLate Thursday 29th September we received the first report of Microsoft Defender SmartScreen within Microsoft's Edge browser flagging our US OAuth flow endpoint (https://app.cronofy.com/oauth/authorize) as being an unsafe site. On Friday 30th September this was flagged to our engineering team who were able to reproduce this issue, submitted a dispute to Microsoft being the site owner, and opened this incident. Though we obviously believed this to be an incorrect classification, we investigated why we may have been flagged in the first place whilst we awaited a response from Microsoft. During this investigation identified an application in development mode which may have been being used as part of a phishing scam. Our guess is that they were using Cronofy's domain as a trust-worthy starting point but redirecting on to an untrustworth redirect URI after the user has granted access to their calendar. For applications in development mode we allow any redirect URI to be used to ease development, but display a warning that the application is not verified to users. It seems that users were ignoring this warning and proceeding to go through our OAuth flow to connect their calendar before being redirected on to a site posing as a financial service. We disabled the specific application and made our warning that an application is in development mode much more prominent to discourage the use of development mode applications in this way, including ensuring the warning was translated for all the locales the page supports. We had yet to hear from Microsoft, but we updated our ticket with Microsoft to let them know our finding and actions taken. At this point we were waiting on Microsoft to process our case. We did not wish to make changes that could be seen as attempting to bypass this protective mechanism as that is what a nefarious actor would do, potentially leading to the entire domain being flagged. Instead we waited on going through the proper process to get the classification corrected. We discussed potential actions to circumvent the block in case we were left with no choice to give our integrators an option that would not require their users performing a workaround involving ignoring a warning from their browser which should be legitimate the vast majority of the time. After a week of waiting we submitted a second case to Microsoft in case the first was somehow lost. Yesterday, Wednesday 12th October, we resorted to reaching out to people on social media and managed to get the attention of someone on the Microsoft Edge team who was able to get our case actioned and the flag was removed. Our US OAuth flow endpoint has not been flagged for over 12 hours now so we consider this incident resolved. We are in contact with Microsoft to better understand why we were flagged in first place to prevent similar incidents, and how we might get to a faster resolution if it happens again. Finally, thankyou to everyone who helped us by submitting a report that our site had been flagged incorrectly.
Our case has been processed by Microsoft and the OAuth authorization URL https://app.cronofy.com/oauth/authorize is no longer being flagged as unsafe.
Microsoft Defender SmartScreen continues to flag the OAuth authorization URL https://app.cronofy.com/oauth/authorize as unsafe. We are still yet to receive a non-automated response from Microsoft having submitted a second owner dispute since the last update. At this point we are still attempting to go through the proper channels, but are starting to consider our options for workarounds that do not involve end-users having to bypass a warning dialog they should generally be paying attention to. A workaround exists in that users appear to be able to refresh the page when they hit the warning and the page then functions as normal. Using a browser other than Microsoft Edge also serves as a workaround to this issue. We would like to repeat our request that our customers initiate their own calendar OAuth flows in Microsoft Edge and see if they are shown a warning. If so, can you click the "More information" link and then "Report that this site doesn't contain phishing threats" and fill out the form. This can only help our case get in front of the correct people at Microsoft for resolution.
Microsoft Defender SmartScreen continues to flag the OAuth authorization URL https://app.cronofy.com/oauth/authorize as unsafe. We are yet to hear back from Microsoft regarding our dispute of this classification. We do not wish to make changes that could be seen as attempting to bypass this protective mechanism as that is what a nefarious actor would do, potentially leading to the entire domain being flagged. We are instead attempting to go through the proper process to get the classification corrected, but this does mean the time line is out of our hands. Users appear to be able to refresh the page when they hit the warning and the page then functions as normal. Using a browser other than Microsoft Edge also serves as a workaround to this issue. We would like to request that our customers initiate their own calendar OAuth flows in Microsoft Edge and see if they are shown a warning. If so, can you click the "More information" link and then "Report that this site doesn't contain phishing threats" and fill out the form. This should help our case get in front of the correct people at Microsoft for resolution.
Microsoft Defender SmartScreen is still flagging the OAuth authorization URL https://app.cronofy.com/oauth/authorize as unsafe. We first received a report of this on Thursday evening, and it is potentially related to a recent release of Microsoft Edge https://blogs.windows.com/msedgedev/2022/09/29/more-reliable-web-defense/ We have identified an application in development mode which may have been being used as part of a phishing scam. Using Cronofy's domain as a trust-worthy starting point but redirecting on to an untrustworth redirect URI after the user has granted access to their calendar. We have disabled this application and made our warning that an application in development mode much more prominent to discourage the use of development mode applications in this way. We have reached out to the SmartScreen team for an update and let them know our findings and actions so far.
We have been unable to find a workaround for the false negative with Microsoft Defender SmartScreen. We have been able to verify that it is only affecting Microsoft Edge users visiting the `/oauth/authorize` for the US data center, though attempts to alter the behavior in non-breaking ways have not cleared the error. Our telemetry has confirmed that the scale of the impact is very small. Customers using Microsoft Edge to authorize calendars will see the warning, though refreshing the page will clear it, as will choosing to Continue to the page. We are awaiting a response from Microsoft regarding our request to verify the affected URL. Users of other web browsers continue to be unaffected.
We have had reports of Microsoft Defender SmartScreen within Microsoft's Edge browser flagging some OAuth flows as being from an unsafe site. We obviously believe this to be a false-negative and have reported this to Microsoft. If users refresh the page Edge will allow users to continue without any warning. Based on this workaround being simple and indicative of the domain as a whole not being deemed untrustworthy, we are investigating if there is anything we can do to avoid this false-negative from our side.
Report: "Connectivity issues to the US datacenter"
Last updateFrom 18:23 to 18:28 UTC we saw reachability problems for our US data center. Symptomatically this is extremely similar to the outage observed on Saturday 23rd July 2022, details of which can be found here: https://status.cronofy.com/incidents/32fc8mjcr1zw Steps are already underway to alleviate the believed root cause of this.
Report: "Connectivity issues in the US datacenter"
Last updateOn Saturday, 23rd July 2022, we experienced a 12-minute outage in our US data center between 17:29 and 17:41 UTC. During this time, our API at [api.cronofy.com](http://api.cronofy.com) and our web application at [app.cronofy.com](http://app.cronofy.com) were not reachable. Any requests made are likely to have failed to connect or received a 500-range status code rather than being handled successfully. Our web application hosts the developer dashboard, Scheduler, Real-Time Scheduling pages, and end-user authorization flows. Our background processing of jobs, such as calendar synchronization, were not affected. Cronofy records all API calls into an API request table before processing. The outage was triggered when the database locked this table. Without being able to write requests to the table, all API requests began to queue up and timeout, and once the queue was full, be rejected outright. This, in turn, caused our infrastructure to mark these servers as unhealthy and take them out of service. We experienced a [very similar incident](https://status.cronofy.com/incidents/mz84qh5n29cq) in February 2021. Since that incident, we have [performed major version upgrades](https://status.cronofy.com/incidents/wzj1vnhj31zc) to our PostgreSQL clusters, and we had thought those upgrades had fixed this issue, as we had not had a recurrence for a long time. It is now clear that the major version upgrades have, unfortunately, not fixed this particular issue. To help prevent this issue from happening again, we will be making changes to how data is stored within our PostgreSQL cluster. # Timeline _All times UTC on Saturday, 23rd July 2022 and approximate for clarity_ **17:29** App and API requests began to fail **17:31** The on-call engineer is alerted to the App and API being unresponsive **17:35** Attempts to mitigate the issue are made, including launching more servers. These result in temporary improvements but do not fix this issue. **17:37** The initial alerts clear as connectivity is temporarily restored as our attempts to resolve this issue temporarily work. **17:38** New alerts are raised for the app and API being unresponsive **17:39** Incident channel created, and other engineers come online to help **17:41** This incident is created. While this is being done, telemetry shows that API and app requests are being processed again. **17:52** Incident status is changed to monitoring and we continue to investigate the root cause. **18:47** Incident status is resolved # Actions The actions for this incident fall into two categories, what we can do straight away, and what we can do in the medium/long-term. ## Short term To improve the performance of database queries we use several indexes within our PostgreSQL clusters, these help to locate the data in a fast and efficient manner. This locking issue always seems to occur when these indexes are being updated and the database gets into a state where it is waiting for some operations to resolve. Therefore, we are going to review which indexes are actively used and determine whether any can safely be removed or consolidated, as this will reduce the chances of the issue occurring by reducing the number of indexes which need updating. We are also going to look at whether we can improve our alerts to help us to identify the root cause of this type of issue faster, and give our on-call engineers a clearer signal that this is the root cause While we currently don’t have a way of resolving the issue directly \(the database eventually resolves the locks\), this will help us provide clearer messaging and faster investigations. ## Medium/long term In the medium to long term, we will review the storage of API and app requests and determine whether PostgreSQL is the correct storage technology. This is likely to lead to re-architecting how we store some types of data to ensure our service is robust in the future. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
The service is still healthy, and we have identified the likely root cause as a rare case in our database management system being triggered. This caused high levels of locking and degraded performance. This occurred at 17:29 UTC and lasted until the locks resolved at 17:41 UTC. We are investigating short and medium-term solutions to change our infrastructure to avoid a repeat incident.
Everything is continuing to perform at normal levels. We are still investigating the root cause and monitoring the service.
We have identified an unusually high number of locks in our database, causing a performance degradation due to high contention. This has now passed and we are monitoring the service while continuing to investigate the root cause.
We are seeing normal service resuming and are still investigating the source of the issue.
We are currently investigating high levels of errors when trying to communicate with the API, Scheduler, and Developer Dashboard.
Report: "Elevated errors from Google calendar"
Last updateAt approximately 22:16 UTC, we observed a much higher number of errors for Google calendar API calls than we would expect (mostly no data received for events page) in our German data center. The on-call engineer was alerted to this issue at 22:32 UTC. After investigating, we decided to open an incident about this at 22:49 UTC to inform of service degradation in our German data center. While opening the incident, we were alerted about the US data center also being impacted. We saw that around 10% of Google calender API calls in our US data center were returning an error, and so the incident was updated at 22:56 UTC. Errors communicating with the Google calendar API returned to normal levels in both our German and US data centers at around 22:52 UTC. Errors have remained at normal levels since then, so we are resolving this incident. There does not appear to have been a pattern to the accounts affected by this.
Errors returned to usual levels at around 22:52 UTC, as the previous message was being sent. We continue to monitor the situation.
Initial investigations showed that this was only affecting our German data center. However, we can now see that this is also affecting our US data center, but on a much smaller scale. We are continuing to monitor the situation. Our monitoring shows that the synchronization performance of other calendar providers is not affected.
Since approximately 22:16 UTC, we have seen a higher level of errors when communicating with Google calendars than we would normally expect in our German data center. We are monitoring the situation. Synchronization performance for Google calendars will be affected by this, other calendar providers are not affected.
Report: "Degraded performance in all data centers"
Last updateOn Wednesday, 13th July 2022 we experienced up to 50 minutes of degraded performance in all of our data centers between 16:10 and 17:00 UTC. This was caused by an upgrade to our Kubernetes clusters \(how the Cronofy platform is hosted\) from version 1.20 to 1.21. This involves upgrading several components of which one, CoreDNS, was the source of this incident. CoreDNS was being upgraded from version 1.8.3 to 1.8.4, as this is the AWS recommended version to use with Kubernetes 1.21 hosted on Amazon's Elastic Kubernetes Service. Upgrading these components is usually a zero-downtime operation and so was being performed during working hours. Reverting the update to components, including CoreDNS, resolved the issue. This would have presented as interactions with the Cronofy platform and calendar synchronization operations taking longer than usual. For example, the 99th percentile of Cronofy API response times is usually around 0.5 seconds while during the incident it increased to around 5 seconds. Calendar synchronization operations were delayed by up to 30 minutes during the incident. Our investigations following the incident have identified that CoreDNS version 1.8.4 included a regression in behavior from 1.8.3 which caused the high level of errors within our clusters, leading to the performance degradation. We are improving our processes around such infrastructure changes to avoid such incidents in future. # Timeline _All times UTC on Wednesday, 13th July 2022 and approximate for clarity_ **16:10** Upgrade of components including CoreDNS started across all data centers. **16:15** Upgrade completed. **16:16** First alert received relating to the US data center. Manual checks show that the application was responding. **16:18** Second alert received for degraded background worker performance in CA and DE data centers. Investigations show that CPU utilization is high on all servers, in all Kubernetes clusters. Additional servers were provisioned automatically and then more added manually. **16:19** Multiple alerts being received from all data centers. **16:31** This incident was opened on our status page informing customers of the issue. We decided to rollback the component upgrade. **16:45** As the components including CoreDNS were rolled back in each data center errors dropped to normal levels and performance improved. **16:47** Rollback completed. The backlog of background work was being processed. **17:00** The backlog of background work was cleared. **17:05** Incident status changed to monitoring. **17:49** Incident closed. # Actions Although there wasn’t an outage, we certainly want to prevent this from happening again in the future. So, this lead us to ask three questions: 1. Why was this not picked up in our test environment? 2. What could we have done to identify the root cause sooner? 3. How could the impact of the change be reduced? ## Why was this not picked up in our test environment? Although this was tested in our test environment, the time between finishing the testing and deploying this to the production environments was too short. This meant that we missed that there was performance degradation introduced. We are going to review the test plan for such infrastructure changes in our test environment. This will include a soaking period, which will see us wait a set amount of time between implementing new changes in our test environment and rolling them out to the production environments. ## What could we have done to identify the root cause sooner? Previous Kubernetes upgrades had been straightforward, which led to over-confidence. Multiple infrastructure components were changed at once and so we were unable to easily identify which component was responsible. In future, we will split infrastructure component upgrades into multiple phases to help identify the cause of problems if they are to occur. ## How could the impact of the change be reduced? As mentioned above, previous Kubernetes upgrades had been straightforward, which led to over-confidence. We rolled out the component updates, including CoreDNS, to all environments in a short amount of time and it wasn’t until they had all been completed that we started to receive alerts. To prevent this from happening in the future for such changes we are going to have a phased rollout to our production environments. This will mean such an issue will only impact some environments rather than them all, reducing the impact and aiding a faster resolution. # Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
This afternoon we were upgrading our Kubernetes clusters, these are all hosted using AWS Elastic Kubernetes Service. There are multiple steps to this process, which had all been performed successfully in our testing environment, and it wasn't until the last step of the process had been applied that we started to see issues. The last step of the process was upgrading CoreDNS and Kube Proxy to the versions recommended by AWS for the new version of EKS. This started at approximately 16:10 UTC. Shortly after this, we received alerts informing us of degraded performance when processing messages. The CoreDNS and Kube Proxy logs didn't contain any errors and so we thought that our worker processes may have been stuck and so we restarted them, however, this did not resolve the issue. At 16:31 UTC this incident was created while we continued to identify the cause. We decided the best course of action was to start rolling back the last change that was made. We started by doing this in a single environment to see if it had the desired effect. Rolling back Kube Proxy had no effect, but when we rolled back CoreDNS we very quickly saw that messages were being processed and the backlog in our queues started to reduce. We then started to roll out the CoreDNS roll back to all environments, this was completed by approximately 16:46 UTC. It then took a further 15 minutes for the backlog of messages to be cleared. Normal performance was resumed at 17:01 UTC. We will be conducting a postmortem of this incident and will share our findings by Monday 18th July.
The backlog of work generated by the degraded performance has now been processed. We're continuing to monitor the situation
We had recently upgraded CoreDNS within our Kubernetes clusters. Although initial signs suggested that CoreDNS was operating normally, we decided to roll back. After rolling back performance appears to have returned to normal, however we will continue to monitor the situation
We are investigating degraded performance in all data centers
Report: "Zoom API disruption"
Last updateCronofy's calls to Zoom's API experienced a heightened number of errors for roughly 40 minutes starting at around 14:00 UTC. Normal operation has resumed for around an hour, and our spot checks indicate that conferencing details have eventually been provisioned as expected.
Cronofy's calls to Zoom's API to provision and update conferencing details are encountering more errors than usual. This may result in events being created in people's calendars without conferencing initially to ensure people's time is reserved. Once we are able to provisioning conferencing details from Zoom, any affected events will be updated accordingly.
Report: "UK data center reachability"
Last updateAn internal process initiated from our centralized billing system appears to be responsible for rendering our UK data center largely unreachable between 11:04 UTC and 11:06 UTC. Our internal billing-related API was invoked at such a rate that our web servers were starved of resources for handling further requests. We will be reviewing this process and others like it to avoid such things happening in future.
Our UK data center appeared to be briefly unavailable. It has recovered and we are investigating what happened.
Report: "Apple calendar sync issues"
Last updateFrom 16:24 UTC we saw our attempts to communicate with Apple calendars fail almost entirely. This was part of a larger issue with all Apple's services. Apple services started showing signs of recovery from 17:15 UTC. We gradually increased the level of service for Apple calendars from this time onwards, returning to usual levels around 18:00 UTC. The side effects of this incident were more significant than we would like in our US data center where 95% of our Apple calendar connections reside. We will be reviewing this and refining behavior to reduce such side effects for similar incidents in the future.
Communication with Apple calendars appears to have returned to normal operation, albeit with a slightly higher latency than usual. We will continue to monitor but expect the next update to be the resolution of this incident.
We are now polling Apple calendars every 5 minutes once more. Errors are close to, but slightly above, normal levels. We are continuing to monitor.
We have reduced the polling frequency for Apple calendars to minimize impact on other services whilst monitoring if the issue has been resolved. Polling every 20 minutes rather than every 5 minutes. Signs are that we are able to communicate with Apple calendars successfully once again, with errors returning to normal levels. We are continuing to monitor for a while before increasing polling frequency back to 5 minute intervals.
We have isolated the Apple processing from the rest of the calendar providers and scaled up our infrastructure to maximize total capacity. Apple synchronization is still heavily impacted by this issue but other processing should no longer be affected.
There seems to be widespread issues for anyone trying to reach Apple's servers, not just Cronofy. We're working to minimize the side effects on the synchronization of other calendar services.
We are seeing a high rate of errors when communicating with Apple calendars.
Report: "Issues synchronizing Outlook.com Calendars"
Last updateAll Outlook.com calendars experienced a major loss of functionality for at least 40 hours. During this period we were operating purely from our cache of their schedule. _As Microsoft’s product naming can be confusing, this only affected Outlook.com Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years. Microsoft 365 and on-premise Exchange were unaffected._ ## Timeline On Tuesday 1st March at around 23:00 UTC it appears that Microsoft made a change to their infrastructure which meant all our requests to interact with Outlook.com calendars began to fail. By Thursday 3rd March at around 15:00 UTC we managed to restore service for roughly 90% of Outlook.com calendars, approximately 40 hours later. Without any success from efforts to communicate with Microsoft, on Friday 4th March around 09:30 UTC we decided to take more drastic action to give the remaining 10% of Outlook.com calendar users a route to restoring their service by implementing a new mechanism for authorizing Cronofy’s access to their calendar. This was made available around 16:30 UTC the same day, Friday 4th March. The remaining 10% of Outlook.com calendars received a notification that they needed to reauthorize Cronofy’s access to their calendar by Friday 4th March 23:00 UTC. ## Investigation to resolution By far the most disappointing part of this incident was how long it took us to notice there was an issue. With hindsight we had received informational severity alerts shortly after 23:00 UTC on Tuesday when the issue started but this was missed by the team. For background, at Cronofy we have three levels of alert: 1. Informational 2. Review soon 3. Look now Informational alerts are delivered to a Slack channel and can cover things not needing any direct attention. This can be from an area we are interested in keeping a further eye on, or early signs of a potential issue. The next level is "review soon", these go into PagerDuty as a low severity alert that is assigned to an on-call engineer, generally for review the next working day. The highest level is "look now" where an on-call engineer is paged regardless of the time of day to investigate. Often the idea of informational and review soon alerts is to provide more color around the impact of a "look now" alert which may be triggered by a single metric. It took until our support team received a couple of support tickets on Thursday morning \(UK time\) relating to Outlook.com calendars and flagged it to our engineering team for us to realise the extent of the problem. This was roughly 36 hours after the start of the issue. Once the extent of the issue had been recognized, this public facing incident was opened. We quickly identified we were consistently receiving 503 Service Unavailable responses from Microsoft. This response code is usually indicative of a temporary issue on the service provider's side which we just have to wait out. However, as we had been seeing this for over 36 hours at this point we worked on the assumption there was something under our control that could resolve the issue. Therefore we started running experiments in alterations to our integration which may help whilst attempting to reach someone at Microsoft that may be able to resolve the underlying issue. Various sets of changes were attempted but unsuccessful until we found mention of an optional header when reviewing Microsoft's API documentation that we could add to our requests, specifically `x-AnchorMailbox`. This seemed promising as a 503 statuses are often returned by load balancers or firewalls responsible for routing requests to the correct place, headers like `x-AnchorMailbox` are often helpful for load balancers or firewalls to more easily route requests to the correct location. The addition of this header using the account's email address sprung the sync of a large number of Outlook.com calendars to life at around 15:00 UTC on Thursday. We were premature in announcing this had resolved the issue for all Outlook.com calendars, instead it was closer to 90% of Outlook.com calendars. Further efforts were made to resolve the problem for the remaining 10% of Outlook.com calendars but none bore fruit. We were able to identify that a large majority of the calendars still experiencing issues were using a custom domain for their account, but not all. Our theory was that we needed to provide the ID of the mailbox for the `x-AnchorMailbox` header due to the presence of custom domains, but this ID was not available through any of the endpoints already at our disposal from the authentication tokens we had for these users. At this point we were into the evening for the team and chose to pause our experimentation and regroup in the morning. We were at a cross-roads where we were facing the need for some drastic intervention, and we did not want to take that decision lightly. Therefore, we chose to continue trying to get a resolution from Microsoft overnight before making the call. Our integration for Outlook.com calendars had been unchanged for a long period prior to Tuesday and so we were optimistic something could be reverted on their side to fix the remaining 10% without need for drastic action on our part. Come Friday morning, 09:00 UTC, we had not had a resolution from Microsoft and the remaining 10% of Outlook.com calendars were still unable synchronize their schedules. Therefore we defined and began to execute on a contingency plan to replace our authorization mechanism for Outlook.com calendars. This was ready to go around 15:30 UTC at which point we made the call to move forward with the switch. The change was deployed around 16:00 UTC and enabled at 16:15 UTC. Around 15 minutes later we deployed a further change that would start sending the remaining 10% of Outlook.com calendars that were still experiencing issues through our relinking process. This would give us a fresh set of credentials via the new mechanism which provided us with the ID of the mailbox, not just their email address, and we expected this would resolve the issue for these remaining Outlook.com calendars. Roughly 15 minutes later we saw someone from that cohort reconnect their Outlook.com calendar and the synchronization with their calendar become healthy again, validating our theory. We continued to monitor and saw further successes building our confidence that all people with Outlook.com calendars now had a route to a successful synchronization link, albeit after their intervention in some cases. The following morning, after a review of the current status, we closed the incident. ## Opportunities for improvement By far the most significant problem within this incident was the missing high severity alerting around Outlook.com calendars. This alerting has been put in place and was already in place for all the other calendar services we support, Outlook.com had unfortunately been missed. A contributing factor to the length of time until we identified there was an incident was the timing of the informational alerts we did receive. Our engineering team is based in the UK and Europe so by 23:00 UTC no-one is actively working and so then skim the informational alerts posted overnight the following morning. This timing and process led to no-one spotting that the Outlook.com informational alerts did not have a corresponding closure message. To this end, we are also looking more holistically at our alerting to avoid such things slipping through the cracks in future. Specifically we are looking at: 1. Refining informational severity alerts that have a tendency to briefly flicker to reduce noise within alerts where no resolution is possible, eg. a side effect of ephemeral network issues and the following retry succeeding. 2. Providing visibility of informational alerts that have been open for a significant period. Both of these aim to reduce the possibility of similar alerts being missed by reducing the noise around them and increasing their signal over time. This will mean that unless alerting is entirely absent, which should never be the case, it is much less likely it will go unnoticed for anywhere near as long. We are comfortable that the time from identification to resolution of this incident was reasonable given the nature of the issue. Roughly 90% of Outlook.com calendars were successfully synchronizing within 4 hours of our investigation starting, with the remaining 10% of Outlook.com calendars being given a path to successful synchronization after we quickly turned around a major change the following day. Our deployment pipeline and tooling enabled us to investigate and experiment safely and rapidly towards the eventual solution to this issue. Whilst we communicated clearly during the incident, we did not meet our internal guidance on how frequently we provided status updates. For example, we should have provided an update by 10:00 UTC on the Friday to make it clear we were still working on the incident but did not post an update until after 13:00 UTC, nearly 20 hours after the previous update. We will be updating our internal guidance around communication, with a focus on multi-day incidents. If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com).
The Outlook.com calendar accounts we could not resolve ourselves have now been asked to relink and our overall success rate for syncing Outlook.com calendars has returned to normal levels. This incident is now resolved and we will be conducting a postmortem of this incident and will share our finding by the end of next week. Please contact support@cronofy.com if you have any questions.
The new version of our Outlook.com calendar authorization process does appear to be allowing users who were still affected by the incident to synchronize again. Roughly half of the cohort we expect to be asked to relink have now been asked. We are continuing to monitor but now expect any person experiencing issues who relinks their Outlook.com account to have their synchronization issues resolved. We expect to close this issue tomorrow after all relink requests have been sent and we have monitored the situation for a while longer. Please contact support@cronofy.com if you have any further questions.
We have released a new version of our Outlook.com calendar authorization process that we believe will allow users still affected by the incident to synchronize again. Unfortunately this will require users to reauthorize Cronofy's access to their Outlook.com calendar, as we have been unable to find a solution that would avoid this for the remaining calendars. The Outlook.com accounts still encountering errors will receive relink emails over the coming hours requesting they reauthorize Cronofy's access to their calendar. This should then resolve their calendar synchronization problems. Only those whose synchronization was not fixed by yesterday's change will be required to relink. We are continuing to monitor the affected users as this process takes place. Please contact support@cronofy.com if you have any further questions at this time.
We’re still working on this incident as a priority. From our observations, we see that from 15:00 UTC yesterday, service would have resumed for most Outlook.com users. We are working on restoring services for the remainder of the affected users. Please do get in touch with us at support@cronofy.com with any questions.
It appears that Microsoft have made a change to the Outlook.com API we are using. To be clear both Microsoft 365 domains and on-premise Exchange calendars are unaffected by this issue. The earlier fix for Outlook.com involved adding a previously optional header to our requests, and resolved the issue for a little over 90% of Outlook.com accounts (not everyone as early signs indicated). Further attempts to work around issue have so far been unsuccessful. We are attempting to get assistance from Microsoft on the underlying issue. Please contact support@cronofy.com if you have any further questions or have customers that are still affected.
We now believe this incident to be resolved. We will continue monitoring to ensure there are no outstanding issues. Please contact support@cronofy.com if you have any further questions at this time.
Upon further investigation, we have identified that a change made by Microsoft has impacted our ability to sync Outlook.com profiles. We are exploring possible workarounds to this issue, whilst liaising with Microsoft Support for further assistance. If you have any further questions at this time please reach out to support@cronofy.com.
We have identified an issue affecting some Outlook.com calendars. Customers may observe this issue as delays or failures when synchronizing with Outlook.com calendars. We are investigating and will update you as we progress.
Report: "US data center issues"
Last updateAWS have closed their incident for the underlying issue with SQS and other services. AWS's SQS service appeared to be unavailable from around 20:47 UTC through to 20:57 UTC in our US data center, hosted in us-east-1. As SQS is our primary messaging queue between parts of the Cronofy platform, many operations will have been severely degraded during this period. We are confident that service has returned to normal as our own metrics have returned to normal levels.
AWS have created an incident relating to the SQS outage in us-east-1 available here: https://health.aws.amazon.com/health/status Our US data center appears to have fully recovered and we are continuing to monitor.
AWS's SQS service appeared to be unavailable from around 20:47 UTC through to 20:57 UTC in our US data center (us-east-1). As SQS is our primary messaging queue between parts of the Cronofy platform, many operations will have been severely degraded during this period. Since SQS has become available again our systems appear to be recovering as expected. We continue to monitor the situation.
We're seeing signs of underlying issues in the US relating to SQS, our core messaging queue. This may have side effects on all operations in the US data center. We are investigating.
Report: "Degraded performance in US data center"
Last updateOn Tuesday, February 22nd 2022 our US data center experienced 95 minutes of degraded performance between 15:45 and 17:20 UTC. This was caused by the primary PostgreSQL database hitting bandwidth limits and its performance being throttled as a result. This was caused or exacerbated by PostgreSQLs internal housekeeping working on two of our largest tables at the same time. To our customers this would have surfaced as interactions with the US Cronofy platform, i.e. using the website or API, being much slower than normal. For example, the 99th percentile of API response times is usually around 0.5 seconds and during this incident peaked around 14 seconds. We have upgraded the underlying instances of this database, broadly doubling capacity and putting us far from the limit we were hitting. ## Timeline _All times UTC on Tuesday, February 22nd 2022 and approximate for clarity._ **15:45** Our primary database in our US data center started showing signs of some performance degradation. **16:05** First alert received by the on-call engineer for a potential performance issue. Attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes. **16:45** Incident opened on our status page informing customers of degraded performance in the US data center. **17:00** Began provisioning more capacity for the primary database as a fallback plan if efforts continued to be unsuccessful. **17:10** New capacity available. **17:15** Failed over to fully take advantage of the new capacity by promoting the larger node to be the writer. **17:20** Performance had returned to normal levels in the US data center. **17:45** Decided we could close the incident. **18:00** Decided to lock in the capacity change and provisioned an additional reader node at the new size. **18:15** Removed the smaller nodes from the database cluster. ## Actions Whilst there was not an outage, this felt like a close call for us. This led to three key questions: * Why had we not foreseen this capacity issue? * Could the capacity issue have been prevented? * Why had we not resolved the issue sooner? ### Foreseeing the capacity issue We had recently performed a major version upgrade on this database, and in the following weeks monitored performance pretty closely. If there was a time we should have spotted a potential issue in the near future, this was such a time. We believe we may have focussed too heavily on CPU and memory metrics in our monitoring, and it was networking capacity that led to this degradation in performance. We will be reviewing our monitoring to set alerts that would have pointed us in the right direction sooner, and also lower priority alerts that would flag an upcoming capacity issue days or weeks in advance. ### Preventing the capacity issue As PostgreSQL internal housekeeping processes appeared to contribute significantly to the problem, we will be revisiting the configuration of these process and seeing if they can be altered to reduce the likelihood of such an impact in future. ### Resolving the issue sooner As this was a performance degradation rather than an outage, the scale of the problem was not clear. This led to the on-call engineer investigating the issue whilst performance degraded further without additional alerts being raised. We will be adding additional alerts relating to performance degradation in several subsystems to raise awareness of the impact of a problem to an on-call engineer. We are also updating our guidance on incident handling for the team to encourage switching to a more visible channel for communication sooner. We are also encouraging the escalation of alerts to involve other on-call engineers in the process, particularly when the cause is not immediately clear. ## Further questions? If you have any further questions, please contact us at [support@cronofy.com](mailto:support@cronofy.com)
Around 15:45 UTC our primary database in our US data center started showing signs of some performance degradation. We first received an alert at around 16:05 UTC as this problem grew more significant. We made attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes. Often giving such breathing room will allow a database to recover by itself. Around 16:45 UTC it appeared our efforts were not bearing fruit, and as the performance of our US data center was degraded from normal levels we opened an incident to make it clear we were aware of the situation. Around 17:00 UTC we decided to provision more capacity for the cluster in case it was necessary, this took around 10 minutes to come online. Whilst that was provisioning, we reduced the capacity of background workers temporarily to see if that would clear the problem by reducing the load. This was unsuccessful and so around 17:15 UTC we decided to failover to the new cluster capacity, after 5 minutes this had warmed and performance had returned to normal levels. There was a brief spike in errors from the US data center as a side effect of the failover, but otherwise the service was available throughout, albeit with degraded performance. We will be conducting a postmortem of this incident and will share our finding by the end of the week.
Our primary database is the source of the degraded performance, we have provisioned additional capacity to the cluster and failed over to make a new, larger node the primary one. Early signs are positive and we are monitoring the service.
We are investigating degraded performance in our US data center.
Report: "Elevated errors from Google Calendar"
Last updateAt approximately 17:00 UTC we observed a much higher number of errors for Google calendar API calls than we would expect (mostly 503 Service Unavailable responses) across all of our data centers. There does not appear to have been a pattern to the accounts affected by this. We decided to open an incident about this at 17:10 UTC to inform of potential service degradation as it seemed like it could be a more persistent issue. Whilst opening this incident, errors when communicating with the Google calendar API returned to normal levels at around 17:12 UTC. Errors have remained at normal levels since that time and so we are resolving this incident.
Errors returned to usual levels at around 17:12 UTC, as the previous message was being sent. We are monitoring the situation.
Since approximately 17:00 UTC we have seen a higher level of errors when communicating with Google calendars than we would normally expect across all of our data centers. We are monitoring the situation and taking any actions available to us to minimize the impact. Synchronization performance for Google calendars will be affected by this.
Report: "Users unable to log in to Scheduler"
Last updateOur Engineering team has resolved the Scheduler issue, and users can now log in again. Please get in touch with support@cronofy.com if you have any further questions.
We are aware of an issue with the Scheduler, which is stopping users from logging in. Our Engineering team are investigating and aim to have a fix in place shortly.
Report: "Slow developer dashboard & Scheduler page loads"
Last updateWe have monitored the issue experienced by the third party and it is now resolved. Please contact support@cronofy.com if you have any further queries.
We detected an issue with a third party Javascript dependency’s CDN, which led to slow page loads on the Developer Dashboard and Scheduler. We have removed this dependency, which has resolved the issue. The Cronofy API was unaffected by this incident.
Report: "Status on Log4j vulnerability from Cronofy"
Last updateOn Friday, December 10th, a critical remote code execution vulnerability (CVE-2021-44228), also known as Log4Shell, was discovered, which affects Apache Log4j versions 2.0-2.14.1. Log4j is a popular logging library in Java and is used in several enterprise applications. Our platform isn't written in Java and therefore isn't vulnerable, we do run some dependencies like Jenkins which are written in Java and have verified they are not vulnerable (as well as not being generally accessible from the internet), and we will update managed services as updates relating to this vulnerability are made available (for example, an update to AWS Elasticsearch was applied yesterday). Our team will continue to monitor our platforms and sub-processors, and we'll update you if there are any updates in relation to this vulnerability. At this point in time, no action is required by Cronofy customers. Please contact support@cronofy.com if you have any further questions.
Report: "Google Calendar sync errors"
Last updateGoogle have opened an incident for this: https://www.google.com/appsstatus/dashboard/incidents/zURR7mGQjom4ktGZcR5A Google's incident states it began at 08:40 UTC and ended at 10:20 UTC. This correlates with what we have observed as errors have returned to normal levels since that time. This incident is now closed.
An outage of Google Calendar has been widely reported, but is not yet corroborated by Google's status page. Error levels appear to be reducing from their peak at around 09:20 UTC but remain higher than normal.
Error rates have increased significantly over the past 15 minutes or so. We are continuing to monitor and investigate.
We are seeing an elevated level of errors when communication with Google calendars across all data centers.
Report: "Elevated Google connectivity errors in German data center"
Last updateThe communication with Google Calendars has remained stable and Google have resolved their incident.
Our monitoring shows errors have returned to normal levels over the past 30 minutes. We are continuing to monitor but it is likely the incident is effectively over.
Google have opened an incident on their status page relating to Google Calendar: https://www.google.com/appsstatus/dashboard/incidents/rqrJqYGT8qtu75wxRfWq
There continue to be elevated errors communicating with Google, particular from our German data center. This leads us to believe it is likely related to calendars hosted by Google within the EU. We're seeing a failure rate in the region of 10-20%, so we believe operations will be succeeding reasonably quickly. We've made some configuration changes to do what we can to reduce the impact on our systems as a whole and continue to monitor.
We are seeing higher than normal errors when communicating with Google. Operations appear to be succeeding eventually but performance may be reduced.
Report: "US data center performance degradation"
Last updateAWS announced their incident was mostly resolved and our platform has been stable for over an hour. This incident has been resolved.
AWS have executed a mitigation which appears to have resolved the vast majority of the problems we have been seeing. Service within our US data center has been at or near normal levels for around an hour across our platform. We continue to monitor the situation.
AWS are experiencing issues in us-east-1 where our US data center is hosted. We are seeing this affect our internal process start times, and also our ability to scale. These are the times we most heavily interact with AWS's APIs where Amazon are reporting issues. In general, systems are operating well. The exception to this is Apple calendars and ICS feed synchronization where we have to poll for changes. These are being impacted by the internal process start times and so performance is degraded for these. We are monitoring the situation and doing what we can to work around the underlying problem.