Historical record of incidents for ServiceChannel
Report: "ServiceChannel System Performance Degradation"
Last updateWe are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel Regularly Scheduled Maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
On Saturday, May 24th 2025, beginning at 2 AM EST, we will perform regularly scheduled maintenance on the ServiceChannel system. During this window you may experience sporadic interruptions, including to the IVR system. We expect the entire process to last less than 1 hour, but it may take up to 4 hours.
Report: "Code Release Causes US Environment Outage"
Last update**Incident Report: Code Release Causes US Environment Outage** **Date of Incident:** 05/08/2025 **Time/Date Incident Started:** 05/08/2025, 2:29 am EDT **Time/Date Stability Restored:** 05/08/2025, 4:07 am EDT **Time/Date Incident Resolved:** 05/08/2025, 4:12 am EDT **Users Impacted:** Many **Frequency:** Continuous **Impact:** Major **Incident description:** During the scheduled US production code release on May 8, 2025, ServiceChannel encountered technical issues that impacted service availability on our platform. Users experienced login difficulties from 2:29 AM to 3:07 AM EDT, while critical dashboard functionality was unavailable from 2:29 AM to 4:12 AM EDT. **Root Cause Analysis:** Login Module Issue**:** As part of ongoing deployment process enhancements, a configuration adjustment was made that worked correctly in our testing environments but behaved differently in production. The issue was identified and resolved through our standard troubleshooting procedures. Dashboard Issue**:** A configuration setting that was properly configured in our development environments had not been fully synchronized to the production environment. This discrepancy wasn't detected until the new code attempted to access the setting during the deployment. Full platform functionality was confirmed restored by 4:12 AM EDT **Actions Taken:** * SRE team immediately investigated upon receiving alerts starting at 2:29am EDT indicating issues with two critical systems: dashboard and login * CICD team successfully rolled back the login module to the prior version, restoring user access by 3:07 AM EDT * Dashboard continued to experience issues, so investigation continued while login was restored * Dashboard functionality was restored by ensuring all required configuration settings were properly applied to production **Mitigation Measures:** * Reviewed existing deployment procedures to include improved configuration validation and improved rollback protocols to prevent similar configuration-related issues in the future * Implemented process improvements for immediate communication with support teams following any service disruptions to ensure proper customer follow-up and transparency
During the scheduled US production code release on May 8, 2025, ServiceChannel encountered technical issues that impacted service availability on our platform. Users experienced login difficulties from 2:29 AM to 3:07 AM EDT, while critical dashboard functionality was unavailable from 2:29 AM to 4:12 AM EDT.
Report: "System Outage"
Last updateThis incident has been resolved.
A fix has been implemented, and we are monitoring the issue; the IVR should now be working
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We're seeing that that only IVR is impacted by this Twilio outage. The workaround is using the mobile app to check in/out. We're monitoring the situation closely.
The ServiceChannel SRE team is currently investigating an issue affecting the IVR. We will provide an ETA shortly. Thank you for your patience.
Report: "Service channel System Performance Degradation"
Last updateThis incident has been resolved. All services are working as expected.
We have restored functionality to impacted services, we are monitoring the results to ensure there are no further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**Crowdstrike Incident Report** **Date of Incident:** 07/19/2024 **Time/Date Incident Started:** 07/19/2024, 01:10 am EDT **Time/Date Stability Restored:** 07/19/2024, 05:47 am EDT **Time/Date Incident Resolved:** 07/19/2024, 10:00 am EDT **Users Impacted:** All users **Frequency:** Continuous **Impact:** Major **Incident description:** On 7/19/2024 at 1:10 AM, The ServiceChannel Database Administration \(DBA\) and Site Reliability Engineering \(SRE\) teams received alerts from step-based test monitors that multiple ServiceChannel systems were failing their checks. Once alerted, the DBA and SRE teams immediately began investigating the issue's cause. **Root Cause Analysis:** A global outage caused by Crowdstrike, a third-party vendor providing a security Endpoint Detection and Response \(EDR\) platform, temporarily impacted the performance of ServiceChannel SaaS applications. There was no security impact as this was a third-party software component that caused the degradation of our services. **Actions Taken:** 1. The DBA and SRE teams, in coordination with ServiceChannel Leadership, activated our business continuity and disaster recovery process, allowing business critical systems to continue operating. 2. Analysis of the problem determined there was an issue with the Crowdstrike EDR platform which ServiceChannel uses for detection of cybersecurity events. 3. Upon further investigation, the SRE team identified a mitigation strategy for each affected asset: 1. Take a snapshot of the boot drive for the affected asset. 2. Detach the impacted boot drive from the affected asset. 3. Attach each impacted boot drive to a recovery workstation. 4. Remove the corrupted Crowdstrike update file. 5. Reattach the boot drive to the asset. 6. Restart and monitor for successful return to service. The ServiceChannel SRE team applied the mitigation across all affected assets. 4. The ServiceChannel SRE team applied the mitigation across all affected assets. **Mitigation Measures:** 1. Work with Crowdstrike to implement any Crowdstrike EDR-related availability remediation advice. 2. Investigate additional technologies, techniques, and capabilities to improve our DR solution to reduce recovery times of secondary systems.
This incident has been resolved. All services are working as expected.
We have restored functionality to impacted services, we are monitoring the results to ensure there are no further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "Supply Manager - Temporary Disruption"
Last updateThis incident has been resolved. All services are working as expected.
A fix has been implemented by our third party provider. We are monitoring the results to ensure there are no further issues.
We are currently dealing with a temporary disruption in our third-party Supply Manager service. At the moment, you may notice that the Supply Manager is not functioning as it should. The issue has been reported to their team, and they are actively investigating to pinpoint the cause.
Report: "ServiceChannel System Performance Degradation"
Last update**Dashboard Latency - Incident Report** **Date of Incident:** 5/16/2024 **Time/Date Incident Started:** 5/16/2024, 10:18 am EDT **Time/Date Stability Restored:** 5/16/2024, 11:37 am EDT **Time/Date Incident Resolved:** 5/16/2024, 12:00 pm EDT **Users Impacted:** Some Users **Frequency:** Intermittent **Impact:** Minor **Incident description:** Some users of ServiceChannel that were utilizing the dashboard experienced slow loading times. **Root Cause Analysis:** Around 10:18 AM EDT, the ServiceChannel Site Reliability Engineering \(SRE\) team was alerted to slow response times on the Dashboard, affecting customer experience. The team quickly looked into the matter and identified that one of the ServiceClick application pools was exhibiting unusually high response times. Initial attempts to rectify the issue by restarting individual nodes did not resolve the problem. Further investigation led to the decision to reboot the entire application pool for ServiceClick. This measure effectively reduced response times and returned our services to their standard operational state. **Actions Taken:** 1. Manually tested and reproduced the issue. 2. Researched, then restarted affected nodes that were reporting slow responses. 3. Retested to confirm the problem had been resolved and continued to monitor. **Mitigation Measures:** 1. Increased monitoring of our Dashboard and the ServiceClick application response times.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel - Fixxbook Provider Profile Viewing"
Last update**Compliance Manager Outage Incident Report** **Date of Incident:** 5/8/2024 **Time/Date Incident Started:** 5/8/2024, 12:01 pm EDT **Time/Date Stability Restored:** 5/8/2024, 1:30 pm EDT **Time/Date Incident Resolved:** 5/8/2024, 1:41 pm EDT **Users Impacted:** Users utilizing the Fixxbook Compliance Manager **Frequency:** Duration of the incident event **Impact:** Major **Incident description:** At 12:00 EDT on May 8th, the Site Reliability Engineering \(SRE\) team responded to increased error rates through alerts, followed by user reports of malfunctions within the ServiceChannel Fixxbook service. This issue affected Fixxbook users, particularly those using features of the Compliance Manager, leading to a less than ideal user experience. **Root Cause Analysis:** After an in-depth analysis, the Site Reliability Engineering \(SRE\) team pinpointed the errors and discovered that an incorrectly set DNS record was disrupting the functionality of the Fixxbook application. The issue was promptly resolved, and the application's performance was reinstated to full capacity. The SRE team maintained vigilant oversight to confirm the application's sustained stability. **Actions Taken:** 1. Manually tested our services to replicate the issue. 2. Investigated system-generated alerts and identified affected platform functionality. 3. Updated DNS entry to restore functionality. **Mitigation Measures:** 1. Introduced enhanced alerting mechanisms designed to rapidly and more precisely detect issues within the Fixxbook application components.
This incident has been resolved. All services are working as expected.
A fix has been implemented and you should be able to view provider profiles on Fixxbook now. We will continue to monitor to ensure stability going forward.
We are actively investigating an issue with errors occurring while viewing provider profiles on Fixxbook. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel Work Order Report Downloads"
Last update**Increased platform latency and workorder reports unresponsive Incident Report** **Date of Incident:** 04/01/2024 **Time/Date Incident Started:** 04/01/2024, 10:34 am EST **Time/Date Stability Restored:** 04/01/2024, 12:45 pm EST **Time/Date Incident Resolved:** 04/01/2024, 1:05 pm EST **Users Impacted:** All Users **Frequency:** Intermitted **Impact:** Major **Incident description:** Users experienced sporadic latency and timeout issues while engaging with the ServiceChannel Platform, particularly for workorder report services. **Root Cause Analysis:** The automated monitoring systems of the ServiceChannel SRE and DBA teams detected elevated CPU utilization on database read replicas. A subsequent investigation into the logs identified that the incident coincided with a spike in user traffic. This surge in activity caused extended wait times for certain Servicechannel Services, notably the excel report services, leading to slower page loads and timeouts. The SRE team swiftly acted by scaling up our infrastructure resources to accommodate the increased traffic. Following the expansion of capacity, normal system operations resumed. **Actions Taken:** 1. Manually tested our services to replicate the issue 2. Isolated the performance degradation to report queues and related database services. 3. Enhanced the capacity of affected services to manage the load and restore full functionality. **Mitigation Measures:** 1. Expansion of database resources to more effectively manage reporting queues. 2. Implementation of refined monitoring systems for better oversight of reporting queues.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating an issue with Work Order Report Downloads. An update will be provided shortly. Thank you for your patience.
Report: "SFTP Unavailable"
Last update**Incident Report: SFTP Service Disruption** **Date of Incident:**` ` 12/12/2023 **Time/Date Incident Started:** 12/11/2023, 05:43 pm EST **Time/Date Stability Restored:**` `12/12/2023, 01:24 pm EST **Time/Date Incident Resolved:**` `12/12/2023, 01:54 pm EST **Users Impacted:** Few **Frequency:** Continuous **Impact:** Major **Incident description:** On December 11th at 5:43 pm EDT, an unexpected disruption occurred in the Production ServiceChannel SFTP service. By the morning of December 12th, 2023, the ServiceChannel Support team began to receive customer reports of timeout errors when attempting to connect to the ServiceChannel SFTP server. **Root Cause Analysis:** A comprehensive investigation by the Site Reliability Engineering \(SRE\) team revealed no resource contention issues with the affected server instance. Nevertheless, to preemptively avoid any hardware bottleneck issues, the SRE team performed a scale-up of the server instance to the next larger instance size. Despite this effort, tests indicated ongoing issues with external connections to port 22, while all internal network tests were successful. The SRE team shifted their efforts to pinpoint potential network irregularities and found that the security policy governing the SFTP server had been altered to exclude access to port 22. Upon further investigation with the Security team, we determined that this change was part of a broad initiative to harden our platform's security posture. Regrettably, this policy update was executed without the normal change management process, and the the broader engineering organization was not notified in advance. This network modification was subsequently reversed, and SFTP functionality was restored. **Actions Taken:** 1. The SRE team inspected the SFTP server and confirmed it was operating within defined parameters. The team also proactively scaled up the infrastructure to proactively address the possibility of any system bottlenecks. 2. The SRE team identified a suspected change in the security policy, wherein Port 22 access was removed for all but private network address spaces. System event logs confirmed that this change was implemented by the security team. Upon identifying the issue, the Security team was informed, and an emergency rollback was requested. **Mitigation Measures:** In light of this incident, the following preventative measures have been put in place: 1. Improvements to internal communications, including ensuring that all network changes are announced and approved by the wider engineering organization prior to their implementation. 2. Ensuring that going forward, Infrastructure changes to the ServiceChannel Platform will be made by the SRE team using the normal Infrastructure as Code process. 3. Additional monitoring of SFTP infrastructure using both network ping tests and end-to-end synthetic transaction tests have been implemented to test from both internal and external network paths.
This incident has been resolved.
A fix has been implemented. We are monitoring the results.
ServiceChannel is currently investigating an issue that prevents users from connecting to our SFTP servers from the internet. We are working to restore service as soon as possible. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**Incident Report: Infrastructure/Hardware Instability** **Date of Incident:**` `09/08/2023 **Time/Date Incident Started:** 09/08/2023, 04:18 pm EDT **Time/Date Stability Restored:**` `09/08/2023, 05:08 pm EDT **Time/Date Incident Resolved:**` `09/08/2023, 05:15 pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** On September 8th at 04:18 pm EDT, the Site Reliability Engineering \(SRE\) team received an alert regarding "SQL timeout errors" and subsequent reports of dashboard slowness. This slowness had a significant impact on a large number of users, resulting in a suboptimal experience. **Root Cause Analysis:** Upon conducting a thorough investigation, the Database Administration \(DBA\) team identified a series of database requests that were causing blocks and imposing a high CPU load on the database replica servers. This, in turn, led to an increased number of "resource waits." As a preemptive measure, the DBA team initiated a restart of the SQL service on both database replica servers. Following the successful restart of the SQL service, the system's stability was closely monitored and subsequently restored. **Actions Taken:** 1. Investigated system-generated alerts and identified affected platform functionality. 1. DBA team proactively initiated SQL service restart on database replica servers. **Mitigation Measures:** In response to this incident, the following mitigation measures have been implemented: 1. Ongoing Investigation: The team is continuing to investigate the root causes of the high CPU usage and blockages on the database servers. 1. Database Query Performance Improvements: Efforts are being made to enhance the performance of database queries to ensure the overall stability of the platform.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**Infrastructure/Hardware Instability** **Incident Report** **Date of Incident:**` `09/05/2023 **Time/Date Incident Started:** 09/05/2023, 09:15 am EDT **Time/Date Stability Restored:**` `09/05/2023, 10:19 am EDT **Time/Date Incident Resolved:**` `09/05/2023, 10:25 am EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** Third party vendor infrastructure/hardware instability **Root Cause Analysis:** A third-party vendor infrastructure issue affected performance and system availability for the underlying data storage layer that services platform resources. **Actions Taken:** 1. Investigated system-generated alerts and identified affected platform functionality. 1. SRE and DBA teams initiated a platform infrastructure redeployment, forcing the new infrastructure to be spun up on unaffected infrastructure/hardware. **Mitigation Measures:** 1. Continue the ongoing investigation into root causes of infrastructure issues within our cloud hosting provider. 1. Continue to implement high availability improvements to prepare the platform to respond better to unexpected hardware issues that are beyond our control.
This incident has been resolved. All services are working as expected.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**Infrastructure/hardware instability** **Incident Report** **Date of Incident:**` `08/31/2023 **Time/Date Incident Started:** 08/31/2023, 02:15 pm EDT **Time/Date Stability Restored:**` `08/31/2023, 02:47 pm EDT **Time/Date Incident Resolved:**` `08/31/2023, 02:50 pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description** On August 31st at 02:15 pm EDT, the ServiceChannel Site Reliability Engineering \(SRE\) team received a large number of SQL timeout errors, followed by reports of dashboard slowness. **Root Cause Analysis** The Database Administration \(DBA\) team discovered a growing queue of active database queries and increasing resource waits, resulting from functionality that was causing database blocks and high CPU load on the database cluster. **Actions Taken** 1. Investigated system-generated alerts and identified affected platform functionality. 1. Recompiled the affected stored procedures and dropped all blocking connections to return the database cluster to the steady state. 1. Compiled incident findings for future remediation by the Application Engineering and SRE teams. **Mitigation Measures** 1. Coordinate with the Application Engineering team to identify and remediate the root causes of the high database CPU and blocks. 1. Identify and implement general performance improvements for database queries to increase overall platform stability. 1. Implement infrastructural modifications to distribute database I/O across additional read replicas.
This incident has been resolved. All services are working as expected.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "US Production App Rollback Incident Report"
Last update**US Production App Rollback Incident Report** **Date of Incident:** 08/09/2023 **Time/Date Incident Started:** 08/09/2023, 10:00 pm EDT **Time/Date Stability Restored:** 08/10/2023, 12:00 am EDT **Time/Date Incident Resolved:** 08/10/2023, 12:00 am EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Major **Incident description:** On 8/9/23, the production release of the US application code was rolled back following smoke testing and synthetic monitors that detected errors on the ServiceChannel platform. **Root Cause Analysis:** Upon investigation, it was determined that the cause of the issue could be traced back to a recent update in the platform session cookie. This update resulted in a malfunction of the Component module due to the module specifying an incorrect Redis store for session data. **Actions Taken:** 1. In response to the incident, the team promptly executed a rollback of the application services code to the previous functional version. After the rollback, the stability of the web platform was confirmed through both smoke testing and synthetic monitors. 2. To address the underlying problem, the Redis connection strings for the component modules were updated. The US Production release was re-deployed on 8/10/23 at 10 PM EDT with the correct configuration applied. **Mitigation Measures:** To prevent similar incidents in the future, the following mitigation measures will be implemented: 1. Ensuring Environment Consistency: A concerted effort will be made to better align production and non-production configurations. 2. Governance of Production Changes: To maintain greater control over potentially disruptive production changes, any changes that, due to scale considerations, can only be applied to the Production environment, will require explicit approval from senior management before implementation. 3. Monitoring Production-Only Variables: We will implement automated monitoring to to regularly check for the presence of "Production Only" configuration values. This practice will provide an additional layer of oversight and help prevent inadvertent changes.
The production release of the US application code was rolled back following smoke testing and synthetic monitors that detected errors on the ServiceChannel platform.
Report: "ServiceChannel Performance Degradation"
Last update**Incident Report: Secondary Read Replica Unavailability and Application Degradation** **Date of Incident:** 08/04/2023 **Time/Date Incident Started:** 08/04/2023, 6:51 AM EDT **Time/Date Stability Restored:** 08/04/2023, 10:00 AM EDT **Time/Date Incident Resolved:** 08/04/2023, 10:45 AM EDT **Users Impacted:** All users **Frequency:** Sustained **Impact:** Major **Incident description:** On August 4th at 6:51 am EDT, a significant incident occurred as the secondary read replica became unavailable. This led to an increased load on the DB system, resulting in intermittent slowness that adversely affected a large number of users. The degraded application experience raised concerns and triggered immediate investigation and response. **Root Cause Analysis:** The incident was promptly addressed by the ServiceChannel SRE \(Site Reliability Engineering\) and DBA \(Database Admin\) teams following an automated alert triggered by an unhealthy state in the AG replication. Upon thorough investigation, the DEVOPS team meticulously reviewed all logs associated with August 4th within the AG replication timeframe. Their efforts unveiled a configuration modification of the system firewalls that coincided with a triggered restart of the database system. The SRE team effectively pinpointed this change within our configuration management systems, which inadvertently pushed through a firewall policy modification. Consequently, the modified database firewall settings obstructed traffic flow to the replica servers, initiating the incident. **Actions Taken:** 1. Immediate Alert Response: The DBA team swiftly responded by reviewing and promptly acknowledging the monitoring alerts associated with the impacted segment of the application. This proactive step ensured that the issue was promptly recognized and addressed. 2. Redeployment and Restart: In a concerted effort to restore system stability, the DBA team executed the strategic redeployment and thorough restart of both primary and secondary database replicas. This rigorous approach aimed to rectify the root cause of the incident and mitigate its impact on performance and availability. 3. Persistent Challenges: Despite the initial actions, the immediate system performance and availability concerns persisted, requiring a deeper investigation to uncover the underlying factors contributing to the incident's persistence. 4. Configuration Management Insights: A comprehensive analysis of our configuration management system logs revealed a crucial breakthrough. This investigation shed light on the unexpected enablement of system firewalls, which had previously gone unnoticed. This realization marked a pivotal turning point in our efforts to restore normalcy. 5. Rapid Firewall Disablement: Armed with the newfound understanding, the necessary steps were taken to promptly disable the system firewalls that were impeding traffic flow. This decisive action facilitated the gradual return of the system to its intended state, marking a definitive resolution to the incident. **Mitigation Measures:** In light of this incident, several proactive steps have been taken to mitigate the risk of similar occurrences: 1. Enhanced Monitoring: A robust monitoring system will be implemented to vigilantly track data-enabled functionality changes \(functionality feature switches\). This enhanced monitoring will promptly detect anomalies and potential performance issues, allowing for swift intervention. 2. Playbook Updates: The DBA and DEVOPS teams' troubleshooting playbook will be meticulously updated to incorporate the lessons learned from this incident. These revisions will streamline response procedures and ensure quicker, more effective resolution. 3. Code Review Process: The code review process has been revamped to include a meticulous assessment of dependencies in any configuration changes. This will mitigate unforeseen interactions and potential disruptions. 4. Conditional Logic Refinement: The SRE team has improved the conditional logic governing firewall settings, ensuring that they are enabled only when explicitly defined. This refinement adds an additional layer of control and security. 5. Continuous Enhancement: Our commitment to improvement remains steadfast. The ongoing development of tests and alerting systems will be a top priority, further enhancing our ability to detect and respond to data and configuration changes.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**Date of Incident:** 07/10/2023 **Time/Date Incident Started:** 07/10/2023, 1:36 PM EDT **Time/Date Stability Restored:** 07/10/2023, 2:27 PM EDT **Time/Date Incident Resolved:** 07/10/2023, 2:53 AM EDT **Users Impacted:** All users **Frequency:** Sustained **Impact:** Major **Incident description:** On July 10th at approximately 1:36pm EDT, customers encountered significant slowness after logging into the platform. The slowness impacted a large number of users, leading to a suboptimal experience. **Root Cause Analysis:** The ServiceChannel SRE \(Site Reliability Engineering\) and DBA \(Database Admin\) teams responded to an automated alert triggered by high CPU usage on database read replicas. Upon investigation, the DBA team identified a new module and functionality that was executing excessively long queries against the read replicas. This new module was recently enabled for internal vendor logins. **Actions Taken:** 1. The SRE and DBA teams promptly reviewed and acknowledged monitoring alerts related to the affected part of the application. 2. The DBA and engineering teams collaborated to identify the root cause of the high loads, which was traced back to the newly enabled functionality for internal vendor logins. 3. To mitigate the issue, the DBA and engineering teams disabled the problematic functionality through a functionality feature switch. **Mitigation Measures:** 1. Improved monitoring of data-enabled functionality \(functionality feature switches\) to quickly detect anomalies and potential performance issues. 2. Implementation of a more aggressive graceful degradation approach, selectively disabling problematic functionality when high loads are detected to prevent widespread impact. 3. Continuous improvement of stress tests in lower environments to enhance the discovery of similar performance-related issues.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "Transient Platform Downtime Due To Database Cluster Failover"
Last update**Date of Incident:** 07/04/2023 **Time/Date Incident Started:** 07/04/2023, 10:42 am EDT **Time/Date Stability Restored:** 07/04/2023, 10:51 am EDT **Time/Date Incident Resolved:** 07/04/2023, 12:48 pm EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Critical **Incident description:** A hardware fault affecting the server in the primary database cluster caused a brief loss of availability of the Primary Database Replica, and subsequent platform downtime, while the cluster healed itself. **Root Cause Analysis:** According to our cloud hosting partner, the server acting as the listener and primary node in the production database cluster suffered a critical hardware fault and went offline. A transient network issue introduced a brief delay in the failover mechanism, but all affected services recovered within a few minutes. **Actions Taken:** 1. Restarted the affected service to bring the failed node back online. 2. Monitored the impacted platform components to ensure application recovery. **Mitigation Measures:** 1. Redeployment of the impacted virtual machine took place during the 7/8/2023 planned maintenance window. 2. Continue the investigation with our cloud service provider to improve cluster recovery even during transient network events.
A hardware fault affecting the server in the primary database cluster caused a brief loss of availability of the Primary Database Replica, and subsequent platform downtime, while the cluster healed itself.
Report: "ServiceChannel Performance Degradation Remediation"
Last updatePlease see the general postmortem at [https://status.servicechannel.com/incidents/cvp26brsbwl8](https://status.servicechannel.com/incidents/cvp26brsbwl8) for a comprehensive description of work to remediate platform performance issues.
As a part of our ongoing investigation, the ServiceChannel SRE and Application Engineering teams have explored a range of scenarios that may have contributed to our current performance issues during periods of high utilization. New information has come to light that indicates the problem could be related to aspects beyond the software itself, which has prompted us to reevaluate the need for the rollback and consider more targeted solutions. We made the decision to postpone the previously-planned rollback while we conduct a thorough examination of this new information. We believe that this decision is in the best interest of maintaining the optimal performance of the system and minimizing any disruptions to your experience. As always, we appreciate your understanding and patience.
In an effort to address ongoing performance issues during periods of heavy platform utilization, the ServiceChannel SRE and Application Engineering teams are preparing to roll our platform back to its 24 May 2023 revision. We expect to complete the rollback on 20 June by 8 am EDT. Platform changes implemented after that date will be unavailable until we are able to reintegrate them. This may take several days. Work orders, invoices, and other transactional activities added to the platform after 24 May 2023 WILL NOT be impacted and remain available. We regret the inconvenience and will keep you posted as we continue to address ongoing performance issues. Thank you for your continued understanding and patience.
Report: "ServiceChannel System Performance Degradation"
Last updatePlease see the general postmortem at [https://status.servicechannel.com/incidents/cvp26brsbwl8](https://status.servicechannel.com/incidents/cvp26brsbwl8) for a comprehensive description of work to remediate platform performance issues.
This incident has been resolved. All services are working as expected. Again, we apologize for any inconvenience and will continue to monitor to ensure any future issues are dispatched quickly.
Though our Cloud hosting partner continues to experience network connectivity issues, the mitigations we have put in place are working and system performance has returned to normal. We will continue to monitor while our hosting partner brings their network issue to a complete remediation. We apologize for the inconvenience and appreciate your patience through this incident.
Our cloud service provider is currently experiencing a widespread network outage, which is causing issues on our platform. We're working with them to restore services as quickly as possible.
We are aware of and addressing an issue affecting the ServiceChannel platform. We will provide an update shortly.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Degradation"
Last updatePlease see the general postmortem at [https://status.servicechannel.com/incidents/cvp26brsbwl8](https://status.servicechannel.com/incidents/cvp26brsbwl8) for a comprehensive description of work to remediate platform performance issues.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "Intermittent Performance Issues Under Normal Production Load"
Last update**Intermittent Performance Issues Under Normal Production Load - Incident Report** **Date of Incident:** 06/06/2023 - 06/30/2023 **Time/Date Incident Started:** 06/06/2023, 06:37 am EDT **Time/Date Stability Restored:** 06/28/2023, 10:00 pm EDT **Time/Date Incident Resolved:** 06/30/2023, 12:30 pm EDT **Users Impacted:** Many **Frequency:** Intermittent **Impact:** Major **Incident Description:** Support highlighted a system slowdown and degraded performance, impacting the Dashboard, Work Order operations, and Invoice reports. Despite our dedicated and consistent efforts to rectify these performance challenges, they persisted over several weeks before we successfully resolved the issues. **Root Cause Analysis:** A series of interrelated issues, typically manageable individually, collectively led to significant performance degradation during periods of increased production load. We initially struggled to identify the root causes due to their occurrence around the same time as unrelated infrastructure changes. The key issues included: * High numbers of Redis cache timeout events. * SQL timeouts in the application. * Multiple app server node failures requiring manual restarts. * Overuse of API calls due to a faulty third-party integration. **Redis Cache Timeouts:** We initially suspected the Redis cache timeouts were due to an upgrade from Redis v4 to Redis v6. However, after the timeouts persisted following a reversion to Redis v4, we discarded this theory. We traced the timeouts to a combination of connection thread exhaustion and misconfigured Redis connection timeout values. The application lacked a fallback mechanism for Redis object retrieval, causing failure instead of graceful data retrieval from the persistence layer. **Application SQL Timeouts:** Unpredictable application behavior stemmed from intermittent periods of SQL timeouts on application server nodes. The distribution of these errors across all server nodes indicated a non-application code issue. Our SRE and Application Engineering teams, working with our DBA team, traced the SQL timeout errors to long-running SQL queries on the database cluster. **Application Node Failures:** During this period, an unusually high number of application nodes failed, marked by increased response duration, maximum CPU utilization, and high memory usage. The SRE team discovered the issue stemmed from the routing algorithm, set to "LeastConnections". This algorithm led heavily loaded nodes to get locked into a high-load state, requiring manual intervention. **Excessive API Calls:** A Service Provider reported an unusually large number of Work Order schedule changes in a Work Order assigned to their organization. We traced these changes, which triggered a nuisance cycle of Work Order Notes and Notifications, to a faulty Subscriber-built integration. **Actions Taken:** During the investigation, our SRE and Application Engineering teams established a protocol for daily joint monitoring conferences. Key events are available in Appendix A. Key activities included: 1. SRE team monitoring logs for performance issue symptoms. 2. SRE team restarting web instances showing elevated response time. 3. DBA team investigating database anomalies. 4. Application Engineering team reviewing Redis configurations. 5. SRE team deactivating the faulty integration, modifying throttle limits, and engaging the responsible Subscriber. **Mitigation Measures:** 1. Redis Cache Timeouts: The Application Engineering team has implemented a shorter timeout threshold and fallback mechanism for Redis. 2. Redis Cache Timeouts: The Application Engineering and SRE teams are separating certain Redis application caches for better future performance. 3. Redis Cache Timeouts: SRE team has scaled up production Redis cache cluster nodes. 4. Application SQL Timeouts: team is systematically modifying stored procedures for improved transaction isolation through improved concurrency, thereby eliminating read blocking and subsequent SQL timeouts. 5. Application SQL Timeouts: Application Engineering team is implementing a systematic review of address transaction isolation levels implemented in stored procedures executed from code. 6. Application SQL Timeouts: When required, the DBA team will schedule Serializable transaction isolation queries during quiescent platform periods. 7. Application SQL Timeouts: DBA team has identified several stored procedures for future optimization. 8. Application SQL Timeouts: SRE team has implemented monitors to alert the DBA team about SQL timeout increases. 9. Application Node Failures: SRE team adjusted application configurations for optimal load balancing. by switching from “LeastConnections” to the "LeastResponseTime" algorithm, allowing nodes handling heavy tasks to finish before receiving additional work. 10. Application Node Failures: SRE team added monitors to identify application nodes trending toward failure. 11. Application Node Failures: The Application Engineering and SRE teams are improving internal health checks for deployed applications. 12. Application Node Failures: SRE team is developing functionality for automatic rebooting of failing application nodes. 13. Excessive API Calls: SRE team disabled a faulty Subscriber integration, communicated the issue to the Subscriber, and tightened the API throttle limit for the impacting integration. 14. Excessive API Calls: SRE team will monitor API usage trends more closely. 15. Excessive API Calls: The Architecture team will investigate alternative backpressure techniques for better platform scaling. 16. Excessive API Calls: Our teams are considering a formal process to evaluate and certify third-party integrations before implementation.
Support highlighted a system slowdown and degraded performance, impacting the Dashboard, Work Order operations, and Invoice reports. Despite our dedicated and consistent efforts to rectify these performance challenges, they persisted over several weeks before we successfully resolved the issues.
Report: "ServiceChannel System Performance"
Last update**Infrastructure/hardware instability** **Incident Report** **Date of Incident:**` `05/01/2023 **Time/Date Incident Started:** 05/01/2023, 5:00 pm EDT **Time/Date Stability Restored:**` `05/01/2023, 11:48 pm EDT **Time/Date Incident Resolved:**` `05/01/2023, 11:48pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** Third party vendor infrastructure/hardware instability **Root Cause Analysis:** A third party vendor infrastructure issue affected performance and system availability for the underlying data storage layer servicing platform resources. **Actions Taken:** 1. Investigated system-generated alerts and identified affected platform functionality 2. SRE and DBA teams initiated a platform infrastructure redeployment, forcing the new infrastructure to spun up on unaffected infrastructure/hardware **Mitigation Measures:** 1. Continue the ongoing investigation into root causes of the infrastructure issue within our cloud hosting provider. 2. Continue to implement high availability improvements to prepare the platform to respond better to unexpected hardware issues that are beyond our control.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last updateThis incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance"
Last updateThis incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel Analytics Platform"
Last updateThe provider has indicated that resolution steps have been implemented and we expect that all ServiceChannel Analytics services are operational. Any missed scheduled reports will be sent shortly. We consider this issue resolved.
Our Analytics Platform provider has restored services and dashboards/reports should now load as expected. Some scheduled reports that failed will be resent shortly. We will share another update once we have confirmation all missed reports were sent and services remain stable.
Our Analytics Platform provider is currently experiencing an outage that impacts ServiceChannel's Analytics. We're working with the provider to restore services as soon as possible. An update will be shared shortly.
Report: "ServiceChannel System Performance"
Last update**Increased Network Latency and Degraded Application Performance - Incident Report** **Date of Incident:** 03/01/2023 **Time/Date Incident Started:** 03/01/2023, 09:30 am EST **Time/Date Stability Restored:** 03/01/2023, 11:54 am EST **Time/Date Incident Resolved:** 03/01/2023, 01:04 pm EST **Users Impacted:** All users **Frequency:** Intermittent **Impact:** Major **Incident description:** At 9:30 am on March 1, 2023, several monitors that measure ServiceChannel platform stability went into an alert state. In the course of its investigation, the ServiceChannel SRE team observed that network latency between components of the ServiceChannel platform had increased significantly. Around the same time, the ServiceChannel support team also began to receive reports of slowness from end users. The SRE team restored network latency to normal by performing a rolling restart of each affected web service. **Root Cause Analysis:** At this time, SRE team believes the network latency was the result of an unannounced networking problem at our cloud provider’s data center. The ServiceChannel SRE team has requested and is awaiting an RCA from our cloud provider. **Actions Taken:** 1. The SRE team performed a rolling restart of affected web application services, which restored network latency to normal. **Mitigation Measures:** 1. The SRE team is investigating hosting web services in multiple geographically dispersed availability zones within a data center.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are continuing to investigate this issue.
We are continuing to investigate the issue. Your patience is appreciated while we're working to fully resolve this.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel Analytics"
Last updateThe provider has indicated that resolution steps have been implemented and we expect that all ServiceChannel Analytics services are operational. Any missed scheduled reports will be sent shortly. We consider this issue resolved.
Our service providers performed an initial series of service recovery operations. Analytics reports and dashboards are running now. However the issue is not completely resolved from our Service Providers side. They have identified the source of the issue, and are developing and implementing a fix to restore service fully. We will continue to monitor the situation. Till that time, there might be some intermittent disruptions.
Our Analytics Platform provider is currently experiencing an outage that impacts ServiceChannel's Analytics. We're working with the provider to restore services as soon as possible. An update will be shared shortly.
Report: "ServiceChannel Mobile App"
Last updateThis incident has been resolved. All services are working as expected.
The SC Mobile App service has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
The ServiceChannel SRE team has identified the issue and is applying a fix. We will continue to provide updates until this is fully resolved. Thank you for your continued patience.
Our engineers are still working to restore SC Mobile App access. We'll continue to share updates as soon as available.
We're currently experiencing an issue with our SC Mobile App. Our engineers are actively working to restore access as soon as possible. An update will be shared shortly.
Report: "ServiceChannel System Stability"
Last update**Cloud Provider Network Outage - Incident Report** **Date of Incident:** 01/25/2023 **Time/Date Incident Started:** 01/25/2023, 3:30 am EST **Time/Date Stability Restored:** 01/25/2023, 6:30 am EST **Time/Date Incident Resolved:** 01/25/2023, 6:30 am EST **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** Azure Networking Errors - Multiple regions **Summary of Impact:** Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365 and PowerBI. **Preliminary Root Cause Analysis:** Microsoft Azure, our primary Cloud Service Provider, determined that a change made to the Microsoft Wide Area Network \(WAN\) impacted connectivity between customers on the internet to Azure, connectivity between services within regions, as well as ExpressRoute connections. This was a global outage for all Microsoft Azure customers. The ServiceChannel SRE team determined that users outside North America encountered issues reaching our EU hosted applications. Users located in the EU encountered issues connecting to the US hosted applications. After our cloud provider rolled back Wide Area Network \(WAN\) changes, network access between regions was restored for all ServiceChannel users. **Actions Taken:** 1. SRE team investigated triggered platform alerts for our European datacenter. 2. Reviewed status page for hosting partner. **Mitigation Measures:** Our cloud provider identified a recent change to WAN as the underlying cause and have rolled back this change. To the end, they have offered the following mitigations to prevent recurrence: 1. Blocking highly impactful commands from getting executed on network devices \(Completed\) 2. Requiring that all command execution on the devices to follow safe change guidelines \(Estimated completion: February 2023\) Cloud Provider RCA \(requires Microsoft Azure account\): [https://app.azure.com/h/VSG1-B90/05a585](https://app.azure.com/h/VSG1-B90/05a585)
Our hosting partner has successfully restored all WAN networking services. We consider this incident to be resolved.
Our hosting partner reports that most customers should now see full recovery as WAN networking has recovered fully. We continue to monitor to ensure full recovery for services that were impacted.
Our hosting partner has identified a recent WAN update as the likely underlying cause of this network connectivity issue, and have taken steps to roll back this update. Their latest telemetry shows signs of recovery across multiple regions and services. We are continuing to actively monitor the situation.
Our hosting partner is experiencing an ongoing issue with network connectivity, impacting the ServiceChannel platform. The ServiceChannel Site Reliability Engineering team will provide updates as they become available."
Report: "ServiceChannel Work Order Editing Errors"
Last update**Execute or Insert permission was denied against DB objects errors - Incident Report** **Date of Incident:** 01/09/2023 **Time/Date Incident Started:** 01/09/2023, 10:19 am EDT **Time/Date Stability Restored:** 01/09/2023, 12:30 pm EDT **Time/Date Incident Resolved:** 01/09/2023, 12:30 pm EDT **Users Impacted:** Few **Frequency:** Intermittent **Impact:** Major **Incident description:** A small number of users encountered “Execute permission was denied against Database objects” or random time-out errors. **Root Cause Analysis:** DBA \(Database Administration\) and SRE \(Site Reliability Engineering\) teams responded to reports of random errors or timeout issues as reported to the Servicechannel support teams. While conducting a deep dive into application logs, SRE team identified a pattern of errors all being generated against a single instance of the serviceclick pool. Furthermore, our application logs identified this instance came online exact when the errors started registering. This new instance was automatically added as previously defined by scale-out rules that take into consideration the existing demands of the system and are then removed when no longer required. SRE team pulled the logs for the bad instance and opened a cloud provider support case and shortly after manually removed the unhealthy instance. **Actions Taken:** 1. Database team attempted to resolve the execute permission errors by providing the required permissions to the tables. 2. SRE team reviewed logs and found a specific instance that was generating all errors. 3. SRE team pulled logs for that unhealthy node and opened a support case with our cloud provider Support to assist with the investigations. 4. SRE team stopped the unhealthy instance via the Cloud Provider Support Rest API. 5. SRE team engaged engineering team to perform a deep dive on logging, health-checks, database configuration and credentials storage. **Mitigation Measures:** 1. Added alerts that fire on insert permissions errors and names the specific instance. 2. The engineering team will add more logging timestamps to ensure proper timestamps are tied to the application. 3. Engineering team will review web application instance health-check to ensure they are working as intended. 4. Cloud provider support to help explain why a single instance exhibited behavior that was different from all other nodes in the pool.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating an issue with errors occurring while editing work orders. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance"
Last update# **Database VM \(Virtual Machine\) failures -** **Incident Report** **Date of Incident:** 01/06/2023 **Time/Date Incident Started:** 01/06/2023, 10:19 am EDT **Time/Date Stability Restored:** 01/06/2023, 12:30 pm EDT **Time/Date Incident Resolved:** 01/06/2023, 12:30 pm EDT **Users Impacted:** Few **Frequency:** Intermittent **Impact:** Major **Incident description:** Automated alerting for Database virtual machines triggered suddenly which lead to failed health checks and VM’s defined as unhealthy. This degradation resulted in performance issues for the ServiceChannel platform users. **Root Cause Analysis:** Early in the troubleshooting process, SRE \(Site Reliability Engineering\) and DBA \(Database Administration\) teams identified one of the VM instances suffered a loss of network connectivity which resulted in the instance being marked as unhealthy. SRE team proceeded with redeploying this VM which served as one of the replica virtual machine for the database cluster. 15 minutes into the redeploy, SRE team determined that a second replica server registered as unhealthy. SRE team decided to redeploy the second virtual machine. The redeploy process involves migrating the virtual machines onto new host hardware, once the redeployment was completed, DBA team proceeded with ensuring the replica servers were fully in-synch and that load was balanced properly between the servers. **Actions Taken:** 1. Investigated triggered alerts and identified degraded virtual machines . 2. SRE team triggered a VM redeploy on both replica database servers onto new underlying hardware. **Mitigation Measures:** 1. SRE team opened an Azure support case for additional assistance with investigating the root cause of the virtual machine failures. 2. SRE and DBA teams has started efforts to enhance high availability and disaster recovery for existing and future database server implementations.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
The ServiceChannel SRE team has identified the issue and is applying a fix. We will continue to provide updates until this is fully resolved. Thank you for your continued patience.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel System Performance"
Last update**Emergency Release Resulting in Service Outage - Incident Report** **Date of Incident:** 11/18/2022 **Time/Date Incident Started:** 11/18/2022, 12:38 pm EDT **Time/Date Stability Restored:** 11/18/2022, 2:58 pm EDT **Time/Date Incident Resolved:** 11/18/2022, 3:36 pm EDT **Users Impacted:** Some users **Frequency:** Intermittent **Impact:** Major **Incident description:** The DevOps CICD team executed an emergency production code deployment for the Subscriber website code to remediate an issue reported by one client. The system exhibited periods of degraded performance for 3 hours during the emergency release. **Root Cause Analysis:** After the emergency code release, SRE team determined there were unhealthy nodes present against one of the ServiceClick web application pools. The SRE team responded by increasing node/instance capacity and restarted existing nodes with a degraded status. DevOps team later identified there was a miscommunication between the Release team and the SRE team during release hand-off wasn't properly stated. This resulted in additional delay to remediate the remaining unhealthy nodes. **Actions Taken:** 1. Release team performed an emergency code release 2. SRE team restarted some degraded app service nodes 3. A Zoom bridge was started with the release team and the SRE team 4. SRE team manually scaled up capacity on one of the ServiceClick web application pools **Mitigation Measures:** 1. Review the criteria for approving an emergency code release. 2. Update the emergency release process documentation, to including the following points: 1. If an emergency release is happening in the service time range \(USA: 9 AM – 2 PM EDT, EU: 9 AM – 2 PM GMT\+1\) then the release team and the SRE team will start a bridge \(Zoom or Microsoft Teams call\) immediately when the release begins. 2. Review the deployment strategy for high-traffic web applications and create new methods to ensure zero downtime deployments at any time of day.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
The Site Reliability Engineering team has discovered the cause of the degraded performance
We are continuing to investigate this issue.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "ServiceChannel Work Order Update Latency"
Last update**Database Replication Latency and Messaging queue bug resulting in delays with processing system messages** **Date of Incident:** 12/23/2022 **Time/Date Incident Started:** 12/23/2022, 02:27 pm EST **Time/Date Stability Restored:** 12/24/2022, 03:20 am EST **Time/Date Incident Resolved:** 12/24/2022, 10:15 am EST **Users Impacted:** Some clients **Frequency:** Intermittent **Impact:** Major **Incident description:** 1. Database Replication latency resulted in event transactions where some messages were blocked, this lead to processing delays. 2. A small number of bad messages that exceeded body size limits were unable to properly age out of the system where the retry process created duplicated bad messages. **Root Cause Analysis:** the SRE team, along with the DBA team had just completed responding to a production database issue that resulted in data replication latency. This type of event typically does not result in a production outage issue as replication is quickly synced back together once the replication servers are in a healthy state. One of the side effects of replication latency is the backlog of system event messages which triggered internal monitors for queue message thresholds. This issue will typically resolve itself once the backlog of messages are processed by the system. SRE team observed that messages continued to remain unprocessed and restarted the worker services responsible for processing the system event messages. When this did not resolve the issue, the SRE team reviewed logs and engaged our software engineering teams. After conducting a joint deep dive into this issue, we were able confirm that new messages arriving in queue were successfully being processed. However, our software engineers identified a previously undiscovered bug on the emitter side of the events system, where if a “WorkOrderCreated” event had a body size larger than 256KB, this would result in a rejection and subsequently resulted in a crash of the queueing service which also kept these specific events marked as not processed, from this point the emitter would start to create duplicate events that were also not being processed causing a loop effect. By early morning, the teams were able to identify and mark the affected messages for deletion, this update allowed the duplicate messages to slowly age out of the system. **Actions Taken:** 1. Monitored FIFO queue and restarted notification services. 2. Increased instance counts for win services responsible for the HttpEndpointNotificationHandler. 3. Restarted application servers. 4. Monitored logs to confirm and confirmed that new system event messages were being processed. 5. Created test messages to confirm status were being updated properly. 6. Identified and deleted messages with body size limits. 7. Increased workers to process outstanding events. 8. Identified messages that exceeded body size of 256KB. 9. Marked duplicate message for deletion. 10. Disabled retry attempts for duplicate messages. **Mitigation Measures:** 1. Engineering team identified a bug with the message size limitation and will add proper validation for this message size limitation. 2. Engineering team will improve worker agent scaling to handle increase message loads.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
The ServiceChannel SRE team has identified the issue and is applying a fix. We will continue to provide updates until this is fully resolved. Thank you for your continued patience.
We are currently investigating work order update latency on the ServiceChannel platform. We will share an update shortly. Thank you for your patience.
Report: "ServiceChannel Contact Center"
Last updateThe provider has indicated that resolution steps have been implemented and we expect that all Contact Center services are operational. We consider this issue resolved.
Our Contact Center is currently experiencing an outage with our phone provider. We're working with the provider to restore services as soon as possible. An update will be shared shortly.
Report: "Security Notification"
Last updateSecurity Notification On December 10th a repeat incident of malicious code was discovered in ServiceChannel’s informational website https://servicechannel.com. After investigating this issue, it was determined that the code was left dormant. ServiceChannel’s cybersecurity team immediately responded to this threat and believes the dormant code was eliminated before it could distribute malware to visitors to the website. The ServiceChannel website is completely segregated from the ServiceChannel software platform and we believe there was no impact to any customer or service provider data or to ServiceChannel systems. There is no action required of you at this point. We are just sharing this in the spirit of full transparency. ServiceChannel’s cybersecurity team constantly monitors all our systems to ensure the security of our customers, service providers and their data. Best, ServiceChannel’s cybersecurity team
Report: "Security Notification"
Last updateSecurity Notification On December 6th malicious code was discovered in ServiceChannel’s informational website https://servicechannel.com. After investigating this issue, it was determined that the code was left dormant. ServiceChannel’s cybersecurity team immediately responded to this threat and believes the dormant code was eliminated before it could distribute malware to visitors to the website. The ServiceChannel website is completely segregated from the ServiceChannel software platform and we believe there was no impact to any customer or service provider data or to ServiceChannel systems. There is no action required of you at this point. We are just sharing this in the spirit of full transparency. ServiceChannel’s cybersecurity team constantly monitors all our systems to ensure the security of our customers, service providers and their data. Best, ServiceChannel’s cybersecurity team
Report: "ServiceChannel Service Center"
Last updateThanks for your patience. We consider this issue to be resolved.
Our telecommunications provider has implemented a fix and inbound calls to our Service Center are flowing properly again. We will continue to monitor for the next 30 minutes.
Our Service Center staff are currently not able to receive inbound telephone calls. We are working with our telecommunications partner to restore service and will provide an ETA as soon as it becomes available.
Report: "ServiceChannel System Performance"
Last update**Primary Database Memory Dump Crash** - **Incident Report** **Date of Incident:** 10/20/2022 **Time/Date Incident Started:** 10/20/2022, 10:59am EDT **Time/Date Stability Restored:** 10/20/2022, 11:06am EDT **Time/Date Incident Resolved:** 10/20/2022, 12:08pm EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Critical **Incident description:** An issue with the underlying hardware used by the Virtual Machine \(VM\) hosting the primary database replica caused a transient memory error in the database service. The process that writes out the affected memory space to a diagnostic file caused the system to become unresponsive and the primary database to be briefly unavailable, causing systemwide downtime of the ServiceChannel platform. **Root Cause Analysis:** A general hardware issue on one of our infrastructure partner’s hypervisors resulted in a transient memory error in the VM running our primary database service. Upon detecting a malfunctioning process, the failure triggered the self-healing process. As a part of the recovery process, the impacted service generated a memory dump of crucial information about the state of the service and content of impacted memory and log files. Normally this process is seamless, however in this event it resulted in a temporary freeze of the database service, thereby causing systemwide downtime. The database returned to normal operation as soon as the memory dump completed, returning the platform to service. **Actions Taken:** 1. The Database Administration \(DBA\) team investigated alerts triggered by an unresponsive SQL Server service on the primary replica. 2. The DBA team verified that the database service had recovered from a system freeze which occurred while a diagnostic memory dump took place. 3. The DBA team confirmed that full functionality returned to database services and the synchronization process between the impacted primary and the secondary database replicas. 4. The Site Reliability Engineering \(SRE\) team confirmed that the ServiceChannel platform returned to normal operation. **Mitigation Measures:** 1. Work with our cloud infrastructure partner to develop additional details pertaining to the nature of this failure to implement a robust strategy to survive unavoidable system instability such as this. 2. Enhance the current high availability remediation and disaster recovery mechanism within SQL Server for additional operational resiliency.
Services are working as expected; we consider this incident to be resolved. Thank you for your patience.
Our engineering team has identified the issue and services are returning to normal. We are continuing to monitor.
We are actively investigating an issue with performance on the ServiceChannel platform. We will provide an update as soon as possible. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**Intermittent Virtual Machine Network Connectivity Issue** - **Incident Report** **Date of Incident:** 10/12/2022 **Time/Date Incident Started:** 10/12/2022, 8:35am/pm EDT **Time/Date Stability Restored:** 10/12/2022, 10:39am/pm EDT **Time/Date Incident Resolved:** 10/12/2022, 10:45am/pm EDT **Users Impacted:** Many **Frequency:** Intermittent **Impact:** Major **Incident description:** Partial loss of network connectivity to a database read replica Virtual Machine \(VM\). **Root Cause Analysis:** The ServiceChannel Site Reliability Engineering \(SRE\) and Database Administration \(DBA\) teams discovered a partial loss of connectivity affecting a VM used by the main transactional database as a read replica. As a result of the connectivity issue, certain processes, including real-time database replication from the primary database replica to the impacted read replica, were delayed. This caused noticeable latency for certain kinds of data updates, and, in some cases, application performance degradation. The connectivity problem occurred in the network hardware layer that is managed by our cloud infrastructure partner, affecting a single read replica in the main transactional database. **Actions Taken:** 1. Investigated triggered alerts and degraded network functionality. 2. The SRE and DBA teams coordinated the redeployment and restart of the impacted VM. This moved the VM to a network segment that was functioning correctly. 3. Confirmed that full connectivity between the impacted VM and the primary database replica was restored and the database was operating normally again. **Mitigation Measures:** 1. Work with our cloud infrastructure partner’s support team to develop additional details pertaining to the nature of this failure to design a network strategy that can survive unavoidable network instability. 2. Enhance the current high availability remediation and disaster recovery provide for additional operational resiliency.
Services are working as expected; we consider this incident to be resolved. Thank you for your patience.
Our engineers have identified and corrected the source of the issue and services have been restored. We will monitor to ensure system stability.
We are actively investigating an issue causing degraded performance on the ServiceChannel platform. We will provide an update as soon as possible. Thank you for your patience.
Report: "SSO Login Errors"
Last update**SSO Login failures for some SAML/SSO customers - Incident Report** **Date of Incident:** 10/06/2022 **Time/Date Incident Started:** 10/06/2022, 7:00 AM EDT **Time/Date Stability Restored:** 10/06/2022, 3:55 PM EDT **Time/Date Incident Resolved:** 10/06/2022, 4:05 PM EDT **Users Impacted:** Few users **Frequency:** Continuous **Impact:** Major **Incident description:** Errors when attempting to authenticate using Single Sign-On for a subset of SAML SSO-enabled customers. **Root Cause Analysis:** Upon further investigation, the team responsible for managing the SAML SSO module determined that an undetected bug was introduced in software released 10/5/2022. This bug was not caught because it appears to only affect certain SAML SSO-enabled customers. As we do not have test accounts for every SAML SSO-enabled customer, our test coverage cannot find these edge cases. **Actions Taken:** 1. SRE team reviewed logs and determined that SSO authentication issues were confined to a very small subset of SAML SSO-enabled customers. 2. SRE team determined that the incident coincided with a login component application release the previous evening. 3. CICD team performed an emergency rollback of the login application components. **Mitigation Measures:** 1. Add monitoring to alert on increased SSO errors for our SAML SSO-enabled customers. 2. Release a fix for the underlying bug in the login component application.
After an extended period of monitoring, we believe this issue is resolved. Thank you for your patience and understanding.
The ServiceChannel engineering team has implemented a fix. We are now monitoring to ensure continued stability.
ServiceChannel engineers have identified a cause for this issue and are preparing a fix.
The ServiceChannel Site Reliability Engineering (SRE) team is currently investigating an issue affecting Single Sign-On (SSO) logins for some users. We will provide an update shortly.
Report: "ServiceChannel System Performance Degradation"
Last update**Date of Incident:** 09/16/2022 **Time/Date Incident Started:** 09/16/2022, 6:08pm EDT **Time/Date Stability Restored:** 09/16/2022, 6:27m EDT **Time/Date Incident Resolved:** 09/16/2022, 7:00pm EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Major **Incident description:** Unexpected failure of a primary database server. **Root Cause Analysis:** The Site Reliability Engineering \(SRE\) team identified the production primary database server for US datacenter was in an unresponsive state and determined that Azure had triggered an automated recovery/redeploy process due to a detected hypervisor hardware failure. The affected hypervisor was responsible for running our primary database virtual machine. SRE and Database teams monitored the failover to new hardware and verified that the redeployed virtual machine was operating properly. **Actions Taken:** 1. SRE team Investigated triggered alerts and identified a failed virtual machine for the US production master database server. 2. The Database team confirmed that the redeployed hardware was operating as expected. **Mitigation Measures:** 1. SRE Team opened Azure support case to get additional details pertaining to the nature of this failure. 2. DBA Team expanded and improved database clustering setup to eliminate single points of failure in database infrastructure.
Services are working as expected; we consider this incident to be resolved. Thank you for your patience.
Our engineers have identified and corrected the source of the issue and services have been restored. We will monitor to ensure system stability.
We are actively investigating an issue causing degraded performance on the ServiceChannel platform. We will provide an update as soon as possible. Thank you for your patience.
Report: "Supply Manager Unavailable"
Last update**DNS Errors for Supply Manager Incident and Postmortem Report** **Date of Incident:** 08/30/2022 **Time/Date Incident Started:** 08/30/2022, 3:01 am EDT **Time/Date Stability Restored:** 08/30/2022, 9:23 am EDT **Time/Date Incident Resolved:** 08/30/2022, 10:06 am EDT **Users Impacted:** Few **Frequency:** Intermittent **Impact:** Minor **Incident description:** Customers that have Supply Manager enabled encountered an undefined error during login. **Root Cause Analysis:** The SRE team responded to internal alerts triggered against the Supply Manager component and confirmed this problem was impacting specific virtual machines running the Ubuntu operating system. The Azure statuspage acknowledged an issue for Ubuntu 18.04, where the latest operating system updates resulted in DNS errors when accessing URL resources. The SRE team forwarded these details to the dedicated support for the managed service. This support team was then able to successfully restore service to the supply manager component for the ServiceChannel Platform. Reference to the Azure incident: [https://app.azure.com/h/2TWN-VT0/05a585](https://app.azure.com/h/2TWN-VT0/05a585) Virtual Machines - DNS errors when accessing resources **Actions Taken:** 1. The SRE team verified the issue was isolated to the managed service responsible for the supply manager application. 2. The SRE team identified a temporary workaround for the problem and forwarded the details to the vendor. **Mitigation Measures:** 1. Recommendations to the vendor to add synthetic checks that could aid in earlier detection of these types of issues.
This incident has been resolved.
We're happy to report Supply Manager is now operational. We will monitor to ensure there are no further issues.
Due to a technical issue, Supply Manager is currently unavailable. Our engineers are currently investigating and will provide an ETA as soon as possible. Thank you for your patience.
Report: "ServiceChannel System Performance Degradation"
Last update**ServiceChannel System Performance Degradation Incident Report** **Date of Incident:** 06/27/2022 **Time/Date Incident Started:** 06/27/2022, 11:12 am EDT **Time/Date Stability Restored:** 06/27/2022, 15:55 pm EDT **Time/Date Incident Resolved:** 06/27/2022, 17:57 pm EDT **Users Impacted:** Many **Frequency:** Intermittent **Impact:** Major **Incident description:** A long-running query caused a temporary database to fill, which resulted in resource exhaustion. This caused performance degradation and negatively impacted customer experience. **Root Cause Analysis:** The DBA team identified a long running query running against the database servers which caused tempdb \(a temporary database\) to fill, causing an application error and adversely affecting performance. This query consumed additional system resources, resulting in further degradation of performance on the ServiceChannel platform. **Actions Taken:** 1. Increased disk space for the temporary database on impacted servers. 2. Restarted the impacted database servers to free up resources. **Mitigation Measures:** 1. Created new internal documentation for responding to this type of scenario. 2. Improved monitoring for earlier detection of this degradation scenario.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
As our investigation continues, we are implementing a systemwide change that may cause a small number of users to be logged out. We apologize for the inconvenience, but we do anticipate that these changes will have a positive impact on system performance. Thanks again for your patience.
We are actively investigating an issue causing degraded performance on the ServiceChannel platform. We will provide an update as soon as possible. Thank you for your patience.
Report: "ServiceChannel Support issues with Support platform"
Last updateThis incident has been resolved.
We are starting to receive reports of the Zendesk Support emails being delivered again. Please monitor your inbox for new notifications. Our provider has made changes to allow the delivery of the emails once again. As a reminder, if you're not receiving a reply to your request, you can log in at https://servicechannel.zendesk.com/hc/en-us/requests and check on your requested tickets directly. However, we anticipate that all emails are being sent as expected.
The ServiceChannel Support platform (powered by Zendesk) is experiencing issues with emails not being received in some cases. If you're not receiving a reply to your request, you can log in at https://servicechannel.zendesk.com/hc/en-us/requests and check on your requested tickets directly. If you do not know your Zendesk Account password, your email is currently registered so you just need to reset the password. Click on the link "Forgot my Password" on the login page to reset it. Once you are logged in you can select My Activities from the drop-down in the upper right corner to review your ticket history. We will continue to work with our business partners to restore the email delivery of the Support tickets as soon as possible. You can continue to send requests/replies to our Support team by email, but please log in if you're not seeing a reply in your inbox.
Report: "Data Direct Connection Issue"
Last updateThis incident has been resolved. All services are working as expected.
The primary VPN endpoint for connectivity to Data Direct has been restored by the responsible connectivity vendor.
For our Customers using Data Direct, we are currently working with our cloud provider and investigating an issue that is impacting our Data Direct Service. Customers with direct access may experience a connection issue while Customers using database backup files are not impacted currently. We'll provide an update ASAP.
Report: "ServiceChannel Analytics"
Last updateAll systems are working normally and we consider this issue to be resolved. As always, we thank you for your patience as we worked to resolve this matter.
Our engineers have restored the Analytics module to normal operation. Everything is working as expected; we are currently monitoring the stability of the fix but do not anticipate any further issues.
The ServiceChannel engineering team has identified the issue and is actively working to restore access to the Analytics module.
We're currently investigating issues with our Analytics module and we are actively working on restoring access. During the outage, scheduled reports should be sent. An update will be provided ASAP.
Report: "System Performance Degradation"
Last update**System Performance Degradation Across Modules - Incident Report** **Date of Incident:** 05/19/2022 **Time/Date Incident Started:** 05/19/2022, 08:50 am EDT **Time/Date Stability Restored:** 05/19/2022, 09:46 am EDT **Time/Date Incident Resolved:** 05/19/2022, 09:58 am EDT **Users Impacted:** All users **Frequency:** Intermittent **Impact:** Major **Incident description:** The ServiceChannel monitoring system detected system performance degradation and increased application latencies. The ServiceChannel Site Reliability Engineering \(SRE\) team established an investigation. Impacted customers reported general slowness while using the ServiceChannel platform. After confirming that core system components that would have a cascading adverse effect on many modules if they were in a degraded state were in fact healthy, the SRE team determined that an issue at our cloud provider was likely experiencing unreported performance degradation on their backend. The SRE team engaged our cloud provider’s support engineers to establish a root cause. During a lengthy exploratory conference bridge, the cloud service provider’s support engineers were able to find evidence of a networking failure within a primary datacenter, consistent with the timeline of the incident. **Root Cause Analysis:** A transient networking failure at our cloud provider’s datacenter caused slowness and degraded performance for end users of the ServiceChannel platform. **Actions Taken:** 1. The SRE team’s monitoring tools issued alerted related to increased latency across all ServiceChannel platform modules. 2. The SRE team conducted an investigation, examining application logs, key infrastructure performance metrics, and other telemetry. 3. After determining that core system components were in fact healthy, the SRE team engaged our cloud provider’s support engineers. 4. Our cloud provider was able to provide evidence of networking failures at their datacenter during the time of the incident. **Mitigation Measures:** 1. The SRE team is exploring alternative deployment patterns to improve resiliency during transient failures within our cloud service provider.
This incident has been resolved. All services are working as expected.
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "Fixxbook Performance Degradation"
Last updateAll systems are back to normal. Thank you for your patience.
The ServiceChannel SRE team has identified the cause of the performance degradation and is implementing a fix. We expect this issue to be resolved shortly.
The ServiceChannel SRE team is currently investigating degraded performance affecting Fixxbook. We will provide an update shortly.
Report: "ServiceChannel System Performance Degradation"
Last updateThis incident has been resolved. All services are working as expected.
Our engineers have restored stability and services are functioning normally. We will monitor closely to ensure no further issues arise.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Report: "Degraded SFTP File Transfers"
Last updateFile transfers to sftp2.servicechannel.com experienced extreme slowness. Date/Time Incident Started: 3/15/2022 5:47 AM Date/Time Stability Restored: 3/15/2022 7:14 AM Date/Time Incident Resolved: 3/15/2022 7:14 AM Users Impacted: Many Frequency: Intermittent Impact: Minor Root Cause Analysis In an effort to align with corporate information security standards, the ServiceChannel SRE team recently updated sftp2.servicechannel.com to user a new Endpoint Detection and Response (EDR) sensor agent. Under a specific configuration condition, the new EDR sensor entered a condition that caused memory swapping, resulting in CPU exhaustion. Actions Taken The EDR sensor was uninstalled. After a reboot, CPU utilization to returned to normal. Mitigation Measures 1. Investigate potential EDR sensor interactions with other security tooling on sftp2.servicechannel.com 2. Re-test EDR deployment to non-production SFTP environment for aggressive load testing prior to redeployment of EDR to sftp2.servicechannel.com
Report: "ServiceChannel FTP Issues"
Last updateThis incident has been resolved. All services are working as expected.
The ServiceChannel SRE team has implemented a fix for the issue that was preventing FTP file processing. All pending jobs have been re-run. We are currently monitoring the system to ensure that everything is working as expected. Thank you for your patience.
For our customers with certain custom integrations, we're currently experiencing interruptions with processing via FTP. Our engineers are actively investigating the cause, but you may be unable to view files in your respective FTP account. Once restored, the files will be available on the FTP account as appropriate. We will share updates