Historical record of incidents for Swapcard
Report: "Connection Error: ECONNREFUSED on Analytics Dashboard"
Last updateThe issue has been identified and a fix is being implemented.
Report: "Integration Error: "Oops, something went wrong" - Key Vault DB/Collection Name"
Last updateThe issue has been identified and a fix is being implemented.
Report: "Push Notification Scheduling Issue"
Last updateThe incident has been resolved. It was caused by an infrastructure update that temporarily prevented scheduled push notifications from being processed. A fix has been deployed to restore normal operation. However, some notifications may still appear as pending. If necessary, please contact customer support to manually trigger their delivery.
Report: "Stripe Invoice Delivery Issue"
Last updateThis incident has been resolved.
We are currently experiencing an issue where Stripe is not sending invoices to some end-users for their ticket purchases. Our team is actively working with Stripe to resolve this issue. Stripe is aware of the problem and expects to have it resolved by Thursday, March 27th, 2025. At that time, all email addresses registered for affected customers will receive their payment receipt emails and other relevant Stripe communications. Additionally, we have requested that any missed emails be sent out, though confirmation of this action is still pending. We appreciate your understanding and will continue to monitor the situation. Please reach out if you have any further questions.
Report: "Event app is unavailable"
Last updateThe incident has been resolved, and a mitigation measure has been implemented. An ongoing investigation is underway to identify the root cause and prepare a post-mortem report.
Report: "Unstable Response Time in Event App & Studio"
Last updateThis incident has been resolved.
Report: "Chat & Login are unavailable"
Last updateWe are providing a detailed post-mortem report regarding the unavailability of Chat & Login services that affected Swapcard customers on Sunday, December 15th, 2024. This issue was caused by an unresponsive internal gRPC service, which led to temporary login failures and Chat unavailability. ### **Incident Summary** On Sunday, December 15th, 2024, Swapcard experienced a service disruption due to a misconfiguration in one of our internal gRPC services. This issue resulted in the following impacts : * **Chat:** Users were unable to send or receive messages. * **Login:** Some users were unable to log in or reauthenticate their sessions. The disruption lasted approximately 10 minutes before the Engineering team mitigated the issue and fully restored service. ### **Timeline of Events \(UTC\)** 7:45 AM UTC | Swapcard’s monitoring systems detected Chat & Login unavailability, and users reported login failures. The on-call team was immediately alerted. 7:48 AM UTC | The team began investigating and identified that an internal gRPC service was unresponsive due to misconfiguration. 7:50 AM UTC | The Engineering team fixed the misconfiguration and restarted the affected gRPC service, restoring Chat & Login functionality. 7:55 AM UTC | The issue was fully resolved, and normal Chat & Login operations resumed. Our team continued to monitor the system to ensure stability. ### **Mitigation Deployment** Once the unresponsive gRPC service was identified, Swapcard’s team promptly corrected the misconfiguration and restarted the service. This action immediately restored Chat & Login functionalities. Throughout the incident, our monitoring tools continued to observe system performance, ensuring no data loss occurred. ### **Root cause** A configuration error in the internal gRPC service led to connection failures between the Chat & Login components and their backend services. This resulted in a temporary inability for users to send/receive messages and for some users to authenticate successfully. We apologize for the inconvenience caused by this downtime and appreciate your patience as we worked to resolve the issue. If you have any further questions, please don’t hesitate to reach out.
This incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Event app & Studio is unavailable"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Event app is unavailable"
Last updateWe are providing a detailed post-mortem report regarding the app downtime that affected Swapcard customers on Friday, October 18th, 2024. This issue was caused by an unresponsive AWS database replica, which led to temporary login failures and app unavailability for a portion of users. The goal of this post-mortem is to share insights from our assessment and the steps taken to resolve the issue while providing transparency to our customers. ### Incident Summary On Friday, October 18th, 2024, Swapcard experienced a service disruption due to an issue with one of our AWS database replicas. Typically, when a database replica malfunctions, it can be restarted automatically to reduce downtime. However, in this case, the replica became stuck in a “Rebooting” state, displaying the message: “Recovery of the DB instance has started. Recovery time will vary with the amount of data to be recovered.” This behavior is unusual and has never been observed in our infrastructure. We are currently investigating the root cause of this with AWS support. The affected replica was critical for login operations, impacting user authentication. However, write operations were not impacted as they were being handled by another database node. This ensured no data loss occurred, even though the unresponsive replica was inaccessible during the incident. ### Timeline of Events \(UTC\) 3:15 AM UTC | Swapcard’s monitoring systems detected app downtime, and users reported login failures. The on-call team was immediately alerted. 3:16 AM UTC | The team began investigating and identified that one of the AWS database replicas responsible for handling login authentication was unresponsive and stuck in the “Rebooting” state. 3:27 AM UTC | Continued investigations confirmed that the replica was unable to recover from the malfunction, and the automatic redundancy system did not reroute traffic as expected. The on-call team began working on a manual solution. 3:33 AM UTC | A manual switch was initiated, rerouting traffic to another functioning replica. This restored login functionality. 3:39 AM UTC | The issue was fully resolved, and normal app operations resumed. Our team continued to monitor the system to ensure stability. ### Mitigation Deployment Once the unresponsive database replica was identified, Swapcard’s team manually rerouted traffic to another database node, restoring login functionality. Throughout the incident, write operations continued unaffected on another node, ensuring no data loss. After restoring service, we kept monitoring the system and worked closely with AWS to diagnose the unusual behavior of the replica. ### Technical Details * **Replica Failure**: The database replica became stuck in a “Rebooting” state, displaying the message “Recovery of the DB instance has started. Recovery time will vary with the amount of data to be recovered.” This is highly unusual, and we are investigating with AWS support to understand why it occurred and how to prevent similar incidents in the future. * **Automatic Failover:** Normally, our system automatically switches to a healthy replica when one becomes unresponsive. However, in this case, the failover did not function as expected, which caused the delay in restoring service. * **Data Integrity**: Write operations continued on another database node, ensuring that there was no data loss even though the affected replica was inaccessible. ### Forward Planning To prevent similar issues in the future: * **Component-Specific Improvements**: We are working to enhance the failover mechanisms specific to this critical component to ensure more reliable performance in the event of replica failures. * **Collaboration with AWS**: We are continuing to work closely with AWS support to understand the root cause of the database replica failure and the unusual behavior observed during the recovery process. * **System Resilience**: We are strengthening our failover systems to ensure automatic redundancy works as expected in all situations, further minimizing downtime. We apologize for the inconvenience caused by this downtime and appreciate your patience as we worked to resolve the issue. If you have any further questions, please don’t hesitate to reach out. **-- Update 10:20 AM UTC --** After an investigation by AWS, it has been confirmed that the issue is due to a rare hardware failure. The AWS team is currently replacing the hardware overnight to restore additional compute capacity. There is no impact on the platform at this time, as traffic has been rerouted to other replicas. Please be aware that while we strive to provide high availability, occasional hardware or communication issues can occur. In such cases, our monitoring system will detect the issue and initiate recovery actions to restore the cluster to a healthy and available state. However, replacing the underlying infrastructure may take some time on AWS side. **-- Update 21th Oct. 7:45 AM UTC --** AWS has resolved the case & the impacted replica is up again.
The issue affecting the event app has been resolved. A post-mortem report will follow shortly. We are confident that this was an unusual incident and will not recur in the coming days. A more robust mitigation plan has already been defined to address any potential similar issues in the future. -- Update 10:20 AM UTC -- After an investigation by AWS, it has been confirmed that the issue is due to a rare hardware failure. The AWS team is currently replacing the hardware overnight to restore additional compute capacity. There is no impact on the platform at this time, as traffic has been rerouted to other replicas. Please be aware that while we strive to provide high availability, occasional hardware or communication issues can occur. In such cases, our monitoring system will detect the issue and initiate recovery actions to restore the cluster to a healthy and available state. However, replacing the underlying infrastructure may take some time on AWS side. -- Update 21th Oct. 7:45 AM UTC -- AWS has resolved the case & the impacted replica is up again.
Report: "Email Delivery Latency"
Last updateWe are providing a detailed post-mortem report regarding the email delivery delays that affected Swapcard customers on Thursday, October 17th, 2024. This issue was due to a network issue on the side of our email service provider, Mailgun, where they experienced a network problem with a third-party provider, causing delays in sending OTP emails and other communications. The goal of this post-mortem is to share insights from our assessment and the steps taken to resolve the issue while providing transparency to our customers. ### Incident Summary On Thursday, October 17th, 2024, Swapcard experienced delays in email deliveries through Mailgun. While Mailgun reported the issue as affecting specific IP ranges \([159.135.132.XXX](http://159.135.132.XXX)\) not in use by Swapcard, we were still impacted. The root cause was a network issue on Mailgun’s side with a third-party provider, delaying email sending for several hours. Other email providers may have been affected by this network issue as well, meaning switching to a backup provider could have resulted in the same delays, along with further complications with IP reputation and email delivery volume. The delay primarily affected OTP & magic links emails required for user authentication, and users repeatedly clicked the "Send OTP" button, which added to the email queue and led to further delays. Although all emails were eventually delivered, some were delayed by 15-20 minutes. ### Timeline of Events \(UTC\) **Timeline Reported by Mailgun:** * 12:12 PM UTC | Incident identified. Due to network connectivity issues with a third-party provider, emails sent through Mailgun's EU region experienced delivery delays. The issue was identified, and a fix was in progress. * 1:34 PM UTC | Mailgun continued working with their networking partner on the fix. Only customers sending with IPs in the 159.135.132.0/24 range were reported as impacted by Mailgun. * 3:30 PM UTC | Connectivity was restored, and Mailgun began monitoring the traffic backlog as email sending resumed. * 5:26 PM UTC | Mailgun continued processing the sending backlog at a reduced rate as the system recovered. Delays for the affected network were still expected. * 5:36 PM UTC | Further mitigations were being worked on by Mailgun to resolve the networking issues. Messages continued to be delayed until a full fix was implemented. * 7:15 PM UTC | Mailgun continued working with their network provider to bring the affected range fully online. * 8:02 PM UTC | Mailgun implemented a fix, and traffic resumed through their networks. They continued monitoring to ensure the resolution was effective. * 8:41 PM UTC | The incident was fully resolved, and normal email delivery was restored. **Swapcard Actions:** * 12:12 PM UTC | Swapcard monitoring systems detected delays in email deliveries, particularly with OTP & magic links emails for user authentication. Swapcard promptly triggered the Incident Response Team to investigate. * 12:12 PM UTC | Mailgun confirmed the network issue and began working on a fix. * 12:15 PM UTC | Initial investigations identified that the issue was related to Mailgun, and Swapcard began working closely with Mailgun to ensure a timely resolution. * 12:15 PM - 8:41 PM UTC | Swapcard continuously monitored the situation, provided updates internally, and ensured that email traffic was processed once Mailgun's fix was implemented. ### Mitigation Deployment As soon as the issue was identified, Swapcard worked closely with Mailgun to resolve the network issue. During the delay, we monitored the sending backlog and ensured that emails were processed as the system recovered. Once the fix was implemented, Swapcard continued to monitor the system to ensure all emails were delivered. ### Improvements to Login Error Messaging During the incident, many users encountered login issues due to delayed OTP & magic links emails. Previously, users received a generic error message: _"Oops! Something went wrong."_ To reduce confusion in future incidents, we have now updated the message to be more informative: _"Oops! It looks like there have been too many login attempts on your account. Please take a short break and try again in a few minutes."_ This should help guide users more effectively during login failures and minimize repeated OTP & magic links requests. ### Forward Planning To prevent similar issues in the future: * We are working with Mailgun to fully understand why Swapcard was affected by an issue linked to an IP range we don’t use. * **We will continue to improve our communication process to ensure more timely updates during incidents.** Despite having potential backup email providers, switching during the incident would likely have caused more harm than benefit due to the volume of emails we send and the importance of maintaining our IP reputation. Additionally, this network issue may have affected other providers, meaning a switch would not have fully resolved the issue. We apologize for the inconvenience caused by this delay and appreciate your patience as we worked to resolve the issue. If you have any further questions, please don’t hesitate to reach out.
On October 17, 2024, we experienced a delay in email deliveries due to an issue with our email provider, Mailgun. The delay was caused by a malfunction in a cloud function used by Mailgun, impacting email sending. While Mailgun reported the issue affecting specific IP ranges that we don’t use, we were still impacted. We are currently investigating this with Mailgun to understand why we were affected. The incident did not breach the 99.9% SLA, and all emails were delivered, though with a 15-20 minute delay in some cases. We will improve our communication process for future incidents and ensure timely updates through our status page. Additionally, since email delivery was delayed and OTPs were required for user authentication, many users repeatedly pressed the “Send OTP” button, leading to even more queued emails and further delays. To help reduce confusion in future incidents, we have updated the generic error message users encountered when login issues occurred. The old message, "Oops! Something went wrong," has been replaced with a more informative one: "Oops! It looks like there have been too many login attempts on your account. Please take a short break and try again in a few minutes." This should help guide users more effectively during such scenarios. A public post-mortem will follow.
Report: "Elevated latency on event app"
Last updateWe are ready to provide a detailed post-mortem report regarding the service disruption that affected Swapcard customers on Tuesday, September 10th, 2024, at 13:30 UTC. The issue arose across several of our apps \(including both web and mobile platforms\) due to an unexpected traffic surge during an automatic scale-down phase triggered by fluctuating CPU usage. The goal of this post-mortem is to share insights from our initial assessment, as published on the Swapcard status page, and to detail the corrective measures we’ve implemented to restore service to normal. ## **Incident summary** On Tuesday, September 10th, 2024, at 13:30 UTC, we encountered elevated latency across our applications \(both web and mobile\), resulting in various user errors and prolonged or incomplete application loads. This latency spike was linked to a traffic surge during an automatic scaling phase. Our system adjusts to incoming traffic by scaling up or down based on load; however, an influx of events caused a delayed scale-up following a slow scale-down, leading to fluctuations in response time as the system attempted to adapt. Swapcard's monitoring systems detected the disruption and promptly activated our Incident Response team. The team took immediate action to triage and mitigate the issue by disabling automatic scaling and forcing an aggressive scale-up. This overscaling approach helped limit further impact from the traffic surge while we implement improvements. Concurrently, we have launched an investigation to refine our handling of such traffic patterns and optimize our scaling configurations for this type of traffic event sequence \(up, low, up, low\) for the future. ## **Mitigation deployment** At 13:35 UTC, our infrastructure team immediately addressed the issue by disabling automatic scaling and manually triggering an aggressive scale-up. This manual override of our usual scaling process took approximately five minutes. The delay was not in the scaling itself but in ensuring that the changes effectively overrode the default behavior and provided a stable foundation to handle the surge in traffic. As the update propagated through our infrastructure, the errors steadily decreased and eventually stopped. Swapcard’s engineering team continued to monitor system metrics to ensure full recovery. At 3:44 UTC, after further monitoring and detecting no additional issues, we confirmed that the incident had been fully resolved. ## **Event Outline** ### **Events of 2024 September 10th \(UTC\)** \(13:30 UTC\) | Elevated latency on our applications \(13:31 UTC\) | Disruption identified by Swapcard monitoring \(13:35 UTC\) | Manual override of our usual scaling process \(13:37 UTC\) | Errors decreasing \(13:40 UTC\) | Incident mitigated Affected customers may have been impacted by varying degrees and with a shorter duration than described above. ## **Forward Planning** Swapcard has begun enhancing the scaling algorithm to prevent frequent recurrences of this type of incident, in line with our high standards for deliverability. Some improvements were deployed during the night of September 11-12th to strengthen our system for handling such scenarios.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Issue with login service prevent user to logged-in"
Last updateWe are prepared to provide a detailed post-mortem report regarding a service disruption that impacted Swapcard customers on Friday, September 6th, 2024 at 11:39 UTC. During this incident, we encountered an issue with the login service preventing users from logging-in. The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service. ## **Incident summary** On Friday, September 6th, 2024 at 11:39 UTC, we experienced an issue with the login service preventing users from logging-in after an upgrade to the application database drivers. This upgrade was successful and running on our staging environment since multiple weeks and was approved for production. Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident by reverting the deployment of the login service. In parallel, an investigation has been launched in order to fix the issue, which revealed that an observability package interfered with the database drivers' functionality. We noticed a different configuration in this package between our staging and production environments. ## **Mitigation deployment** At 11:39 UTC our infrastructure team has immediately reverted the deployment of the login service. The switch process took around ~5min. The errors kept decreasing and stopped as the update propagated through our infrastructure. Swapcard engineering then monitored the login pages to ensure full and proper recovery. At 12:01 UTC, Swapcard confirmed that the rollback was completed and no further login issues were detected or reported. ## **Event Outline** ### **Events of 2024 September 6th \(UTC\)** \(11:39 UTC\) | Login service is disrupted, preventing users from logging-in \(11:44 UTC\) | Disruption identified by Swapcard monitoring \(11:51 UTC\) | Deployment of the login service is reverted \(11:57 UTC\) | Errors decreasing \(12:01 UTC\) | Incident mitigated Affected customers may have been impacted by varying degrees and with a shorter duration than described above. ## **Forward Planning** Swapcard has deployed a permanent fix for this incident in accordance with our high standard in terms of deliverability. Additional monitoring and processes have been setup to prevent similar issues from happening in the future.
This incident has been resolved.
Report: "Increase of error rates "Oops, something went wrong""
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Report: "Analytics is taking longer than usual to load"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Playback issues and delays in video processing"
Last updateMUX is reporting that issue is resolved and being monitored. > We have finished the processing backlog of ingest jobs and Mux Video is operating normally. The engineering team continue to monitor the system health.
Live functionality has been restored and confirmed by MUX. MUX is currently working around resolving delayed on VOD.
Mux is reporting having identified the issue and working on the mitigation, playback is back to normal operating as of 13:30 UTC. > We have identified and are actively mitigating the issue. Issues with playback are resolved as of 13:30 UTC, and we are working through the ingest backlog.
https://status.mux.com/incidents/sqqbbw7ytyth Our video provider, MUX, is currently experiencing issues with playback and delays in video processing. We are closely monitoring the situation with their team to provide an estimated resolution time to our affected customers.
Report: "Increase of 400 http code on SSO IdP endpoint"
Last updateSummary: We experienced an increase in 400 HTTP response codes on Single Sign-On (SSO) Identity Provider (IdP) authentication requests. This issue was identified and traced back to a problem in our IdP application logic. Resolution: The issue has been fixed, and normal functionality has been restored.
Report: "Blank page on studio.swapcard.com"
Last updateWe are prepared to provide a detailed post-mortem report regarding a service disruption that impacted Swapcard customers on Wednesday, June 12th, 2024. During this incident, we encountered blank page on Event Studio. The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service. ## **Incident summary** On Wednesday, June 12th at 17:25 UTC, we experienced a blank page on Event Studio due to a corrupted cache files version on our CDN in front of \([studio.swapcard.com](http://studio.swapcard.com/)\). Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident by checking any recent changes on the web application \([studio.swapcard.com](http://studio.swapcard.com/)\) configuration and revert to an older version. In parallel, the CDN cache was purged and forced to retrieve new files version from our final endpoint. ## **Mitigation deployment** At 17:30 UTC our infrastructure team has immediately revert the configuration changes for [studio.swapcard.com](http://studio.swapcard.com/). The switch process took around ~4min. The error reporting has stopped as the update propagated through our infrastructure. Swapcard engineering then monitored [studio.swapcard.com](http://studio.swapcard.com/) to ensure full and proper recovery. As a result of the deployment of that change, customers would then see a reduction of the blank page At 17:32 UTC, Swapcard confirmed that the update was completed and no further error detected or reported. ## **Event Outline** ### **Duration Summary** Time alerted to the outage: 1 minutes Time to identify the source of disruption: ~2 minutes Time to initiate recovery: ~4 minutes Time to monitor and restore service pre-crash: ~1 minutes ### **Events of 2024 June 12th \(UTC\)** \(17:24 UTC\) | Initial onset of the blank page error \(17:25 UTC\) | Disruption identified by Swapcard monitoring \(17:28 UTC\) | Configuration revert has been initiated \(17:30 UTC\) | blank page error decrease and recovered \(17:32 UTC\) | Incident mitigated \(17:35 UTC\) | Swapcard Engineering redeploy the new corrected changes Affected customers may have been impacted by varying degrees and with a shorter duration than described above. ## **Forward Planning** Swapcard has deployed a permanent fix for this incident in accordance with our high standard in terms of deliverability.
This incident has been resolved.
Report: "Delay in scheduled email delivery"
Last update**I. Executive Summary:** We are issuing a detailed post-mortem report on a service disruption that occurred on May 13th, 2024, impacting Swapcard users, concerning delayed email deliveries. This incident was caused by an upgrade to a sub-dependency used for internal service communication, which led to unexpected service crashes in very specific and rare cases. **II. Impact Analysis:** **User Impact:** * Delay in receiving scheduled emails. **Service Impact:** * The upgraded sub-dependency caused intermittent crashes and delays in the email delivery system across our platform, occurring under highly specific conditions that eluded initial detection. **III. Mitigation and Resolution:** To resolve the issue, we reverted the upgraded sub-dependency to its previous stable version. Our further analysis identified a malfunction in our health check system, which failed to restart the service automatically. We have since repaired the health check to ensure it functions correctly, enhancing the reliability of our systems to prevent similar disruptions in the future. **IV. Forward Planning:** In response to this incident, Swapcard is reinforcing our testing protocols and enhancing our monitoring systems to detect such inconsistencies more effectively, even in the most unusual scenarios, before they affect our production environment. We are committed to continually improving our processes and infrastructure to uphold the high-quality service our users expect. We sincerely apologize for any inconvenience caused by this disruption. For further assistance or additional information, please feel free to contact our support team. We value your understanding and continued support.
On Monday, May 13th, we experienced a disruption in our email delivery system, which resulted in delays in sending scheduled emails. The issue has been addressed, though the root cause is still under investigation. It appears to be related to a malfunction in the retry mechanism intended for system reliability. Our Site Reliability Engineering team will provide a detailed postmortem in the coming days. Please be assured that the incident has been resolved.
Report: "Go live button in backstage session is not working"
Last update**I. Executive Summary:** We are issuing a post-mortem report regarding a **backstage service disruption** that affected Swapcard customers on Thursday, April 5th, 2024. The incident was linked to an outage of our video backstage infrastructure provider 100ms [https://status.100ms.live/incidents/ww76zjqsj6j9](https://status.100ms.live/incidents/ww76zjqsj6j9). The issue affected rooms located in the Europe region. The initiation of RTMP utilized a cluster-specific, legacy endpoint. Prior to this incident, despite successful recording and streaming operations, the session and recording processes were executed in separate clusters. A recent update introduced a modification whereby the recording process would terminate if it did not detect the session within the same cluster, leading to recorder failures. **II. Impact Analysis:** * **User Impact:** Backstage streaming capabilities * **Service Impact:** Streaming through Swapcard's backstage functionalities was experiencing issues, resulting in the "Go Live" button being reset immediately after activation. This was due to failures in the recording and streaming processes. **III. Mitigation Deployment:** To address the issue, our video infrastructure provider has taken significant steps by introducing a new API endpoint that guarantees intelligent routing. This advanced endpoint has been specifically designed to enhance the stability and reliability of our streaming and recording functionalities. **IV. Forward Planning:** Consistent with our dedication to maintaining high-quality service, Swapcard is thoroughly reviewing our communication with our external video backstage infrastructure provider to avoid similar issues in the future. This incident has led us to improve our processes and establish stronger measures to prevent reoccurrences. We are grateful for your patience and ongoing support. We deeply regret any inconvenience this disruption may have caused. Should you need further assistance or additional information, please feel free to contact our support team. Your understanding and cooperation are highly appreciated.
This incident has been resolved.
We are currently investigating an issue with our backstage provider 100ms impacting our backstage functionality. The issue seem related to today incident on 100ms infrastructure in Europe https://status.100ms.live/incidents/ww76zjqsj6j9
Report: "Push notification not properly delivered, despite several people being targeted"
Last updateA recent update to our notification API inadvertently led to the discard and improper delivery of some push notifications. We promptly identified the issue and deployed a fix at 10 AM (UTC) to rectify the situation. We sincerely apologise for any inconvenience this may have caused and assure you that we are continuously working to improve our services to ensure a seamless experience for all our users. Thank you for your understanding and continued support.
Report: "High response time on the Event App"
Last update**Title: Post-Mortem Analysis - January 8, 2024 Incident** **I. Executive Summary:** We are issuing a post-mortem report regarding a service disruption that affected Swapcard customers on Monday, January 8th, 2024, from 8:37 UTC to 9:55 UTC. The incident was linked to a specific events configuration causing high load on our databases, specifically related to the meeting feature creating a large number of combinations due to events with extensive locations and slots, resulting in over 9 million combinations on some events. **II. Incident Overview:** * **Incident Description:** On Monday, January 8th, 2024, at 8:37 UTC, a service disruption impacted the Swapcard platform due to high load on our databases caused by the meeting feature's extensive combinations. The incident was resolved at 9:55 UTC. * **Timeline:** * 8:37 UTC: Initial onset of service disruption observed, with high load on databases. * 8:37 UTC: Detection of service disruption by Swapcard monitoring. * 8:45 UTC: Identification of the meeting feature causing extensive combinations. * 9:55 UTC: Successful resolution of the incident and implementation of patches. * Post-incident: Confirmation of the resolved status on the Swapcard status page. **III. Root Cause Analysis:** * **Immediate Cause:** The incident was triggered by a specific events configuration causing high load on the databases. * **Underlying Causes:** The meeting feature led to an exceptionally large number of combinations due to events with extensive locations and slots, resulting in over 9 million combinations on some events. * **Mitigation:** Optimization of SQL queries and indexes, and implementation of hard limits on location to prevent similar incidents. **IV. Impact Analysis:** * **User Impact:** Event App, Exhibitor Center and Studio experienced downtime. Elevated latency was observed following the resolution of the incident for a few minutes. * **Service Impact:** The master database responsible for data writes on Swapcard was unaffected, meaning that data coming from integration services were unaffected. Data integrity was not compromised. Only the read-only replicas of the database were affected by this incident. **V. Mitigation Deployment:** Upon identifying the root cause, immediate actions were taken to optimize SQL queries and implement hard limits on meetings. These measures ensure the platform is better adapted to handle such use-cases in the future. **VI. Forward Planning:** In line with our commitment to service deliverability, Swapcard is undertaking a comprehensive review of the technical architecture of the meeting feature, event configuration and caching mechanisms. This incident has prompted us to enhance procedures and controls to prevent similar occurrences in the future. We appreciate your understanding and continued support. We sincerely apologize for any inconvenience caused by this disruption. If you have further concerns or require additional information, please don't hesitate to reach out to our support team. Thank you for your patience and collaboration.
This incident has been resolved.
The issue has been identified and a fix is being implemented.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Session pages not loading in Backstage format"
Last updateThe incident commenced around 6:30 PM UTC. We are now initiating an in-depth post-mortem analysis to unveil the root cause. In our preliminary examination, we identified that the root cause was associated with the build and linking of OpenSSL in one of our dependencies, which was constructed using cross-compiled images. This resulted in glibc being inadvertently statically linked, leading to complications. The problem appeared sporadically and went unnoticed for a few hours. No issues have been detected since the deployment of this version in our production, staging, and testing clusters. The problem has been rectified, and an additional layer of security has been implemented to prevent the recurrence of a similar issue. We will soon furnish a more detailed post-mortem analysis.
Report: "Significant latency and impaired user experience impacting Event App"
Last updateThe incident began around 2:00 PM (CEST) and the system autonomously returned to normal around 2:05 PM (CEST). We will now perform a comprehensive post-mortem analysis to uncover the underlying issue. During our initial investigation, we found a combination of both correlated and uncorrelated events, such as fluctuations in request volumes, redeployments, server activity, and cache congestion, which collectively exerted substantial strain on Swapcard systems, resulting in elevated latency and a compromised user experience.
We are ready to furnish a comprehensive post-incident analysis concerning a service disruption that affected our Event App product. In the course of this incident, Event App users encounter significant latency and impaired user experience leading to “Stay tuned” unavailability pages and potentially causing “Oops, something went wrong” during the login phase. The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service. ## Incident Overview In the first week of November, we identified a technical issue that initially went unnoticed by our internal monitoring system. Typically, incidents of this nature are promptly detected by our automated monitoring system, but this time, it wasn't the case. This delayed the reporting on our status page. We became aware of the incident after receiving reports of service unavailability during a specific timeframe. The Swapcard Response Team promptly addressed this concern by conducting a thorough investigation into the reported timeline. On October 25th, around 2:00 PM CET, Swapcard experienced a series of both related and unrelated events that exerted significant pressure on our various systems. These events can be summarised as a substantial increase in inbound traffic coinciding with a service redeployment and congestion on our infrastructure nodes. It's worth highlighting that individually, none of these events typically disrupt our services. Swapcard is accustomed to efficiently managing large trade shows with numerous attendees and a continuous influx of inbound requests without causing any disruptions. In order to avoid singling out a particular event, we regard these occurrences as a singular disruption. Our investigation has revealed that it is the culmination of various unexpected events that leads to disruptions in the user experience on our Event App. The system has automatically recovered from this disruption ~5min after the start of the disruption at 2:05 PM CET, this automatic recovered is the success of the several safe-guard implemented by our Site Reliability Engineering team to ensure fast recover and mitigation of such incidents for our customers and end-users. ## Mitigation and Resolution As soon as the incident was reported, our team immediately initiated an investigation to ensure a thorough understanding and accurate reporting of the situation. Throughout the investigative phase, which took place from November 1st to November 2nd, the team identified several areas for improvement to enhance the management of both related and unrelated events, which are seldom encountered together. In our dedication to providing an outstanding experience for Swapcard users, we have implemented and planned these changes to enhance the management of unforeseen events. These improvements encompassed: * Enhanced monitoring to more effectively detect unforeseen events, resulting in quicker incident assessment. * The Site Reliability Engineering team will implement additional safeguards to alleviate infrastructure node congestion and minimize the impact, reducing potential disruptions. * A review of the cache mechanism during this timeframe to further diminish the potential impact of disruptions when under high pressure. * An evaluation and update of the automatic redeployment procedure to enable a more incremental rollout, thereby reducing strain on infrastructure nodes. * Reduction of the memory footprint of one of our logging systems, which was exerting pressure during high loads. ## Future Planning This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.
Report: "Backstage experienced issues related to video, audio, and screen-sharing"
Last updateWe are ready to furnish a comprehensive post-incident analysis concerning a service disruption that affected our Backstage product. In the course of this incident, Backstage users encountered problems with video, audio, and screen-sharing, resulting in content not being displayed correctly on the main broadcasted stage The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service. ## Incident Overview In the week of October 17th, a technical issue came to our attention. This problem emerged when we added or removed speakers or moderators from the main stage. It had an adverse effect on the encoder, which is responsible for monitoring these modifications and generating the final broadcast output for our end users. As soon as we identified this critical problem, we promptly reported it to our service provider. On October 18th at 10:00 AM CET, our external service provider confirmed that they had recently released a significant backend update known as Mesh SFU, designed to handle large-scale sessions. Unfortunately, this update introduced a bug specific to a rare scenario involving role changes. ## Mitigation and Resolution As soon as the incident was reported, our team promptly informed our partner about the unusual behavior observed in the video/audio and screen-sharing features, which were not functioning as expected on the main stage. We assured our partner that we would address and resolve these issues within the agreed upon Service Level Agreement \(SLA\). In our commitment to delivering an exceptional experience for Swapcard users, we have been leveraging our provider's API to create a distinctive workflow for our partners. To proactively prevent similar issues in the future, we are actively engaged in discussions to ensure this workflow is thoroughly tested in all of their testing scenarios. Additionally, we are exploring options to incorporate this scenario into our automated testing processes. The ultimate goal is to establish effective testing procedures and early detection of such bugs ## Future Planning This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.
The incident has been resolved, 18th October at 10am CET
Report: "Increase error rate on Event App, Studio & Exhibitor Center"
Last updateWe would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Monday, the 23th of October, 2023, from 09:23 UTC to 09:28 UTC. The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service. ## **Incident Overview** On Monday, October 23rd, at 09:23 UTC, there was a service disruption impacting the Swapcard apps. This disruption was a result of scheduled and routine maintenance on one of our caching clusters. Unfortunately, this maintenance unexpectedly led to queries failing. These query failures were associated with a recently introduced caching script. Following the script's implementation, the caching client did not correctly timeout on commands, as originally configured for handling system disruptions, despite the service being designed to maintain fault tolerance in the event of caching system disruptions. ## **Incident Timeline** Events of October 23th, 2023 \(UTC\): * 09:23 UTC: The initial onset of service disruption was observed, with query failing/timeout affecting the Swapcard Apps. * 09:23 UTC: Swapcard monitoring detected a service disruption. * 09:24 UTC: Swapcard Engineering identified the caching cluster zero downtime maintenance as the root cause. * 09:27 UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity. * 09:30 UTC: The status was confirmed as resolved post-incident. Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence. ## **Root Cause** The disruption was traced back to a Caching system cluster maintenance operation, which has cause an unusual interuption even if the process is common and done multiple time without issues, this recent issue it’s introduced by a new caching script that was introduce by a previous release and that prevent proper failover of the system. This resulted in service delivery issues and a disruption of service for Swapcard customers. ## **Mitigation Deployment** Upon identifying the root cause, we implemented a mitigation strategy to prevent further service disruptions. The caching client has been was modified to ensure that queries are not failing during routine maintenance and request get properly ejected and served by the main system if the caching mechanism is unavailable. ## **Forward Planning** In accordance with our commitment to maintaining high standards in service deliverability, Swapcard has taken several measures to prevent similar incidents in the future. This includes a comprehensive review of our caching failover mechanism during the zero down time maintenance. Procedures and controls are already in place, and this incident has underscored the importance of continuous improvement in our service delivery processes. We apologize for any inconvenience this disruption may have caused and thank you for your understanding and continued support.
This incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "504 gateway timeout error on the Developer API"
Last updateWe are prepared to provide a detailed post-mortem report regarding a service disruption that impacted Swapcard customers on Wednesday, October 18th, 2023. During this incident, we encountered intermittent 504 gateway timeout errors on the Developer API. The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service. ## Incident Overview On Wednesday, October 18th, at approximately 4 PM UTC, we observed a surge in 504 gateway timeout errors on the Developer API. This issue affected various external system integrations, excluding those provided by the Studio. Please note that the impact on affected customers may have varied in duration and severity. After conducting a thorough investigation, it was determined that the problem stemmed from a connectivity issue within our primary developer gateway. This issue led to routing problems, resulting in only one-third of the HTTP requests made during that period reaching the appropriate backend Developer APIs. Our Swapcard Response Team, in collaboration with other departments, identified and resolved the connectivity issue within approximately one hour from the initial report. ## Mitigation and Resolution The service interruption was promptly addressed as the network connectivity between the developer gateway and related backends was restored. Our Swapcard Incident Response team acted swiftly to mitigate the impact on our customers. This incident highlighted areas where we can make improvements to enable faster diagnosis of connectivity issues, network congestion, or related problems. ## Future Planning This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.
This incident has been resolved.
Report: "Elevated latency on event app"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Issue to logged-in & unusual logged-out causing various internal server error"
Last updateWe would like to present you with a retrospective report regarding a service disruption that affected Swapcard customers on the 18th of September, 2023, from 4:55 AM UTC to approximately 8:00 AM UTC. The purpose of this retrospective is to provide insights into our initial assessment of the incident, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore service. This incident pertains to issues related to logged-in and unusual logged-out scenarios, which resulted in several internal server errors. ## Incident summary On Monday, September 18th, at 04:55 AM UTC, we encountered an issue with user logins and logouts, along with various internal errors on the delivery of the service. The nature of the problem was inconsistent, which led to a slightly longer resolution time than usual. This was primarily because not all requests were failing, and users could still access the product. The root cause of this problem was the automatic activation of our security protection system when it detected a potentially malicious or system-flooding. Specifically, one particular request triggered our Web Application Firewall \(WAF\) to flag certain IPs as a security precaution. Subsequently, our corporate rules blocked actions originating from these flagged IPs. Unfortunately, this also affected one of Swapcard's own IP addresses being incorrectly blocked, resulting in a range of issues. Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers \(SREs\) & Security Team, identified the root cause and implemented a mitigation strategy to prevent incidents related to the affected components, specifically knowing the criticality of this components inside our architecture. Our systems & team detected a disruption in traffic, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritise and restore the quality of services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams. ## Mitigation deployment As part of our mitigation plan, we are enhancing our Web Application Firewall \(WAF\) system for flagged IPs by implementing a range of rules. This is aimed at preventing recurring issues where certain IPs are mistakenly flagged due to the actions of an individual who has been flooding our system. ## Event Outline ### Events of September 18th, 2023 \(UTC\): * 4:55 AM UTC: The initial onset of service disruption was observed. * 7:30 AM UTC: Swapcard Engineering identified the components as the root cause. * 7:55 AM UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity. * 8:00 AM UTC: The status was confirmed as resolved post-incident. Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence. ## Forward Planning In accordance with our high standard in terms of deliverability, Swapcard has conduct and planned several improvement on the relative components to prevent further incident of same type. We also improve our capacity to detect earlier this issues, for preventing impact on our customer. Procedures and controls are already in place but this incident highlights the need for improvement.
The incident has been successfully resolved, and we will conduct a more detailed post-mortem to identify the root cause. In the initial investigation, we discovered a problem with the firewall that resulted in our datacenter IPs being mistakenly identified as flooding our server. As a result, our system blocked the IPs of one of our servers, leading to various issues.
Report: "Un-usual database load, causing latency in the ingestion of profile"
Last updateWe are currently investigating this issue.
Report: "Intermittent latency on the event app & studio causing error page or latency"
Last updateWe would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Thursday, the 15th of June, 2023, from 12:55 UTC to 13:35 UTC. The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service ## Incident summary On Thursday, June 15th, at 12:55 UTC, we experienced latency issues and encountered an unexpected error page saying "Stay tuned." This occurred due to a memory leak leading to intensive CPU usage on one of our main API running on NodeJS technology. The memory leak is happening in certain circonstances under heavy load and causing various latency accros the product mostly because this services is in charge of doing the bridge between the interface and the databases. At 12:57 UTC on Thursday, June 15th, one of our critical services encountered a problem with excessive memory and CPU usage. This issue arose due to an undetected memory leak, which resulted in intensive garbage collection tasks. As a consequence, memory was not freed up properly, and the CPU usage became significantly high. The combined effect of these symptoms caused various services to restart, affecting the replication of those services as well. Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers \(SREs\), identified the root cause and implemented a mitigation strategy to prevent incidents related to the affected components, specifically knowing the criticality of this components inside our architecture. Our monitoring systems detected a disruption in traffic at 12:55 UTC, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritise and restore the quality of services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams. ## Mitigation deployment To resolve the issue, our initial course of action involved implementing a mitigation strategy that gradually scaled up the various services. This approach aimed to distribute the memory load among a larger number of pods, preventing any individual pod from going offline due to excessive memory consumption. As a result, the services were able to recover more quickly, granting us additional time to investigate the underlying cause and implement appropriate long-term solutions. The second strategy involved conducting in-depth analysis to identify the source of the abnormal memory consumption. We performed multiple analyses in the production environment, but the issue could not be reproduced in our staging and development environments, even under substantial simulated load. To troubleshoot this memory usage, we utilised the `--heapsnapshot-signal` command line to delve into the memory allocation of our services. We conducted two analyses, one at the start of the service and another under heavy load. Due to the investigation being performed in the production environment, it took longer than usual to pinpoint the issue while prioritising the live environment's performance. After conducting further investigation, we successfully identified the cause of the memory retention and addressed it to ensure stable memory usage in such circumstances. This approach also prevented excessive CPU usage caused by garbage collectors attempting to reclaim the used memory. ## Event Outline ### Events of June 15th, 2023 \(UTC\): * 12:55 UTC: The initial onset of service disruption was observed. * 12:55 UTC: Swapcard monitoring detected a service disruption. * 13:00 UTC: Swapcard Engineering identified the components as the root cause. * 13:28 UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity. * 13:35 UTC: The status was confirmed as resolved post-incident. Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence. ## Forward Planning In accordance with our high standard in terms of deliverability, Swapcard has conduct several improvement in the memory consumption of two our our services to prevent further incident of same type. We also improve our capacity to detect earlier during the process this issues, preventing impact on our customer. Procedures and controls are already in place but this incident highlights the need for improvement.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Delay on messaging, notifications and simulated live stream processing"
Last updatePlease see our post-mortem below regarding a service delivery disruption that affected Swapcard customers from Tuesday, June 28th 2023 at 20:39 UTC through to 23:28 UTC. It is our goal in this post-mortem to provide details on our initial assessment of the incident and to describe the remediation actions that we have taken to restore service. ## Incident summary On Tuesday, June 28th at 20:39 UTC, we experienced an outage due to an internal queue messaging system used by several Swapcard core services, such as messaging, notifications and live stream processing. On Tuesday, June 28th at 20:45 UTC, the automated monitoring system triggered an on-call response from the Incident Response team. On Tuesday, June 28th at 20:47 UTC, the alarm has been acknowledged by the Incident Response team. On Tuesday, June 28th at 21:06 UTC, the Incident Response team identified the issue related to the internal messaging system, a message queue was full and disrupting performance for the other queues present in the same system. On Tuesday, June 28th at 21:32 UTC, the Incident Response team triggered a capacity upgrade of the affected service, responsible for consuming the excess messages buildup in an attempt to restore service. This change was applied in accordance with Swapcard standard infrastructure & security change and enhancement practices. On Tuesday, June 28th at 23:21 UTC, the Incident Response team monitored the results and made sure that the internal messaging system came back to nominal levels. On Tuesday, June 28th at 23:34 UTC, the Incident Response team resolved the incident. We are currently investigating the root cause of the incident, which led to the buildup of messages inside the internal queue messaging system. Swapcard monitoring detected the disruption at 20:45 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue is being investigated and mitigations were put in place. ## Mitigation deployment To ensure proper processing of all messages, the service responsible for handling these messages has been scaled-up to process more messages than nominal levels to compensate for the buildup. As a result of restoring the system, customers would then see a reduction in the delay and processing of messages, notifications and live stream usage. At 23:34 UTC, Swapcard confirmed that the incident was resolved and delay restored to pre-incident levels, ensuring that the processing speed was back to the pre-incident rate. ## Event Outline ### Events of 2023 June 28 \(UTC\) \(20:39 UTC\) | Initial delays start happening in messaging, notifications and live stream \(20:45 UTC\) | Disruption identified by Swapcard automated monitoring systems \(20:47 UTC\) | Swapcard Engineering acknowledged the issue \(21:06 UTC\) | Swapcard Engineering identified the cause of the disruption \(21:32 UTC\) | Swapcard Engineering triggered a scale-up of the affected service in an attempt to restore service \(23:21 UTC\) | Swapcard Engineering monitored the results \(23:34 UTC\) | Swapcard Engineering resolved the incident Affected customers may have been impacted by varying degrees and with a shorter duration than as described above. ## Forward Planning In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the on-call procedure, which didn’t trigger a status page update during the incident. Swapcard will also take measures to improve the monitoring systems on the affected internal messaging system to avoid service disruption. Procedures and controls are already in place but today’s incident highlights the need for improvement. **We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.**
The internal queuing system, used by multiple systems inside the Swapcard platform, suffered an outage which added delays in processing of notifications and streaming services.
Report: "Intermitent latency on the event app causing "Stay tuned" errors"
Last updateWe aim to present you with a post-mortem report regarding a service delivery disruption that affected Swapcard customers on Wednesday, June 14th, 2023. This incident resulted in reduced performance for certain aspects of the service, particularly during peak hours on Wednesday afternoon; causing displayed of the “Stay tuned” generic page in some particular moment. The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service ## Incident summary On Wednesday, June 14th, at 14:00 UTC, we experienced latency issues and encountered an unexpected error page saying "Stay tuned." This occurred due to a sudden surge of traffic caused by a slow Distributed Denial of Service \(DDoS\) attack, which resulted in delays in processing requests and frequent fluctuations in the performance of our clusters. On the afternoon of Wednesday, June 14th, our public APIs encountered a major Distributed Denial of Service \(DDoS\) attack. This resulted in a significant influx of approximately 3 million requests within a short span of time, causing a sudden increase in latency within our system. However, our service efficiently scaled up after detecting the high incoming traffic associated with the slow DDoS attack. It is worth noting that this particular attack occurred at a rate just below our typical rate limiting and distributed across various users & IPs, which made it difficult for our security tools and rate limiter to immediately detect it. Right after the event, the Infrastructure Team and Security Team at Swapcard joined forces with our Site Reliability Engineers \(SREs\) to swiftly determine the underlying cause. They took prompt measures to address the involved IPs and users, and devised a mitigation plan to proactively prevent any potential future incidents related to the affected components. As part of this strategy, we have fine-tuned our rate limiters more aggressively to effectively safeguard the platform's stability and performance against similar slow DDoS attacks. Earlier in the afternoon, a disturbance in traffic was identified by our monitoring systems, leading to the swift activation of the Swapcard Incident Response team. The team diligently investigated the root cause of the incidents, considering that slow Distributed Denial of Service \(DDoS\) attacks are typically challenging to detect due to their integration with legitimate traffic. However, the team took great care not to disrupt legitimate users by carefully adjusting the rate limiter and WAF system, ensuring an appropriate balance. Note that affected customers may have been impacted by varying degrees and with a shorter duration than as described above. ## Mitigation deployment The interruption of service ceased as the rate limiters were fine-tuned and the Swapcard Incident Response team swiftly intervened to mitigate the effects on our customers. This incident highlighted areas where we can make enhancements to achieve even quicker scalability and absorption of traffic, especially considering the exceptionally high volume we experienced. Additionally, it emphasised the need to enhance our detection capabilities for slow Distributed Denial of Service \(DDoS\) attacks. ## Forward Planning The incident today has brought attention to potential enhancements we can implement. Although our existing procedures and controls are already in place, we recognise the opportunity for improvement. This proactive approach ensures that we continue to enhance the resilience of our systems and mitigate any potential disruptions
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Event App & Studio are unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "High load on main database replicas"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Event App & Studio are unresponsive"
Last updateWe would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Wednesday, the 17th of May, 2023, from 17:01 UTC to 17:20 UTC. The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service ## Incident summary We encountered a significant outage on Wednesday, the 17th of May, at 17:01 UTC, when our production pods running on Kubernetes were abruptly terminated in a cascading manner. This disruption resulted in service delivery issues for Swapcard across all regions. On Wednesday, the 17th of May at 17:01 UTC, our Kubernetes cluster experienced a significant issue where a large number of production pods were unexpectedly terminated. This issue arose due to a recent upgrade of the pods scheduler version, combined with a specific parameter that caused problems in a particular scenario. It's important to note that this configuration had been running without any problems for multiple days, and it had also been tested successfully on our non-production clusters. Unfortunately, a combination of manual actions taken to upgrade our integration services, along with this specific configuration, triggered an unintended cascade termination of our production pods. These changes and manual actions were made in accordance with Swapcard's standard infrastructure and security change practices. Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers \(SREs\), identified the root cause and implemented a mitigation strategy to prevent any future incidents related to the affected components, specifically the pods scheduler mechanism. We have a high level of confidence that these components will not lead to similar mass terminations of our production pods in the future. We want to emphasize that we handled this incident in compliance with our Disaster Recovery Plan \(DRP\). The utilization of our GitOps methodology and Infrastructure as Code \(IaC\) approach proved invaluable in minimizing the impact on our customers and reducing the resolution time. The situation we encountered on Wednesday, May 17th, can be classified as a worst-case scenario from an infrastructure standpoint. Our monitoring systems detected a disruption in traffic at 17:01 UTC, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritize and restore services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams. ## Mitigation deployment The service disruption came to a halt simultaneously with the restart and deployment of the 80 pods. This incident brought to light certain improvements we can implement to achieve even faster recovery times in the event of a worst-case scenario. Following the restoration of the Kubernetes pods, Swapcard engineering diligently monitored all services to ensure a complete and proper recovery, which was achieved by approximately 17:20 UTC. Consequently, customers would have observed the availability of Swapcard's services. At 17:24 UTC, Swapcard officially confirmed that services had been restored to pre-incident levels, ensuring that traffic had returned to its previous rate before the incident. ## Event Outline ### Duration Summary Time alerted to the outage: 1 minutes Time to identify the source of disruption: 1 minutes Time to initiate recovery: 2 minutes Time to monitor and restore capacities pre-crash: 14 minutes ### Events of May 17th, 2023 \(UTC\): * 17:01 UTC: The initial onset of service disruption was observed. * 17:01 UTC: Swapcard monitoring detected a global service disruption. * 17:02 UTC: Swapcard Engineering identified the components as the root cause. * 17:05 UTC: Recovery of Kubernetes pods started. * 17:19 UTC: The majority of services were restored, and additional mitigation measures were implemented. * 17:20 UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity. * 17:24 UTC: The status was confirmed as resolved post-incident. Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence. ## Forward Planning The incident today has brought attention to potential enhancements we can implement to further improve our recovery time in worst-case scenarios, aligning with our Disaster Recovery Plan \(DRP\). Although our existing procedures and controls are already in place, we recognize the opportunity for improvement. We assess the probability of a similar issue recurring as extremely low. However, we remain committed to minimizing any potential risks by implementing future interventions and enhancements to our infrastructure and procedures. This proactive approach ensures that we continue to enhance the resilience of our systems and mitigate any potential disruptions
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Analytics API is unreachable and data ingestion delayed, analytics may falsely appear empty"
Last updatePlease see our post-mortem below regarding a service disruption that affected the Analytics API and related services from May 1, 2023 at 4:29 AM UTC through to 7:42 PM UTC. Affected customers may have been impacted by varying degrees and with a shorter timescale, this timeline take in account the first events and the official resolving time after close monitoring. It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. ## Incident summary On May 1, 2023 at 4:29 AM UTC, we experienced an increase of reports regarding issue with the “Lead Board” page on the Exhibitor Center, the reports were the sunset of the Analytics API issues. On May 1, 2023, our Analytics database trigger an auto-scale of disk size because of reaching threshold \(usage/free spaces\), the auto-scale has alter our database indexes, causing long queries running that cause a cascade failure on Analytics API, causing issue on “Lead Board” feature and Developer API \(only on Analytics Endpoints\). Swapcard monitoring took time to detected the database disruption mostly because of the database not being completly unreachable, the only report were support related and about “Lead Board” page. Because of various report of malfunction of analytics related feature the Swapcard Incident Response team were triggered, Swapcard’s team worked to triage and restore services to mitigate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place. ## Mitigation deployment In favour of restoring the Analytics services, our first mitigation has been too attempt an hard restart of the database, has documented in our response plan, to free stacking queries \(force query termination\), we notice that the restart were not providing the expected effect and the queries were still stacking and not getting resolved. Some Analytics queries were properly resolved at that time, thanks to caching system tampering the issue. The number of succeed queries was at various degree, according to the freshness of events performing API requests. At this time the underline issue were not yet discovered and the correlation with previous events, were not made. After few attempts looking at traffic incoming to exclude \(slow DDOS\), and malfunction from the circuit breaking, Swapcard Incident Response team discovered a gap in the Analytics database indexes \(In fact the index were existing but alter due to previous events\) Once the underline issue discovered, and in favour restoring the “Lead Board” and specially the export lead button as fast as possible, the team has switch the Analytics API to an empty database the time to restore the indexes \(Causing analytics may falsely appear empty\). Due to the load on the database, restoring indexes at the same time than receiving long and costly queries were not possible and would have largely extend the resolution of the incident. Once the indexes were restored to their proper state, the Analytics API has been switch back to the normal database and pipeline has been restored to ensure analytics metrics were properly computed and delivered. No data has been lost during the process, no data has been alter. At 7:42 PM UTC, Swapcard confirmed that the restoration was completed and API & underline features restored. ## Event Outline ### Events of 2023 May 1st \(UTC\) \(4:10 AM UTC \) | Automatic disk auto-scaling on our Analytics Database because of reaching threshold \(usage/free spaces\). \(4:29 AM UTC\) | Increase of reports regarding issue with the “Lead Board” page on the Exhibitor Center \(3:00 PM UTC\) | Swapcard Engineering found the underline issues and start to established plan to restore indexes. \(6:40 PM UTC\) | Analytics API is back online and “Lead Board” page is reachable, while Swapcard Engineering is monitoring internal systems and database indexes recovery. \(7:42 PM UTC\) | Status post resolved Affected customers may have been impacted by varying degrees and with a shorter duration than as described above. ## Forward Planning In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the autoscaling and failover capabilities used by the Analytics database. Automatic capacity upgrades and failover replica were already in place but today’s incident highlights the need for improvement. We also find a gap in the monitoring and automatic issues detection on our Analytics Database that has been resolved on 2th May, after a post-mortem audit and mitigation plan prepared on 1st May. **We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.**
This incident has been resolved. A more detail post-mortem will be published further, as soon as short/mid & long term mitigation are in place and planned.
A fix has been implemented. We are working on restoring the analytics data, there are no data loss during the process.
The issue has been identified and a fix is being implemented.
Report: "Event app increase response time"
Last updatePlease see our post-mortem below regarding a service disruption that affected Swapcard customers from March 29th, 2023 at 12:58 UTC through to 13:08 UTC. It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. ## Incident summary On March 29th, 2023 at 12:58 UTC, we experienced an increase in latency on the Event App due to an excess number of requests on our API gateway. On March 29th, 2023 at 13:02 UTC, the infrastructure team triggered a capacity upgrade on our main API in favor of improving the latency and user experience on the current & up-coming events. This change didn’t affect latency as we hoped so. The team continued to investigate using Swapcard’s internal profiling tools and found that the APIs weren’t receiving the spike of traffic observed on our gateway. It has been determined that our main gateway didn’t scale according to network traffic, causing a network bottleneck and increasing latency. Swapcard monitoring detected the traffic disruption at 12:59 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place. ## Mitigation deployment Latency dropped as the same time than the capacity upgrade on our main gateway was triggered. Swapcard Engineering then monitored all the services to ensure full and proper recovery by 13:11 UTC. At 13:11 UTC, Swapcard confirmed that the scaling was completed and latency restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate. ## Event Outline ### Duration Summary Time alerted to the outage: 1 minute Time to identify the source of disruption: 5 minutes Time to initiate recovery: 1 minute Time to monitor and restore capacities pre-incident: 7 minutes ### Events of 2023 Mar 29 \(UTC\) \(12:58 UTC\) | Event App latency increased \(12:59 UTC\) | Swapcard automated monitoring alerted the infrastructure team \(13:02 UTC\) | Swapcard Engineering pre-emptively scaled main API to lower latency \(13:05 UTC\) | Swapcard Engineering found a network bottleneck on gateway using internal profiling tools \(13:06 UTC\) | Swapcard Engineering triggered a capacity upgrade on main gateway \(13:08 UTC\) | Event App latency goes back to pre-incident levels while Swapcard Engineering is monitoring internal systems \(13:10 UTC\) | No internal systems were affected, incident mitigated \(13:11 UTC\) | Status post resolved Affected customers may have been impacted by varying degrees and with a shorter duration than as described above. ## Forward Planning In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the autoscaling methods used by the main gateway. Automatic capacity upgrades were already in place but today’s incident highlights the need for improvement. **We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.**
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
On Wednesday, March 29th at 12:58 UTC, we experienced an increase response time on our event app.
Report: "Event app & Studio & Team are unreachable (RBAC: access denied)"
Last updateThis incident has been resolved.
Report: "Event app is unreachable in some regions"
Last updatePlease see our post-mortem below regarding a service delivery disruption that affected Swapcard customers from Thursday 9th, 2023 at 16:20 UTC through to 16:35 UTC. It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. ## Incident summary On Tuesday, Thursday 9th at 16:20 UTC, we experienced an outage due to an capacity upgrade on our cache system, causing issue in the delivery of the event app across several regions. On Tuesday, Thursday 9th at 16:20 UTC, the infrastructure team triggered a capacity upgrade on our caching system in favour of improving the latency and user experience on the current & up-coming events, this action is commonly execute and part of recurrent task. The same change was applied to our production account by our infrastructure as code tool in a dry run mode previous days in advance to ensure and prevent impact on the production environment. This caused a significant disruption of traffic and an outage of the Event app. This change was applied in accordance with Swapcard standard infrastructure & security change and enhancement practices. We currently investigating our cache system configuration, to find the difference between previous dry-run and the failure upgrade of today. Swapcard monitoring detected the traffic disruption at 16:20 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place. ## Mitigation deployment Traffic disruption stopped as the same time than the cache capacity upgrade were reverted. Swapcard engineering then monitored all the services to ensure full and proper recovery by 16:35 UTC. As a result of restoring the cache system, customers would then see a reduction in the errors. At 16:35 UTC, Swapcard confirmed that the update was completed and capacity restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate. ## Event Outline ### Duration Summary Time alerted to the outage: 1 minutes Time to identify the source of disruption: 1 minutes Time to initiate recovery: 5 minutes Time to monitor and restore capacities pre-crash: 8 minutes ### Events of 2023 Feb 9 \(UTC\) \(16:20 UTC\) | Initial onset of cache disruption \(16:20 UTC\) | Global cache disruption identified by Swapcard monitoring \(16:21 UTC\) | Swapcard Engineering identified the causing capacity upgrade \(16:32 UTC\) | Impacted services began to recover \(16:33 UTC\) | Majority of services recovered, additional mitigation measures taken \(16:35 UTC\) | Incident Mitigated, pre-incident capacity restored \(16:44 UTC\) | Status post resolved Affected customers may have been impacted by varying degrees and with a shorter duration than as described above. ## Forward Planning In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the cache upgrade timeline and procedure. Procedure and control are already in place but today’s incident highlights the need for improvement. **We consider the likelihood of a recurrence of this issue to be extremely low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.**
A fix has been implemented and we are monitoring the results. This incident has been resolved
Report: "Delayed in Email & Push sending"
Last updateA fix has been implemented and we are monitoring the results. This incident has been resolved.
Report: "Event app & Studio are unreachable"
Last updatePlease see our post-mortem below regarding a service outage that affected Swapcard customers from Feb 2th, 2023 at 11:32 UTC through to 12:21 UTC. Impacted services : * Event App * Studio App * Exhibitor Center * Developer API It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. ## Incident summary On Thursday, Feb 2 at ~11:32 UTC, we experienced an major outage on our event app & studio, exhibitor center and developer-api due to an high number of un-finished/stacking database sessions on our main core databases \(master & replicas\). On Thursday, Feb 2 at ~11:33 UTC, our infrastructure team has been automatically alerted of an high number of database sessions on our main core database \(master & replicas\), the number of sessions increases over the time one database replica, then start to propagate to the others replicas that part of our Multi-AZ deployments for high availability. Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident according to our internal documentation by redirecting the database requests to the others replicas that are in place in case of major disruption in one of the database node. Unfortunately this action didn’t end-up in a service recovery like initially expected by the Swapcard Incident Response team. As explained earlier in the post mortem, the issue were propagating to our Multi-AZ nodes as-well, that were in place for mitigating such incidents. In parallel, the cause of the issue was investigated so short term plans were put in place. The second mitigation attempt lead to an global database restart, leading to longer resolution time, this plan was not initially considered knowing that this action were potentially extending the resolution time. ## Mitigation deployment At 11:50 UTC, Swapcard’s Engineering team that were investigating the initial sunset of the incident, identified the initial cause of this incident, an difference in the minor version between our master & replicas database in addition of a specific database operation has cause an table lock leading to database sessions to increase and stack. The lock happened on a database table with a high volume of requests per seconds, this table is use for rendering a major part of the contents. The specific operation was an internal database operation that was not initially available for manual termination, forcing a restart to ensure lock being releases. By 12:19 UTC service outage stopped as the update propagated through the databases. Swapcard engineering then monitored all the services to ensure full and proper recovery by 12:21 UTC. At 12:21 UTC, Swapcard confirmed that the update was completed and capacity restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate. ## Event Outline ### Duration Summary Time alerted to the outage: 1 minutes Time to identify the source of disruption: 1 minutes Time to initiate recovery \(1st attempt\) : 7 minutes Time to initiate recovery \(2nd attempt\) : 35 minutes Time to monitor and restore capacities pre-crash: 5 minutes ### Events of 2023 Feb 2 \(UTC\) \(11:32 UTC\) | Initial onset of core database outage \(11:32 UTC\) | Service outage identified by Swapcard monitoring \(11:32 UTC\) | Swapcard status post is activated \(11:33 UTC\) | Swapcard Engineering identified an high number of database sessions \(11:39 UTC\) | 1st attempt of mitigation \(11:50 UTC\) | Swapcard Engineering identified the initial cause of the issue. \(11:50 UTC\) | 2nd attempt of mitigation \(12:19 UTC\) | Majority of services recovered, additional mitigation measures taken \(12:21 UTC\) | Incident Mitigated, pre-incident capacity restored \(12:21 UTC\) | Status post resolved Affected customers may have been impacted by varying degrees and with a shorter duration than as described above. ## Forward Planning Swapcard has deployed a permanent fix for this incident and will implemented technical measures to ensure that any database internal operations being identified earlier in addition of adding a procedure for preventing propagation on the Multi-AZ nodes. In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the database configuration, version and internal procedure to ensure to prevents similar incidents. Procedure and control are already in place but today’s incident highlights the need for improvement. **We consider the likelihood of a recurrence of this issue to be extremely low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.**
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Investigating reports of degraded performance (Event App & Studio)."
Last updateWe are investigating reports of issues with service(s): Event App and the studio. This incident has been resolved, we will provide post-mortem regarding the issue
Report: "Event app is unreachable"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Investigating 'Header Timeout Error' on feed / live discussion & lead report"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Intermittent playback issues for Video"
Last updatePlease see our post-mortem below regarding the intermittent playback issue face on Oct 25, 2022 at ~14:30 UTC through to ~16:30 UTC. It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. **Note this incident is linked to partial Cloudfare outage on Oct 25, 2022** : 👉 [https://blog.cloudflare.com/partial-cloudflare-outage-on-october-25-2022/](https://blog.cloudflare.com/partial-cloudflare-outage-on-october-25-2022/) ## **Incident summary** On Tuesday, Oct 25 at ~14:30 UTC, Cloudflare, one of our video provider two CDN providers began experiencing a service degradation. It was at 14:30 UTC when video provider began seeing intermittent playback issues as a result of the service degradation that was propagating through Cloudflare’s network. On Monday, Sep 19 at ~16:00 UTC, our infrastructure team has been automatically alerted of an un usual amount video issues reported by some users \(~5% increase\). Swapcard support team has detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and support/communication plan in favour of support our impacted customers. In parallel, the cause of the issue was investigated and short & long term plans were put in place. ## **Mitigation deployment** Swapcard video features is not yet compatible with a secondary backup video provider technology this has preventing us hot switching the provider to mitigate the issue. At ~15:00 UTC our video/infrastructure team has immediately trigger our premium support line with our main video provider in favour of assessing the impact of the issues and gather as much information as possible to support our impacted customers. At ~16:50 UTC, Swapcard confirmed that the playback issue were totally resolved and no further error were detected or reported. ## **Event Outline** ### **Events of 2022 Oct 25 \(UTC\)** \(14:30 UTC\) | Initial onset of the playback issues according to Cloudfare and our video provider, **we didn’t note any report of disruption at that time**. \(15:53 UTC\) | Cloudflare posted a public incident \(16:00 UTC\) | Disruption identified by Swapcard support team. Note that at that time Cloudfare just start to publicly report disruption on their network. \(16:30 UTC\) | Incident mitigated \(17:04 UTC\) | Swapcard public status post activated \(17:13 UTC\) | Status post resolved \(18:50 UTC\) | Cloudflare resolved their incident **We have to worked with our video provider to assess the impact and evaluate that degradation impacted 5% of Swapcard streaming traffic.** ## **Forward Planning** Swapcard is hardly working on leveraging a second preferred backup video provider to mitigate and reduce the risk of this type of incidents in accordance with our high standard in terms of deliverability. This backup provider is based on a totally different underline technology in favour of providing a strong alternative in case of major outage of well known worldwide CDN provider. In addition, we will be adding additional monitoring to detect and alert on these sorts of CDN request failures. This will allow us to identify and respond to issues more quickly.
This incident has been resolved.
Our video partner seeing performance return to normal levels, there implemented a mitigation by routing away from the affected CDN, and stream performance should be returning to normal levels. We are continuing to monitor the system to ensure performance remains, and will provide updates as we have them.
One of our video partner is experimenting an issue where some streams (live and on-demand) are experiencing playback issues, ranging from increased startup failures to increased stalling and rebuffering during playback. https://status.mux.com/incidents/l5ckhrj68d2d
Report: "Elevated API Errors"
Last updateThis incident has been resolved.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Investigating Headers Timeout Error"
Last updatePlease see our post-mortem below regarding the sporadic “Header Timeout” error from Sep 19, 2022 at ~19:48 UTC through to ~20:07 UTC. It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. ## **Incident summary** On Monday, Sep 19 at ~19:48 UTC, we experienced some sporadic “Header Timeout” error on our event app & studio due to a memory leak and abrupt periodic restarts on one of ours core internal service. On Monday, Sep 19 at ~19:50 UTC, our infrastructure team has been automatically alerted of an un usual amount of “Header Timeout” in our logs and report of displayed errors by some users. Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident by scaling the internal core item in favour of reducing the memory pressure on the current services and spread the load across largest amount of services than usual, to reduce the probability of having them restarting. In parallel, the cause of the issue was investigated and short & mid term plans were put in place. ## **Mitigation deployment** At ~19:55 UTC our infrastructure team has immediately scaled manually the internal core item in favour of reducing the memory pressure. The scaling process took around ~7min. The error reporting stopped as the scaling propagated through our infrastructure. Swapcard engineering team then monitored application endpoints logs to ensure full and proper recovery. As a result of the deployment of that change, customers would then see a reduction of the sporadic error message. At ~20:02 UTC, Swapcard confirmed that the update was completed and no further error were detected or reported. Swapcard’s Engineering team identified the root cause, has worked on proper short & mid term mitigation plan at the same time than the incident were mitigated by the Swapcard Incident Response team. ## **Event Outline** ### **Duration Summary** Time alerted to the issue: 2 minutes Time to identify the source of disruption: ~5 minutes Time to initiate recovery: ~5 minutes Time to monitor and restore service pre-crash: ~5 minutes ### **Events of 2022 Sep 19 \(UTC\)** \(19:48 UTC\) | Initial onset of the header timeout error rate increase \(19:50 UTC\) | Disruption identified by Swapcard monitoring \(19:50 UTC\) | Swapcard status post activated \(20:02 UTC\) | Incident mitigated \(20:07 UTC\) | Status post resolved Affected customers may have been impacted by varying degrees and with a shorter duration than described above. ## **Forward Planning** Swapcard has deployed a permanent mitigation for this incident in accordance with our high standard in terms of deliverability.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Investigating intermittent "Something went wrong" error reported on Login"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Event app increase response time"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Error page on event app"
Last updateThis incident has been resolved.
Report: "404 error page on studio.swapcard.com"
Last updatePlease see our post-mortem below regarding the 404 error page from May 5h, 2022 at 13:11 UTC through to 13:20 UTC. It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service. ## **Incident summary** On Tuesday, May 5th at 13:12 UTC, we experienced a 404 error page our studio due to a miss configuration on the web application \([studio.swapcard.com](http://studio.swapcard.com)\). On Tuesday, May 5th at 13:12 UTC, our infrastructure team has been automatically alerted of an un usual 404 error page on the web application \([studio.swapcard.com](http://studio.swapcard.com)\). Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident by checking any recent changes on the web application \([studio.swapcard.com](http://studio.swapcard.com)\) configuration and revert to an older version. In parallel, the cause of the issue was investigated and short & mid term plans were put in place. ## **Mitigation deployment** At 13:13 UTC our infrastructure team has immediately revert the configuration changes for studio.swapcard.com. The switch process took around ~6min. The error reporting has stopped as the update propagated through our infrastructure. Swapcard engineering then monitored [studio.swapcard.com](http://studio.swapcard.com) to ensure full and proper recovery. As a result of the deployment of that change, customers would then see a reduction of the 404 error messages. At 13:20 UTC, Swapcard confirmed that the update was completed and no further error detected or reported. Swapcard’s Engineering team identified the root cause, and by ~14:31 UTC has redeploy the configuration change corrected. ## **Event Outline** ### **Duration Summary** Time alerted to the outage: 1 minutes Time to identify the source of disruption: ~2 minutes Time to initiate recovery: ~6 minutes Time to monitor and restore service pre-crash: ~1 minutes ### **Events of 2022 May 5th \(UTC\)** \(13:11 UTC\) | Initial onset of the error rate increase \(13:12 UTC\) | Disruption identified by Swapcard monitoring \(13:13 UTC\) | Configuration revert has been initiated \(13:12 UTC\) | Swapcard status post activated \(13:19 UTC\) | 404 error page decrease and recovered \(13:20 UTC\) | Incident mitigated \(13:20 UTC\) | Status post resolved \(14:31 UTC\) | Swapcard Engineering redeploy the configuration change corrected Affected customers may have been impacted by varying degrees and with a shorter duration than described above. ## **Forward Planning** Swapcard has deployed a permanent fix for this incident in accordance with our high standard in terms of deliverability. **We consider the likelihood of a recurrence of this issue to be extremely low.**
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Increase response time and timeout on Developer API"
Last updateThis incident has been resolved.