Historical record of incidents for InEvent
Report: "Login delayed authentication"
Last updateIncident Summary We experienced a slowdown in the API’s ability to issue new tokens. This caused delays for some users attempting to generate tokens. Current Status The underlying issue has been identified and resolved. Changes have been applied, and we are actively monitoring the system to ensure stability. Users experiencing issues are advised to refresh their browsers, which should resolve the problem. Next Steps We will continue to monitor the performance and ensure that the service remains stable. A post-incident analysis will be conducted, and a full report will be made available. Thank you for your patience. If you experience further issues, please contact our support team.
Report: "Payment Emails not being sent properly"
Last updateWe have identified an issue with our email provider when sending payment emails and we recently deployed a fix. New emails will be sent right away and emails from the past 72 hours that were not sent yet have been added to our queue and are being sent now.
Report: "UK Server Region slowness"
Last updateWe have identified slowness when opening certain modules of the UK Server Region and we have deployed a fix to the problem. A small percentage of users may have received a "Bad Gateway" response when connecting between 11:00 AM and 11:05 AM (CET) to the Virtual Lobby of the UK Server.
Report: "Servers non-responsive"
Last updateWe had an issue with AWS auto scaling module that caused our servers to take longer than expected to reach the desired scaling point and that caused our servers to stop responding for a period of 5 minutes (the time that it took to reach the desired scaling point). This behavior is out of the ordinary and we have several scaling procedures running every day. Our team is investigating what caused this momentary issue and will apply a fix.
Report: "Mux service is down and Live Studio customers might be affected"
Last updateThis incident has been resolved.
Mux has identified the issue and deployed a fix. We will be monitoring the results and update this report soon.
We are still waiting for Mux support to contact us.
We are investigating an issue with Mux (mux.com) Streaming Platform that is rejecting new InEvent Live Studio streams due to an outage on their end. Our recommendation for now is to use Video Conferencing or RTMP modes for customers that need to stream content. Once we have updates from Mux, we will update this incident report.
Report: "European Servers are not responsive"
Last updateThis incident has been resolved. Our European Servers are fully operational.
We are continuing to monitor the servers. The API calls are now 100% responsive and without degraded performance.
The DNS servers are now responsive and our Load Balancer is now draining pending requests and flushing old requests in favor of new ones. New API calls will be handled normally and we will be monitoring the results.
We have identified a DNS issue that caused our server requests to queue and severely delay their responses on API requests. The API microservices had to wait more than anticipated for a DNS resolution and that caused timeouts with a Gateway Timeout HTTP error. We have deployed another DNS server to be used and this may take up to 30 minutes to take effect and requests starts to normalize. We will be monitoring the resolution development.
We are investigating an issue with the European Servers that are causing a response delay and Gateway Timeouts on certain requests.
Report: "Live Studio streams affected by Mux (mux.com) AV Sync issue"
Last updateThis issue has been resolved.
We have identified an issue on Mux's CDN that's affecting all Live Studio streams. Streams that lasts for more than an hour might have AV Synchronization issues due to an internal issue on Mux's servers.
Report: "Service Issue: Outbound email sending is delayed"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
This incident happened on Postmark's end. For further information, please go to https://status.postmarkapp.com/incidents/277086
Report: "Issue with AWS Subnet Traffic Control"
Last updateThere was an issue with our Subnet Traffic Control policy in our public AWS VPC that was blocking HTTP and HTTPS traffic. Once we identified the issue and applied the fixes, there was a delay that lasted for around 30 minutes to apply the new policy to all our subnets.
Report: "Emails are delayed"
Last updateThis incident has been resolved.
Our subprocessor Postmark is having issues delivering emails. Although delayed, all emails will still be delivered. We will update this incident when new information arises.
Report: "Virtual Lobby is unresponsive"
Last updateThis issue was caused due to a code deployment we did to add support to custom Push Notifications with dynamic content \(first-name, last-name, email, etc\). This code had, in some circumstances, a loop that would cause your browser tab to crash and freeze completely. We reverted this code and we will evaluate another way to build dynamic content to Push Notifications.
We have identified the issue and fixed it. The Virtual Lobby should now open normally on Desktop and also on Mobile devices.
The Virtual Lobby is unresponsive and we are investigating the reason. The rest of the platform is working normally.
Report: "Firebase is currently outputting closed connection errors"
Last update**Google Firebase** had a major global outage this afternoon that lasted for 6 minutes. This outage impacted our **Virtual Lobby**, but the overall functionality still worked, with limited connectivity to _Chats, Questions, Group Rooms_ and _interactions_. You can check out **Google’s** official statement here: [https://status.firebase.google.com/incidents/Xed5fGf9USDWdzsqN4GX](https://status.firebase.google.com/incidents/Xed5fGf9USDWdzsqN4GX)
The issue has been mitigated and the service has been restored. https://status.firebase.google.com/incidents/Xed5fGf9USDWdzsqN4GX
We are continuing to monitor for any further issues.
Google Firebase issue can also be seen here on this page https://status.firebase.google.com/
We have implemented a fix which allows you to connect to the Virtual Lobby with limited connectivity. While Google Firebase is offline, we recommend to switch to InEvent native websockets under Settings > Tools.
We are continuing to investigate this issue.
We are checking with Google the status of the issue. For now, we recommend switching to the Native WebSockets option under Event > Tools.
Report: "Slow web socket connection"
Last update## What is a “Websocket”? Without getting into too much detail, a **Websocket** is a method of network communication that is used for real time applications. We use **Websocket** for real time communication and interaction, and these are the modules that uses the **Websocket** service: * News Feed; * Inbox; * Session Chat; * Session Q&A; * Networking; * Creation of Group Rooms; * Invitations; * Push Notifications; * Live updates \(session settings changes\); * Networking Roulette; ## Issues with Native Websocket and Regular Websocket \(Google Firebase\) We have two Websocket providers, Google Firebase \(realtime database\) and our own implementation \(Native Websocket\). Today we had a large amount of users connecting at the same time and this caused the Websocket servers to halt. Google Firebase couldn’t scale fast enough and the Native Websockets couldn’t handle the scale either. The issue resulted on users having the “Connecting” popup showing up and never disappearing. ## Issues with Caching Server \(Redis\) We had a major outage with our Caching server \(Redis\) that caused the entire platform and backend to go offline. The Redis server clogged up and couldn’t handle the server scaling and load, and this resulted in an overall failure of the platform. The landing page and login page were still operational. ## Fixes implemented For **Native Websockets** we have implemented a manual scale for now and we will work on the autoscaling mechanism to support a large load in the future. If the connection fails, you will still be able to use the Virtual Lobby normally, with limited interaction. For **Google Firebase** we couldn’t implement a fix. We will try sharding the entire operation into multiple micro services for different modules \(Chat, Q&A, etc\), but since they don’t support replicas, it will be hard to scale on large events. If your event has more than 5,000 users, it’s better to use **Native Websockets**. If the connection fails, you will still be able to use the Virtual Lobby normally, with limited interaction. For **Caching Server \(redis\)** we are still implementing a fix but we did deploy a temporary workload that should replace the caching server for now and keep the platform and the backend stable. This is an internal fix and shouldn’t affect the user experience. ## What to expect for this week The platform backend and all its modules should be operational. In case **Native Websockets** or **Google Firebase** fails, you will still be able to access the platform and the Virtual Lobby, but users will have a limited experience without realtime interactions – chat, Q&A and the other modules listed above will not be operational. We are constantly working on improvements and we will announce when we have both realtime **Websockets** fully functional for large events. Meanwhile, we can guarantee that the backend and the Virtual Lobby will be online – even in case of limited realtime experience.
This incident has been resolved.
The team has concluded the implementation of all temporary solutions on the platform. This includes creating a timeout option on slow connecting sockets and also disabled Redis as a single connection. The web socket chat will remain with limited chat support on the firebase instance until we add support for per instance local connection, which should happen next week. The Redis cache team will be implementing a new permanent solution end of this week, Friday at the latest.
We currently fixed an autoscaling Redis instance that was not able to redirect the load to a separate system. Redis is used by InEvent to quickly balance its write operations instead of relying only on a traditional SQL database. The following components are still affected: Firebase Web Sockets, which covers the live chat on the Virtual Lobby.
We have identified the slowness from the Firebase product. We are deploying an alternative with the Native websockets, under Company > Tools. We are currently deploying a fix for UI improvements on the Firebase UI console while Firebase is not responsive. Chat may be offline while the fix is being applied, but video and streaming should work normally.
We are currently seeing a slow scalability response from the InEvent Firebase socket group. We are currently investigating the issue with Google Firebase team.
Report: "AWS Hardware Failure & Recovery (1 minute)"
Last updateAWS had an outage in our North America region and the failover and recovery mechanism took action. During 10:59:30 PM - 11:00:30 PM EDT, the website was slow and unresponsive. This issue has been fixed in 1 minute.
Report: "Low Latency Stream Issues"
Last update**Affected users:** Events with _Custom Domain_ enabled that used _Control Room_ video modes with the _Low Latency_ enabled. **Workaround:** Disable _Custom Domain_ or change the video latency to _Standard_. **Issue description:** We encountered issues when using _Control Room_ video mode with _Low Latency_ enabled and also using a _Custom Domain_ at the same time. The _Low Latency Control Room_ option uses the [Amazon AWS IVS](https://aws.amazon.com/ivs/) technology for Low Latency Streaming \(sub 5 seconds\) and the 1.3.0 version of its IVS Video.js Player had an issue when working with Cross-Origin Resource Sharing \(CORS\) that resulted in a player with a frozen state \(a pitch-black image with a blue play button that didn't work when pressing it\). We are still not sure if this issue was caused by the 1.3.0 version of the player or if it was an underlying issue with the _playlist url_ and the CORS headers sent by their servers. **Fix:** We updated the player plugin to its newest version \(1.3.1\) and implemented fixes that will also circumvent CORS issues in the future. This solution implements a fallback to InEvent's domain in case the _Custom Domain_ doesn't work - the fallback only affects the Video Player and not the actual parent window \(visually indistinguishable for regular users\). We also contacted AWS IVS support for further assistance. Per our throughout test, the player now works with _Custom Domains_ \(either on event level or company level\) on all browsers.
We encountered issues when using Low Latency with custom domains and we fixed it. The Low Latency Control Room option uses the Amazon AWS IVS technology for Low Latency Streaming (sub 5 seconds) and the 1.3.0 version of its IVS Video.js Player had an issue when working with cross-origin resource sharing (CORS) that resulted in a player with a frozen state (a pitch-black image with a blue play button that didn't work when pressing it). We have updated the player plugin to its newest version (1.3.1) and implemented fixes that will also circumvent CORS issues in the future -- a fallback to the regular iFrame using InEvent's domain in case the custom domain doesn't work (visually indistinguishable for regular users).
Report: "Socket performance is downgraded"
Last updateThis incident has been resolved.
Google is our socket hosting provider and their servers are offline. We are investigating this incident with Google engineers. https://status.firebase.google.com/incident/Console/21001
Report: "Issues on UK Region connectivity"
Last updateThis incident has been resolved.
We have identified unhealthy server replicas that were being treated as healthy servers, resulting in poor connectivity to our platform. We have deployed a fix for this issue.
We are currently investigating this issue.
Report: "Issues on UK Region connectivity"
Last updateThis incident has been resolved.
We detected an issue on our auto scaling group in our AWS eu-west-1 region. A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Increased error rate"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Minor 1 minute downtime due to environment updates"
Last updateThis incident has been resolved.
Report: "Slowness EU region"
Last updateWe detected that our AWS load balancer detected multiple connections from the same region as invalid and that prevented some users from connecting. The issue has been fixed.
Report: "Slowness on delivery"
Last updateThis incident has been resolved.
All emails are taking longer than expected to deliver
Report: "Slowness and unavailability"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We received a higher than expected number of requests in a brief period of time.
Report: "Mux Live Stream is rejecting new RTMP streams"
Last updateMux had a request limit per minute and we managed to increase it for our customers. We are working hard to avoid this in the future.
This incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Mux Live Stream is unstable"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Database slowness"
Last updateOur servers use an asynchronous event-driven approach, rather than threads, to handle requests. With the modular event-driven architecture, we can have a bucket overflow that outputs the requests as soon as they arrive. Once these requests arrive, we perform same safe-checking and then redirect them to the appropriate logical application. This can be done in multiple ways. What we realized during today operations was that the servers were not able to pipe these many requests as quickly as possible to the application layer. The socket was not able to quickly send the requests to the database, process the output and return to the user. To fix this issue we are improving our load balancer, changing our socket configurations and also changing the database processing mechanism, so the queries can be processed without any additional allocation overload.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Delay in emails and push notifications"
Last updatePostmark emails are being delayed for the last couple of minutes, the issue has now been fixed.
Report: "Database slowness"
Last updateWe have detected that one of the master servers exceed their temporary storage capacity. The server auto scaling temporary capacity has now been enabled for all master servers.
We are currently investigating a database connectivity slowness at our US East region.
Report: "Live streaming seems to be buffering more than usual"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Mux have identified the problem and fixed it, we are now monitoring until things get back to normal.
It looks like the Live Streaming service is back to normal, but we are still investigating the cause of this issue. New updates will be posted shortly.
We are investigating a degraded performance on our Mux Live Streaming Service.
Report: "Cloudflare outage"
Last updateAll systems should be accessible from any location now.
Users are beginning to be able to connect, and the upstream internet issues appear to be recovering.
It seems that Cloudflare had a major outage and affected millions of websites worldwide (https://techcrunch.com/2020/07/17/cloudflare-dns-goes-down-taking-a-large-piece-of-the-internet-with-it/)
Users are currently having trouble connecting to InEvent due to an upstream internet issue.
Report: "Slowness and unavailability"
Last updateWe had identified specific endpoints that had a longer loading time than usual and due to the large amount of requests we had during this period, the availability for the users was compromised and most of them had problems trying to access our platform. We have fixed these endpoints and they will not be an issue anymore when we face a large amount of simultaneous requests.
We have identified slowness at U.S. East Coast region and the service might have been unavailable for most of the users accessing that region. This was caused by an unusual large number of repeated requests and we have fixed it.
Report: "We identified some instabilities when trying to access the platform."
Last updateThis incident resulted in slow access to the platform and failure in uploading and download files on it.
Report: "Database not available"
Last updateA high traffic peak has hit our servers for 12 minutes.
Report: "Email analytics loading issues"
Last updateThis incident has been resolved.
This issue has been identified and we are working to fix it.
Report: "Revoked access for dynamic pages"
Last updateWe encountered an error while accessing dynamic pages and we already fixed it.