signageOS

Is signageOS Down Right Now? Check if there is a current outage ongoing.

signageOS is currently Operational

Last checked from signageOS's official status page

Historical record of incidents for signageOS

Report: "Some actions may take longer to process"

Last update
Resolved

Temporary spike in traffic. Resolved automatically.

Investigating

We are currently investigating this issue.

Report: "Some actions may take longer to process"

Last update
resolved

Temporary spike in traffic. Resolved automatically.

investigating

We are currently investigating this issue.

Report: "Validating availability of recent PING data in Box"

Last update
resolved

False positive alert

investigating

We are currently investigating this issue.

Report: "Box not receiving health checks"

Last update
postmortem

**Date** 2025-05-12 ‌ **Authors** Vaclav Boch, DevOps Lead ‌ **Summary** On 12th May 2025, signageOS experienced a 20-minute degradation in the availability of device health check information \(pings\) displayed in the Box UI. During this time, users were unable to view up-to-date health data from devices, and some received false-positive offline alerts. The issue was reported by customers. No devices in the field were impacted, and no other services or functionalities were affected. **Impact** * Users of Box experienced missing or delayed device health \(ping\) updates. * Some users received false-positive notifications about devices being offline. * No actual devices were affected — all remained connected and operational. * No other parts of the signageOS platform were impacted. **Trigger** A misconfiguration in the alerting system for one of the three RabbitMQ instances prevented proper detection of a queue processing issue related to device health data. **Detection** The incident was reported by customers via support channels. Internal monitoring did not detect the issue due to the misconfigured RabbitMQ alert. **Root Causes** * One of the three RabbitMQ instances had an alert misconfiguration, which led to a lack of visibility into the state of the message queues handling device ping updates. * This prevented the DevOps team from being notified of the degradation. * As a result, the health check data pipeline temporarily failed to deliver real-time updates to the UI. **Remediation** * The misconfigured alert was corrected immediately upon identification. * A validation audit was conducted across all RabbitMQ instances and critical services to ensure proper alerting coverage. * Additional safeguards are being implemented to prevent blind spots in monitoring for partial system degradations.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

Based on the user report, there are no health checks/pings data available in Box. We are investigating the root cause.

investigating

We are currently investigating this issue.

Report: "Temporary system degradation caused by Redis cache"

Last update
postmortem

**Date** 2025-04-12 **Authors** Václav Boch, DevOps Engineer **Summary** System degradation of multiple services due to a Redis database crash. **Impact** Temporary unavailability of the API, Box, and Platform services. No deployed devices were affected. **Trigger** A Redis instance designated as a cache ran out of memory and was rotated. **Detection** Our internal monitoring system detected high memory usage on one of our Redis cache servers. **Root Causes** Upon detection, we immediately began investigating the issue. Although the server initially had sufficient memory, an unexpected memory spike caused Redis to be terminated by the OOM \(Out of Memory\) Killer. Unfortunately, automatic restarts were disabled in SystemD for this scenario, requiring manual intervention from a system administrator. Additionally, no key eviction policy was configured in Redis to manage high memory consumption. This server was intended as a simple cache instance and was deployed as a single-instance replica. However, multiple services used it as a primary data source, leading to service unavailability when the instance crashed. **Remediation** While most of our Redis servers have replication enabled, we are in the process of deploying a more robust Redis solution. This new setup, enforced via Kyverno, will prevent single-instance Redis deployments in production. Every Redis instance will be configured with policies to prevent crashes due to OOM events. We are implementing stricter internal policies and providing additional training for developers to prevent similar single points of failure in the future.

resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "[False Positive] Report on outage of the platform service"

Last update
postmortem

False-positive entry triggered by monitoring system.

resolved

Our team received an alert from the alerting system that Platform system is down which means that potentially all devices are offline. After investigating the team concluded that it was a false positive.

Report: "API services degradation"

Last update
postmortem

**Date** 2025-04-01 **Authors** Michal Artazov, DevOps Lead **Summary** System degradation of multiple services due to human error and system inefficiency. **Impact** Degradation of API, temporary unavailability of API, Box, Platform and Screenshots **Trigger** signageOS employee accidentally triggered a Bulk Action to all devices under the internal signageOS company, assigning an Applet to all of them. That included tens of thousands of devices. **Note:** None of the devices were production devices. It's a mix of real devices in the signageOS lab and virtual devices used for load testing. **Detection** The trigger lead to a message queue growing in size due to the inability to process messages quickly enough. The alerting system alerted the team about this as soon as it happened. **Root Causes** There is a combination of factors that contributed. 1. Bulk Action assigned the Applet even to Devices that have an Applet managed by a Policy. That in turn triggered an automatic process, that began reverting it for each such Device to ensure that they adhere to their Policy setting. This made the whole problem worse, doubling the amount of Applet assignments that needed to be processed. 2. Service responsible for processing Bulk Actions flooded the system with too many Device changes that couldn't be properly handled in time, overloading the database. This in turn affected other services that depend on that database as the database performance degraded. 3. An employee triggered the excessive Bulk Action without realizing that it will be triggered on that many devices. The overall magnitude of this action exceeded 3 times the volume of the last performance test for processing bulk action/policy changes without the proper guardrails in place. **Remediation** We will address each part of the root cause separately. 1. **Bulk Actions optimization no 1.** - devices that have Applet \(or another property\) managed by Policy should be skipped to reduce the system load 2. **Bulk Actions optimization no. 2** - Bulk Actions service should process devices in batches, always making sure that the system receives a reasonable load it can manage. That way, if there's a degradation, it will slow down Bulk Actions service only and won't affect the rest of the system. 3. **Improve Bulk Actions UX in Box** - We will analyze how we can improve the UX to make accidental excessive Bulk Actions less likely.

resolved

System performance degradation caused by human error in combination with system inefficiency

Report: "Experiencing intermittent unavailability from several regions"

Last update
postmortem

**Date:** March 17, 2024 **Authors:** Václav Boch, DevOps Engineer **Summary** A critical failure in one of our MongoDB telemetry clusters resulted in a regional system-wide outage, impacting core services. Recovery time 10 mins. **Impact** * Complete regional outage of Box and API. * Both services returned 503 errors, preventing customer access. * Devices in the field were **not** impacted. ‌ **Trigger** The incident was triggered by a segmentation fault \(SegFault\) occurring in rapid succession across multiple servers in the MongoDB platform cluster. **Detection** * The issue was initially identified through internal monitoring. * Customer support tickets confirmed the impact shortly after detection. ‌ **Root Cause** * A segmentation fault \(SegFault\) in the MongoDB server led to instability and service disruption. ‌ **Remediation Actions** * Schedule an upgrade to a newer MongoDB version across all clusters to enhance stability. * Improve microservice dependencies to ensure that future incidents result in only partial outages rather than full system failures.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Action confirmation delayed"

Last update
postmortem

**Date** January 17, 2024 ‌ **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead ‌ **Summary** During performance testing of a new feature, **Secrets within Policy**, we identified inefficiencies that led to service degradation. This resulted in partial outages of some functionality for **Box and the API**, lasting up to **three hours**. No device in the field was affected and kept operating as expected. ‌ **Impact** * Box and REST API experienced **timeouts and 50x errors**. * The issue affected endpoints related to **device actions, logs, history, and audits**. * Other API functions remained operational. ‌ **Trigger** The issue was triggered by a **performance test** of the **Secrets within Policy** feature involving **20,000 devices**. ‌ **Detection** The issue was first detected through **internal monitoring** and subsequently reported via **customer support tickets**. ‌ **Root Causes** **1. Performance Testing on Production** * The final-stage performance test for **Secrets within Policy** introduced unexpected system strain. **2. Inefficiencies in the Command-Handler Service** * The **UpdatePolicy command** handler was inefficiently processing policy changes, emitting **excessive events**. * This issue was later resolved with a fix \(internal resource\) 3\. **RabbitMQ Overload and Failure** * The excessive events led to **RabbitMQ overload**, exhausting memory and causing **Node 1 \(of 3\) to restart**. * While RabbitMQ uses **quorum queues**, a few remaining **classic queues** \(**TimingCommandStored-related**\) were affected, causing **platform-consumer service failures**. * The **RabbitMQ node recovery took ~20 minutes** before services could resume normal operation. 4\. **Persistent Overload Post-Fix Deployment** * After deploying the command-handler fix, **no new overload events** were introduced, but **RabbitMQ still contained a backlog** of unprocessed events. * This affected **deviceActionLogs** \(MongoDB & Redis\), causing delays and resulting in **API timeouts \(30s\) and 50x errors**, along with **infinite loading states in Box**. 5\. **Service Slowdowns Due to Dependencies** * **Device-consumer \(DC\)** and **Platform-consumer \(PC\)** services had indirect dependencies. * PC writes to **MongoDB**, while DC reads from it—**without a backoff mechanism**—causing **contention** that further slowed PC processing. * A temporary removal of **transient queues** in DC improved performance but wasn’t sufficient. **6. MongoDB Performance Disparity** * **PC-mongo0** and **PC-mongo1** processed faster than **PC-mongo2**, due to PC-mongo2 being the **source for Redis notifications**, adding additional load. * Further strain occurred as **Mongo-Platform-Gamma** \(primary for Box and API\) was under high demand. **7. Resolution Strategy** * We waited for **PC-mongo0 to process its queue \(the fastest node\)**. * Once it caught up, we **migrated the primary DB source** for **Box, REST API, and Redis notifications** to **PC-mongo0**. * This resolved the incident. ‌ **Remediation Actions** ✅ **Bug Fix in UpdatePolicy Command** * The command-handler now prevents **duplicate events** when the same command is sent multiple times. ✅ **RabbitMQ Queue Migration** * Remaining **classic queues** will be **migrated to quorum queues** to enhance fault tolerance. ‌ This incident highlighted key performance bottlenecks in event processing and database dependencies. Further optimizations are planned to **prevent recurrence and improve system resilience**.

resolved

We’re happy to report that the issue has been fully resolved, and all systems are back to normal. We sincerely apologize for any inconvenience this may have caused and appreciate your patience while we worked through it. If you experience any lingering issues, please don’t hesitate to reach out to our support team. Thank you for your understanding and trust.

monitoring

API can falsely report status 500 while the action is performed. The patch will be deployed in production within 10 minutes.

monitoring

Some regions could still experience issues with REST API. The problem is being investigated.

monitoring

The patch has been deployed and the service is recovering. Most of the API endpoints are working as expected. ETA for complete recovery is 20 minutes.

identified

API response on several endpoints might be delayed or rejected. The issue was identified and is being mitigated.

identified

The issue has been identified and a fix is being implemented.

Report: "Partial queue procession degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Partial queue procession degradation"

Last update
postmortem

**Date** 2024-10-16 **Authors** Michael Zabka, CTO Vaclav Boch, DevOps **Summary** On 2024-10-15 11:00 CEST, we deployed a new version of a component called "device-consumer" that has an optimization for some telemetries \(FirmwareVersion, FirmwareType, DeviceInfo, PIN, OfflineActions\). It was moved from a standard database to a telemetry database. We identified that the caching algorithm doesn't work properly for those moved messages for the new telemetry database, but only for the standard database. It was inefficiently consuming and storing too much messages and produced too much unexpected writes to MongoDB. After identification of the problem \(on the following day at 2024-10-16 08:00 when large amount of devices was connected\), we had to roll back the deployment to the previous version. **Impact** Performance degradation on some telemetry events coming from devices. It's being partially delayed at random times in peaks. **Trigger** Deployment of inefficient version of component "device-consumer" followed by connection of the higher amount of devices on upcoming morning. **Detection** Correlation detection of deployment and significant anomaly in MongoDB writes and increasing number of write operations. **Root Causes** Inefficient algorithm of caching data events for telemetry in MongoDB database. **Remediation** Team performed a rollback of the service to previous version. Reimplement caching of telemetry in the new system and write performance tests \(benchmarks\) for the feature.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Unavailability of Box and REST API"

Last update
postmortem

**Date** 2024-03-29 **Authors** Michal Artazov, DevOps Lead **Summary** An inefficiently designed task of the CQRS/ES system leads to a crash of one of the MongoDB replica sets. **Impact** Box and API toggled between degraded performance to being completely unavailable. No devices in the field were affected and continued the operation as expected. **Trigger** Undetected accumulation of events in the Event Sourcing database over time. **Detection** Monitoring detected the crash of one of the MongoDB databases and alerted the engineer on-call duty. **Root Causes** Undetected accumulation of events in the Event Sourcing database over time leads to a gradual slowdown of future command processing. Eventually, it reached a critical point that caused too much data to be queried from the database at once which caused the database to crash. **Remediation** The team has implemented several steps to remediate the problem. * Improvements to the CQRS/ES task to prevent future excessive accumulation of events in the database. * Consolidation and cleanup of the events in the database to reduce them to a manageable number. * The team will discuss options to implement additional monitoring checks for future early detection of similar issues.

resolved

Temporary unavailability of Box and REST API, with degraded performance.

Report: "Telemetry degradation"

Last update
postmortem

**Summary** * We have 3 replicas of mongo-telemetry that are used for storing Telemetry data \(usually the last state of data for every device\). * 1 replica is always primary and 2 replicas are secondaries. * The problem started with a PD alert of delayed oplog replication lag on the primary replica \(that's unusual since the primary is the only writable replica and should never happen that oplog is delayed\). * The fix required to stop MongoDB telemetry replicas. * The main indication of the resolution was following log in MongoDB Telemetry primary replica * "Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries." **Impact** * Services API and Box didn't show some of the telemetries data properly for 40 minutes. * No impact on connected devices. **Trigger** PagerDuty - Grafana Alerts of MongoDB Telemetry 0, Replication lag **Detection** 20:00 - immediately after the incident started **Root Causes** * Page faults - probably on HW of MongoDB Telemetry instances **Remediation** * Increase IOPS on MongoDB Telemetry Data disks to 6000 iops * Change the settings of MongoDB to HA. Reduce requirements on consistency on the MongoDB Telemetry. * MongoDB cluster should not depend on Secondaries. It should work as a standalone instance when all secondary replicas are down or unavailable.

resolved

Temporary degradation in telemetry processing.

Report: "System degradation"

Last update
postmortem

**Date** 2024-01-29 **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead Vaclav Boch, DevOps ### Display connectivity issues and delayed queue processing **Impact:** Caused display connection issues and slow system responses to received commands. **Trigger:** Overloaded platform services. **Detection:** Internal monitoring and customer tickets. **Root Causes:** Our services directly communicating with displays got overloaded. This increased response times, causing displays to reconnect. After several failed attempts, the displays fell back to our backup system for communication. This produced a large number of system messages and overloaded our configuration servers, further increasing the general system latency. **Remediation:** * Improved platform autoscaling to have more space for additional traffic. * Changed underlying EC2 instances to more powerful ones to cover the CPU spikes. * Improved monitoring to be more proactive in discovering similar incidents.

resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "System queues for Pings and Provisioning is degraded"

Last update
postmortem

**Date** 2024-01-17 **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead **Summary: Delayed System Queues and Offline Displays Due to CPU Credit Depletion** **Impact:** The incident resulted in delayed system queues for pings and provisioning, causing some displays to go offline. Once offline, these displays were unable to reconnect, leading to a disruption in service availability. **Trigger:** A portion of platform instances ran out of CPU credits, initiating CPU throttling and high memory usage. Sluggish responses increased reconnection attempts, amplifying traffic and ultimately contributing to the outage. **Detection:** The incident was detected through internal CPU throttling monitoring and customer tickets reporting display outages. **Root Causes:** The primary root cause was the exhaustion of CPU allowance on certain platform instances, leading to throttling and increased memory usage, causing system delays. **Remediation:** 1. **Improved Monitoring Limits:** * Enhance monitoring limits to detect similar issues sooner, enabling proactive intervention and preventing prolonged system delays. 2. **Change of Underlying Instances:** * Switch to larger instances with improved CPU credit allowances to address the immediate resource constraints and minimize the risk of future incidents. 3. **Implement Better Autoscaling Processes:** * Enhance autoscaling processes to dynamically adjust resources based on demand, resolving the impact without manual intervention. This ensures a more adaptive and responsive system.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

Report: "Devices presented as offline in Box"

Last update
postmortem

**Incident Summary:** During a routine deployment of platform-v2, we encountered unexpected issues with showing the current device status. **Impact:** The deployment resulted in a significant surge in data and connections. This was due to all devices reconnecting simultaneously, averaging 4000 socket connections per minute. The reconnection phase was completed within 30 minutes, followed by a 40-minute period to process the backlog of data. **Partial Service Outage:** The incident led to partial service degradation, affecting the timely delivery of screenshots, certain device information, and telemetry data. **Remediation Steps:** To prevent a recurrence, we are implementing an improved deployment strategy. This will involve a more controlled and gradual rollout of changes to avoid sudden spikes in device connections. The updated process will ensure a smoother and more stable update experience. **Timeline of Events:** The incident began: 13:18 GMT Incident resolved: 14:12 GMT

resolved

Incident Summary: During a routine deployment of platform-v2, we encountered unexpected issues with showing the current device status. Impact: The deployment resulted in a significant surge in data and connections. This was due to all devices reconnecting simultaneously, averaging 4000 socket connections per minute. The reconnection phase was completed within 30 minutes, followed by a 40-minute period to process the backlog of data. Partial Service Outage: The incident led to partial service degradation, affecting the timely delivery of screenshots, certain device information, and telemetry data. Remediation Steps: To prevent a recurrence, we are implementing an improved deployment strategy. This will involve a more controlled and gradual rollout of changes to avoid sudden spikes in device connections. The updated process will ensure a smoother and more stable update experience. Timeline of Events: The incident began: 13:18 GMT Incident resolved: 14:12 GMT

Report: "Temporary delay in devices processing due to system patch"

Last update
postmortem

**Date** 2023-08-14/15 ‌ **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead Vaclav Boch, DevOps ‌ **Summary** Throughout Tuesday, 15th August 2023, signageOS experienced an incident causing delays in processing screenshots, device connections and some device telemetries. The issue was caused by a suboptimal check for connection type at the time devices are connecting to the signageOS for the first time and followed by a subsequent feedback loop that caused traffic to grow exponentially. This report aims to provide a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation. ‌ **Impact** The issue has a negative impact on screenshots processing, establishing connection from newly provisioned devices and reporting some telemetries in a timely manner. No content playback was impacted. All devices run as expected. ‌ **Trigger** Large number of new devices were provisioned at the same time. ‌ **Detection** The issue was detected by our internal alerting system, which continuously monitors various metrics collected from our systems and triggers an alert when a metric crosses over a set threshold. Alerts are sent to PagerDuty, which notified the people currently on-call at that time. The alert was specifically triggered by a message queue size growing over the set threshold. This queue contains pending requests to process new screenshots, connections and telemetries. Through proactive monitoring, we were able to identify the issue and initiate the investigation promptly. ‌ **Root Causes** The root cause was a crash of one of our MongoDB databases. Due to the sudden increase in the number of devices, it ran out of resources and crashed. In fact, two out of three replicas in the replica set crashed. This affected all services that use this database, causing an outage in processing of the screenshots and some telemetries. However, the services that handle incoming traffic from devices weren't affected, causing a large influx of pending data that can’t be processed. This, in turn, affected other services, causing a chain reaction. ‌ **Remediation** The team has implemented several key optimizations that prevent this same problem from occurring in the future. First optimization is shortening the path to process a new screenshot. Before, a new screenshot would first be sent to an intermediate service for validation and then to the final service that would write it to the database. The team has confirmed that the intermediate validation process was the bottleneck and that it can be safely removed because it doesn’t serve the purpose it was originally intended for anymore. This optimization has increased our speed of processing new screenshots 2x. Secondly, the team has made processing of the affected telemetries faster by applying a less conservative write strategy. By default, a more conservative write strategy is used for all incoming requests that uses database transactions and waits for data to be replicated across the database replica set. This strategy wasn’t necessary in case of the affected data. In case of device telemetry, performance is the priority and in the worst case, temporary data loss is acceptable. By applying a more relaxed write strategy to this data, the processing speed of this kind of data has increased 10x. Thirdly, the team has addressed one of the main issues that contributed to the situation. When a device goes online, it makes a request to our server and the server responds with configuration for that particular device. The device then uses this information to configure itself into the desired state. When the server receives a request from a brand new device, it will check whether the request came over HTTP or HTTPS and it compares that to the lastUsedProtocol field in the database. If they differ, it will create a new write request to write the new protocol. The issue was, that in the beginning, there’s no value in the database so every device that connects for the first time will generate a new write at least once. Due to several previously discussed factors, the processing of write requests became too slow. Devices make continuous requests to fetch configuration every minute. When the server received the next request from the same device, the previous request to write the protocol wasn’t yet processed so it compared the new protocol to the old empty value and generated another write request. This process kept repeating for that device, creating a feedback loop. To make matters worse, there were at least 5000 devices that were stuck in this loop at the time, generating new data at an exponential rate. The team has patched the code so that it treats an empty value of lastUsedProtocol as HTTPS. Since HTTPS is the default for all new devices, it prevented any more new data from being generated. Last, but not least, the team has scaled up the affected MongoDB machines so that they have a larger buffer of free resources. The team has also improved the monitoring so that they’re notified in advance next time the database is running out of resources. In modern systems, most serious issues are complex. There is always some trigger that throws other components out of balance, sometimes causing a chain reaction throughout the system. When that happens, it exposes weaknesses in the system’s architecture or in some of its components. It’s impossible to fully prevent this but it is possible to improve the ability to contain the issues and to fix discovered weaknesses. Our team takes these situations very seriously. For us, it’s always an opportunity to make the whole system even more stable. Next time, it can handle even more devices, more user activity and if there is another issue, its impact is more localised and affects a few number of features and services.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue is being fixed. There is no effect on existing devices, all devices are playing content and performing as usual. New device provisioning is delayed in some regions and status reporting in Box shows devices in "yellow" pending state.

investigating

We are currently investigating this issue.

Report: "The screenshots queue is slowed down"

Last update
postmortem

**Date** 2023-08-14/15 ‌ **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead Vaclav Boch, DevOps ‌ **Summary** Throughout Tuesday, 15th August 2023, signageOS experienced an incident causing delays in processing screenshots, device connections and some device telemetries. The issue was caused by a suboptimal check for connection type at the time devices are connecting to the signageOS for the first time and followed by a subsequent feedback loop that caused traffic to grow exponentially. This report aims to provide a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation. ‌ **Impact** The issue has a negative impact on screenshots processing, establishing connection from newly provisioned devices and reporting some telemetries in a timely manner. No content playback was impacted. All devices run as expected. ‌ **Trigger** Large number of new devices were provisioned at the same time. ‌ **Detection** The issue was detected by our internal alerting system, which continuously monitors various metrics collected from our systems and triggers an alert when a metric crosses over a set threshold. Alerts are sent to PagerDuty, which notified the people currently on-call at that time. The alert was specifically triggered by a message queue size growing over the set threshold. This queue contains pending requests to process new screenshots, connections and telemetries. Through proactive monitoring, we were able to identify the issue and initiate the investigation promptly. ‌ **Root Causes** The root cause was a crash of one of our MongoDB databases. Due to the sudden increase in the number of devices, it ran out of resources and crashed. In fact, two out of three replicas in the replica set crashed. This affected all services that use this database, causing an outage in processing of the screenshots and some telemetries. However, the services that handle incoming traffic from devices weren't affected, causing a large influx of pending data that can’t be processed. This, in turn, affected other services, causing a chain reaction. ‌ **Remediation** The team has implemented several key optimizations that prevent this same problem from occurring in the future. First optimization is shortening the path to process a new screenshot. Before, a new screenshot would first be sent to an intermediate service for validation and then to the final service that would write it to the database. The team has confirmed that the intermediate validation process was the bottleneck and that it can be safely removed because it doesn’t serve the purpose it was originally intended for anymore. This optimization has increased our speed of processing new screenshots 2x. Secondly, the team has made processing of the affected telemetries faster by applying a less conservative write strategy. By default, a more conservative write strategy is used for all incoming requests that uses database transactions and waits for data to be replicated across the database replica set. This strategy wasn’t necessary in case of the affected data. In case of device telemetry, performance is the priority and in the worst case, temporary data loss is acceptable. By applying a more relaxed write strategy to this data, the processing speed of this kind of data has increased 10x. Thirdly, the team has addressed one of the main issues that contributed to the situation. When a device goes online, it makes a request to our server and the server responds with configuration for that particular device. The device then uses this information to configure itself into the desired state. When the server receives a request from a brand new device, it will check whether the request came over HTTP or HTTPS and it compares that to the lastUsedProtocol field in the database. If they differ, it will create a new write request to write the new protocol. The issue was, that in the beginning, there’s no value in the database so every device that connects for the first time will generate a new write at least once. Due to several previously discussed factors, the processing of write requests became too slow. Devices make continuous requests to fetch configuration every minute. When the server received the next request from the same device, the previous request to write the protocol wasn’t yet processed so it compared the new protocol to the old empty value and generated another write request. This process kept repeating for that device, creating a feedback loop. To make matters worse, there were at least 5000 devices that were stuck in this loop at the time, generating new data at an exponential rate. The team has patched the code so that it treats an empty value of lastUsedProtocol as HTTPS. Since HTTPS is the default for all new devices, it prevented any more new data from being generated. Last, but not least, the team has scaled up the affected MongoDB machines so that they have a larger buffer of free resources. The team has also improved the monitoring so that they’re notified in advance next time the database is running out of resources. In modern systems, most serious issues are complex. There is always some trigger that throws other components out of balance, sometimes causing a chain reaction throughout the system. When that happens, it exposes weaknesses in the system’s architecture or in some of its components. It’s impossible to fully prevent this but it is possible to improve the ability to contain the issues and to fix discovered weaknesses. Our team takes these situations very seriously. For us, it’s always an opportunity to make the whole system even more stable. Next time, it can handle even more devices, more user activity and if there is another issue, its impact is more localised and affects a few number of features and services.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Investigating possible connectivity issues for some regions"

Last update
postmortem

**Date** 2023-08-14/15 ‌ **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead Vaclav Boch, DevOps ‌ **Summary** Throughout Tuesday, 15th August 2023, signageOS experienced an incident causing delays in processing screenshots, device connections and some device telemetries. The issue was caused by a suboptimal check for connection type at the time devices are connecting to the signageOS for the first time and followed by a subsequent feedback loop that caused traffic to grow exponentially. This report aims to provide a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation. ‌ **Impact** The issue has a negative impact on screenshots processing, establishing connection from newly provisioned devices and reporting some telemetries in a timely manner. No content playback was impacted. All devices run as expected. ‌ **Trigger** Large number of new devices were provisioned at the same time. ‌ **Detection** The issue was detected by our internal alerting system, which continuously monitors various metrics collected from our systems and triggers an alert when a metric crosses over a set threshold. Alerts are sent to PagerDuty, which notified the people currently on-call at that time. The alert was specifically triggered by a message queue size growing over the set threshold. This queue contains pending requests to process new screenshots, connections and telemetries. Through proactive monitoring, we were able to identify the issue and initiate the investigation promptly. ‌ **Root Causes** The root cause was a crash of one of our MongoDB databases. Due to the sudden increase in the number of devices, it ran out of resources and crashed. In fact, two out of three replicas in the replica set crashed. This affected all services that use this database, causing an outage in processing of the screenshots and some telemetries. However, the services that handle incoming traffic from devices weren't affected, causing a large influx of pending data that can’t be processed. This, in turn, affected other services, causing a chain reaction. ‌ **Remediation** The team has implemented several key optimizations that prevent this same problem from occurring in the future. First optimization is shortening the path to process a new screenshot. Before, a new screenshot would first be sent to an intermediate service for validation and then to the final service that would write it to the database. The team has confirmed that the intermediate validation process was the bottleneck and that it can be safely removed because it doesn’t serve the purpose it was originally intended for anymore. This optimization has increased our speed of processing new screenshots 2x. Secondly, the team has made processing of the affected telemetries faster by applying a less conservative write strategy. By default, a more conservative write strategy is used for all incoming requests that uses database transactions and waits for data to be replicated across the database replica set. This strategy wasn’t necessary in case of the affected data. In case of device telemetry, performance is the priority and in the worst case, temporary data loss is acceptable. By applying a more relaxed write strategy to this data, the processing speed of this kind of data has increased 10x. Thirdly, the team has addressed one of the main issues that contributed to the situation. When a device goes online, it makes a request to our server and the server responds with configuration for that particular device. The device then uses this information to configure itself into the desired state. When the server receives a request from a brand new device, it will check whether the request came over HTTP or HTTPS and it compares that to the lastUsedProtocol field in the database. If they differ, it will create a new write request to write the new protocol. The issue was, that in the beginning, there’s no value in the database so every device that connects for the first time will generate a new write at least once. Due to several previously discussed factors, the processing of write requests became too slow. Devices make continuous requests to fetch configuration every minute. When the server received the next request from the same device, the previous request to write the protocol wasn’t yet processed so it compared the new protocol to the old empty value and generated another write request. This process kept repeating for that device, creating a feedback loop. To make matters worse, there were at least 5000 devices that were stuck in this loop at the time, generating new data at an exponential rate. The team has patched the code so that it treats an empty value of lastUsedProtocol as HTTPS. Since HTTPS is the default for all new devices, it prevented any more new data from being generated. Last, but not least, the team has scaled up the affected MongoDB machines so that they have a larger buffer of free resources. The team has also improved the monitoring so that they’re notified in advance next time the database is running out of resources. In modern systems, most serious issues are complex. There is always some trigger that throws other components out of balance, sometimes causing a chain reaction throughout the system. When that happens, it exposes weaknesses in the system’s architecture or in some of its components. It’s impossible to fully prevent this but it is possible to improve the ability to contain the issues and to fix discovered weaknesses. Our team takes these situations very seriously. For us, it’s always an opportunity to make the whole system even more stable. Next time, it can handle even more devices, more user activity and if there is another issue, its impact is more localised and affects a few number of features and services.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Domain api.docs.signageos.io is not available (downlevel problems on Postman.com)"

Last update
resolved

This incident has been resolved.

identified

See https://status.postman.com for more details.

Report: "Partial unavailability of telemetry data in Box"

Last update
postmortem

**Date** 2023-03-07 ‌ **Authors** Lukas Danek, CPO Michael Zabka, CTO Michal Artazov, DevOps Lead ‌ ‌ **Summary:** On the 7th of March 2023, a partial unavailability of telemetry data occurred in the Box service at signageOS. This issue was caused by a partial outage of the third-party service InfluxDB, which impacted the data retrieval and storage process. The issue was detected by our internal monitoring tool and promptly addressed. This report provides a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation. ‌ **Impact:** The partial unavailability of telemetry data in Box had minimal impact on the overall functionality of our system. While the issue affected the retrieval and storage of telemetry data, it did not impact any devices connected to signageOS, and no data was lost. However, the absence of real-time telemetry data limited the ability to analyze and monitor system performance accurately, which may have affected troubleshooting and diagnostics during the incident. ‌ **Trigger:** The trigger for the issue was a partial outage of the third-party service, InfluxDB, which disrupted the normal flow of telemetry data processing. The service interruption hindered the seamless retrieval and storage of telemetry data, resulting in partial unavailability. ‌ **Detection:** Our internal monitoring tool detected the partial unavailability of telemetry data by continuously monitoring the data flow and storage in the Box service. It raised alerts when it identified a deviation from the expected behavior, indicating a disruption in the telemetry data pipeline. The tool provided real-time visibility into the issue, enabling us to respond promptly and investigate the root cause. ‌ **Root Causes:** After thorough investigation, the following root causes were identified: ‌ Partial outage of InfluxDB: The third-party service, InfluxDB, experienced a partial outage, causing disruptions in data retrieval and storage processes. This outage impacted the seamless flow of telemetry data into the system, resulting in partial unavailability. Remediation: To address the issue and prevent its recurrence, the following steps were taken: ‌ Restoring InfluxDB service: As the root cause was traced to the partial outage of InfluxDB, we worked closely with the service provider to resolve the underlying issues and restore full functionality. The InfluxDB service was reinstated, ensuring the seamless retrieval and storage of telemetry data. ‌ Additional caching mechanism: To mitigate the impact of future service disruptions, we implemented an additional caching mechanism within our system. This caching layer helps maintain a temporary storage of telemetry data, allowing for limited availability even during service interruptions. ‌ Stability checks and communication with InfluxDB support: We engaged with the InfluxDB support team to ensure stability and reliability moving forward. We requested stability checks and collaborated on potential preventive measures to mitigate the risk of similar issues in the future. ‌ By implementing these remediation steps, we have improved the resilience of our system, enabling continued availability of telemetry data and minimizing the impact of third-party service disruptions. ‌ We apologize for any inconvenience caused by this issue and appreciate your patience and understanding as we worked to resolve it promptly. Our team remains committed to ensuring the highest level of service reliability and continuous improvement. ‌ If you have any further questions or concerns, please feel free to reach out to our support team.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The internal dependency on the service InfluxDB has an incident: https://status.influxdata.com/

Report: "Box responses are randomly slower"

Last update
postmortem

**Date** 2023-03-31 ‌ **Authors** Lukas Danek, CPO Michael Zabka, CTO Michal Artazov, DevOps Lead ‌ ‌ **Summary**: On the 31st of March 2023, our production cluster experienced an issue where box responses were randomly slower than expected. This issue was caused by inefficient caching and was detected by our internal monitoring tool. This report aims to provide a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation. ‌ **Impact:** The slowdown in box responses had a negative impact on the user experience. Users experienced delays and slower response times when interacting with boxes, leading to a degraded user experience. This issue affected the productivity and satisfaction of our customers, potentially resulting in reduced engagement with our platform and a negative perception of our service quality. No content playback was impacted, all devices run as expected. ‌ **Trigger:** The trigger for the issue was the presence of inefficient caching within our system. The caching mechanism in place was not optimized to handle the increased load and varied response times, leading to inconsistent performance. The inefficient caching exacerbated the response time issues, resulting in slower box responses. ‌ **Detection:** The issue was detected by our internal monitoring tool, which continuously collects and analyzes performance metrics from our production cluster. The tool alerted the team when it observed an increase in response times for box requests, exceeding the acceptable thresholds. Through proactive monitoring, we were able to identify the issue and initiate the investigation promptly. ‌ **Root Causes:** After a thorough investigation, the following root causes were identified: ‌ Inefficient caching mechanism: The caching mechanism in use was not designed to handle the current workload and request patterns. The cache was not adequately tuned, leading to frequent cache misses and subsequent delays in box responses in the production environment. ‌ Insufficient automated testing on pre-production: The caching mechanism was not thoroughly tested under realistic production scenarios, and the system lacked optimization measures to address performance bottlenecks. This oversight resulted in the underperformance of the caching system during peak hours. ‌ **Remediation:** To address the issue and prevent its recurrence, the following steps were taken: Cache optimization: The caching mechanism was reevaluated and optimized to better handle the workload and improve response times. The cache sizing was adjusted, and caching algorithms were refined to minimize cache misses and improve overall performance. Implementation of cache invalidation strategy: A comprehensive cache invalidation strategy was devised and implemented. This strategy ensures that outdated data is promptly removed from the cache, reducing the chances of serving stale responses. Performance testing and tuning: Rigorous performance testing was conducted to simulate realistic production scenarios and identify performance bottlenecks. Based on the findings, optimizations were applied to various components of the system, including the caching mechanism, to enhance overall performance. Enhanced monitoring and alerting: Our monitoring tool was enhanced to provide more granular visibility into cache performance metrics. This allows us to proactively detect and address any potential issues related to caching in real-time. Continuous improvement and review: Regular reviews of the caching mechanism and its performance are now part of our operational practices. We prioritize ongoing optimization efforts and regularly evaluate the caching system to ensure it aligns with the evolving needs of our system and user requirements. By implementing these remediation steps, we have significantly improved the performance and reliability of box responses, ensuring a better user experience for our customers. We apologize for any inconvenience caused by this issue and appreciate your patience and understanding as we worked to resolve it promptly. Our team remains committed to continuously improving our system to provide the highest level of service to our users. ‌ If you have any further questions or concerns, please feel free to reach out to our support team.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are investigating random peaks and slower responses in Box UI. The issue is NOT affecting connected devices. Only affects user experience.

Report: "Incident - REST API (SDK) applet upload and login endpoints failing with 500 response code"

Last update
postmortem

**Date** 2022-06-29 **Authors** Michael Zabka, CTO Michal Artazov, DevOps Lead ‌ **Summary** signageOS Trial and Free-tier customers experienced increased traffic on some REST API endpoints between 20:30 - 22:00 UTC. That was followed by increased response time of some responses up to 60 seconds or more. Automated REST API monitoring system notified the DevOps team via PagerDuty and the team started analyzing the problem. Shortly after, the team discovered the cause of the problem. The REST API connection to MongoDB database for trial and free-tier users was temporarily configured to use a migration database instance since the last deployment maintenance window. That was a human error, since the instance was not meant for long-term production traffic. The DevOps team switched the database connection of REST API back to the original production instance of MongoDB. After redeploying all REST API instances, the issue got eliminated. **Impact** REST API minority set of endpoints - Applet management, Account session \(login\) Endpoints/requests were delayed and in some cases, they didn’t complete in 60 seconds, which resulted in a timeout so they were responding with 50x error status codes. **Trigger** Unexpected higher traffic on newly deployed features related to Applet management. Temporary MongoDB connection configured to the temporary cluster instance meant for migration purposes. **Detection** Detected by an automated monitoring system at 21:01 UTC. Confirmed by tickets from customers on Trial 21:30 UTC. **Root Causes** Incorrect configuration of MongoDB database caused delayed responses due to insufficient CPU resources. **Remediation** The new alerts for the correct configuration of REST API service will be added to the internal monitoring.

resolved

signageOS Trial and Free-tier customers experienced increased traffic on some REST API endpoints between 20:30 - 22:00 UTC. It affected the response time of endpoints up to more than 60 seconds. Automated REST API monitoring system notified the DevOps team via PagerDuty and the team started analyzing the problem. Shortly after, the cause of the problem was discovered. The REST API connection to MongoDB database for trial and free-tier users was temporarily configured to use a migration database instance since the last deployment maintenance window. The instance was not meant for long-term production traffic. The DevOps team switched the database connection of REST API back to the original production instance of MongoDB. After redeploying all REST API instances, the issue got eliminated.

Report: "API responsing 500 to some requests"

Last update
postmortem

**Authors** Michal Artazov, Backend Team Leader ‌ **Summary** Failing of API requests caused by human error in configuration of request limiting sub-system of API. ‌ **Impact** Some API requests started unexpectedly failing with response code 500. **Trigger** Human error. **Detection** We combine [Pingdom](https://www.pingdom.com/) service and [Postman Monitoring feature](https://www.postman.com/api-monitor/) to monitor health of API endpoints. We were alerted about large amounts of 500 response codes as soon as it started happening. ‌ **Root Causes** Recently we added new functionality to API to be able to limit the number of requests each organization can do on certain endpoints. The goal of this functionality was to prevent misuse of the API and sending excessive amounts of requests that can degrade API’s performance for other clients. At first we only applied it to two endpoints and it worked well. Today we decided to apply it to two more endpoints - [Upload Applet Version Files](https://api.docs.signageos.io/#64af3257-32ce-49b1-b4b4-e0233d634dbc) and Upload and [Update Applet Version Files](https://api.docs.signageos.io/#476dd6ca-c6dc-4435-bc7a-c92d53f7714a). ‌ To configure a limit for an endpoint, we match the endpoint with method and a path matching string in the format of express.js library. We use npm library [path-to-regexp](https://www.npmjs.com/package/path-to-regexp) to match request’s path against the configurations. ‌ A human error caused misconfiguration and configuring an invalid path matching string. Any time it would attempt to get matched against the request’s path, it would throw an exception, causing the request to fail with 500 response code. **Remediation** We fixed the configuration to contain path matching string in correct format. We also started working on a fix that would validate the configurations before using them so that the invalid configurations get discarded in runtime.

resolved

Failing of API requests caused by human error in configuration of request limiting sub-system of API. This issue has been resolved, API is not returning 500 response codes anymore.

Report: "Device status reporting devices as Offline/Pending"

Last update
resolved

Due to queue traffic generated at the time of the maintenance, signageOS experienced a traffic peak which led to the temporary disconnection of some devices. This issue has been resolved, all devices are reporting correct connectivity status again.

monitoring

Due to queue traffic generated at the time of the maintenance, signageOS experienced a traffic peak which led to the temporary disconnection of some devices. signageOS is monitoring the system as the patch was deployed and all devices are back connected and shown as connected in Box.

investigating

We are continuing to investigate the issue.

investigating

We are continuing to investigate the issue.

investigating

We are continuing to investigate the issue.

investigating

As part of our scheduled maintenance we've encountered an issue regarding device status reporting, we are currently investigating the root cause of the issue

Report: "Slower loading speed for Box"

Last update
resolved

DevOps and Box team found out the issue of random slowness of Box and deployed a fix into production. The issue was caused by inefficient data processing on the Box.

investigating

Users may experience slower loading speeds for Box, system is still operation and no device playback is affected. We are investigating the root cause for this slowdown.

Report: "Platform is not responsive, some devices might be disconnected"

Last update
postmortem

Kindly read the overall Postmortem here: [https://status.signageos.io/incidents/61k1kghcpwgk](https://status.signageos.io/incidents/61k1kghcpwgk)

resolved

Issue has been identified and resolved. We are continuing to monitor the situation.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Platform rejects connection of devices irregularly."

Last update
postmortem

Kindly read the overall Postmortem here: [https://status.signageos.io/incidents/61k1kghcpwgk](https://status.signageos.io/incidents/61k1kghcpwgk)

resolved

The issue has been resolved, we are continuing to monitor the situation.

monitoring

A fix has been implemented and we are monitoring the results.

Report: "Devices are reconnecting in large batches causing delayed events processing"

Last update
postmortem

**Date** 2022-01-26 **Authors** Lukas Danek, COO / Head of Product Michael Zabka, CTO Michal Artazov, Lead Developer **Summary** On the 26th of January, 12:00 pm, some of signageOS services, namely Platform and Box, experienced degraded performance.  **Impact** Large number of devices experiencing connectivity \(offline/pending state\) issues to signageOS Platform. Degraded Box connection and load speeds. No content playback was affected nor interrupted. **Trigger** The partial connectivity outage was triggered by a scheduled deployment of new signageOS Documentation with underlying nginx-ingress issues.  ‌ **Detection** Internal monitoring tools report large amounts of devices as offline concurrently. Slower loading speeds for Box.  **Root Causes** Normally, our production-grade nginx-ingress has rolling upgrade functionality, however this specific part of the system did not, and required manual rollout, resulting in all nginx instances being deleted simultaneously. This situation resulted in a spike of connection traffic. Normally, signageOS has a back-up mechanism which should catch such behaviour and recover automatically, but unfortunately an underlying and unrelated bug related to a Kubernetes cluster caused an unexpected networking issue. As a result, approximately 80% of our Node instances were cut off from connection. The solution implemented involved removing the old Node instances and creating new, healthy ones. As mentioned, this process should be automatic thanks to autoscaling group and health-checks of individual instances. However, incorrectly set health-checks which checked incorrect endpoint, in combination with checking strictly CPU load, resulted in this issue not being detected automatically. **Remediation** From the knowledge and information gained, signageOS introduced new preventive measures and further fine-tuned necessary health-checks. Currently, the health-checks confirm running nginx on Node instances \(AWS EBL\) rather than exclusively their current CPU load \(AWS EC2\). Additionally, we’ve extended the rolling upgrade functionality into this part of the nginx-ingress system, replacing the necessity for manual action.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Delayed processing of Pings and Screenshots"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Box might show devices as offline even though they were not"

Last update
postmortem

**Date** 2021-09-01 **Authors** Lukas Danek, COO / Head of Product Michael Zabka, CTO **Summary** 1st of September 17:00 UTC, some Box instances started to show devices offline, despite being properly connected. **Impact** For some users, devices appear offline in the Box, but devices could process commands and respond to API requests. ‌ **Trigger** Deployment of RabbitMQ and Platform optimization. ‌ **Detection** Reported by users. **Root Causes** We updated our RabbitMQ cluster following previously reported incidents. It had happened during the maintenance window just before the current incident. The main goal of this fix is to reduce the upcoming peak traffic of device connections, which could, on rare occasions, lead to RabbitMQ cluster overuse and end up in RabbitMQ network partitioning \(cause of previous incidents\). This fix was to throttle the number of device connections at the same time and distribute connections over a longer time. While the fix was successfully deployed and worked well, we configured the volume too conservatively. And this led to slower events processing, consequently causing devices’ offline connection status in the Box, REST API, and other services. ‌ **Remediation** Currently, we temporarily prepared scripts that allow us to increase the rate limit for throttling device connections on-demand manually. Additionally, we configured a higher default rate limit of connections to be less conservative and allow device connections slowdown to be practically unnoticeable

resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Queue system is degraded, slow messages processing"

Last update
postmortem

**Date** 2021-08-24 ‌ **Authors** Lukas Danek, COO / Head of Product Michael Zabka, CTO Michal Artazov, Lead Developer **Summary** On the 24th of August, 12:00 pm UTC RabbitMQ cluster started to behave unstable. Many of the replicated nodes displayed Network Partitioning and messages in queues were not consumed by the rest of the system. signageOS DevOps team had to reboot and scale up the cluster manually to avoid loss of any message. During the time of the incident, devices were not connected to the signageOS Cloud. Devices remained playing and the content on the screens were not anyhow affected. ‌ **Impact** Devices were shown offline/pending in the Box and the REST API requests were queued until the incident was resolved. No content playback was affected nor interrupted. **Trigger** Connection loop of 3000 devices. **Detection** Detected by alert in AWS CloudWatch. ‌ **Root Causes** The RabbitMQ cluster responsible for communication between devices and system \(REST API & Box\) has got network partitioned due to some instabilities on network or due to CPU overload during peak of more devices connections. This started to desynchronize some queues and those queues cannot be automatically synchronized in the long term. So some messages could be lost for a longer time and the device management actions could feel unresponsive. To resolve this RabbitMQ unsynchronized and network partitioned state, we had to pause all connections of devices to the server for a few minutes and restart all nodes of the RabbitMQ cluster and then unpause the connections again. Unfortunately, this restart did another amount of peak in device connections so the network partitioning did appear again after a while. So next time, we decided to do the restart of the cluster again, but now with upscaled cluster nodes to prevent network partitioning again for now and even for next time. ‌ **Remediation** We started working on a new architecture of the device connection queues in RabbitMQ to allow better performance of peaks of device connections and keep the cost of these services low as before. For the short term, we decided to overscale the RabbitMQ cluster to allow handling those incoming device connections peaks, til the architecture is prepared for that for next time.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Box login issues"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The fix is being deployed.

identified

Users might experience login issues caused by timeouts in Box.

Report: "Investigating events delivery for outbound commands"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Some Power Actions and management actions might be affected for some customers.

Report: "Connectivity issues between Platform and main cluster"

Last update
postmortem

**Date** 2021-06\_29 ‌ **Authors** Lukas Danek, COO / Head of Product Michael Zabka, CTO Michal Artazov, Lead Developer **Summary** 29th June, between 11:00 and 13:00 UTC, we encountered an SSL certificate issue, which caused some of the Samsung SSSP 4 displays to stop operating properly. At 11:00 am UTC, our primary server reached a threshold for the “CPU overload” and consequently triggered the CPU overload mitigation process. To avoid any service disruption, when there is a risk of CPU overload, some devices are disconnected and consequently connected to different “load-balancing” servers. When the CPU load drops down to a defined level, the primary server performs self-diagnostics and resets itself. After the server’s self-correction, all devices are connected back to the original server. The whole process is semi-automated, takes minutes, and results in zero downtime.  This is a standard operational semi-automated procedure, fully described in our internal policies. Unfortunately, we faced some errors today due to a mistake that had been made on our end. Some of the server addresses were incorrect. That was caused by changes we had made recently during performing capacity upgrade operations. New addresses were not updated, and as a result, some of the devices were directed to a wrong \(non-existent\) address. Particular operating systems, namely Tizen 2.4 operating system devices, couldn’t handle the error message \(non-existent SSL certificate\) and failed to work. Remote restart or a manual reset was necessary to recover them back to a regular operation.  We identified and recognized the issue at 13:00 UTC and immediately informed all the affected customers via email and status update page. We included all the steps to remedy the situation in our email and suggested scenarios for fixing the issue and bringing displays back online. While we don’t understand why the Tizen 2.4 system responded in a way that caused the entire device to turn offline \(still under investigation\), we took steps to prevent this from happening again. Mainly, we updated our internal guidance so this issue won’t happen in the future.  As of 17:00 UTC, most devices are back online, and the issue has been resolved. ‌ **Impact** Affected Tizen 2.4 devices showed a Pop-up window and had to be rebooted manually. **Detection** DevOps team working with internal system monitoring devices during the scale up. **Root Causes** Auto scaling server addresses misconfiguration. **Remediation** * New DevOps rule set introduced to validate server addressing * System-wide check for any similar misconfiguration related to scaling * Investigation of the unexpected behavior of Tizen 2.4

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are investigating connectivity issues between the platform and the main cluster.

Report: "Random Power Actions API failure & Pings delay"

Last update
postmortem

**Date** 2021-07-12 ‌ **Authors** Lukas Danek, COO / Head of Product Michal Artazov, Lead Developer ‌ **Summary** We experienced degradation in communication between our RabbitMQ “Devices” cluster and its clients. Due to race condition, connections were being dropped rapidly. At the same time, the clients kept trying to connect, creating a feedback loop. Eventually the problem went away on its own. We now have to determine the best way to prevent this in the future. **Impact** Device pings were delayed and Power Actions API was not working for some devices for 10 minutes. **Trigger** Large number of devices with unstable connections kept connecting repeatedly to signageOS. **Detection** Notification from AWS CloudWatch. **Root Causes** Race condition on the connection endpoint used by devices. **Remediation** Better handling of the connection-loop state management as a part of currently ongoing cluster optimization.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Pings are now delayed, devices in Box might look like they are offline, even though they are not and all device operations are working.

investigating

We are currently investigating this issue.

Report: "Performance degrdation in device management operations"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Box is behind current data with status updates and screenshots"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Issue with Box logins and Box-triggered power actions"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Login to Box and Deprovisioning issues"

Last update
postmortem

Related to the previous incident, result of RabbitMQ migration: [https://status.signageos.io/incidents/7kpv5g4xc8vb](https://status.signageos.io/incidents/7kpv5g4xc8vb)

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

Provisioning is back online.

identified

Currently, device provisioning is affected as a result of data inconsistencies between master and replica databases.

identified

The issue has been identified and a fix is being implemented.

Report: "Monitoring data - pings - are delayed and showing devices as Pending"

Last update
postmortem

Related to the previous incident, result of RabbitMQ migration: [https://status.signageos.io/incidents/7kpv5g4xc8vb](https://status.signageos.io/incidents/7kpv5g4xc8vb)

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

Report: "Events processing stopped, affecting some Box and API operations"

Last update
postmortem

**Date** 2021-06-07 ‌ **Authors** Lukas Danek, COO / Head of Product Michael Zabka, CTO Michal Artazov, Lead Developer ‌ **Summary** At 19:00 CET there was a rapid increase in the number of unprocessed messages in the Devices RabbitMQ cluster. Other system features, unrelated to the device communication were affected too. Further investigation discovered degradation in the Main RabbitMQ cluster. All publishers that were using channels with confirms enabled, failed to publish any messages. Since our main CQRS/Event-sourcing logic depends on the Main RabbitMQ cluster, most writes were affected. This includes messages from the devices, login, device provisioning and others. Through-out the incident, the team temporarily configured the affected services not to use channel confirms. When the basic operation was restored, the team deployed a new RabbitMQ and migrated the traffic to it. Through technical analysis the team also decided that previous use of confirms was too conservative and disabled it permanently for certain use cases to reduce impact of a similar failure in the future. REST API as well as Box were partially degraded, disallowing users to see devices as online \(even though they were connected\) and deploying Applets. ‌ **Impact** REST API returned 504/404 for some requests, and for all Applet-related requests. Box showed devices as pending or offline. No content playback was affected. Devices continued to play and behaved as usual. ‌ **Trigger** Internal error in the RabbitMQ message confirmation logic caused unexpected NACKing of all published messages. ‌ **Detection** Notification from the monitoring systems provided by AWS CloudWatch alerted the DevOps Team. ‌ **Root Causes** Catastrophic failure in the Main RabbitMQ cluster. ‌ **Remediation** Short term solution was to redeploy the RabbitMQ cluster from scratch and migrate the traffic there. Also, we disabled channel confirms for some publishers where it’s not necessary. ‌ The team has analyzed how similar failures can be avoided in the future. We will deploy multiple RabbitMQ clusters and separate various features among them to reduce the impact of the failure of a single cluster in the future. We will also implement a faster and more automated way to deploy a new cluster and to migrate the affected traffic to it in case of cluster failure.

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

Temporary Hotfix was deployed to production.

investigating

The root cause of the incident is located in the RabbitMQ message acknowledgment.

investigating

We are currently investigating this issue.

Report: "Delayed Pings in Box"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "Longer API response time"

Last update
postmortem

One replica of Mongo database lost some indexes due to corrupted disk storage of the database \(or memory break\). The loosing of indexes will trigger reindexing the database data from scratch. It produced an unpredictable CPU load on the database instance. This reindexing process kept the database alive but the rest of the requests are being slowed down. The major impact was on REST API service because some larger requests could take more than 30 s which is leading to HTTP requests timeout. The requests were processed in the background even if the requests were not finished from the user’s point of view. This problem was detected based on reporting timed-out requests on Postman REST API automatic tests. The problem was resolved using manually stepping down the primary replica which was affected & another secondary replica got in place as the primary one. To prevent this from happing next time, we will automatically detect slowed-down MongoDB instances and step them down automatically when it’s being for a longer time.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The fix is being deployed to all API nodes. Deployment should be completed in next 10 minutes.

identified

The issue has been identified and a fix is being implemented.

Report: "Screenshots not showing in Box UI and then later API requests processing became slow"

Last update
postmortem

**Date** 2021-05-05 3:30 CET ‌ **Authors** Lukas Danek, COO / Head of Product Michael Zabka, CTO Michal Artazov, Lead Developer **Summary** At 3:30 CET there was a spike in the number of incoming screenshots. Our services weren’t able to process them fast enough so the number of screenshots waiting to be processed in RabbitMQ quickly grew to a couple million. At 8:30 CET, in order to speed up the process, we manually scaled up the number of instances of the service that writes the incoming screenshot metadata into MongoDB. However, that caused too many writes into MongoDB, which led to degradation of its read speed. Incoming API requests that read data from MongoDB experienced degradation because of this. Playback on end-devices were NOT affected at any time. ‌ **Impact** Screenshots were not showing in Box UI between 3:30 - 8:30 CET. Some API requests took up to 1 minute to complete between 8:30 - 10:00 CET. ‌ **Trigger** Spike in traffic, number of incoming screenshots. Scaling up a service too high, causing too many writes to MongoDB. ‌ **Detection** Notification from the monitoring systems provided by Graphana and AWS CloudWatch alerted the DevOps Team. ‌ **Root Causes** There was an unexpected spike in the number of incoming screenshots at 3:30 CET. The spike resulted in a couple of million pending screenshots in RabbitMQ, waiting to be processed. In an attempt to resolve this issue, we allowed for more screenshots to be processed in parallel, overloading MongoDB with too many writes. ‌ **Remediation** Eventually all pending screenshots were processed and new screenshots were processed in real-time again. As a short-term solution to prevent this from happening again, we defined an internal guideline that defines the maximum number of instances of the service that handles screenshots. As a long-term solution, we are planning to redesign our infrastructure so that the monitoring data \(such as screenshots\) are stored in a separate MongoDB instance and are processed independently from other data on all levels.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Devices are shown as disconnected but it's only false info in Box UI"

Last update
postmortem

**Date** 2021-04-16, 7:00 CET **Authors** * Lukas Danek, CPO * Michael Zabka, CTO **Summary** At 7:00 UTC there was a CPU spike in the primary Redis database resulting in database unresponsiveness. signageOS cluster reacted to this event automatically, transferred all device connections from primary Redis to a failover replica, and restored the primary database. The event took less than 15 minutes to complete. During the initial 5 minutes devices appeared as offline in signageOS Box, despite the fact they were not. All device management functionally worked as expected. This event only affected Box UI. Seamless content playback at all the endpoints was unaffected. No endpoint was affected. **Impact** Devices that were connected to signageOS were shown as offline for 5 minutes in signageOS Box. **Trigger** One replica of the Redis database has failed and was restarted due to heavy traffic. **Detection** Notification from the monitoring systems provided by Graphana and AWS CloudWatch alerted the DevOps Team. **Root Causes** There was some unexpected spike of traffic coming from devices at 08:00 CEST. This spike did a short but heavy demand on storing device connections in the Redis database where those connections are stored. Redis is used to have the fastest and the least demanding access to real-time data. The resources on this database were set to long observed values based on averages and high values in the history of the signageOS system. Unfortunately, this spike did a larger request on CPU resources which was not expected based on longer observations. This spike leads to unresponsive requests of the database and after a short time the database was automatically restarted. **Resolution** All traffic was correctly and automatically redirected to the backup Redis database. It invoked synchronization of backup databases which filled data based on the history of devices. The only affected data was the device connections which are shown in Box UI. The redirection to the failover database took up to 5 minutes. The synchronization of backup data about device connections took up to 15 minutes. No other parts of the system were affected. Including real connections to the device. Devices were still responsive and they were accepting management requests and processing standard monitoring data. **Remediation** The prevention for the future is better detection of spikes and analyze them to better prepare resources of Redis database on spike traffic. Currently, the Redis database is scaled up to handle all possible spikes at any time.

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Box login issue"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Applet builder is stopped for some server problems"

Last update
resolved

This incident has been resolved.

identified

The expectation for getting service back is 15 minutes.

Report: "Failing over to mongodb replica set"

Last update
postmortem

The primary replica of MongoDB is being on 90% of root disk usage. It invoked the upgrade disk size in runtime. To make these changes effective, it has to restart the MongoDB daemon on the server. So the primary replica was changed to another secondary replica and then, the MongoDB daemon was automatically restarted and brought back as primary.

resolved

The primary replica of MongoDB is being on 90% of root disk usage. It invoked the upgrade disk size in runtime. To make these changes effective, it has to restart the MongoDB daemon on the server. So the primary replica was changed to another secondary replica and then, the MongoDB daemon was automatically restarted and brought back as primary.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Box power actions processing postponed"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Regional Box degradation"

Last update
resolved

There was a network interruption inside AWS which disconnected the internal VPN. System switched to a backup failover instance and recovered. Now all systems are back on the primary instance.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.