Historical record of incidents for Rollbar
Report: "Service provider incident"
Last updateWe are currently experiencing an issue with our service provider, we are awaiting an update from them about this situation and will provide an update as soon as more information is available.
Report: "Datastore maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance during this time.
Report: "Web degraded performance"
Last updateThis incident has been resolved, we apologize for any inconvenience during this time.
The third-party provider has deployed a fix, and it is currently propagating. We are monitoring the rollout and will confirm once full recovery is observed. Impacted features may begin to recover during this time.
We are currently experiencing degraded performance due to an issue with a third-party service we rely on. This is affecting our web experience. Our team is monitoring the situation and will provide updates as we learn more. We’ll post the next update in 60 minutes or sooner if we have new information."
We are experiencing web performance degradation. We are investigating actively and will post when we have identified the issue.
Report: "Web degraded performance"
Last updateWe are experiencing web performance degradation. We are investigating actively and will post when we have identified the issue.
Report: "Database Maintenance"
Last updateWe will be undergoing scheduled maintenance during this time.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Report: "Delayed data in Account Dashboard, Item List and Item Detail charts"
Last updateFix has been released, and the backlog has been fully processed. All screens/api endpoints are back to realtime, functioning normally.
We've identified that although pipeline processing to real-time notifications is functioning, the Account Dashboard, Item List and Item Detail charts are missing recent data since 2:23pm PT. We think we've identified the problem and are in the process of releasing a fix.
Report: "Delayed data in Account Dashboard, Item List and Item Detail charts"
Last updateWe've identified that although pipeline processing to real-time notifications is functioning, the Account Dashboard, Item List and Item Detail charts are missing recent data since 2:23pm PT. We think we've identified the problem and are in the process of releasing a fix.
Report: "Web app slowness/unavailability"
Last updateThis incident has been resolved. It was caused by high load on one datastore. We identified a likely source of the high load and have implemented a fix to prevent it from recurring.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We're investigating slow/unavailable requests affecting the Dashboard, Item List and Item Detail in the web UI.
Report: "Web app slowness/unavailability"
Last updateWe are continuing to investigate this issue.
We're investigating slow/unavailable requests affecting the Dashboard, Item List and Item Detail in the web UI.
Report: "We are currently experiencing issues with our web front end"
Last updateWe are currently investigating this issue.
Report: "GKE updates"
Last updateWe have been advised to perform updates to our kubernetes cluster this shouldn't impact things but out of caution we will apply a maintenance and perform this during EU hours to minimize any unexpected impact.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Report: "Pipeline latency impacting notification delivery."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Pipeline latency impacting notification delivery."
Last updateWe are currently investigating this issue.
Report: "Pipeline maintenance"
Last updateWe are performing some maintenance during this time, we do not expect any significant impact on the performance however we would like to make you aware should there be something unexpected.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Report: "Pipeline lag"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We are currently experiencing an issue in the pipeline we have identified what is at fault and are actively working to resolve this issue, sorry for any inconvenience. We will update this status page once more information is available.
Report: "Pipeline lag"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We are currently experiencing an issue in the pipeline we have identified what is at fault and are actively working to resolve this issue, sorry for any inconvenience. We will update this status page once more information is available.
Report: "database maintenance"
Last updateThis has now been completed, sorry for any inconvenience caused while this important work was carried out.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance during this time.
Report: "Security maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance during this time.
Report: "Database maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance during this time.
Report: "Queue system maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance during this time.
Report: "Incident impacted Item table visibility"
Last updateAn image update cause a temporary issue with visibility of the item table for approximately 60 mins this morning. Once the issue was identified the suspect image was rolled back and service was resumed. We apologize for any inconvenience caused by this and will be taking actions to ensure better validations / tests are in place to mitigate this type of problem occurring again in the future.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Web app intermittent failures"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're investigating and issue with intermittent failures with requests to the Rollbar web app (rollbar.com).
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Pagerduty integration partial outage"
Last updateFrom 2024-12-09 1:27pm PST to 2024-12-10 2:33pm PST, notifications from Rollbar to Pagerduty partially unavailable; the error rate was elevated with 34% excess notification failures. The root cause was a code issue. The issue was identified by Rollbar staff at 2024-12-10 2:29pm PST and promptly rolled back, resolving the incident.
Report: "Pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Pipeline latency"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently experiencing a problem in the processing performance of the pipeline causing some delays in processing and alerting. We are working on the issue actively and will update as appropriate.
Report: "Web Application Outage"
Last updateDuring a database maintenance operation, the Rollbar Web Application became unresponsive for 15 minutes.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Web outage & Occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased web latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing latency on WebApp. We are investigating actively and will post when we have identified the issue.
Report: "Issues loading dashboard graphs"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
There are problems loading dashboard graphs, but the Web Application is operational. We are working on a fix to resolve the problem.
Report: "Pipeline latency"
Last update**Summary of the Incident and Impact** On March 25th, 2024, between 10:58 and 12:08 PDT Rollbar experienced a platform latency increase affecting the Web Application \([rollbar.com](http://rollbar.com)\) and Pipeline services. The cause of these issues can be traced to a combination of 2 releases that occurred in relatively quick succession. One of the releases involved transitioning our package management for the Summarization service and the other was a code release containing a poorly optimized query that caused our database to increase load. At 07:34 PDT on March 25th, a release of the Summarization service was completed using a new package management system. The release resulted in a change to an IP address that was used to configure a DNS that connected to this service. This resulted in requests that timed out and increased the page load latency on certain views of items in the Web Application. At 10:08 a release was deployed to the Web Application and Pipeline services with a code change which resulted in a query that significantly increased disk IO on one of Rollbar’s main databases. Pipeline latency started to build as load increased on the server, and this further affected page load times on the Web Application. Alerts triggered and brought attention to engineers as thresholds were breached at 10:21 PDT but since these 2 issues were compounding to affect latency, it was not immediately clear what the problem was. The application was still usable but significantly slow for some customers. A series of reverts were made that brought the system back to stability. Timeline: * March 25 07:34 PDT - Summarization service was deployed using a new package manager * 10:08 PDT - Changes to Rollbar’s Web Application and Pipeline were released with a poorly optimized database query * 10:21 PDT - Alerts internal to Rollbar started to trigger as latency spiked in various places * 10:58 PDT - General stability of the Web Application and Pipeline are affected with some customers reporting slow loading or unreachable pages * 11:26 PDT - The changes to the Web Application and Pipeline were reverted and deployed * 12:08 PDT - The changes to the Summarization service were reverted and full stability was reached **Follow-Up Actions** To mitigate future risks and avoid similar incidents, we are undertaking the following actions: * We are actively working on addressing how we reconcile the IP addresses with our DNS for the summarization service and looking to improve this process. * We will be having a full internal postmortem on this event by April 5, 2024, and expect to identify further action items to improve our systems.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently experiencing an issue within the pipeline which might be causing some latency in processing, we are investigating this issue and will provide and update as more is understood. Apologies for any inconvenience.
Report: "Web Outage"
Last update# **Summary of the Incident and Impact** On February 3rd, 2024, between 03:37 and 06:35 PST Rollbar experienced a platform outage affecting the Web Application \([rollbar.com](http://rollbar.com)\) and Platform API \([api.rollbar.com](http://api.rollbar.com)\) servers. The cause of these outages can be traced to an automated update by our Google Cloud Platform to Rollbar’s GKE \(Google Kubernetes Engine\) Clusters. Following this incident, the trailing-12-month uptime of the API tier as measured by our external monitoring service is 99.92%. The upgrade removed firewall rules necessary for health checks originating from Google Cloud Application Load Balancers \(ALBs\) required to be able to send traffic to application servers. Our default network firewall security posture is very strict and removal of rules has significant consequences as we disallow all IP traffic on the relevant ports. The removal of these firewall rules resulted in the inability of workloads on the GKE clusters to communicate with the ALBs thus causing the load balancers to register all workloads as unhealthy. Initially, it was unclear what had happened as no code changes had been deployed by Rollbar nor were changes made directly to any infrastructure. Not knowing that the firewall rules had been eliminated, we attempted to restart applications and create new load balancers from roughly 03:37 to 05:08am. At 05:08 a support ticket was created with Rollbar’s cloud services provider, and Google to help resolve the issue. At 05:11 engineers from the cloud services provider, Google, and Rollbar teleconferenced to try to discuss the issue. After 75 minutes on the support call, the cloud services provider and Google were able to determine that the firewall rules had been removed due to the GKE upgrade. Starting at 06:28, Rollbar created new firewall rules and resolved the issues with load balancer health thus restoring service for the Platform API & Web Application. By 06:35, all services were fully restored. **Timeline:** * Feb 3 03:37 PST - Both the Platform API and Web Application stop responding * 03:37-05:08 PST - Attempts to remedy through restarts and creating new load balancers fails * 05:08 PST - Critical support ticket created with our cloud support provider * 05:11 PST - Teleconference call initiated with cloud services provider, Google, & Rollbar engineers * 06:28 PST - New firewall rules recommended and added for the Web Application’s ALB * 06:30 PST - Web Application became available * 06:32 PST - New firewall rules recommended and added for the Platform API’s ALB * 06:35 PST - Platform API became available # **Follow-up Actions** To mitigate future risks and avoid similar incidents, we have undertaken the following actions: * In order to avoid the deletion of necessary firewall rules, we have created our own firewall rules rather than relying on automatically-created rules. * We have incorporated notifications on GKE updates into our internal application performance graphs to note when these occur to help in the future when diagnosing issues.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "API Tier is down"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Web application is down"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are having issues with our web application and are investigating.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Web outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Partial Pipeline Outage"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
The web latency was produced by extra resource consumption due to backlog processing and is now operational again.
We are experiencing latency across our web application. We are currently investigating the issue and will provide an update shortly.
We are continuing to monitor for any further issues.
We are continuing to monitor the fix, and we have processed most of the occurrence backlog.
We have applied a fix and are processing the occurrence backlog.
We have identified the root cause of the outage and are implementing a solution
We are continuing to investigate this issue.
We have latency in our pipeline that is causing some users to see their items for the past few hours.
Report: "Web app unavailable"
Last updateThis incident has been resolved.
We continue to see degraded performance, and we are monitoring.
All systems are back online, and we are continuing to monitor.
Web App is available, Login still unavailable.
We are continuing to work on a fix for this issue.
We're continuing to work to restore the login page. All other systems are functioning normally.
The web app is now available, except for the login page. We're continuing to work on a fix.
We've identified the cause of the web app outage. We're working on a fix.
We are currently investigating this issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Sept 26th 2023"
Last updateThis incident has been resolved.
A fix has been issued and we are monitoring the results.
We are currently having issues with CORs and certain sections of our site not loading. We are currently investigating the use of uBlock Origin Ad blocker.
Report: "Pipeline degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue, customers may experience some delays in processing while this incident is ongoing.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Pipeline Latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue with degraded performance in our pipeline.
Report: "Pipeline Latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Pipeline Incident being investigated"
Last updateThis has now been resolved
We are continuing to work this issue, the pipeline is processing the backlog and we are monitoring the applied changes.
The issue has been identified and we are applying updates to help alleviate the situation and release delayed data through the pipeline
We are continuing to investigate this issue, sorry for any inconvenience this has caused. A further update will be posted in 1 hour.
We are continuing to investigate this issue.
We are currently investigating an incident causing delays in processing on the pipeline, once further information is available and update will be made.
Report: "Pipeline latency - investigation underway"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "pipeline latency"
Last updateThis incident has been resolved.
We are currently experiencing a latency event within the pipeline, this investigation is currently ongoing updates will be provided as the problem is identified. Apologies for any inconvenience
Report: "Increased occurrence processing pipeline latency and web app loading issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are experiencing a processing delay in our occurrence pipeline, and problems loading pages within our web application. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateWe have recovered the backlog and service is now back and fully operational.
A fix has been implemented and we are currently now processing through the backlog.
We're experiencing an outage on the processing pipeline due to a problem with our stream processing datastore. We're working toward a resolution.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Increased occurrence processing pipeline latency"
Last updateThis incident has been resolved.
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Report: "Processing Delay"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.