Coveralls

Is Coveralls Down Right Now? Check if there is a current outage ongoing.

Coveralls is currently Degraded

Last checked from Coveralls's official status page

Historical record of incidents for Coveralls

Report: "Elevated Latency for some users"

Last update
monitoring

A fix has been implemented and we are monitoring the results.

Report: "Elevated Latency for some users"

Last update
identified

The issue has been identified and a fix is being implemented.

Report: "Elevated Latency for some users"

Last update
monitoring

We have identified an incident that slowed or delayed processing for a set of builds from 3-6a PDT. We have resolved the issue, are scaling servers and manually clearing the backlog. We will be monitoring until clear.

Report: "Delayed Coverage Calculations for Some Users"

Last update
resolved

All builds from today (Jun 3) have been processed. As background job queues cleared, build times returned to normal. We will continue monitoring.

monitoring

We took the following action to more quickly restore normal build times for all _new_ builds today: - We moved all unfinished background jobs from yesterday (Jun 2) into holding queues in order to restore normal build times for new builds from today (Jun 3). - We scaled resources to more quickly drain the existing queues of jobs from new builds from today (Jun 3). We will monitor progress on all new builds and provide updates here until we're fully caught up (zero (0) background jobs in queue). Thanks for your patience in the meantime as we restore the best possible performance to the service.

monitoring

We are continuing to clear a backlog of background processing jobs for builds submitted in the past 18-24 hrs. While all systems are operational, there will continue to be latency on build times until we clear all background job queues, which are FIFO. Current estimate: 1-hour. We will post updates here until build times return to normal.

Report: "Delayed Coverage Calculations for Some Users"

Last update
resolved

This incident has been resolved, but we will continue monitoring closely. All systems are operational, but we will leave systems category at Degraded Performance until we have fully cleared a backlog of background processing jobs.

monitoring

We have implemented another fix and are monitoring the results.

monitoring

We are continuing to monitor for any further issues.

monitoring

A partial fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

investigating

While monitoring we have discovered some additional planner anomalies that are slowing down queries associated with our various calculation jobs. We are investigating those again and working to identify and implement a fix. We will continue posting updates here.

monitoring

All systems operational. We are carefully scaling resources and monitoring database performance to ensure stable recovery. Some delays in build and coverage report processing may still be observed as we restore full capacity. Thank you for your continued patience — we’ll share further updates as recovery progresses.

monitoring

We have completed implementation of our fix. We are cautiously resuming background processing and will continue monitoring closely. If you notice any delays in build processing, rest assured they will be resolved shortly. Thank you for your patience — more updates will follow as we return to full capacity.

monitoring

We’re currently experiencing an outage due to unexpected query planner behavior following our recent upgrade to PostgreSQL 16. Despite extensive preparation and testing, one of our core background queries began performing full table scans under the new version, causing a rapid increase in load and job backlog. What we're doing: - We’ve paused background job processing to stabilize the system. - We tried all "quick fixes" like adjustments to DB params that affect planner choices—all to no effect. - We're now actively deploying a targeted database index to resolve the performance issue. - We’ve identified a longer-term fix that will make the query safer and more efficient on the new version of PostgreSQL. Why this happened: PostgreSQL 16 introduced changes to how certain types of queries are planned. A query that performed well in PostgreSQL 12 unexpectedly triggered a much more expensive plan in 16. We're correcting for that now. Estimated recovery: Background job processing is expected to resume within 20–40 minutes, with full service restoration shortly thereafter. We’ll continue to post updates here as we make progress. Thanks for your patience — we’re on it.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

We are continuing to work on a fix for this issue.

identified

We need to pause processing momentarily to clear a backlog of DB connections. We cutover to a new database version this weekend and even after months of planning and preventative steps, during periods of elevated usage after such a change it's still common for planner regressions to occur. We will identify the offending SQL statements, fix their planner issues, and restart work as soon as possible. Thanks for your pateince as we work though this as quickly as possible.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Delayed Coverage Calculations for Some Users"

Last update
Identified

The issue has been identified and a fix is being implemented.

Investigating

While monitoring we have discovered some additional planner anomalies that are slowing down queries associated with our various calculation jobs. We are investigating those again and working to identify and implement a fix.We will continue posting updates here.

Update

All systems operational. We are carefully scaling resources and monitoring database performance to ensure stable recovery.Some delays in build and coverage report processing may still be observed as we restore full capacity.Thank you for your continued patience — we’ll share further updates as recovery progresses.

Update

We have completed implementation of our fix. We are cautiously resuming background processing and will continue monitoring closely. If you notice any delays in build processing, rest assured they will be resolved shortly.Thank you for your patience — more updates will follow as we return to full capacity.

Update

We’re currently experiencing an outage due to unexpected query planner behavior following our recent upgrade to PostgreSQL 16.Despite extensive preparation and testing, one of our core background queries began performing full table scans under the new version, causing a rapid increase in load and job backlog.What we're doing:- We’ve paused background job processing to stabilize the system.- We tried all "quick fixes" like adjustments to DB params that affect planner choices—all to no effect.- We're now actively deploying a targeted database index to resolve the performance issue.- We’ve identified a longer-term fix that will make the query safer and more efficient on the new version of PostgreSQL.Why this happened:PostgreSQL 16 introduced changes to how certain types of queries are planned. A query that performed well in PostgreSQL 12 unexpectedly triggered a much more expensive plan in 16. We're correcting for that now.Estimated recovery: Background job processing is expected to resume within 20–40 minutes, with full service restoration shortly thereafter.We’ll continue to post updates here as we make progress. Thanks for your patience — we’re on it.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to work on a fix for this issue.

Update

We are continuing to work on a fix for this issue.

Update

We need to pause processing momentarily to clear a backlog of DB connections. We cutover to a new database version this weekend and even after months of planning and preventative steps, during periods of elevated usage after such a change it's still common for planner regressions to occur. We will identify the offending SQL statements, fix their planner issues, and restart work as soon as possible. Thanks for your pateince as we work though this as quickly as possible.

Identified

The issue has been identified and a fix is being implemented.

Investigating

We are currently investigating this issue.

Report: "Latency"

Last update
Investigating

While monitoring we have discovered some additional planner anomalies that are slowing down queries associated with our various calculation jobs. We are investigating those again and working to identify and implement a fix.We will continue posting updates here.

Update

All systems operational. We are carefully scaling resources and monitoring database performance to ensure stable recovery.Some delays in build and coverage report processing may still be observed as we restore full capacity.Thank you for your continued patience — we’ll share further updates as recovery progresses.

Update

We have completed implementation of our fix. We are cautiously resuming background processing and will continue monitoring closely. If you notice any delays in build processing, rest assured they will be resolved shortly.Thank you for your patience — more updates will follow as we return to full capacity.

Update

We’re currently experiencing an outage due to unexpected query planner behavior following our recent upgrade to PostgreSQL 16.Despite extensive preparation and testing, one of our core background queries began performing full table scans under the new version, causing a rapid increase in load and job backlog.What we're doing:- We’ve paused background job processing to stabilize the system.- We tried all "quick fixes" like adjustments to DB params that affect planner choices—all to no effect.- We're now actively deploying a targeted database index to resolve the performance issue.- We’ve identified a longer-term fix that will make the query safer and more efficient on the new version of PostgreSQL.Why this happened:PostgreSQL 16 introduced changes to how certain types of queries are planned. A query that performed well in PostgreSQL 12 unexpectedly triggered a much more expensive plan in 16. We're correcting for that now.Estimated recovery: Background job processing is expected to resume within 20–40 minutes, with full service restoration shortly thereafter.We’ll continue to post updates here as we make progress. Thanks for your patience — we’re on it.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to work on a fix for this issue.

Update

We are continuing to work on a fix for this issue.

Update

We need to pause processing momentarily to clear a backlog of DB connections. We cutover to a new database version this weekend and even after months of planning and preventative steps, during periods of elevated usage after such a change it's still common for planner regressions to occur. We will identify the offending SQL statements, fix their planner issues, and restart work as soon as possible. Thanks for your pateince as we work though this as quickly as possible.

Identified

The issue has been identified and a fix is being implemented.

Investigating

We are currently investigating this issue.

Report: "Increased Latency (Recovered)"

Last update
Update

All systems operational. We are carefully scaling resources and monitoring database performance to ensure stable recovery.Some delays in build and coverage report processing may still be observed as we restore full capacity.Thank you for your continued patience — we’ll share further updates as recovery progresses.

Update

We have completed implementation of our fix. We are cautiously resuming background processing and will continue monitoring closely. If you notice any delays in build processing, rest assured they will be resolved shortly.Thank you for your patience — more updates will follow as we return to full capacity.

Update

We’re currently experiencing an outage due to unexpected query planner behavior following our recent upgrade to PostgreSQL 16.Despite extensive preparation and testing, one of our core background queries began performing full table scans under the new version, causing a rapid increase in load and job backlog.What we're doing:- We’ve paused background job processing to stabilize the system.- We tried all "quick fixes" like adjustments to DB params that affect planner choices—all to no effect.- We're now actively deploying a targeted database index to resolve the performance issue.- We’ve identified a longer-term fix that will make the query safer and more efficient on the new version of PostgreSQL.Why this happened:PostgreSQL 16 introduced changes to how certain types of queries are planned. A query that performed well in PostgreSQL 12 unexpectedly triggered a much more expensive plan in 16. We're correcting for that now.Estimated recovery: Background job processing is expected to resume within 20–40 minutes, with full service restoration shortly thereafter.We’ll continue to post updates here as we make progress. Thanks for your patience — we’re on it.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to work on a fix for this issue.

Update

We are continuing to work on a fix for this issue.

Update

We need to pause processing momentarily to clear a backlog of DB connections. We cutover to a new database version this weekend and even after months of planning and preventative steps, during periods of elevated usage after such a change it's still common for planner regressions to occur. We will identify the offending SQL statements, fix their planner issues, and restart work as soon as possible. Thanks for your pateince as we work though this as quickly as possible.

Identified

The issue has been identified and a fix is being implemented.

Investigating

We are currently investigating this issue.

Report: "Increased Latency"

Last update
Investigating

We are currently investigating this issue.

Report: "Infrastructure Cutover"

Last update
Scheduled

We will be undergoing scheduled maintenance during this time.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Report: "Infrastructure cutover"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be undergoing scheduled maintenance during this time.

Report: "Elevated Latency"

Last update
resolved

This incident is resolved. Performance normal for all new builds is normal. All jobs in backed up queues have been cleared with the exception of secondary and tertiary queues for outlier repos. We will be clearing those manually over the next several hours as traffic allows.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Elevated Latency"

Last update
resolved

We’re now closing this incident, several hours after restoring full system stability. Over the past 4 hours, we’ve continued to monitor key requests and queries closely. During that time, we identified a number of previously long-running queries that we’ve either: - Optimized immediately, based on new platform characteristics; or - Added to a short-term optimization backlog for tuning over the next few days. These efforts are part of our ongoing work to adapt all app queries to the updated infrastructure context.

monitoring

The site remains fully operational, and performance for all new builds is normal. We’re continuing to monitor request and query times closely to identify any long-running queries that may have contributed to recent job processing delays or latency spikes.

monitoring

Performance has been restored to standard, and site is fully operational, but we will continue to clear any previously blocked (or retry) jobs we discover in background job queues and monitor performance stats as they clear.

monitoring

Monitoring for further issues. Performance for new builds is normal. Waiting for dequeue metrics to fall below 50% normal before we lift "degraded performance" rating.

monitoring

We believe the issue is resolved. We are scaling infrastructure to clear any delayed background jobs, and monitoring to ensure latency stays within normal range.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are investigating elevated latency in our background jobs system. Some users have also reported receiving Timeout errors while trying to load web pages.

Report: "Infrastructure maintenance"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be undergoing scheduled maintenance at this time. We do not expect to use the full window. We will close the window / post when complete.

Report: "Staged Infrastructure Migration"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be undergoing staged infrastructure migration described here:https://status.coveralls.io/incidents/ph6p14vg1fyr

Report: "Partial Outage - Database Connection Overload"

Last update
postmortem

## **Coveralls Incident Postmortem** **Date of Incident:** April 6–7, 2025 **Published:** April 11, 2025 ### **Introduction** This postmortem outlines a recent incident that affected performance and stability for some of our customers. In the spirit of transparency and continuous improvement, we’re sharing what happened, what we learned, and how we’re moving forward. While our team has deep experience managing production infrastructure, we aren’t full-time PostgreSQL specialists. We’re full-stack developers — and this incident pushed us deeper into the internals of our database than we’d ever had to go before. It revealed blind spots, forced urgent decisions, and ultimately gave us a deeper understanding of what it will take to make Coveralls more resilient going forward. One of the nice things about developing an application for developers is that when we share our experience with technical challenges and lessons learned, we know many of our customers will relate. ### **Incident Summary** Between Sunday, April 6 \(PDT\), and early morning Monday, April 7 \(PDT\), Coveralls experienced performance degradation and elevated error rates affecting coverage reporting and related functionality across the platform. This culminated in a database overload and service outage from approximately 10:45 PM to 1:15 AM PDT. ### **Impact** * **Duration:** Approximately 8 hours of compounding service degradation, culminating in a full outage from ~10:45 PM Sun to 1:15 AM Mon PDT * **Affected Systems:** Coverage uploads, report generation, dashboard access, notifications delivery * **Customer Impact:** Delays in CI workflows, missing or partial coverage reports, stalled PR checks ### **Root Cause: Systemic Degradation** Roughly six weeks before this incident, we began seeing signs of database degradation — slower queries, rising table bloat, and an alarming acceleration in overall database size. Upon investigation, we realized that routine PostgreSQL maintenance \(VACUUM, ANALYZE, autovacuum\) — previously effective with our parameter tuning — was no longer working. This marked the beginning of a “runaway cycle,” where dead tuples piled up faster than the system could clear them, causing performance to degrade further, which in turn made optimizations even slower. In retrospect, we now understand this shift coincided with our database exceeding what we now recognize to be its _practical operational ceiling_ — a threshold around 50–60% of our 65TB AWS RDS instance size. While well below the official storage limit, this was the point at which PostgreSQL’s internal optimization routines began falling behind our write-heavy workload. Once we crossed it, most of the practices we had relied on for years to manage data growth and performance simply stopped working. In response, we began scheduling emergency optimization efforts across successive weekend maintenance windows. These included offline cleanup routines targeting bloated tables and long-running transactions, along with backups and preparatory work to reorganize our largest table partitions. The goal was to reduce database size while preserving historical data — and to free up space for optimizations that could no longer run reliably at our current scale. Unfortunately, one after another, those operations failed to complete in the allotted time — leaving us with fewer and fewer options as our database continued creeping toward the 65TB physical limit, and with it, the risk of out-of-disk errors and potential service failure. ### **Root Cause: Immediate Incident Trigger** The direct cause of the outage on April 6–7 was our decision to let three long-running `VACUUM FULL` operations continue on legacy partitioned tables after the end of our scheduled maintenance window. We had successfully reclaimed ~20TB of disk space earlier that day and expected the remaining operations — anticipated to reclaim an additional ~20TB — to finish within 1–2 hours based on size comparisons and previous completion times. Believing the risk was minimal — given the tables contained historical build data no longer accessed by active jobs or user queries — we reopened the application to production traffic. However, what we failed to anticipate was that: * **Background reconciliation jobs** still touched those locked tables indirectly * **PostgreSQL system processes** attempted metadata access that also triggered waits * These queries quickly **piled up behind table-level locks**, consuming all available connections This led to total connection saturation, application failures, and a full outage until the vacuum operations were forcibly terminated. ### **Outage Event: April 6–7 \(10:45 PM – 1:15 AM PDT\)** To reclaim space and restore performance, we began by decommissioning a set of legacy partitions containing historical build data no longer accessed by active workflows — a necessary step to reclaim ~20TB of space and allow critical optimization routines to proceed. We then launched nine parallel `VACUUM FULL` operations on bloat-heavy legacy tables. As the maintenance window ended, three of those operations were still running — expected to finish soon. Rather than risk losing the disk space gains by canceling mid-operation, we opted to let them finish. We added app-level protections to exclude the old data from queries and ran QA tests to ensure performance remained unaffected. Everything appeared stable: queries ran normally, monitoring showed no signs of blockage, and one engineer stayed online for two extra hours to confirm system health before calling it a night. At 10:45 PM, the site began returning 500 errors. Alerts indicated multiple servers were down, and the database was rejecting new connections with “Out of connections: Connections reserved for superuser.” RDS showed over 1,400 active connections — far beyond any level we’d previously seen in production. Most were tied up in blocked transactions: system-level activity and background reconciliation jobs that became stuck waiting on table locks held by the ongoing `VACUUM FULL` operations. We: * Entered Maintenance Mode * Killed all background jobs and blocked system queries * Canceled the remaining `VACUUM FULL` tasks This cleared the backlog and allowed the system to recover. The site was brought back online at 1:15 AM PDT and resumed normal operation. All remained well through 8:00 AM Monday PDT — the beginning of our regular Monday morning traffic spike. The system stayed stable and error-free under close watch. While a few queries ran slower than usual, autovacuum had resumed, and performance steadily improved over the following days as vacuum thresholds were tuned. ### **Lessons Learned** * **Legacy tables still matter:** Even unused partitions can be accessed by background queries or internal PostgreSQL processes * **Manual testing isn’t enough:** Latent issues from locked tables can take hours to appear under load * `VACUUM FULL` **must never run in production again**: It will only ever be used in offline maintenance contexts, and only when it can complete fully before reopening the application to production traffic \(We feel PostgreSQL DBAs nodding slowly with eyes closed at this one\) * **We didn’t realize we’d crossed the rubicon:** We now understand that for the way we were managing our schema, the real ceiling came far earlier than expected — around 50–60% of our RDS instance capacity. That’s where PostgreSQL’s behavior began to change, and where the practices we’d long relied on started to break down. * **Our tables are too big:** Our current manual partitioning scheme—organized by time—was not granular enough to prevent tables from growing to unmanageable sizes. Going forward, we’ll need to rethink partitioning entirely, possibly by table size, data volume, or much finer time intervals. * **We need to tier old data:** Partitioning alone isn’t enough. We need to offload older data to a different database instance or long-term storage tier — one designed for low-frequency access but with preserved integrity. * **We can't store everything in one place forever:** Coveralls has been storing every coverage report for every file in every commit for every repo it tracks for over 13 years — all in a single database instance. That approach hit its limit. That number now stands as a badge of our endurance — and a reminder of the architecture we must leave behind. ### **Remediation & Recovery** * All stuck optimization jobs were terminated * Background workers and queues were paused and restarted safely * Emergency disk space was reclaimed and preserved * Monitoring was expanded to detect and alert on blocked queries, autovacuum status, and connection saturation ### **Action Items** **Short-Term** * Establish hard guardrail against closing a maintenance window with open `VACUUM FULL` operations * Restrict application access to legacy tables not actively used in production workflows * Implement dashboards and alerts for `autovacuum` lag, blocked queries, and idle-in-transaction states * Tighten timeout thresholds to prevent query pile-ups during unexpected contention **Long-Term** * Complete staged migration to our new schema and infrastructure platform * Refactor partitioning to better manage table size and maintain performance * Build system-level resilience against lock pile-ups with improved observability, timeout strategy, and alerting * Design and implement a tiering strategy for long-term historical coverage data ### **Conclusion** This incident was a turning point. It taught us that the scale we’re operating at now requires a different mindset — one grounded in constraints we hadn’t fully appreciated before. When faced with a problem of scale and an urgent need to transition our infrastructure, we thought we had a clear destination in mind and rushed to get there — only to find the tracks couldn’t support the weight we were carrying. Now we understand this is a longer journey than we expected. We may not see the final station yet, but we know how to move forward: one stop at a time — making sure each station is stable, sustainable, and ready to serve our customers well before moving to the next. ### **Thanks** As mentioned above, no one on staff here is a PostgreSQL expert — and this incident forced us to learn more, and sooner, than we ever expected about running PostgreSQL at scale. We’d like to express our thanks to the many contributors and educators who share their hard-won PostgreSQL knowledge. Without their guidance, we couldn’t have learned what we now know—or recognize how much more there is to learn. * **Chelsea Dole**, of Citadel \(formerly of Brex\), for: * “[Postgres Table Bloat: Managing Your Tuple Graveyard](https://youtu.be/gAgbzvGT6ck?si=IY_O8vflQSToY3hn)” * “[It’s Not You, It’s Me: Breaking Up with Massive Tables via Partitioning](https://youtu.be/TafwSuLNxe8?si=Kk6L3rgZ5xTZQyMN)” * **Peter Geoghegan**, PostgreSQL contributor: * “[Bloat in PostgreSQL: a Taxonomy](https://youtu.be/JDG4bMHxCH8?si=_q-N2IsT2KiQRSPg)” * [His blog](https://pgeoghegan.blogspot.com/) * **Michael Christofides and Nikolay Samokhvalov**, of [Postgres.fm](https://postgres.fm/): * For their podcasts on Bloat, Index Maintenance, Out of Disk, and BUFFERS * Especially “[100TB and Beyond!](https://youtu.be/L6JWI296fyk?si=yQmPLykXR62YrXqb),” featuring Arka Ganguli \(Notion\), Sammy Steele \(Figma\), and Derk van Veen \(Adyen\)

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Partial Outage - Database Connection Overload"

Last update
Resolved

This incident has been resolved.

Update

We are continuing to monitor for any further issues.

Update

We are continuing to monitor for any further issues.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to work on a fix for this issue.

Identified

The issue has been identified and a fix is being implemented.

Update

We are continuing to investigate this issue.

Investigating

We are currently investigating this issue.

Report: "Partial Outage"

Last update
Update

We are continuing to monitor for any further issues.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to work on a fix for this issue.

Identified

The issue has been identified and a fix is being implemented.

Update

We are continuing to investigate this issue.

Investigating

We are currently investigating this issue.

Report: "We are experiencing a partial outage"

Last update
Investigating

We are currently investigating this issue.

Report: "Partial outage"

Last update
Update

We are continuing to investigate this issue.

Investigating

We are currently investigating this issue.

Report: "Major infrastructure maintenance"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be undergoing major infrastructure maintenance related to an infrastructure upgrade in progress for many weeks. We apologize for the length of this maintenance window, but it is absolutely crucial that we give ourselves enough time to complete this effort without new data coming into the system.

Report: "Elevated Latency"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

Report: "Elevated Latency"

Last update
Monitoring

A fix has been implemented and we are monitoring the results.

Report: "Infrastructure upgrade"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be undergoing infrastructure upgrades and maintenance during this window.

Report: "Infrastructure Maintenance"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be performing what we hope is the last of several weekends of database infrastructure maintenance this weekend. Due to the complexity of the remaining tasks, we have created a longer-than-normal maintenance window of eight (8) hours running from Saturday night at 8pm PDT to Sunday morning at 4a PDT. We apologize for any inconvenience this may cause customers working on projects over the weekend. For those customers, we advise as follows:- During this work, Coveralls will be in maintenance mode, and requests to our API, like coverage report uploads will be rejected. - To avoid this breaking your CI builds with step failures, users of our official integrations can enable the following "Do not fail on error" settings:"Do not fail on error" settings:1. `fail-on-error: false` for Coveralls GitHub Action2. `fail_on_error: false` for Coveralls Orb for CircleCI3. `--no-fail` flag for Coveralls Universal Coverage ReporterNote that in cases (1) and (2), both CI extensions are ultimately passing the `--no-fail` flag to Coverage Reporter, which is our main integration running under-the-hood of both CI extensions (Action & Orb).

Report: "Elevated Latency"

Last update
resolved

We have cleared all backed up jobs from this incident. Latencies are back in normal range; albeit normal for our highest-traffic days of the week (Wed-Thu, US workday). For those following our efforts to address the latencies we've experienced over the last several weeks, this weekend and next represent our final planned efforts to address these issues. We expect the work over this weekend and next to resolve the occasional spikes in latency we've been experiencing, especially during periods of high traffic. If you are still waiting for any of your builds created during this period (late Wed-early Thu, US) to complete, please reach out to us at support@coveralls.io and we will investigate to ensure they don't have other causes and are completed.

monitoring

Latencies are back in normal range, but we continue to clear the remainder of backed up background jobs.

monitoring

Backed up jobs are about 50% cleared. ETA to normal: ~10-min.

monitoring

Elevated latencies system-wide. We have deployed resources to clear the backup. ETA to clear: ~20-min.

Report: "Elevated Latency"

Last update
resolved

Performance in terms of latency of background jobs has been restored to normal. The only customers who may continue experiencing latencies are "outlier" customers with builds that exceeded fair use policies earlier this morning. Latencies on the secondary queues we use to process these excess jobs remains at about 40-min. If you wonder whether your latencies were caused by excess usage, reach out to support@coveralls.io and we'll confirm for you either way.

monitoring

We are monitoring the results of scaled resources. ETA 15-20 min.

identified

We have detected elevated latency during a period of high traffic. We are deploying resources to address and bring those down. ETA ~20-min.

Report: "Elevated Latency in US East"

Last update
resolved

This incident has been resolved.

monitoring

Elevated latency during the US Eastern workday from ~7a-10a. We are clearing the associated backlog. ETA ~20-min.

Report: "Elevated Latency in EU"

Last update
resolved

Customers in the EU region experienced elevated latency during their core workday. We have updated our monitoring to to improve our response to elevated traffic that may increase latency in this region.

Report: "Elevated Latency"

Last update
resolved

This incident has been resolved.

monitoring

About 45 minutes ago we started experiencing elevated latency for some repo types. We have scaled up and are monitoring as queues clear for those jobs. Coveralls remains fully operational. All builds should return to baseline in ~20-min.

Report: "Elevated latency"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: ""Data Tables" Error"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are actively working on a fix for this issue. Based on our current timeline, we expect to deploy the fix later today, closer to the end of the US Pacific Time (PST) workday. For most US users, this means the issue should be resolved by tomorrow morning. Customers in the EU and Asia may experience the rollout during their working hours. We appreciate your patience as we complete this process.

identified

The issue has been identified and a fix is being implemented.

investigating

A number of users with repos that have parallel builds have reported encountering the following error on their Build Pages: Error message: DataTables warning: table id=DataTables_Table_0 - Ajax error. For more information about this error, please see http://datatables.net/tn/7 This error indicates an issue rendering elements of the Source Files table in your Build Page. Usually the TREE View of that table. It sometimes appears for a short time before the page is fully rendered, but if it persists it could indicate that Coveralls had a problem successfully aggregating all of the source files for your parallel build.

Report: "Service disruption for some users"

Last update
postmortem

### **Incident Postmortem: Database Partitioning Bottleneck & Job Backlog** #### **Summary** A database partitioning limitation caused a severe backlog of background jobs, leading to degraded build processing times from **Monday, February 10, to Friday, February 14, 2025**. The backlog resulted from excessive autovacuum contention on a high-growth table, which ultimately led to cascading failures in job processing, database performance, and monitoring visibility. #### **Root Cause** Several tables in our production database grow at an accelerated rate. While we employ a partitioning strategy to prevent them from becoming unwieldy, our time-based approach failed to transition a critical table before it reached an unmanageable size. During investigation, we found: * The table was in a perpetual state of autovacuum due to an excessive number of dead tuples. * The high volume of tuples prevented autovacuum from progressing beyond the scanning phase, causing tuple locks. * These locks delayed regular transactions, leading to transaction backups that worsened over time. * By late **Tuesday, February 11**, the backlog had reached a breaking point, causing tens—eventually hundreds—of thousands of jobs to accumulate. ### **Impact** 1. **Background Job Delays** 1. A significant job queue buildup occurred between **February 10 and February 11**. 2. Clearing the backlog took an additional two days \(**February 11–13**\). 3. Failed jobs in long-tail retries prolonged the impact for another **24–36 hours**. 2. **Monitoring Gaps & Alert Failures** 1. Average job duration alerts triggered only after the queue size became a critical issue. 2. As server load increased, monitoring metrics stopped logging, preventing alerts that could have provided earlier intervention signals. 3. **Database & Infrastructure Overload** 1. Scaling up resources to clear the backlog introduced additional database contention due to high transaction volumes, exacerbating delays. 2. The increased database load led to degraded server performance, disconnecting our orchestration layer and APM monitoring. 3. This created a self-reinforcing failure loop that required **continuous manual intervention** from **February 12 to February 13**. #### **Resolution** * We transitioned the affected table, significantly relieving the bottleneck. * We scaled up resources to process the backlog, though this required careful throttling to avoid further database contention. * By **Thursday, February 13**, we placed the site into **maintenance mode for 30 minutes** to reduce load—but ultimately needed nearly **two hours** to restore stability. * To prevent immediate re-saturation, we deferred processing some older jobs to lower-traffic periods. * By **Thursday evening**, build times stabilized as overall traffic declined. * By **Friday morning, February 14**, all remaining queued jobs had processed without further intervention. #### **Next Steps** 1. **Finalizing Database Transitions** 1. To fully resolve performance degradation, we transitioned **two additional tables** closely related to the affected table. 2. This was completed during a **maintenance window on Saturday, February 15 \(8 PM – 11:59 PM PST\).** 2. **Long-Term Database Optimizations** 1. We will perform **VACUUM FULL** on legacy tables to remove ~36B dead tuples and optimize disk layout. 2. Further maintenance windows will be scheduled on late-night weekends. 3. **Partitioning Strategy Enhancements** 1. We are evaluating **size-based partitioning** or a refined **time-based strategy with shorter intervals** to prevent similar issues. 4. **Improved Monitoring & Alerting** 1. We will introduce **earlier warning thresholds** to detect job queue buildup before it becomes critical. 2. We will enhance **database contention monitoring** to catch autovacuum failures and lock contention earlier. #### **Conclusion** Even after **12\+ years in production**, incidents like this remind us of the importance of continually evolving our **data management and monitoring practices**. As Coveralls scales, we are committed to refining our approach to proactively address infrastructure challenges before they affect users. We sincerely apologize to all users affected by this incident. If you need assistance with historical builds or workflow adjustments, or if you'd like to share feedback, please contact us at [**support@coveralls.io**](mailto:support@coveralls.io). Your input will help us shape future improvements.

resolved

We are closing this issue but will continue to monitor as we clear the remaining queues of background jobs from yesterday. If you believe any of your recent builds are still affected (incomplete), or if you are having any issues uploading coverage reports, please reach out to us at support@coveralls.io.

monitoring

We are out of maintenance mode and monitoring live transactions.

monitoring

Our ETA for reverting maintenance mode is within the next 30 minutes.

monitoring

Use "fail on error" to keep Coveralls 4xx from failing your CI builds / holding up your PRs: While our API is in maintenance mode, new coverage report uploads (POSTs to /api/v1/jobs) will fail with a 405 or other 4xx error. To keep this from breaking your CI builds and holding up your PRs, allow coveralls steps to "fail on error." If you are using one of our Official Integrations, add: - `fail-on-error: false` if using Coveralls GitHub Action - `fail_on_error: false` if using Coveralls Orb for CircleCI - `--no-fail` flag if using Coveralls Coverage Reporter directly Documentation: - Official Integrations: https://docs.coveralls.io/integrations#official-integrations - Coveralls GitHub Action: https://github.com/marketplace/actions/coveralls-github-action - Coveralls Orb for CircleCI: https://circleci.com/developer/orbs/orb/coveralls/coveralls - Coveralls Coverage Reporter: https://github.com/coverallsapp/coverage-reporter Reach out to support@coveralls.io if you need help.

monitoring

We are in maintenance mode as we perform some database tasks to improve our performance in clearing background jobs still stuck-in-queue. ETA: 2 hrs. But we may need to update this as we monitor progress.

monitoring

We are not clearing background jobs fast enough to recover by morning US PST, so we will be putting the site into read-only mode for about 2 hrs from (4:30a-6:30a US PST) in order to perform some database operations.

monitoring

We are continuing to monitor for any further issues.

monitoring

We have resolved the partial outages and are monitoring.

monitoring

We are still experiencing partial outages as we try to deploy across an extended fleet of servers. We are all hands and working to resolve asap.

monitoring

Partial outages. We are working to resolve asap.

monitoring

We are deploying extra servers to help clear backed up jobs. This will entail a rolling reboot, which may cause some users to lose their current connection to Coveralls.io. Your connection should be restored momentarily, so please try again in 30-sec to 1-min.

monitoring

While we have applied a fix and are monitoring for any further issues, we are clearing backlogged jobs for some accounts. If you are waiting on some recent builds to complete, please give them at least another 20 minutes to clear. If you are not seeing your builds clear after that, please reach out to us with your org/subscription and repo name(s) at support@coveralls.io.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating reports of service disruptions for some users, possible related to specific subscriptions or repos.

Report: "Service Disruption due to "Invalid SSL Certificate""

Last update
postmortem

**Incident**: Service Disruption Due to Failed SSL Certificate Renewal * **Date**: January 22, 2025 * **Duration**: 17 minutes \(07:00-07:17 UTC\) * **Impact**: Service interruption due to SSL certificate issue **Summary**: Coveralls experienced a brief service disruption when our automated SSL certificate renewal process failed. While our SSL certificates auto-renew 30 days before expiration, one unreachable server prevented the renewal process from completing successfully. **Timeline**: * Prior to incident: Multiple automated renewal attempts unsuccessful * 07:00 UTC: Service disruption began * 07:17 UTC: Service restored after infrastructure adjustment **Root Cause**: The incident occurred when one server became unreachable during our SSL certificate auto-renewal process. While our certificates are configured to auto-renew, the renewal process requires successful deployment across our infrastructure. The unreachable server prevented this deployment, ultimately leading to an outage due to “certificate expiration.” **Resolution**: We identified and removed the problematic server from our infrastructure, allowing the SSL certificate renewal and deployment to complete successfully. **Preventive Measures**: 1. Enhanced monitoring for SSL renewal processes 2. Improved early warning system for similar infrastructure issues 3. Updated incident response procedures \(new SOP\) 4. Additional automated health checks We apologize for any disruption this caused and continue working to improve our infrastructure reliability.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Outage for some users"

Last update
resolved

We are considering this resolved after monitoring for ~90 minutes. If you happen to experience any ability to login or access your Coveralls Dashboard, please let us know at support@coveralls.io.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating report of a partial outage for some users.

Report: "Longer than normal build times reported"

Last update
resolved

We have implemented the fist of several planned changes to hopefully improve builds times for the users who have been experiencing slow build times. We have planned additional changes for one evening this week and the coming weekend. If you're experiencing slower than normal build times, or have any questions, please reach out to us at support@coveralls.io.

identified

The issue has been identified and a fix is being implemented and planned for a no-downtime window this evening. This issue continues to affect some customers and not others, but our fix should address longer build times for affected customers.

investigating

Several users have reported longer than average build times, either appearing as large number under a build's "BUILD TIME" or as builds staying in PENDING COMPLETION state for longer than expected. We are investigating the issue.

Report: "Missing build data in builds from last week?"

Last update
resolved

Closing this for now. Please try re-running any builds you think may have been affected last week, or reach and we'll be glad to help.

monitoring

We've had a couple of reports from customers about builds that were received by the Coveralls API prior to last week's outage (https://status.coveralls.io/incidents/mmg3wsghl3k5), which indicate they may not have been fully processed. This would most likely be due to background jobs responsible for fully processing these builds getting caught up in service failures during our outage and, as a result, potentially, running out of retries and never completing. If you think any of your builds may have been affected this way, we have two suggested remedies: 1) Re-run the build via our API > ReRun Build Webhook - Documented here: https://docs.coveralls.io/api-introduction#rerun-build-webhook, the ReRun Build webhook will re-process any existing build, including a full recalculation of all received coverage report data. Here's how to issue a rerun via the API (just substitute your "Build Number" and Repo Token): curl --location 'http://coveralls.io/rerun_build?build_num=<CI_build_number_OR_commit_sha>&repo_token=<your_repo_token>' \ --header 'Content-Type: application/json' Note that `build_num` can be either the Build Number assigned to your build by your CI service, or the build's commit SHA per GitHub. 2) Reach out to us at support@coveralls.io - We'll be happy to re-run your build for you, or investigate further if rerunning the build doesn't resolve your issue. Thanks to all customers for your patience with us last week.

Report: "500 Errors"

Last update
postmortem

**Reason for the incident**: We failed to upgrade to a new RDS CA and update complementary SSL certs on all clients by the expiration date for the previous CA. We misunderstood the potential impact of not making this change in time and planned to make the changes as housekeeping with normal to low priority. In doing so, we failed to prioritize the ticket and make the changes necessary in time to avoid this incident. **Reason for the response time**: While trying to implement the fix, we confronted very verbose documentation that made it hard to understand how to apply the fix in our context, especially while under pressure. When we did identify the correct procedure for our context, and implemented a fix, for some reason we could not get our database clients to establish a connection in production with a freshly downloaded cert that worked in tests from local machines. In the end, we manually copied the contents of the cert into an existing file before our app recognized it. We still don’t know why, but the confusion surrounding this added at least an extra hour to our response time as we cycled through other applicable certs and recovered from failed deployments. **How to avoid the incident in the future**: We will consider all notices from infrastructure providers as requiring review by multiple stakeholders at different levels, and will apply an already established procedure for handling priority infrastructure upgrades in a timely manner, as scheduled events, with review and sign-off.

resolved

This incident has been resolved.

monitoring

We are resolving a remaining issue.

monitoring

We will come out of maintenance mode as soon as we confirm the fix.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We have received reports of 500 errors received as responses from the Coveralls API upon coverage report uploads. We are investigating.

Report: "Multiple PR Comments Issue RESOLVED"

Last update
resolved

The ISSUE of Coveralls leaving MULTIPLE PR COMMENTS on PRs was RESOLVED this weekend. Details: - Going forward, Coveralls should always update the last comment with its latest comment. - Please let us know at support@coveralls.io if you experience multiple PR comments on any PRs that were created after SUN JUL 7 at ~7PM PDT. - PRs created before that time may have received more than one comment.

Report: "Known Issue - Maintenance Window Required"

Last update
resolved

We have completed a key step in the resolution of the issue described here.

identified

We are working to resolve a known issue of Coveralls suddenly sending multiple PR comments to customer PRs, rather than one, updated PR comment that replaces older PR comments. This issue has been active for ~2.5 weeks. Our solution involves the need to change a column type in a production database table to accommodate larger integer values. With hundreds of millions of records and constant read/write activity, we will need to make this table change during a maintenance window scheduled for this weekend, Sat Jun 29-Sun, Jun 30. We will attempt to perform our changes on Saturday evening US PDT, but will update this post when the exact window is known. We will keep this "incident" open until the associated maintenance window has closed.

Report: "Hanging status updates"

Last update
postmortem

**Postmortem**: We want to share a postmortem on this incident since it took us an unusually long time to identify its root cause and resolve it, and since it affected an unusually large number of users throughout its course. **Summary**: The cause of this incident was a failure to allocate sufficient resources to, or put in place sufficient monitoring of, an existing background job queue after assigning a new background job to it. To avoid incidents of this type in the future, we have implemented a pre-deploy process for features entailing new background jobs, which is something we’ve done less and less frequently over the past number of years as our codebase and infrastructure have matured. **Cause of incident**: * We deployed an optimization earlier in the week last week \(Mon, Apr 1\) meant to address Gateway Timeout errors experienced by a small number of customers with _massively_ parallel builds \(builds with _hundreds_ of parallel jobs\). * As part of this optimization, we moved a common process, “Job creation,” to a new background job and, in a mindset of "this is an experiment, let's see how it goes," chose a readily available \(ie. traffic-free\) queue \(our _default_ queue\), released it to production, and watched it for a day and a half with good results. The change resolved the issue we aimed to fix, and all looked good from the standpoint of error tracking and performance. * Unfortunately, while we considered traffic in our selection of a queue during initial implementation, we did not consider the need to create a permanent, dedicated queue for the new background job \(which also represented a new _class_ of background job\), nor did we, after seeing good performance on Mon-Tue, evaluate the need to change any configuration details for our default queue, which turned out to be not only insufficiently resourced, but also insufficiently monitored. * As a result, later in the week when we entered our busiest period \(Wed-Thu\), the new queue backed up. But we didn't know it because we didn't have visibility, and, since the nature of the new background job \(Job creation\) was such that it preceded a full series of subsequent jobs, it began acting as a gateway mechanism, artificially limiting traffic to downstream queues, which _were_ being monitored, where, of course, everything looked hunky-dory across all of those metrics. * By the time we realized what was going on, we had 35K jobs stuck in the newly utilized queue. * At that point, the issue was easy to fix---first, by scaling up, and then by allocating proper resources to the new queue going forward---but for most of the day we did not understand what was going on so it caused problems for those hours and, as backed-up jobs accrued, affected a growing number of users as time ticked by. **A‌ctions taken to avoid future incidents of this type**: Hindsight being 20/20, we clearly could have avoided this incident with a little more process around deploys of certain types of features---in particular, features entailing the creation of new background jobs \(something we had not done in any significant way for over a year prior\). As avoidable as the initial misstep here was, its impact was great in the way it led us to miss the true underlying issue for most of an 18 hour period---which is just not acceptable in a production environment. **In response to this incident, we have added the following new step to our deployment process**: * **Prior to deployment, if changes entail the creation of any new background jobs, or modification of any existing background jobs, we must evaluate the need to update our Sidekiq configuration, including the creation of any new workers or worker groups.** We’ve been operating [Coveralls.io](http://Coveralls.io) for over 13 years now, but we are, of course, far from perfect in doing so, and, clearly, we still make mistakes. While mistakes are probably unavoidable, our main goal in addressing them is to try not to make the same mistake twice. This was a new one for us \(or at least new in recent years for our current team\), and it has caused us to shore up our SOPs around deploys in a way that should reduce this type of incident in the future.

resolved

All queues are cleared. As a result, all previously reported delayed builds and or status updates should now be complete / received. We are not seeing any further backups in any queues, but will continue monitoring into morning when our usage increases. If you are still experiencing any unfinished builds, or delayed status updates, please reach out and let us know at support@coveralls.io.

monitoring

All backed up queues are fully drained. There is now a flurry of activity in some associated queues, which are completing the processing and notifications of previously delayed builds, but those are processing quickly and we expect any and all builds and notifications previously reported as delayed today to be finished and complete in the next 30-45 minutes. Our fix has been fully deployed and we will be monitoring for any further backups.

monitoring

The backed up queue affecting all users has drained by 75%. Our fix is still being deployed across all servers, but should start taking effect in the next 15-20 min.

monitoring

We have scaled up processes on clogged background queues and they are draining. We have also implemented a fix we hope will avoid further backups and are monitoring for effects.

identified

We have identified the root cause of delayed status updates for some repos (reported today) as backups in several queues that process background jobs pertaining to aggregate coverage calculations for new builds, which precede the sending of notifications and are therefore delaying those. However, we have not yet identified a pattern behind these spikes or the delays in processing these queues since none of our usual performance metrics had been triggered (until recently when a queue that affects all users triggered an alarm). We are scaling up server processes to clear that backup, but since we are not seeing degraded performance metrics from servers, we are continuing to investigate other causes for delayed processing.

investigating

Several customers have reported long delays receiving status updates for new builds at GitHub, or status updates that have hung and never arrived. We are investigating the issue. If you are experiencing this issue, please reach out and let us know at support@coveralls.io so we can include your cases in our investigation. Note that there were some incidents receiving API requests at GitHub in the last 24 hrs, per this status update from GitHub: https://www.githubstatus.com/incidents/gqj5jrvzjb5h We evaluating cases against this timeframe to understand if they align with the GitHub incident period.

Report: "Some customers with active subscriptions receiving 402 ("repo paused") notice"

Last update
resolved

This incident has been resolved. No repos for any active subscriptions should be returning the `402 repo paused` error from the Coveralls API. If you are experiencing this error, please get in touch with us at support@coveralls.io and we'll sort you out asap. Thanks for your patience, everyone!

monitoring

A fix has been implemented and we are monitoring the results.

identified

If you were affected by this issue, the issue should be resolved in the next 15 minutes for you. Please check the status of your next CI build. You may wish re-run your previously failing CI build in order to send your coverage report(s) to Coveralls. Note: FAIL ON ERROR option: If you want to ensure that this type of error, or any other, will not break your CI pipelines, please set your integration's "FAIL ON ERROR" input option to `false`. How to Enable the "FAIL ON ERROR" input option: Coveralls GitHub Action users (`fail-on-error`): https://github.com/marketplace/actions/coveralls-github-action#:~:text=fail%2Don%2Derror,to%20any%20errors. Coveralls Orb for CircleCI users (`fail_on_error`): https://circleci.com/developer/orbs/orb/coveralls/coveralls#:~:text=boolean-,fail_on_error,-Whether%20to%20fail Coveralls Universal Coverage Reporter (`--no-fail`): https://github.com/coverallsapp/coverage-reporter#:~:text=For%20more%20options%20see

identified

The issue has been identified and a fix is being implemented.

Report: "Intermittent 502 (Bad Gateway) errors"

Last update
resolved

We've had no further reports of 502 Bad Gateway errors (or related 504 errors) in the past 48 hours so we will consider this incident closed / resolved for the time being. We are continuing to monitor for these errors and reports of these errors and will open a new incident if we expect they will be affecting the general population of users for some time. In the meantime, we suggest using the `fail-on-error: false` workaround to ensure intermittent errors like this don't break your CI builds. If you encounter a 502 Bad Gateway error, please let us know at support@coveralls.io. It would be helpful to see a segment or screenshot of your CI build log showing the error (but not necessary).

monitoring

We are continuing to monitor for this issue and keeping this incident open so affected customers can find the `fail-on-error: false` workaround described below. We received one additional report late Friday and one over the weekend, which we hope is the tail end of these issues, but we will keep this incident open for another 24-48 hrs to allow for any further reports.

monitoring

In case you are being affected by these intermittent 502 (Bad Gateway) errors, here is a workaround that will keep them from breaking your CI builds: Workaround: If you are using one of our official integrations, such as the Coveralls GitHub Action or the Coveralls Orb for CircleCI, you can utilize the `fail-on-error: false` or `fail_on_error: false` input options, which will prevent Coveralls error from breaking your CI builds. https://docs.coveralls.io/integrations#official-integrations https://github.com/marketplace/actions/coveralls-github-action (`fail-on-error: false`) https://circleci.com/developer/orbs/orb/coveralls/coveralls (`fail_on_error: false`) Thanks for your patience as we figure this out.

identified

We have new reports of users receiving `502` (Bad Gateway) errors from our edge service provider (Cloudflare). We are aware of rolling maintenance at Cloudflare across it's global centers that is leading to temporarily re-routing which may be adding latency to otherwise normal requests and resulting in `502` errors via our API or `504` errors via our Web UI. We are watching and have reached out to Cloudflare.

Report: "Intermittent 502 (Bad Gateway) errors"

Last update
resolved

After another two (2) hours of monitoring without further incident, we are considering this matter resolved. Again, if you happen to experience a 502 (Bad Gateway) error response from the Coveralls API, please reach out and let us know at: support@coveralls.io.

monitoring

While we have not identified a root cause on our side, we have received no further reports of these `502` errors, and have not seen any additional production errors at the associated endpoint for 2 hrs, so we will move this to "monitoring."

investigating

We have stopped receiving new reports of these `502` errors, and still have not identified an organic cause internally, so we are hoping that this was a temporary issue with a related service. We will keep this open until we feel confident there are no further incidents. If you experience any such errors, please reach out to: support@coveralls.io.

investigating

We are still investigating but have not found any underlying errors in production logs for reported requests, pointing to possible latency issues with our edge service provider, Cloudflare. No reported outages at Cloudflare, but two US centers (Salt Like City and Houston), are in maintenance mode and may be re-routing requests, adding to latency: https://www.cloudflarestatus.com/ > Traffic might be re-routed from this location, hence there is a possibility of a slight increase in latency during this maintenance window for end-users in the affected region.

investigating

Some customers are receiving sporadic `502` Bad Gateway errors from the Coveralls API in response to posts to `/api/v1/jobs`. We are investigating.

Report: "Reports of slow build times"

Last update
resolved

The single backed-up queue on our troubled server is now clear, so all jobs related to slow builds are no longer pending, but are now in-progress (being processed by Coveralls). Affected builds aside, performance should be back to normal for all users. Again, if you think you were affected by, or are still affected by this incident, please reach out to support@coveralls.io with your build URLs and we'll look into them for you.

monitoring

We are continuing to clear jobs from the backed-up queue and are at about 80%. We'll continue to monitor and update, but we expect most stalled jobs to start processing again in the next 15-min.

investigating

We've received reports of slow build times from some users associated with some repos. We have identified one server in OOM hang that is servicing one queue, so customers in this queue are most likely the ones affected. We will be clearing that queue as quickly as possible. If you are experiencing slow builds, feel free to reach out to support@coveralls.io and we can find out if any of your recent builds are involved.

Report: "Recent GitHub incident may affect Coveralls users"

Last update
resolved

Because of how closely tied Coveralls is to GitHub for our GitHub users, Coveralls users will have experienced some follow-on issues from the GitHub incident yesterday: https://www.githubstatus.com/incidents/zjchv3zvfg50 For instance, we received a few reports yesterday of failed logins with the error message: “Sorry, there was a GitHub server error. Please try again shortly.” In addition to this, it seems that other issues at GitHub related to the same incident may have caused other unusual behavior, including (temporarily) missing source files. If you think you may have been affected by such issues, please check the GitHub status report to help determine if these GitHub issues were the cause: https://www.githubstatus.com/incidents/zjchv3zvfg50 And of course please reach out and we’ll help you determine if that, or something else, is the cause.

Report: "Processing delays for repos with 5K+ source files"

Last update
resolved

This incident has been resolved.

monitoring

Continuing to monitor. Build times coming down for large repos. Fewer than five (5) repos now affected.

monitoring

Build times have come down for large repos. We are continuing to monitor as associated queues clear.

monitoring

We've seen a spike in build times for large repos (5K+ source files). Right now, only nine (9) repos seem to be affected. We have implemented a fix and are monitoring as build times return to normal for these repos. The root cause of this spike is elevated jobs (and therefore dequeue times) in queues dedicated to large repos which appears to have been triggered by two (2) repos sending an unusually high number of uploads over the past 1-3 hr period. We have increased resources to these queues to accelerate processing but may need to pause the above repos if uploads continue at the same level. If you think your repo was affected you can reach out to us at support@coveralls.io. We will reach out to the owners of the repos with elevated jobs.

Report: "Slow builds"

Last update
resolved

We believe that all builds delayed by this incident have been processed. If you have any builds with issues or that appear to be unfinished, please reach out to us at support@coveralls.io.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We're receiving reports of slow builds due to slow dequeue times for background jobs. We are investigating and will update status here.

Report: "Intermittent 403 errors"

Last update
resolved

We have not received further reports of `403` errors so we will close this for now. If you happen to receive a `403` error please let us know at support@coveralls.io.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have had two reports of intermittent `403` errors this morning. We are investigating and will post updates here. If you are having this issue, please reach out to us at support@coveralls.io so we can add you to the ticket.

Report: "Login issues"

Last update
resolved

A fix has been implemented. We suspect this only affected a few users at once as a matter of coincidence. If this particular issue should ever affect you, here are some steps you can take to resolve the issue: Steps to resolve "Infinite reauthorization loop via Github login": If clicking Reauthorize (App) does not log you in: 1. Note the name of the app requesting reauthorization 2. Go to Github > Settings > Applications > Authorized Apps and Revoke authorization of the app. 3. In browser settings (Chrome), go to chrome://settings/cookies and click See all site data and permissions 4. Find coveralls.io and click trash icon to clear all cookies for coveralls.io 5. Go to coveralls.io and try logging in again.

identified

The issue has been identified and a fix is being implemented.

investigating

Some users report not being able to login to coveralls.io. The behavior is a circular redirection landing on a Reauthorization required screen with error message: > This application has made an unusually high number of requests to access your account. Please reauthorize the application to continue. If you are experiencing this issue, please let us know at support@coveralls.io.

Report: "Issue affecting coverage details for some repos (Fix being implemented)"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Internal testing today revealed an issue with some repos, in which coverage calculations and data for particular builds show obviously large, incorrect drops in coverage and/or missing coverage data. We have identified the root cause of the issue and a fix is being implemented. If you are wondering whether you are affected by this issue, know that, once we've resolved the issue, we will re-run as many affected builds as we can identify, but feel free to check in at support@coveralls.io and we will verify and make sure your build are recalculated.

Report: "Infrastructure outage"

Last update
resolved

We experienced a partial infrastructure outage overnight that led to failing builds (500 errors from API) for some users for up to four (4) hours. This issue is now fixed. To all those affected, we apologize. Please get in touch if CI builds for your repo are still experiencing any behavior like this.

Report: "Drop in coverage for some parallel pull_request builds"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

After a recent release, we have received reports of `pull_request` builds that have suddenly dropped in coverage with no change in CI setup. Specifically, these builds have had several parallel jobs and now, suddenly, have only one job per build. We are aware of the issue, have identified the root cause, and are working on a fix. We will update this incident when status changes.

Report: "422 Errors for users of coveralls-python with Github Actions"

Last update
resolved

This incident has been resolved.

monitoring

We are getting reports that the issue is resolved for several users using coveralls-python with Github Actions for CI. We'd like to hear from you if this is not the case for your workflows: support@coveralls.io We'll monitor for another 1-2 hrs before we mark this incident resolved.

monitoring

We have implemented a fix for this issue and are monitoring the results. If you are still experiencing this issue, please let us know at support@coveralls.io.

investigating

We are investigating and trying to resolve an incident that seems related to a previous incident, now resolve, here: https://status.coveralls.io/incidents/pgcdqr7r4lgj We believe that some recently deployed changes, which appeared to affect customers using the Coverage Reporter and Github Action for several hours on Fri, Apr 28, seems to still be affecting customers using the coveralls-python integration with Github Actions. This issue, submitted by user andy-meier, provides both a description, and a workaround that solved his issue: https://github.com/lemurheavy/coveralls-public/issues/1710 We are hoping to identify a root cause of the issues he describes, to prevent other users from having to go to the same lengths. We will be posting updates here, and in that issue.

Report: "422 Errors for users of coveralls-python with Github Actions"

Last update
resolved

We are investigating and trying to resolve an incident that seems related to a previous incident, now resolve, here: https://status.coveralls.io/incidents/pgcdqr7r4lgj We believe that some recently deployed changes, which appeared to affect customers using the Coverage Reporter and Github Action for several hours on Fri, Apr 28, seems to still be affecting customers using the coveralls-python integration with Github Actions. This issue, submitted by user andy-meier, provides both a description, and a workaround that solved his issue: https://github.com/lemurheavy/coveralls-public/issues/1710 We are hoping to identify a root cause of the issues he describes, to prevent other users from having to go to the same lengths. We will be posting updates here, and in that issue.

Report: "Server errors (Resolved)"

Last update
resolved

We received reports of customer CI jobs failing with Internal Server Error messages. Examples: https://github.com/lemurheavy/coveralls-public/issues/1708 We believe these errors were intermittent between 9:30a-12:30p US PST. If you are still seeing errors like this, please let us know at support@coveralls.io.

Report: "Degraded performance for some repos"

Last update
resolved

This incident has been resolved.

monitoring

We are experiencing elevated traffic from a couple of sources that has been enough to affect some other repos in queue. While we are addressing this right now, performance may be slightly degraded for projects with greater than average number of files or number of parallel jobs. This may range from slow page loads to gateway timeouts. This situation is temporary and should be resolve din the next 30-min to 1-hr. Let us know if that's not the case for you: support@coevralls.io.

Report: "Auth issue with two symptoms"

Last update
resolved

We became aware of an authorization-related issue on Fri, now resolved, that had been in effect for at least the previous 48-hrs, affected some customers, and surfaced as one of two (or both) symptom issues: 1. Failed PR Comments - PR Comments that were previously working fine stopped working; OR 2. 500 Internal Server error - POSTs to /jobs endpoint resulted in 500 server errors (intermittent) If you were affected by either of these issues over the last week, please retry your workflow as the underlying issue has been resolved. If you are currently experiencing either of these issues, please reach out to us at support@coveralls.io as your issue probably has a different root cause and we'd like to address it for you asap.

Report: "504 and 503 errors"

Last update
resolved

We have continued to monitor this issue, but have not seen additional 504 or 503 errors for almost 3 hours now, so we are closing this incident. Please let us know at support@coveralls.io if you continue to receive 504 Timeout errors, or 503 Heavy Load errors from the Coveralls.io Web app or from the Coveralls.io API.

identified

We are continuing to experience 504 Timeouts and 503 Heavy Load errors originating at our load balancers. We believe it is a continuation of recent issues overnight, described here: https://status.coveralls.io/incidents/0xllsk9v8tpx We are aware of the issue and are working on it.

Report: "Overnight issue with SOURCE FILES table"

Last update
postmortem

The root cause of the the backed up Web and API requests was slow reads against the source\_files table in our database \(our largest table\), themselves caused by a long running database maintenance task. While that task \(“repacking” the source\_files table\) had been planned for, and started, over the previous weekend, it unexpectedly proceeded well into the week. After seeing normal site behavior Mon and Tue, we decided to let the procedure continue because of its importance to general database performance, but we believe that when we hit our weekly usage peak \(Wed-Thu\), even as the maintenance task was nearly complete, the database was overwhelmed with read requests against the table while the maintenance activity held transaction locks against relevant rows. In addition to the temporary action of restricting reads from our Web app, we have also curtailed all maintenance activity against the table until we can guarantee tasks will complete within normal maintenance windows \(late evenings and weekends PDT\). We’ve also identified a longer term solution that involves a different approach to partitioning tables that will take 1-2 weeks and is planned for later this month.

resolved

We had an issue between late afternoon yesterday, WED, DEC 7 PDT and this morning, THU, DEC 8 PDT affecting all builds. For some time, both the Web app and Coveralls API returned 504 Timeouts or 503 "This website is under heavy load" errors, which originated from our load balancers. The root cause of the the backed up requests was slow reads against the source_files table in our database themselves caused by a long running database maintenance task. To resolve---to reduce the number of reads against that table, to continue processing incoming Web requests and clear background jobs relying on such reads---we made the SOURCE FILES table unavailable (in all build and job pages) overnight with the message "WE'RE SORRY, THIS FUNCTIONALITY IS TEMPORARILY UNAVAILABLE." The 504 Timeouts and 503 Heavy Load errors were mostly resolved by 6p PDT, and backed up background jobs completed early this morning. The restriction on the SOURCE FILES table was lifted around 7a PDT. We apologize for the inconvenience to any users who tried to access SOURCE FILES overnight. UPDATE: We've received reports that the Coveralls API was also affected during the early periods of this incident. In parallel with Web requests, many API requests were rejected with the 503 Heavy Load error.

Report: "Slow build times for small set of repos"

Last update
resolved

This incident has been resolved.

monitoring

We are keeping this issue open as we continue to monitor as we un-pause paused repos. Without further issues, we expect to resume normal operation by 6pm.

monitoring

We have paused several repos from the group mentioned above as we determine the nature of their impact on other system users. We will unpause all repos by EOD today. If you think you may have been paused and want to confirm, email us at support@coveralls.io.

monitoring

We experienced some general performance issues last night that continued into early morning, but, this morning, should only be affecting a small group of repos: The group of affected repos should be limited to ~20-25 repos and the defining characteristics of these repos are that they have: - More than 100 jobs in one day, and/or - More than 40 jobs in one hour Upon hitting those thresholds, these repos are being routed to a job processing queue that has lower precedence than queues that process jobs for the general population of repos. If you think you may be affected by this issue because your repo may be exceeding these thresholds, feel free to contact us at support@coveralls.io and we will confirm or advise. If you believe you are experiencing slow performance (slow build times), but don't think your repo is exceeding the above thresholds, also please contact us at support@coveralls.io, as we can verify that and also see if there's another issue at play for your repo.

Report: "Degraded performance. Scaling infrastructure."

Last update
resolved

This incident has been resolved.

monitoring

All background queues are nearly cleared. We are continuing to monitor for further issues.

monitoring

We are experiencing degraded performance on some of our production servers. We have scaled infrastructure to address the issue and bring down background job queues. Build times should be back to normal within the hour.

Report: "Missing builds for some repos"

Last update
resolved

This incident has been resolved and has been monitored successfully for several hours. If you think some of your builds were affected by this issue, send us your Coveralls Repo URL to support@coveralls.io and we'll investigate whether we can re-process the builds for you, or if you will need to re-send them. If you absolutely need a missing build in your build history, the solution is to go to CI and re-run the build in question. That should re-send coverage data to coveralls and place it in your history in a chronologically correct manner. If this isn't happening for you, let us know at support@coveralls.io. Apologies for the inconvenience around this issue. We appreciate your patience.

monitoring

We have deployed a fix that we expect to resolve this issue for all future builds. We are still reaching out to repo owners who let us know they may have been affected. We hope to re-enqueue all missing builds and have them appear by EOD today (Fri, Sep 16). Again, if you think you were affected, send us your Coveralls Repo URL to support@coveralls.io and we'll make sure your builds get re-processed.

identified

We believe we have identified the root cause of this issue and a fix is being implemented that we'll deploy in the next ~1-hour. At this time, we have identified five (5) affected repos. Missing jobs for those repos will be re-enqueued as soon as the fix is deployed. We will notify each repo owner once this is done. Again, if you think you may have been affected, send us your Coveralls Repo URL to support@coveralls.io, and we'll ensure your missing builds are re-processed.

investigating

We've had reports today that some repos have had builds go missing. The original symptom for these users was that they stopped receiving PR Comments, or Status Updates, at Github; but, looking further into why, we discovered that most of their coverage reports since Sep 13 started builds that never completed. Our engineers are investigating. If you think you're affected, please send the Coveralls Repo URL(s) for your affected repo(s) to support@coveralls.io.

Report: "Server outage last night may have affected build times for <1% repos."

Last update
resolved

We had a couple of servers lose comms last night for unknown reasons. Between 4pm-6pm two servers hit over 90% CPU Util until they lost comms. This means the jobs on those servers would have failed due to timeouts and been retried. This morning there were about 300 jobs still in queue, which represents a fraction of a percent of jobs processed yesterday. We think this translates to less than 1% repos affected, by which we mean repos with longer-than-average build times. If you think you were affected, let us know at support@coveralls.io and we can verify. As of 8a this morning, all components operating as normal. So no further effects are expected.

Report: "Migrated servers this weekend - New static IP addresses"

Last update
resolved

Closing this as it was intended to stay open for the week of Mon, Aug 1-Fri, Aug 5.

monitoring

Keeping this incident open for another 24-48 hours in case users experience any of the issues below. If you're having any issues, please see the potential remedies below.

monitoring

We migrated the servers underlying our app again this weekend out of necessity. All systems are fully operational, but for those of you who need to whitelist Coveralls IP addresses, the list is below. Our DNS service is set up to make this change invisible to you, but for those of you who may be receiving 400 Bad Request errors back from Web or API requests, a guide to clearing the various caches involved in your chain of requests is also given below. Finally, if you're experiencing these, or any other issues, we'd appreciate a heads up at support@coveralls.io. New Coveralls IP addresses: 54.89.228.101/32 3.93.17.184/32 3.95.136.35/32 54.89.23.76/32 54.172.125.86/32 3.82.218.125/32 54.86.25.163/32 3.88.194.77/32 54.86.81.7/32 18.234.163.214/32 44.203.186.142/32 Guide to clearing cache in your chain of requests: If you are receiving 400 Bad Request errors, we think the root cause is caching somewhere in the chain of requests. Please try clearing your local caches with the help of these resources: 1. How to clear browser cache in: - Chrome: https://support.google.com/accounts/answer/32050 - Other Browsers: https://its.uiowa.edu/support/article/719 2. How to clear machine cache on: - Windows: https://www.c-sharpcorner.com/article/how-to-clear-cache-in-windows--10/ - MacOS: https://mackeeper.com/blog/how-to-clear-cache-mac/ 3. How to clear cache at CI services and on CI runners: - Github Actions: https://github.community/t/how-to-clear-cache-in-github-actions/129038/8 - actions/cache: https://github.com/actions/cache - Listing and deleting caches in your Actions workflows: https://github.blog/changelog/2022-06-27-list-and-delete-caches-in-your-actions-workflows/ - Github Actions Cache API: https://docs.github.com/en/rest/actions/cache - Travis CI: https://docs.travis-ci.com/user/caching/#clearing-caches - CircleCI: https://circleci.com/docs/caching - Gitlab CI/CD: https://docs.gitlab.com/ee/ci/caching/

Report: "More reports of Gateway Timeouts"

Last update
resolved

All servers have been replaced with upgraded instances. Therefore we are considering this incident resolved. If you continue to receive a Gateway Timeout, please retry your request, or clear cache and retry, as caching may be playing a role.

monitoring

We had a report of Gateway Timeouts and observed one server that had gone offline. We proactively removed the server and replaced all current web servers with more robust instances to avoid further issues. Therefore, the issue should be resolved, but if you continue experiencing Gateway Timeouts after 10-15-min, please let us know so we can re-open and investigate further.

Report: "Reports of Gateway Timeouts"

Last update
resolved

No further reports of timeouts. New server has been in place for over 1-hr without issue. Closing the incident.

monitoring

We have replaced the problem server and are monitoring. Please remember to clear your browser cache if you get a timeout error accessing the web app. After a refresh it should stop happening.

identified

We are replacing the problem server and should be back to full service in 10-15 min.

investigating

We've received some reports of gateway timeouts while trying to access the web application. We see that one web server in our fleet was down for some time. If possible, clear your browser cache and try again. Let us know if the behavior continues at support@coveralls.io.

Report: "Intermittent outages"

Last update
resolved

We are closing this incident after monitoring and observing no further outages throughout the rest of the day.

monitoring

We think we have resolved this issue, but we will continue to monitor for the next hour before closing this incident.

monitoring

We are modifying configs and deploying new servers. Please allow 15-20 minutes for new server fleet to deploy.

monitoring

We are still seeing intermittent outages without a known root cause. We are continuing our investigation.

monitoring

We have deployed new servers and continue to investigate the cause of intermittent outages with the previous servers. Website and API are fully operational at this time.

identified

We are deploying new servers and decommissioning problematic ones. Current outages should be resolved within 5-10 minutes.

investigating

We are receiving reports of intermittent outages of both the web app and the API. We are investigating.

Report: "Degraded web performance & slower than normal builds"

Last update
resolved

Resolving this incident as we enter our maintenance window from 10p US PST to 2a US PST.

monitoring

We have a performance issue rooted in our database, which, to resolve, will take several full maintenance activities with downtime. We tried to do so last night without luck due to time needed. We have scaled infrastructure to its maximum useful size to manage the current issue, but we will need to complete the full maintenance to return performance to normal. That will require additional planned maintenance window tonight from 10p-2a US PST. This incident will remain open all of today as we monitor the situation until the next maintenance window.