SpeedCurve

Is SpeedCurve Down Right Now? Check if there is a current outage ongoing.

SpeedCurve is currently Operational

Last checked from SpeedCurve's official status page

Historical record of incidents for SpeedCurve

Report: "RUM Data Outage"

Last update
resolved

Fastly have fixed the streaming log service and we're now seeing normal log volumes and RUM page views coming through. We think it's unlikely Fastly will be able to recover the missing logs so there will be 2h:30m of missing RUM page views.

identified

We use Fastly to collect RUM beacons and write logs for ingestion into SpeedCurve RUM. However, Fastly is currently having issues with its streaming log service, and we've seen a major drop in logs being sent to us for ingestion into SpeedCurve RUM. You will see a large drop in RUM page views in your dashboards from 12th Sept 03:10am UTC. You can follow the Fastly incident here. https://www.fastlystatus.com/incident/376914

Report: "RUM performance issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Slow performance of RUM queries and slow RUM data injection past 12 hours. Gaps in RUM data are possible in some charts.

Report: "RUM performance issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Still experiencing slow performance of RUM queries.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Some notes with specified site_id posted via API were not saved"

Last update
resolved

Some notes posted via 'Add a note' API endpoint (POST https://api.speedcurve.com/v1/notes) with site_id parameter were not saved or were saved without the link to a specified site. The issue was introduced after a release on December 4 (UTC) and now fully resolved.

Report: "RUM Data Processing Delays - Reporting Tools Affected"

Last update
resolved

The issue has been resolved and all RUM data is now available.

monitoring

RUM cluster is back to normal and we are monitoring its performance.

identified

Our data processing infrastructure is running behind which is causing inaccuracies in the reporting tools. No data has been lost and the system should be caught up shortly.

Report: "RUM Data Processing Delays"

Last update
resolved

The RUM pipeline has returned to its normal performance, all data has been uploaded and available in charts.

investigating

Processing of RUM page views is currently delayed. Data is being collected, but data ingestion into our data store is currently slow.

Report: "Chrome Canary tests failing"

Last update
resolved

This incident has been resolved.

identified

Chrome Canary version 111.0 has caused an issue that is preventing our test agents from running any tests in this browser. We are in the process of reverting this browser to version 110.0 and will not be automatically updating it until the issue is resolved.

Report: "Synthetic dashboard degraded performance and timeouts"

Last update
resolved

This incident has been resolved.

identified

Some synthetic dashboards are experiencing slow performance and in some cases are not loading at all due to timeouts. We are working on a fix, but in the mean time we suggest viewing a smaller date range in your dashboard to prevent timeouts.

Report: "Synthetic Firefox tests failing"

Last update
resolved

An automated software update at approximately 14:00 on 13 October 2022 (UTC) caused Firefox tests to begin failing. This was initially fixed at 08:00 on 19 October 2022 (UTC), but a regression caused the issue to reappear. As of 21:00 on 23 October 2022 (UTC), the issue is considered resolved.

Report: "RUM dashboards delayed"

Last update
resolved

Rum page view processing is back to normal now. The query cache will be refreshed shortly, and charts will return to normal.

investigating

We are continuing to investigate this issue.

investigating

Processing of RUM page views is currently delayed. Data is being collected, but data ingestion into our data store is delayed. The team are investigating.

Report: "Email sending delayed"

Last update
resolved

SpeedCurve emails have been delayed over the last day. We discovered a backlog of queued emails that have now all been sent.

Report: "Issues with synthetic scheduled testing"

Last update
resolved

Turns out we were hitting max network limits in AWS for some of our services. We've transitioned to new instances with higher network limits which should resolve any issues.

investigating

We're seeing stability issues with our synthetic testing which is causing delays to scheduled tests. The team are investigating.

Report: "Synthetic testing paused for maintenance"

Last update
resolved

Synthetic testing was paused for just under an hour while we moved servers.

investigating

Synthetic testing and deploys are paused for maintenance. Any scheduled tests will be added when the service resumes. Estimated time is 1 hour.

Report: "Synthetic testing paused for disk replacement"

Last update
resolved

The disk has ben replaced and Synthetic scheduled tests are now running normally. Any scheduled tests skipped in the last few hours will now be run.

monitoring

We've had a SSD disk failure on our WPT server. The server will be down for an hour while it's replaced. You can continue to view completed tests and any scheduled tests will be completed once the server is back.

Report: "Global CDN Disruption"

Last update
resolved

Fastly has resolved its global CDN issues. We had an approximately 75% drop in LUX page views during this incident.

monitoring

We are continuing to monitor for any further issues.

monitoring

Fastly who we use for CDN services is having global issues at the moment and that is having a knock on affect to SpeedCurve. You can follow the Fastly incident here... https://status.fastly.com/incidents/vpk0ssybt3bj

Report: "LUX Export Endpoint Errors"

Last update
resolved

We are no longer seeing errors on the LUX export endpoint.

monitoring

We have identified a release that appears to have caused the increased error rates. We have rolled back this release and are continuing to monitor the LUX export endpoint.

investigating

The LUX export endpoint (/v1/lux/export) is currently experiencing a high error rate. We are looking into the issue now.

Report: "Degraded dashboard performance"

Last update
resolved

This incident has been resolved.

identified

We've identified the cause of the issue. We introduced a bit of latency with our last infrastructure change that was not anticipated as we continue to migrate to a new platform. Our solution is to push forward with moving the frontend to our new platform. The team is working aggressively to complete this migration. We are moving with urgency, while continuing to be cautious. We are completing our testing and tentatively plan to migrate the frontend early next week. Getting the production environment stable for our entire customer base is our highest (and only) priority at the moment. We will continue to provide updates to our status page as we have them.

identified

The SpeedCurve API is fully operational again. We are continuing to investigate degraded performance across Synthetic and LUX dashboards.

identified

We have identified an issue which may be impacting performance for some users. We are in the process of implementing a solution.

Report: "Scheduled synthetic tests currently paused"

Last update
resolved

This incident has been resolved.

monitoring

Our WebPageTest server is back online and normal scheduled tests and deploys are running again. Over the next few hours we will run any missed scheduled tests from earlier today.

identified

The RAID array on our main WebPageTest server has failed and is currently being replaced by the team at LiquidWeb. We expect the server to be back up in an hour or so. Deploys and scheduled tests will resume once the server is back online.

investigating

We are investigating issues with our WebPageTest server and have paused scheduled synthetic tests while we determine the cause. Deploys may run slowly but should still continue.

Report: "Unexpected changes in metrics"

Last update
resolved

This incident has been resolved.

monitoring

After discussing the issue with Amazon EC2 engineers, we have concluded that these changes are permanent.

investigating

You might have noticed some unexpected changes in your metrics around 9-10 December. We have not made any changes to our service. We believe the issue is due to changes in Amazon EC2. At this stage we believe the changes are permanent, however we're still in the process of investigating. If you have any questions, send them to us at support@speedcurve.com.

Report: "On demand test failures"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

Tests that are run via the SpeedCurve API or 'Test Now' feature are not being triggered. Scheduled tests are working as expected. This issue is currently under investigation.

Report: "Synthetic tests delayed"

Last update
resolved

Synthetic test are all running on schedule again.

investigating

Scheduled Synthetic tests are running up to an hour late at the moment. We're investigating and will have them back on time shortly.

Report: "Chrome 80 update affecting CPU metrics"

Last update
resolved

This incident has been resolved.

monitoring

We have identified a fix for Chrome 80 and will test it over the next few days. In the meantime our test agents will continue to run Chrome 78.

monitoring

Due to a change in Chrome 80, our test agents have been unable to accurately measure some CPU metrics including First CPU Idle and Time To Interactive. As a result, data collected over the last 48 hours may have lower FCI and TTI values than expected. We have temporarily reverted our test agents to Chrome 78 while we work on a solution.

Report: "Degraded performance for some pages"

Last update
postmortem

On 5 September 2019, we became aware of increased CPU usage on our test agents across all SpeedCurve regions. Unfortunately, the increased CPU usage affected metrics for almost all of the tests that were run between 3-11 September 2019. CPU-based metrics like TTI & scripting time were the most heavily affected, but in many cases time-based metrics like start render & speedindex were also affected. ![](https://img.speedcurve.com/blog/2019-09-04-postmortem-timeline.png?max-w=1000) We know that dramatic changes in metrics like this can be frustrating, especially when you aren't sure what caused the change. We now know that the root cause of this incident was an update to the Linux kernel on the servers that run our test agents. The unusually long duration of the incident was due to a combination of insufficient monitoring, a complex tech stack, and a slow debugging feedback loop. # What happened Let's dive straight into a timeline of events. All times are in UTC. #### 2 Sep 21:00 An update to the Linux kernel was installed on our test agents. All of our test agents run Ubuntu 18.04 LTS and are configured to run a software update when they first boot, as well as every 24 hours. For this reason, there would have been a mixture of "good" and "bad" test agents for several hours after this point. #### 3 Sep 13:06 Our internal monitoring alerted us to an increase in CPU metrics. At this point there was still a mixture of "good" and "bad" test agents, so the data that triggered the alert appeared to be caused by some anomalies rather than a genuine issue. For this reason, the alert was ignored on the assumption that a genuine issue would issue subsequent alerts. ![](https://img.speedcurve.com/blog/2019-09-04-postmortem-first-alert.png?max-w=1000) #### 4 Sep 02:00 Our internal monitoring alerted us to another increase in CPU metrics, this time for a third party script \(Google Analytics\). This prompted a short investigation, but it was believed that the alert was caused by a change in Google Analytics rather than an issue with the test agents. ![](https://img.speedcurve.com/blog/2019-09-04-postmortem-second-alert.png?max-w=1000) #### 5 Sep 02:28 Our internal monitoring alerted us again to an increase in CPU metrics. This time the alert was seen and taken more seriously, because it appeared to be more widespread than a single third party. Investigation into the issue began in earnest at this point. #### 5 Sep 03:35 We received the first report from a SpeedCurve user about degraded performance. #### 5 Sep 03:38 More members of the SpeedCurve team joined the discussion to speculate about possible causes. The tech stack for our test agents has several layers: * The SpeedCurve application, which orchestrates the testing * WebPageTest, which farms testing jobs to individual test agents * The test agent software, which control the web browsers and extract performance data * The web browsers * Linux * Amazon EC2 Our goal at this point was to rule out as many layers as possible so that we could focus the investigation. #### 5 Sep 07:16 More internal monitoring alerted us to the fact that this issue is much more widespread than we initially thought. We began to speculate that there could be an EC2 issue, but this was ruled out as the issue appeared to be spread across multiple regions. #### 6 Sep 00:56 By this point we had ruled out all layers except for Linux and EC2. We believed the most likely cause was a software package upgrade, and began a binary search to identify the package. #### 6 Sep 05:00 No further investigation was performed over the weekend. #### 8 Sep 20:28 After some false positives identifying the software package, we switched to a much more thorough debugging method. This involved upgrading software packages one-by-one, rebooting the server, and creating AMI snapshots at every step of the way. #### 9 Sep 21:20 After a flood of support tickets from SpeedCurve users, we agreed that this issue was widespread enough to justify creating an incident on our status page. #### 10 Sep 03:06 We identified an update to the Linux kernel as the root cause. This was unexpected, and started some heavy discussions around whether automated software updates were appropriate for our test agents. #### 10 Sep 20:30 The SpeedCurve team agreed to roll back the Linux kernel to a known-good version and disable automatic software updates. #### 11 Sep 05:52 We began preparing patched test agent images for all of our test regions. #### 11 Sep 09:35 All regions except for London had been switched to the patched test agents. The London region seemed to be experiencing issues and we were unable to copy images to it. #### 11 Sep 10:25 The London region was switched to the patched test agents. The incident was marked as resolved on our status page. # What didn't go well This was SpeedCurve's most widespread and longest-running incident. There are many reasons for this, but the biggest reasons are as follows: 1. While we have full control over changes to the SpeedCurve application and WebPageTest, there are several layers of our tech stack that we have less control over. Even though we exclusively use stable and LTS \(long-term support\) software update channels, we are still at the mercy of software vendors to ensure no breaking changes are introduced. Clearly the use of stable and LTS channels is not enough to prevent issues like this from occurring. 2. Our internal monitoring produced some unexpected results, and we ignored the first two alerts. For this reason, it took us around 48 hours to realise the severity of this incident. 3. We are familiar with breaking changes being introduced in web browsers, but this was the first incident where we had to dig all the way down to the operating system level. Our existing debugging processes were not sufficient to deal with this incident, and it took much longer than anticipated to find the root cause. We also had no way to revert to a known-good OS configuration, since our existing rollback scenarios only accounted for issues higher up the stack. # How we intend to prevent this from happening again The major change we're making after this incident is switching from automated software updates to periodic, curated updates. This has a few benefits for us \(and for our users\): 1. We can perform updates on our own test agents before rolling them out to al of our regions. This allows us to check for potential issues in a timely manner, and also gives us the opportunity to report bugs to software vendors before they impact SpeedCurve test results. 2. We can take a snapshot after each update has been approved and rolled out. Since test agents are essentially frozen after each update, we have a reliable history of agent images that we can roll back to in the case of an incident like this. 3. In the case that an update will have a noticeable impact on SpeedCurve test results, we can give our users plenty of notice. On top of this, we will also continue to improve our internal monitoring. # Conclusion This was a frustrating incident for SpeedCurve users and for the SpeedCurve team. We're really sorry for the inconvenience that it caused. On the bright side, we learned a lot and we're looking forward to improving our processes so that incidents like this don't happen again. Thanks so much for helping us to improve SpeedCurve!

resolved

All test regions have been patched and performance should return to expected levels. We will follow up with a full write-up of this issue soon.

monitoring

A fix has been rolled out to most testing regions and we are continuing to monitor the situation.

identified

We have identified that a change in the Linux kernel is responsible for noticeable CPU overhead on our servers. We're working to resolve this as soon as possible.

investigating

We've noticed that some pages are experiencing degraded performance metrics since 3 September 2019. We are actively investigating the cause of this issue.

Report: "First Meaningful Paint Issues"

Last update
resolved

Resolving this for now. It's now a known issue that first meaningful paint is not always correct in the Chrome trace files. First meaningful paint is still considered under development with no "standardized definition by the Google and Chrome teams.

monitoring

We are continuing to monitor for any further issues.

monitoring

The FMP reported via chromeUserTiming has had some funky changes in Chrome 75 and we've been working with the WebPageTest team to find out what's going on and improve the parsing of Chrome user timing events. It looks like Chrome has started reporting FMP for multiple frames, not just the main frame, and this has thrown the metric out on some pages. We've pushed an update to use the first FMP event we find rather than the last FMP event on the page which should fix the issue with FMP events that appear late in the page load. We're also seeing other tests with no FMP reported or a FMP of 0 which is nonsensical. FMP is still regarded as a work-in-progress by the Chrome team so we're not sure what their appetite is for fixing issues like this in the Chrome trace. We strongly recommend using hero rendering times over FMP as we find them a much better representation of the "meaningful" content a user is actually seeing as the page renders. For example FMP doesn't currently take into account when any images render. Google acknowledges that FMP doesn't have a "standardized definition" yet and recommends using user timing marks on hero elements instead.

identified

We've identified an issue in Chrome 75 where First Meaningful Paint is not being reported for some URLs.

Report: "Firefox not running tests"

Last update
resolved

The WebPageTest team has now updated the test agent codebase and Firefox is working again.

investigating

There are currently issues with synthetic testing using the Firefox browser. A change in Firefox is causing WebPageTest to error when trying to start a test. We're working with the WebPageTest team to identify and resolve the issue.

Report: "This is an example incident"

Last update
investigating

When your product or service isn’t functioning as expected, let your customers know by creating an incident. Communicate early, even if you don’t know exactly what’s going on.

resolved

Empathize with those affected and let them know everything is operating as normal.

identified

As you continue to work through the incident, update your customers frequently.

monitoring

Let your users know once a fix is in place, and keep communication clear and precise.

Report: "API outage"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.