LiftIgniter

Is LiftIgniter Down Right Now? Check if there is a current outage ongoing.

LiftIgniter is currently Operational

Last checked from LiftIgniter's official status page

Historical record of incidents for LiftIgniter

Report: "Backend server overload resulting in issues with recommendation quality as well as update lags"

Last update
resolved

A one-off backend job caused an overload on some of our backend data servers, causing the associated processes to crash. We promptly restarted the affected processes and they came back online. While these backend data servers were down, our front-end servers continued to serve requests using intermediate layers of caching, and most end users continued to receive recommendation results. After bringing the data servers online, we were able to resume running the one-off job with updated settings that did not put production infrastructure at risk. During the period that the affected processes were down, the following issues would have been observed: * Recent updates made (to the catalog and to ML data) would not be getting pulled in to recommendations. * Infrequently accessed data (that was not cached) would not be successfully looked up, which could cause issues such as not returning enough results (particularly for queries with restrictive rules), or not returning the best results. We had two periods where users may have noticed issues: * US East from 2025-02-11 10:59 PM PT (2025-02-12 06:59 UTC) to 2025-02-11 11:26 PM PT (2025-02-12 07:26 UTC), mostly affecting traffic from the Eastern United States and Canada, Europe, and South America * US West from 2025-02-11 11:38 PM PT (2025-02-12 07:38 UTC) to 2025-02-11 11:59 PM PT (2025-02-12 07:59 UTC), mostly affecting traffic from the Western United States and Canada, Asia, and Australia Our redundant architecture prevented visible damage to end users. We are investigating best practices around the right settings, configurations, and safeguards for one-off backend jobs to reduce the risk of these sorts of incidents occurring in the future.

Report: "Issues with inventory API and related background processing"

Last update
resolved

This incident has been resolved.

monitoring

An initial fix has been implemented and most of the affected processing has been restored, including the inventory API endpoints. We are working on residual fixes and monitoring the results.

investigating

We are investigating issues with inventory API endpoints, which seems to have been mostly returning 503 error codes since 12:20 AM PT (2024-05-19T07:20:00Z). Updates to catalog (both insertions and deletions) are affected.

Report: "Issues with email service (and another component that is not end-user-facing) likely due to Google Cloud infrastructure issues"

Last update
resolved

Google Cloud confirms having fully resolved the incident https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre on their end, and all our systems appear to be stable.

monitoring

The email service is functioning normally -- so on-open email recommendations should resume working as expected. We are still seeing some symptoms of issues with Google Cloud's backend services, and are waiting on Google Cloud to address those issues and confirm resolution of the underlying issue at https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre. We will continue to monitor the situation closely until they have done so.

identified

Google Cloud has posted details of the incident on their end: https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre. We are waiting on them to resolve the issue.

investigating

We are investigating issues with our email rendering service due to which on-open email recommendations are expected to not render properly. This is likely due to issues with the Google Cloud infrastructure. We're still awaiting confirmation from Google Cloud that there's an issue on their end. In the meantime, we are looking into the best ways to mitigate and address the issue. Another infrastructure component of ours is also affected, but this component is not end-user-facing and will not have an immediate impact on the end user experience.

Report: "Capacity issues in US East region due to Google Cloud infrastructure issues (they are unable to provision new instances)"

Last update
resolved

Google Cloud has confirmed resolution on their end, and all alerts have resolved on our end. We're marking the incident as resolved. If you're a customer who is still seeing infrastructure issues, or if you have any questions about the incident, please reach out to us at liftignitersupport@thearenagroup.net.

monitoring

Capacity has been reprovisioned in the us-east1 region, and we resumed routing traffic to US East. While our metrics seem stable as of now, Google Cloud has not yet confirmed resolution of the issue on their end, so we will continue to monitor the situation for the time being.

identified

Due to a Google Cloud issue creating new VM instances in the us-east1 region (https://status.cloud.google.com/incidents/DyCcy7iGtWqLYEyJBWrC#7Nro3snGH5YXCyUfQuKS), we were having difficulty serving traffic that was being directed to that region. We were able to minimize the damage by directing as much traffic as possible to the us-west1 region that is still operational. We are monitoring for updates from Google Cloud as well as assessing the infrastructure impact.

Report: "Issues with console and the API usage"

Last update
resolved

This incident has been resolved.

monitoring

LiftIgniter console was taken down temporarily to update API keys. This may have caused some API calls to fail. Affected customers have been notified, and all systems should be back up and running.

investigating

https://console.liftigniter.com has temporarily been taken down to investigate an issue. Inventory updates are also experiencing issues.

Report: "Issues with services in US East due to capacity issues with cloud provider"

Last update
resolved

Capacity is back to normal and all configurations have been returned to their defaults.

monitoring

Due to some capacity issues being experienced by our cloud provider (Google Cloud) in US East, we are or were experiencing issues with some of our services. Our query endpoint (query.petametrics.com), that is used to serve recommendations, saw (503 status) error rates rise to about 1%. Error rates were nonzero between 18:00 and 18:04 UTC. We had already started provisioning alternate capacity prior to the increase in error rates, but still got some errors as the provisioning of capacity took a few minutes. We also saw increased latency in the period from 17:51 to 18:11 UTC for the successful requests. We also provisioned alternate capacity for a few other affected services; these services had a few minutes of downtime while the alternate capacity was coming online. We significantly benefited from preparation we did after the previous incident http://status.liftigniter.com/incidents/1522vrjxbmcp.

Report: "Issues with services in US East due to capacity issues with cloud provider"

Last update
resolved

Capacity is back to normal and all services are operating normally. We've identified improvements to make to our systems to make them even more robust to similar issues.

monitoring

All our services are back to working normally. We are still waiting for the underlying capacity issues to be fixed, and will be reviewing our setup to see how we can reduce the impact of such incidents in the future.

identified

As of 15:04 UTC, our email-rendering services are back online and working properly, so all our front-facing services are working properly now. We have identified that the capacity issue is affecting one of our backend services used for managing user histories, and are continuing to investigate that.

identified

Due to some capacity issues being experienced by our cloud provider (Google Cloud) in US East, we are or were experiencing issues with some of our services. Our query endpoint (query.petametrics.com), that is used to serve recommendations, saw (503 status) error rates rise to over 1%, peaking at 2.6% briefly. Error rates were nonzero between 14:23 and 14:38 UTC. Error rates went down to zero after we provisioned alternate capacity. The period of increased error rates was also a period of increased latency for the successful requests. We are currently investigating the impact on some of our other services, including a service used for rendering emails.

Report: "Google Cloud load balancer issues causing errors for end users; workarounds in progress for core services"

Last update
resolved

Google has confirmed full resolution at https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh and we have reverted to our normal settings. All metrics have returned to normal ranges.

monitoring

Starting around 18:10 UTC (10:10 AM Pacific Time) we started seeing the endpoints working again. We paused our mitigation steps but are continuing to monitor before reverting them. Google Cloud has posted an incident report at https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh and the incident report does not yet confirm resolution. The incident has also been reported in the media; see https://www.theverge.com/2021/11/16/22785599/google-cloud-outage-spotify-discord-snapchat-google-cloud for instance.

identified

All our domains that route through Google Cloud's global public load balancer are giving 404 errors. This appears to be due to issues with Google Cloud; as of the time of writing this, they do not have an incident page but are reported degraded service on their status page (https://status.cloud.google.com/). We implemented mitigation procedures for our recommendation engine endpoints that we originally wrote after a 2018 incident http://status.liftigniter.com/incidents/7kz9f7w1z8jg and expect to mitigate the bulk of the impact through these mitigation procedures, even if Google Cloud takes time to solve its issues. However, some of our other services, including our inventory API, user API, email recommendations, and console do not have a similar mitigation process in place so recovery for them must wait till Google Cloud fixes the issue.

Report: "CDN issues with MaxCDN (one of our CDN providers)"

Last update
resolved

This incident has been resolved.

investigating

For a little under an hour, from 2021-10-20 8:04 UTC to about 2021-10-20 8:46 UTC, MaxCDN, one of the CDN providers that we use to serve traffic, was slow and unresponsive for part of our Australia traffic due to scheduled maintenance in the Sydney region: https://status.maxcdn.com/incidents/0stcf5zpm766. Customers with significant web traffic volume in Australia were affected. Customers using a version of the LiftIgniter snippet that automatically tries against the Cloudfront CDN when MaxCDN is down woud be partly insulated. Even so, since there is some time lag for the retry, we did fail to capture data for their end users who bounced before the retry could be triggered, and failed to show recommendations for users who scrolled to recommendation areas before the retry had kicked in. API integrations were not affected, so apps were unaffected. We are working with MaxCDN's parent company, Stackpath, to transition from MaxCDN's infrastructure to Stackpath's infrastructure, which we expect to be more robust to scheduled maintenance. We expect such incidents to stop happening once we transition.

Report: "CDN issues with MaxCDN (one of our CDN providers)"

Last update
resolved

For about an hour, from 2021-10-12 7:57 UTC to about 2021-10-12 8:55 UTC, MaxCDN, one of the CDN providers that we use to serve traffic, was slow and unresponsive for part of our Japan traffic due to scheduled maintenance in the Tokyo region: https://status.maxcdn.com/incidents/n6s35q0f8gf0. Customers with significant web traffic volume in Japan were affected. The majority of such customers by traffic are using a version of the LiftIgniter snippet that automatically tries against the Cloudfront CDN when MaxCDN is down, so they were partially insulated. Even so, since there is some time lag for the retry, we did fail to capture data for their end users who bounced before the retry could be triggered, and failed to show recommendations for users who scrolled to recommendation areas before the retry had kicked in. API integrations were not affected, so app customers in Japan were not affected.

Report: "Datastore issues in US East region causing degraded recommendation quality"

Last update
resolved

Data restoration is now complete, and all metrics are back to normal. The backup from which we restored is a little older than the point in time at which we lost data, so there is a possibility of some intermediate data loss; we are keeping a close eye on performance to see if there are data inconsistencies.

identified

Capacity has been restored in the US East region; latency and other performance metrics are close to normal. However, we are still faced with the problem that the data available to the recommendation engine for queries in the US East region is less complete than it should be, causing degraded recommendation quality in some cases. We are working to restore from backups.

identified

We are currently experiencing hardware issues that affected multiple nodes of our datastore in the US East. These hardware issues are causing some data to be unavailable to our recommendation engine when calculating recommendations (causing a degraded quality of recommendations), and may also result in increased latencies when returning recommendations. We are working to restore capacity and will then restore lost data from backups. Our system in US West is not affected as of now. We will share more details as they become available.

Report: "Datastore issues in US West region causing degraded recommendation quality"

Last update
resolved

Data restoration is now complete and all metrics are back to normal. We will continue to keep an eye for any data inconsistencies created by the restore process, but nothing seems off as of now.

identified

This is similar to http://status.liftigniter.com/incidents/1z9fqwpckkyk We experienced a hardware issue affecting multiple nodes of out datastore in US West [EDIT: Our hosting provider, Google Cloud, believes that this was actually a software issue with their virtualization software, and not a real hardware issue]. Capacity has been restored (it was limited between 8:30 UTC and 9:41 UTC). During the period of limited capacity, we experienced increased latency and substantially degraded recommendation quality. Now that capacity has been restored, the ongoing challenge is that due to the large amount of node failures, the datastore system in US West lacks all the data that it should have, causing degraded recommendation quality in some cases. We are working to restore from backups.

Report: "Issues with email service that powers email recommmendations"

Last update
resolved

We believe that the outage was caused by the load balancer incorrectly marking the service as unhealthy, due to aggressive timeout and failure conditions. This caused downtime for a brief while. We are addressing this by changing the failure thresholds for the load balancer to be more generous. EDIT: We determined that the initial marking as unhealthy was likely related to a live migration that our hosting provider, Google Cloud, conducted on a number of servers at the time.

monitoring

The email service seems to be back up, as all alerts have resolved. We believe the likely cause was in the networking and load balancing layers, as the servers themselves were operational throughout the duration of the incident. However, we're continuing to investigate what happened.

investigating

We are investigating issues with the email service that powers on-open email recommendations, based on alerts we received at 8:44 AM Pacific Time (15:44 UTC). The issues could be affecting new email opens. We will post more details once we have them.

Report: "Brief period of increased error rates in US West model-servers due to a mass killing of servers by our hosting provider"

Last update
postmortem

On 2020-08-13, between 1:11 AM Pacific Time and 1:25 AM Pacific Time \(8:11 to 8:25 UTC\), LiftIgniter's model-server system in the US West region, that serves traffic in the Western Americas and Asia-Pacific, saw degraded performance. The degraded performance was caused by a mass killing by Google Cloud of a large fraction of "preemptible" model-servers. Our system automatically scaled out "normal" model-servers in response, and the system performance stabilized rapidly without human intervention. Our engineers did get notified through the alerting system, but no human intervention was needed to stabilize the system. ## **Background** LiftIgniter's model-server system, that serves recommendations to end users, comprises a mix of "normal" model-servers and "preemptible" model-servers. The model-server system is hosted by Google Cloud. This system exists in two Google Cloud regions \(US East and US West\). The normal model-servers cannot be arbitrarily preempted by Google; the preemptible model-servers can be arbitrarily [preempted by Google with a 30-second notice](https://cloud.google.com/compute/docs/instances/preemptible). LiftIgniter has a robust process for handling preemptible terminations with no interruption of service to end users. If Google Cloud terminates a lot of preemptible servers close by in time, and does not provide preemptible capacity to replace the removed model-servers, LiftIgniter's normal model-server system scales out to handle the load. This scaling out can take a few minutes, and depending on the proportion of preemptible servers killed, can result in intermittent errors and connectivity issues. ## **Event timeline** 1. 1:11 AM to 1:17 AM Pacific Time: Over this time period, Google Cloud killed about 2/3 of the preemptible model-servers in the US West region. 2. 1:12 AM to 1:23 AM Pacific Time: Over this time period, we saw an increase in unresponsiveness of the status endpoints on the model-servers. Unresponsiveness was highest from 1:18 AM to 1:21 AM. 3. 1:17 AM to 1:24 AM Pacific Time: Over this time period, we saw an incidence of errors with status codes 429, 503, and 504 on the recommendations endpoint. Status codes 429 and 503 dominated the errors. These are the expected status codes when servers are loaded; the use of the status codes 429 and 503 suggests a relatively smart response to overload, compared to 504 \(in that the system discarded extra load rather than starting to compute on it and then timing out\). The peak proportion of request traffic that was returning an error code \(summed across error codes\) was 1% in the US West region. For individual customers, the peak proportion of request traffic that was returning an error code was 3%; these peak values lasted for short periods of a few seconds. 4. 1:18 AM to 1:27 AM Pacific Time: Over this time period, the normal model-servers scaled out, making up for the lost capacity on preemptible model-servers. 5. 1:27 AM to 1:36 AM Pacific Time: Over this time period, capacity recovered for the preemptible model-servers. ## **Learnings and improvements for the future** The rapid automatic stabilization of the system was satisfactory to us, given the rarity of such mass killings of servers. However, if the frequency of such incidents increases, we will tweak our settings around the relative proportion of preemptible and normal model-servers. We are also generally satisfied with alerting -- our engineers were alerted and monitoring the situation, but the automatic response was good enough that no manual intervention was needed. However, if for some reason the automatic response had not been good enough, our engineers would have been on hand to make adjustments.

resolved

On August 13, 2020, between 1:11 AM and 1:25 AM Pacific Time (PDT), we saw increased error rates and unresponsiveness on our model-server system in US West due to a mass killing of servers by Google Cloud, our hosting provider. The automatic response was sufficient to stabilize the system. No manual intervention was needed, though our engineers did get alerted and were monitoring the situation.

Report: "Significant increase in 500 internal server errors in recommendations"

Last update
postmortem

On Thursday, August 6, 2020, at 2:15 PM Pacific Time \(21:15 UTC\) we deployed an update to one of our jobs responsible for making real-time machine learning updates. The job had a bug that caused a \(reversible\) corruption in machine learning data. The data is used by the servers serving LiftIgniter's recommendations \(known as the model-servers\). The corruption in the data resulted in internal server errors \(status code 500\) being returned by the model-servers on some queries. Only queries that used the corrupted data in a particular way were affected. The errors were discovered by our engineers through an alert. After determining that the bad code deployment to the job responsible for real-time machine learning updates was responsible, we reverted the deployment for that job. The errors died out within a few minutes of the revert. ## **Event timeline** 1. On Thursday, August 6, 2020, at 2:11 PM \(21:11 UTC\) we began a deployment to one of our jobs responsible for real-time machine learning. The deployment finished rollout at 2:15 PM \(21:15 UTC\). 2. Starting 2:19 PM \(21:19 UTC\), corruption in the machine learning data started causing the model-servers to return internal server errors with status code 500. The initial error rate was 0.023%, and in the next few minutes \(till about 2:33 PM\) the error rate rose to 0.5%. 3. At 2:26 PM \(21:26 UTC\), after a little over 6 minutes of 500 internal server errors, an alert triggered, notifying our engineers. Since the error was on the model-servers, and the deployment had been to a different component, we were not initially sure that the deployment was the cause, but it was a leading hypothesis. 4. Starting 2:34 PM \(21:34 UTC\), the error rate started increase. By the time range of 2:41 PM to 2:49 PM, the error rate had increased to the range of 3% to 3.5%. The error rate differently affected different customers; at the high end, error rates for one customer reached 17%. 5. By 2:50 PM \(21:50 UTC\), the revert of the bad deploy had been completed, and the error rate started going down. By 2:54 PM, the error rate was down to zero. 6. Following the immediate mitigation, we continued investigating the mechanism of the problem and began working on robustification of the model-server. After extensive testing and peer review, the robustifications were rolled out to production on Friday, August 7, 2020, starting 6:26 PM Pacific Time. 7. On Monday, August 10, 2020, we resumed making updates to the jobs involved in building machine learning configurations. Thanks to the robustifications pushed out the previous Friday, we were able to make the changes safely with minimal risk to the serving. ## **Cause and impact** The underlying cause of the problem was that the bad deploy of the machine learning update job had a bug due to which, at each update, some parts of the record being updated were set to null. The model-server, when reading the record, would encounter the null, and throw an error. The error would get caught in the model-server and return as a 500 internal server to the end user. The impact of the problem was therefore limited to the queries where one or more of the records looked up in the query had been updated since the bad deploy. Therefore, the percentage of queries affected increased gradually after the bad deploy. There was a further lag in the increase of errors due to caching on the model-server side. Specifically, the sharp increase in error rate at 2:34 PM Pacific Time, 15 minutes after the start of errors, was partly due to the fact that the maximum cache duration is about 11 minutes. Similarly, after the bad deploy was reverted, the error rate started dropping as the affected records started cleaning themselves. This reduction was more rapid because the corrupt data had not been cached in our model-servers \(only valid data is cached\). ## **Learnings and improvements for the future** The following are some of the improvements that we have made or are planning to make based on this experience: 1. **Robustification of model-server against corrupt data** \(done\): As we work on adding more algorithms and strategies in the model-server, and modifying the logic of building machine learning data, we want to make sure that the model-server is robust against corrupt data. That way, a bad deploy such s this one will not cause 500 internal server errors. Rather, this bad data will be reported by the model-server through a more granular exception-handling mechanism, so that we get alerted but end users do not get errors \(instead, end users do suffer but in a milder form, through somewhat degraded recommendation quality\). 2. **More proactive monitoring of model-servers, with a predetermined monitoring plan, for deploys to jobs that involve real-time machine learning updates** \(instituted as a process update\): One of the challenges with deploying build jobs is that their impact on serving can be fully tested only after a full deployment. By thinking through metrics that may be affected, and proactively monitoring them, we could more quickly catch and revert bad deploys. While this would not have prevented this incident, it could have reduced the duration before we diagnosed and reverted the deploy. 3. **Some improvements to pre-deploy testing procedures for jobs that involve real-time machine learning updates** \(still under consideration\): Wee are experimenting with various ways of improving pre-deploy testing procedures with such jobs, so that subtle errors in them can be caught before production deployment. One idea that we plan to try out is to switch some non-customer organizations to the new job and check for the impact on the model-server for those organizations, before rolling out too widely. The list of non-customer organizations includes organizations with synthetic data, as well as real websites that are being tracked through LiftIgniter but aren't showing recommendations.

resolved

We have pushed robustifications to our serving architecture. We expect that with these robustifications, problems similar to the ones that triggered this incident would not cause internal server errors. Rather, such problems would trigger an alert in our system for exceptions, while still returning a possibly degraded response to the queries. The robustifications will be tested next week as we resume the work of pushing updates to our machine learning building configurations.

monitoring

We have successfully reverted the faulty code push and the internal server errors have stopped completely. We are working on robustifications on the serving side that will prevent it from throwing errors for similar corruptions on the building side.

identified

LiftIgniter's recommendation servers are experiencing a significant increase in internal server errors and returning status code 500. We identified a recent code push to our machine learning building configuration that is responsible, and are working on reverting. We will update with more details as we learn more.

Report: "Issue with honoring extremely short TTLs for items in the catalog"

Last update
postmortem

In May/June 2020, LiftIgniter was notified by a customer that the customer had been trying to delete items from the catalog by reinserting them with a ttl of 1 second \(rather than use the DELETE API\); however, these items were not successfully deleted, and in fact got a ttl of 30 days. Upon investigation, LiftIgniter found and fixed two issues that were causing this problem. ## **Event timeline** 1. Between August and December 2019, LiftIgniter moved to a new system for managing catalog insertion. Prior to this, the entire catalog was stored in our Aerospike clusters, and these were the source of truth for the catalog. Aerospike is a key-value data storage software that allows for rapid lookups. After the change, we switched the canonical source of truth to a SQL database, while still maintaining the Aerospike clusters for use for rapid serving. 2. In May and June 2020, a customer communicated with us that they had been trying to delete items by inserting them with a time to live \(ttl\) of 1 second. We discovered that the items were not being deleted, and were being recorded with a ttl of 30 days \(a fallback, default TTL of value in Aerospike\). 3. In June 2020, we discovered one cause of the problem: the SQL database that we were using as our canonical source of truth was not expiring items correctly after their ttl. This was because a stored procedure on the SQL database was not running at all between February and June 2020. We addressed this problem and pushed the code fix on July 6, 2020, so that the items would get expired from SQL and the expiration would propagate to Aerospike. We also ran a job to backfill all the past deletions that needed to be done. 4. In July 2020, we discovered a second cause of the problem: when the ttl is short, it sometimes was the case that by the time the insertion was propagated to Aerospike, the remaining ttl was close enough to 0 seconds that it was getting truncated to 0 seconds. Aerospike does not accept a ttl of 0 seconds; a value of 0 gets replaced by Aerospike's default ttl. That explains the ttl of 30 days that we had been seeing. We pushed the code fix on July 15, 2020. NOTE: Prior to step 1, our insertion logic had respected ttls, because it had been a single-step process. The bug introduced by step 1 was specifically that we added logic to adjust ttl based on a lag between first insertion and the job that copies from the SQL database to Aerospike. It was this adjustment for the lag that created the case of a ttl of 0 seconds, even when the ttl at insertion was more than 0 seconds. ## **Learnings and improvements for the future** We have three learnings from this experience: * **Improving communication with customers around expected uses of ttl functionality**: Our ttl functionality is not intended as a way to immediately delete items; we encourage customers to use the DELETE API for that purpose. In particular, we had not tested around extremely short ttls as that was not a typical use case for us. * **Better monitoring of SQL permission errors**: We could have detected problems on the SQL side more quickly once they happened if SQL error logs automatically triggered alerts in our alerting system. We are working on instituting this monitoring for the future. * **Improving our speed of diagnosis of the issue once it was reported to us**: Since we didn't have good logging around the exact body of the original API insertion request, it took us some time to narrow the problem down to one with our system's lack of respect for ttl. We could have diagnosed the problem faster by testing a wider range of alternatives once the problem was reported to us.

resolved

This incident has been resolved.

Report: "Catalog updates backlogged for recomendation-serving in US West"

Last update
postmortem

On Monday, June 29, 2020, 2:26 PM Pacific Daylight Time \(PDT\) \(21:26 UTC\), our server that inserts catalog data \(inventory\) into our Aerospike-based key-value store in the US West region stopped inserting successfully. The reason for the failure was disk corruption on one of the Aerospike cluster nodes in US West. We addressed the backlog completely by the morning of Tuesday, June 30. Over the next few days, we implemented improvements to our metrics, monitoring, and alerts to be able to deal with similar situations better in the future. Sections of this document: * Event timeline * Cause and impact * Learnings and improvements for the future ## Event timeline 1. On Monday, June 29, 2020, 2:26 PM Pacific Daylight Time \(PDT\) \(21:26 UTC\), LiftIgniter's server that inserts catalog data into our Aerospike-based key-value store in the US West region stopped inserting successfully. Thus, a backlog of insertion started building from this time onward. The reason for the failure was disk corruption on one of the Aerospike cluster nodes in US West. Insertion into the US East region was unaffected. 2. At 7:33 PM PDT and 7:43 PM PDT \(five hours after the problem started\), two alerts triggered for a customer website whose catalog configuration was highly sensitive to timely catalog updates, and whose end user traffic went mainly to the US West region. 3. Between 7:57 PM PDT and around 8:15 PM PDT, two of our engineers together diagnosed the problem. 4. At 8:21 PM PDT, LiftIgniter pushed an update that redirected traffic for the main affected customer to the US East region. This addressed the immediate problems for that customer and gave us breathing room to fix the problem. 5. At 8:36 PM PDT, LiftIgniter successfully resumed the process of catalog insertion into US West, so from that point onward it started trying to catch up on the backlog. We did so by removing the bad Aerospike node from the cluster. 6. At 8:56 PM PDT, LiftIgniter added a new Aerospike node to replace the bad Aerospike node, so that the cluster could successfully rebalance without running out of disk space. 7. On Tuesday, June 30, at 8:39 AM PDT, the backlog was fully caught up in US West. The rebalancing of the Aerospike cluster had also completed by this time. We also reverted the customer configuration for the affected customer to resume sending traffic to US West. ## Cause and impact ### Root cause LiftIgniter stores a copy of its catalog in separate Aerospike clusters in both its regions of operation \(US West and US East\), though the cluster is not the canonical location of the catalog data. Copying of catalog updates to the Aerospike cluster happens via a job. The copy of the catalog in the Aerospike cluster in each region is used by LiftIgniter's recommendation servers for serving recommendations, as well as for real-time machine learning model updates. The root cause, and trigger, of the problem, was a disk corruption issue on one of the nodes in the Aerospike cluster in US West. This disk corruption was causing some writes to that node to fail. The job that updates the catalog in Aerospike makes updates in order, so once it encountered the error, it did not proceed further. This issue, that started at 2:26 PM PDT, led to the backlog that was discovered several hours later. ### Impact on recommendations For the majority of customers, the impact was as follows: their end users whose requests were routed to the US West region were being served using a somewhat outdated catalog \(as of the start time of the problem\). This means that items published recently would not be shown in recommendations, and would not be recognized when users visited them. Also, updates to catalog metadata for items would not be reflected in the recommendations. Other than that, however, recomendations would still be served normally. The impact could thus be described as a slight degradation in the quality of recommendations, as well as slight inaccuracies in the quality of real-time learning. In most cases, end users would not notice this impact. The overall effect on metrics was also small. \[NOTE: If the problem had continued for a much longer period, then the recommendation quality impact would be more severe.\] Two customer accounts were affected to a much greater extent. Both of them had high time-sensitivity to catalog updates, with old content either being expired from the catalog or filtered out through rules. For these accounts, LiftIgniter ended up returning empty results for a while. The proble with one of these accounts was the alert trigger described in the incident timeline, that led to our discovery of the issue, and we worked around the problem for that account by directing its traffic to the unaffected US East region. The problem with the other account occurred for a few hours during the period that the backlog was being caught up on, and resolved on its own as our system worked through the backlog. ## Learnings and improvements for the future We believe we could do better on two fronts: 1. Alerting: We should have received an alert about the problem right when it started, rather than discovering it indirectly due to downstream customer impact. 2. Catching up on backlogs: Our system should have been able to catch up faster on the backlog. That would have prevented the issues with the second customer account. We're making improvements on both fronts: 1. Alerting: We have improved our alerting so that we get alerted if the job stops working in the way it stopped this time. We already had two alerts, one for the job not reporting any metrics and one for the job reporting a backlog, but the specific failure mode here did not match either of the alert definitions. To elaborate: the job was reporting one metric and not the other metric, and it was this unreported metric that was used to calculate the backlog. We modified one of the alert definitions to incorporate this case. With the new alerting, we would have discovered the problem around 10 minutes after it started, rather than 5 hours later, and would have been able to fix it before any visible customer impact. We would have not only saved on the 5 hours of discovery time, but also cut down on the investigation time \(since the more specific alert would have led us to the problem more quickly\) and reduced the time taken to catch up with the backlog \(because there would have been less of a backlog to catch up on\). 2. Catching up on backlogs: We have added latency metrics so that we can debug backlogs better. This may inspire future improvements so that we can catch up faster on backlogs.

resolved

This incident has been resolved.

monitoring

The backlog was fully caught up as of 2020-06-30 16:00 UTC (about one hour ago at the time of posting this). Recommendation-serving is therefore operating completely normally now. The customer website that was dramatically affected has been restored to its prior configuration of sending traffic to the closest region. Also, after removing the node with disk corruption, we have successfully rebalanced data across the remaining nodes, and our data storage is operating as expected.

identified

Our machines have started working through the backlog of catalog updates. We are working on getting an estimate of how quickly the backlog will be fully caught up.

identified

LiftIgniter maintains servers to serve recommendations in two Google Cloud regions: US West and US East. End user traffic generally goes to the closer of the two regions unless configured to go to a specific region. The LiftIgniter recommendation servers in the US West region do not have access to catalog updates made since around 2020-06-29 21:20 UTC (about 6 hours ago at the time of posting this). We have determined that this is because of disk corruption on one of the nodes blocking updates from being made. No data is lost, and we are working to replace the bad node and resume the process of catalog updates. Affected customers are those whose traffic mainly goes in the US West region, and this primarily includes customers with end users in the Western United States and Asia Pacific. The main effect on these customers will be the lack of freshness of catalog information used to serve recommendations. Only one customer website was dramatically affected, and we are in touch with them. We've directed their end user traffic to the US East region for the time being.

Report: "JS SDK file not working on IE 11 since 11:30 AM PT (18:30 UTC) on Sunday June 7"

Last update
postmortem

Starting Sunday, June 7, 2020, 11:26 AM Pacific Time \(PDT\), LiftIgniter's JS SDK stopped working on Internet Explorer \(IE\) 11 and lower. The problem was fixed on Tuesday June 9 at 7:18 AM PDT. A minor related problem was fixed on Friday June 19, at 1:48 PM PDT. As a reminder, LiftIgniter attempts to support only IE 9 and higher, so the affected browser versions are IE 9, IE 10, and IE 11. ## Event timeline 1. On Sunday, June 7, 2020, at 18:26 UTC \(11:26 AM PDT\), LiftIgniter released a new version of its browser-client, the JS file that supports LiftIgniter's JavaScript SDK. 2. On Tuesday, June 9, after reports from two customers that the JS file was not working on IE 11, LiftIgniter investigated and pushed a hotfix that caused the JS file to resume working on IE 9, 10, and 11. The fix was pushed at 14:18 UTC \(7:18 AM PDT\). However, diagnostic and debugging functionalities \(specifically, `$p("runDiagnostics")` and `$p("printDebugInfo")`\) still did not work on IE 9, 10, and 11. More generally, diagnostic and debugging functionalities were unavailable whenever `window.Promise` did not exist, with IE 9, 10, and 11 as example browsers. 3. On Friday, June 19, at 20:48 UTC \(1:48 PM PDT\) we pushed out the fix making diagnostic and debugging functionalities \(specifically, `$p("runDiagnostics")` and `$p("printDebugInfo")`\) available on browsers that do not have `window.Promise` defined, including IE 9, 10, and 11. The way the fix works: if the user runs `$p("runDiagnostics")`, and the global `window.Promise` isn't defined, the user is prompted to load the global promise polyfill using `$p("loadPromise")`. After this is loaded, the user can rerun the diagnostic command and it should work. ## Cause and impact ### General background of changes we were trying to make LiftIgniter has an ongoing project to reduce the size of the browser-client JS file. As one step toward this, we moved two of our JS file's diagnostic functions, `$p("runDiagnostics")` and `$p("printDebugInfo")`, to a separate "Debug" chunk file, that is lazy-loaded whenever somebody first runs either of the commands. The main JS file is therefore shrunk somewhat. Since most end users will not run diagnostic functions, we save on the total amount of data downloaded for a typical user, while still keeping diagnostic functionality available. ### Cause of the JS file not working at all on IE 11 and lower: use of document.currentScript The implementation of lazy loading used the `document.currentScript` construct. This is not available on IE 11 and lower. As a result, the JS file was crashing on these browsers. Because of the place this crash was occurring, it did not even report any errors to LiftIgniter's backend. ### Cause of diagnostic functions not working after the initial fix: use of window.Promise After we fixed the issue with `document.currentScript`, LiftIgniter's JavaScript file was now working on IE 11 and lower. However, the actual loading of the Debug chunk files for diagnostic functions still failed, because this loading relied on `window.Promise` being available, and `window.Promise` is not available in IE 11 and lower. Therefore, on IE 9, 10, and 11, diagnostic functions `$p("runDiagnostics")` and `$p("printDebugInfo")` were only available on customer sites that were already doing a global polyfill for `window.Promise` prior to the execution of the diagnostic function. ## Learnings and improvements for the future LiftIgniter has implemented three broad categories of system improvements to reduce the risk of similar problems in thee future: 1. Automated alerts around traffic volume by browser family 2. Reinstatement of IE 9 and IE 11 checks in the release process 3. More IE 9 and IE 11 compatibility testing in pull request review ### Automated alerts around traffic volume by browser family At the time of this faulty release, LiftIgniter's automated alerting included alerts around overall traffic volume, traffic volume by customer, as well as traffic volume by country. Our release process also included a manual review of global traffic patterns. However, there was no automated alerting around traffic volume by browser family, and the share of traffic of IE 9, 10, and 11 is small enough \(close to about 0.1%\) that its impact on overall metrics can hardly be noticed. We have now added automated alerting around traffic volume by browser family. With this alert in place, if traffic from the "IE" family drops to zero, an alert will trigger in about 30 minutes. The alert does not separately check for traffic levels by individual browser versions \(e.g., IE 9 versus IE 10 versus IE 11\) because data at that level of granularity can be too noisy. ### Reinstatement of IE 9 and IE 11 checks in the release process Historically, the release process for the LiftIgniter JS file included manual compatibility checks with IE 9 and IE 11. At some point, we removed these checks. The removal was due to a mix of factors, including the fact that the test sites we were using to test IE 9 and IE 11 compatibility either stopped supporting those browsers or stopped using LiftIgniter. We also thought the checks are probably not necessary because we had rarely found any of our incremental changes break these browsers. The recent incident highlights the importance of these checks, so we are adding them back to the release process, and also changing the sites used for testing to ones that we expect should continue to function properly and use LiftIgniter for the foreseeable future. ### More IE 9 and IE 11 compatibility testing in pull request review Ideally, we would like to identify browser compatibility problems even before getting to the release stage. One aspect of this is to include compatibility testing when reviewing pull requests that affect the source code of our JS file. We will be doing this more often, particularly for pull requests that use nontrivial JavaScript constructs that may not be available on older browsers.

resolved

On Friday, June 19, 2020, around 20:48 UTC, LiftIgniter released an updated version of the JS SDK file that fixes the unavailability of diagnostic and debugging commands (specifically, $p("runDiagnostics") and $p("printDebugInfo")) on IE 9, 10, and 11. With the new version, if it is found that the browser does not support window.Promise, the user is prompted to load a global promise polyfill using $p("loadPromise"), and then rerun the diagnostic command. This should allow users to access diagnostic and debugging functionality on IE 9, 10, and 11, where it was previously unavailable due to window.Promise not being present on these browsers. Customers who are loading their own global promise polyfill are unaffected; diagnostic and debugging commands would already have been available to them in IE 9, 10, and 11 prior to this fix.

monitoring

We have deployed the first fix to the JS SDK file (the fix was released at 7:18 AM PDT, or 14:18 UTC). The core functionality of the file should now work on IE 9, 10, and 11. Diagnostic and debugging commands are still not available. We have also communicated with the customers who reached out individually regarding this.

identified

On Sunday June 7, LiftIgniter released a new version of its JavaScript SDK file. The new version used a smaller file, with some diagnostic and debugging functionalities moved to separate chunk files. The logic used to handle these chunk files was not compatible with IE 11 and earlier versions of IE, so the script stopped working for these users. We have a hotfix ready and are rolling it out. The hotfix still does not provide diagnostic functionalities on IE 11 and below, but at least the core JS SDK works correctly on these browsers now.

Report: "All non-region-specific traffic being sent to US East instead of US West due to Google Cloud networking issues"

Last update
resolved

The impact of the networking issue on LiftIgniter's services has been addressed. All traffic is now being routed normally. The recovery proceeded without incident (there were no latency or timeout alerts during the transition).

identified

Due to a networking issue reported by Google Cloud at https://status.cloud.google.com/incident/cloud-networking/19020 all activity and requests traffic for the query.petametrics.com endpoint is being sent to US East rather than US West. This may mean increased latencies for users who are closer to the US West region. We had a brief period of increased error rates for a few minutes after the transition, which happened between 5:36 PM and 5:38 PM. However, our capacity in US East scaled up quickly to meet the traffic increase, so the errors stopped quickly within about 5 minutes.

Report: "Inventory API servers giving errors and experiencing downtime"

Last update
resolved

We have deployed a code update that we expect will make the system more robust and the likelihood of this kind of downtime much lower. Even prior to this update, we haven't had this issue for the past 4 weeks. We're marking the incident as resolved.

monitoring

We have just recovered from degraded performance for our inventory API servers used for our inventory API operations on the api.petametrics.com domain. We will share more details later, but the problems we noticed were: - Elevated rates of 5XX errors for insertion operations starting around September 19, 2019 1:20 PM Pacific Time (20:20 UTC) - Servers unresponsive to pings intermittently between on September 19, 2019 between 1:30 PM Pacific Time and 1:45 PM Pacific Time (20:30 to 20:45 UTC). We have reinstated server capacity and are currently reviewing the situation.

Report: "Issues with our inventory API servers"

Last update
resolved

We have pushed some code updates that turn off the parts of the codebase that were causing trouble, and also added several more code robustness improvements, better alerting, and better playbooks for alert response. We expect that incidents with the same cause won't occur any more, and other incidents with similar symptoms will be mitigated much more quickly. We are still diagnosing the exact mechanism by which the problem occurred (so that we can reactivate the parts of the codebase we've turned off). We're also preparing an internal postmortem. We will share further details regarding timeline and impact once the investigation and internal postmortem are completed.

monitoring

The inventory API servers had issues between August 21, 2019, 10:45 PM PDT and 11:45 PM PDT (August 22, 2019, 5:45 UTC to 6:45 UTC). A large fraction of requests to inventory insertion, GET, and DELETE operations timed out or gave error codes during the period. We were able to get the servers back to normal through scaling up capacity, and the servers have been stable since 11:45 PM PDT. Customers who received failures or timeouts on inventory API operations during this period would see their requests succeed if they retried after 11:45 PM PDT. We are still reviewing what happened and will update with more details later. NOTE: This degraded performance only affects inventory insertion, GET, and DELETE operations attempted via the API during the time period. Any affected users would have either had their request time out or received an error code. Customers who do not use the inventory API, or who were not using it during the time period of the problem, are unaffected.

investigating

We are experiencing degraded performance for our inventory API servers used for our inventory API operations on the api.petametrics.com domain. The problems appear to have started a little before 11 PM PDT. We will post more details as we get to know them. NOTE: This degraded performance only affects inventory insertion, GET, and DELETE operations performed via the API. It has no direct effect on the model-servers that server queries. Therefore it only affects the users who use our inventory API, and specifically, used it during that time period.

Report: "Google Cloud networking incident in US East: Minimal impact on LiftIgniter service other than slight latency increases"

Last update
resolved

We are making this incident resolved after verifying that traffic is being distributed between US West and US East in the normal manner. We believe that our system's response to the networking issues was graceful and resulted in minimal end user impact from the outage. We will keep an eye out for further details from Google about the incident https://status.cloud.google.com/incident/cloud-networking/19015 to learn more.

monitoring

The Google Cloud load balancer has now resumed sending traffic to US East. Everything seems to be working as expected, but we are closely monitoring metrics and will resolve this incident once all the metrics look healthy for a while.

identified

Although Google posted in https://status.cloud.google.com/incident/cloud-networking/19015 at 9:12 AM PDT (16:12 UTC) that the problem is fully resolved, we are continuing to see all our traffic being sent to US West. We have opened a case with Google to check in on the status. We continue to believe that none of our services are affected (except possibly for slight latency increases). We will post further updates once we hear back from Google or see that the load balancer is directing traffic to US East.

identified

On Tuesday, July 2, 2019, between 08:22 and 08:24 AM Pacific Time, which is 15:22 to 15:24 UTC, we saw a dramatic reduction in the traffic going to our US East datacenter and a corresponding increase in the traffic going to our US West datacenter. The change in traffic appears to be due to the Google Load Balancer, a global public load balancer provided by our cloud provider Google Cloud Platform, deciding to no longer direct traffic to US East. Our autoscaling was able to handle the approximate doubling of traffic to US West fairly gracefully, with capacity roughly doubling within minutes. We believe that Google Cloud's decision to redirect traffic is driven by networking issues with US East as described at https://status.cloud.google.com/incident/cloud-networking/19015 According to Google's update at 08:50 AM PDT (15:50 UTC): "The Cloud Networking service (Standard Tier) has lost multiple independent fiber links within us-east1 zone. Vendor has been notified and are currently investigating the issue. In order to restore service, we have reduced our network usage and prioritised customer workloads. We will provide another status update by Tuesday, 2019-07-02 09:38 US/Pacific with current details." We will await further updates from Google. As far as we can make out, there is no impact on the availability of LiftIgniter's services. Even the regional endpoint for US East (query-us-east1.petametrics.com) appears to be working correctly. However, there may be a small end-user latency impact, both for customers who have hardcoded query-us-east1.petametrics.com as the endpoint (due to the networking issues in US East) and to the customers whose end users would normally go to US East, but are now being redirected to the somewhat more distant US West.

Report: "Networking issues between our servers causing downtime for end users"

Last update
resolved

Google has confirmed at https://status.cloud.google.com/incident/cloud-networking/19009 that the networking issue is resolved and that they will post a detailed report. Since all our alerts have also resolved and our systems have been stable for the past few hours, we are marking the incident resolved as well.

monitoring

Our systems have been stable for 40 minutes now, but we are still waiting for Google to confirm in https://status.cloud.google.com/incident/cloud-networking/19009 that they have fixed the issue on their end before we consider this issue resolved.

identified

The services have returned again to a fully functional state in both regions. However, we are still waiting for more details from Google Cloud regarding the networking issue at https://status.cloud.google.com/incident/cloud-networking/19009

identified

We noticed a recurrence of the problem in US West (previously, the problem had been more severe in US East) and are applying the same emergency fix on US West. We expect to return to fully functional status in 10 to 15 minutes. Also, Google Cloud has clarified at https://status.cloud.google.com/incident/cloud-networking/19003 that the issue is related to a larger networking issue (which is also what we originally saw evidence for). Their status page on the networking issue is at https://status.cloud.google.com/incident/cloud-networking/19009

monitoring

All our services appear to be fully functional again. However, we are still waiting for Google to share more details of the underlying issue at https://status.cloud.google.com/incident/compute/19003 so we can evaluate how much longer to closely monitor our systems and whether there may be any other impact missed by our alerts.

identified

Google Cloud has reported the issue with Google Compute Engine at https://status.cloud.google.com/incident/compute/19003 They appear to have recovered enough that we should be able to get our services to a fully functional state soon. However, because they continue to have degraded performance, we will keep an eye on the impact on our services.

investigating

We received alerts suggesting that our services in various regions are having trouble talking to each other as well as to external services. This is affecting the volume of traffic that is being successfully processed by all our endpoints under query.petametrics.com and api.petametrics.com and is also affecting the accessibility of the LiftIgniter Console. These networking issues may be due to our cloud provider. We are still investigating to mitigate the situation and assess the impact. EDIT: After more investigation we are more confident that the issues are due to our cloud provider, Google Cloud, but are still waiting for them to report the issues on their Status page https://status.cloud.google.com It looks like others have also noticed the same issues with Google Cloud; see for instance https://twitter.com/GossiTheDog/status/1135260263316381696 https://twitter.com/phineyes/status/1135259372895031297 https://twitter.com/dripstatstatus/status/1135261993055600640

Report: "CDN issues in Japan"

Last update
postmortem

Starting Wednesday, 4 July, 4 AM Japan Time (Tuesday, 3 July, noon Pacific Time) our CDN provider, MaxCDN, [had problems](https://status.maxcdn.com/incidents/7zwpqc1f581r) with their Tokyo point-of-presence, causing Japan traffic to be routed to Hong Kong, which was unable to serve the majority of traffic. The CDN serves the JavaScript files loaded by LiftIgniter through the snippet customers put on their website (all under the domain cdn.petametrics.com). Thus, most end users in Japan of customers using LiftIgniter via the JavaScript integration were unable to load LiftIgniter's JavaScript. Our query endpoint (query.petametrics.com) and inventory API endpoint (api.petametrics.com) maintained their usual availability. We estimate this affected 95-98% of impressions in Japan until we resolved the problem on Wednesday, 4 July, 12:15 PM Japan Time (Tuesday, 3 July, 8:15 PM Pacific Time) The impact on end users in Japan of customers using our JavaScript integrations was as follows: 1. No recommendation requests were being made to LiftIgniter, so LiftIgniter-powered recommendations were not being shown to these users. Customers who are requesting recommendations via API were unaffected. 2. No activities were being sent to LiftIgniter. Customers who are sending activities to LiftIgniter via API were unaffected. 3. Inventory information for these users was not being sent. However, the impact on the overall inventory would be minimal, since it would only affect newly published or updated content in the timeframe, and they would still get updated if LiftIgniter got any events from users outside Japan. Customers sending inventory via API were unaffected. ## Event timeline 4:30 AM Japan Time (12:30 PM Pacific Time): We noticed reduced traffic for one of our Japanese customers. We verified that the JavaScript file is loading and events are firing correctly for us locally, and also saw that the site had been under maintenance overnight, so we incorrectly diagnosed the ongoing site maintenance as the main reason for reduced traffic. When traffic levels as seen by us failed to pick up by 7 AM Japan Time, we got in touch with the affected customer, but they felt that the scheduled maintenance was the likely reason. Neither side pinpointed CDN failure in Japan. 10:30 AM Japan Time (6:30 PM Pacific Time): We received reports from two other Japanese customers about the JavaScript file not loading, and identified CDN failure in Japan as a likely cause. 10:45 AM Japan Time (6:45 PM Pacific Time): Our engineer in California and our support representative in Japan began interactive debugging. Within 5-10 minutes, we obtained diagnostic information that made it clear that the CDN service was to blame, and opened a ticket with MaxCDN. We sent debugging and diagnostic information to the MaxCDN support representative. 12:00 PM Japan Time (8:00 PM Pacific Time): MaxCDN's network engineering team identified the problem as a failure of the Tokyo point-of-presence causing traffic to get routed to the Hong Kong point-of-presence, which was getting overloaded. The MaxCDN support representative suggested that LiftIgniter disable the use of additional points of presence, so that requests would be routed to MaxCDN's core network. LiftIgniter made the change, and traffic levels were back to normal by 12:15 PM Japan Time. Our support representative in Japan and our customers also confirmed that things were working normally. 2:00 PM Japan Time (10:00 PM Pacific Time): MaxCDN posted about the outage as a [Status Page incident](https://status.maxcdn.com/incidents/7zwpqc1f581r). (By this time, LiftIgniter's customers were no longer affected because of the setting change made at 12:15). 2:30 PM Japan Time (10:30 PM Pacific Time): MaxCDN noted that a fix had been made. 3:00 PM Japan Time (11:00 PM Pacific Time): MaxCDN reported the incident as being resolved. ## System improvements to reduce incidence and minimize impact The scale of impact of the CDN outage in Japan has led us to revisit our CDN relationship as well as our alerting and monitoring framework. ### CDN redundancy and vetting Historically, LiftIgniter has not paid close attention to monitoring the uptime of our CDN service. The CDN service we've used has generally been reliable -- we have had a couple other outages in the last four years but they were resolved within minutes. However, this incident highlights how critical CDN uptime is to LiftIgniter's customers and end users, so we are going to invest more into a more redundant set of CDN solutions. LiftIgniter is moving to a multi-CDN architecture, where we have at least two CDN providers. All CDN providers will be reviewed thoroughly for uptime, latency, quality of internal monitoring, and speed of incident resolution for serving end users around the world. We will pay particular attention to the reliability of the CDN in regions with a large number of our customers and end users, in particular Japan. ### LiftIgniter's own monitoring of CDN uptime LiftIgniter has an in-house service called upcheck that monitors uptime and latency for LiftIgniter's APIs, by sending requests to these APIs from servers in three different regions. We are working to expand upcheck in two ways: 1. upcheck will now also query the CDN JavaScript files, to make sure these files are accessible. 2. upcheck will run from a wider range of geographic locations, in particular more locations in Asia-Pacific. Metrics from the expanded upcheck will be periodically reviewed, and high latencies or error or timeout rates will trigger alerts for our 24/7 on-call rotation. Expected time of completion: We expect to have expanded upcheck to include the new metrics by Thursday, 5 July, and to have the alerts in place by Friday, 6 July. Impact if these had been present prior to the outage: If both the fixes 1. and 2. were in place, we would have been able to catch the problem within minutes of it occurring, and been in touch with our CDN provider within about 15 minutes of the problem starting (so around 4:15 AM Japan Time). ### Better traffic level monitoring LiftIgniter already had some alerting around decline in traffic levels, but the alerts in place would only catch global declines in traffic rather than declines specific to one region. In light of this incident, we have improved our monitoring: 1. Rather than use absolute traffic level thresholds, we have switched to comparing traffic levels with traffic levels at the same time a day ago. This allows us to control for the daily cycle in traffic levels. 2. In addition to an alert on global traffic level, we have added alerts for traffic decline (relative to the same time a day ago) at the level of individual customers. We already had monitoring of traffic level for individual customers through daily generation of traffic anomalies, but this kind of monitoring is too slow to catch urgent issues, hence the need for the new alerts. 3. We have also started sending metrics on traffic levels by country to our internal metrics tool. After getting at least 24 hours of data, we will be adding alerts for low traffic level by country (relative to the same time a day ago) and also for high load time by country. Expected time to completion: 1. and 2. are already completed; we expect to finish 3. on Thursday, 5 July Pacific Time (after getting at least 24 hours of data into our metrics tool). Impact if these had been in place prior to the outage: * We verified that the alerts we set up for 2. would have triggered as a result of the outage. We would have immediately identified a list of the affected customers, narrowed down the problem to Japan, and been in touch with our CDN provider before 5:30 AM Japan Time. * We expect that with the alerts we expect to set up for 3., we could cut down the alert and response time even further: we expect that with 3. in place, we would have been in touch with our CDN provider by 4:30 AM Japan Time.

resolved

We have received confirmation from affected customers that things are working normally for them and their end users, and also verified that the traffic levels continue to be similar to what they are at this time of day. We are marking the issue resolved. We plan to publish a postmortem later providing more information on the cause of the issue and additional safeguards we are putting in place to prevent a recurrence. Our CDN provider has posted some information to their own status page: https://status.maxcdn.com/incidents/7zwpqc1f581r

monitoring

At the suggestion of our CDN provider, we have disabled the problematic edge location, and are seeing a traffic increase to levels similar to those generally seen at this time of day. We also verified with our Japanese team that they are able to load the JavaScript files and events are firing normally.

investigating

Our CDN provider has confirmed that the issue is related to the Hong Kong location being unreachable for most clients, and their networking team is working to address the issue.

investigating

Our CDN provider is having issues serving our JavaScript files in Japan. As a result, we are successfully serving traffic for less than 5% of end users in Japan. The problem has been ongoing since 4 AM Japan Time on July 4, or 19:00 UTC on July 3. We are working actively with our CDN provider to resolve the issue.

Report: "API endpoint down due to DNS routing errors"

Last update
postmortem

This is a postmortem for the [inventory API outage due to DNS errors](http://status.liftigniter.com/incidents/bn42cqb18src) (api.petametrics.com) that occurred on Tuesday, October 3, Pacific Time. The endpoint was down on both HTTP and HTTPS from 3:55 PM Pacific Time (22:55 UTC) to 5:44 PM Pacific Time (00:44 UTC on Wednesday, October 4). It remained down on HTTPS from 5:44 PM to 9:23 PM Pacific Time (04:23 UTC on Wednesday, October 4). Here is the sequence of events: 1. We were migrating our API servers from an old setup (based on Kubernetes) to a new setup that would be more robust, scalable, and stable. 2. At 3:55 PM, thinking that the new servers are ready, we changed DNS routes to point to the new servers. However, due to a firewall setting error, the new servers were not accepting traffic, so both HTTP and HTTPS were failing. 3. We discovered this and reverted the DNS to point to the old servers at 5:44 PM. However, we reverted to an IP address for the old servers that works only for HTTP and not for HTTPS (essentially, instead of pointing at a load balancer with the SSL certificates, we were pointing directly at the cluster). Therefore, the HTTP endpoint worked (as verified by us) but the HTTPS endpoint did not. 4. Subsequently we fixed the firewall issue with the new servers and were ready to switch DNS back to point to the new servers. However, we did not do the switch as it was the end of the working day. Therefore, the DNS continued to point to the old servers, and HTTPS continued to not work. 5. At 9:23 PM, after receiving reports of SSL errors, we fixed the DNS to point to the new servers again, and everything resumed working properly. Server migrations are fairly rare events, so we do not expect this sort of issue to happen frequently. With that said, the long downtime and the issues it created for our customers has made us revisit our monitoring. Historically, our insertion API has received less monitoring attention than our other APIs. Although we do monitor for errors, we don't have any rules based on traffic volume, partly because the volume fluctuates quite wildly so that we don't have a guarantee of a minimum insertion volume. We are planning to update the monitoring in two ways: 1. We are adding continuous status endpoint pings for all services in all regions. While not that necessary for our highest-traffic services (because volume alerts would fire if there are connection problems) this can catch issues like the inventory API insertion issue. Moreover, by monitoring both HTTP and HTTPS endpoints, we will be able to catch issues that only affect HTTPS. 2. We are adding volume-based alerts for the inventory API. The alert will look at volume trends over a longer time period like an hour (since insertion volume can fluctuate wildly) so it is mainly as a fallback to 1.

resolved

This error affected only API inventory operations (insertion and deletion) and did not affect any of our model serving, activity collection, or JavaScript-based inventory collection. Due to DNS routing errors, we experienced downtime for the api.petametrics.com endpoint. For a brief time period, for about an hour until 5:40 PM Pacific Time, both http://api.petametrics.com and https://api.petametrics.com weren't working. Later, http://api.petametrics.com was working but https://api.petametrics.com did not work for a few hours. As of 9:23 PM Pacific Time on Tuesday October 3 (4:23 AM UTC on Wednesday October 4) the issues are resolved. We will include more information in a postmortem.

Report: "Interruption to Real-time Recommendations"

Last update
postmortem

This is a postmortem for the [shell script-caused server outage](http://status.liftigniter.com/incidents/nxnx08sxt36j) that occurred on Monday, September 18, 2017, starting around noon Pacific Time and continuing till about 3:48 PM Pacific Time. We saw a gradual increase in error rates in our model servers across all regions. The time from serious investigation to resolution was short (after beginning an investigation at 2:51 PM we identified the cause at 3:10 PM and got most capacity back up within ten minutes, with some residual capacity taking a little longer). There were a lot of things we did right, that helped us to mitigate the impact. 1. A robust alert system, including notification and escalation policies. 2. Our technical account manager updated the LiftIgniter status page, proactively notified customers significantly affected, and responded to customer questions about the issue. 3. Our engineering team came together quickly to diagnose and mitigate the problem. There were a few things that had scope for improvement. 1. The shell scripting change that caused the outage should have been more thoroughly tested, and its impact more closely monitored. Also, our servers should have been more robust to this issue, detecting a problem at startup and failing to launch rather than launching without being able to serve traffic. 2. Some aspects of our alerting system weren't working as expected. Specifically, alerts from two of the three regions did not get thrown immediately because the alerting servers were having transient issues. 3. Our first response (the first alert was triggered at 1:51 PM) did not correctly identify the seriousness of the problem. If it had, we would have been able to resolve the issue an hour earlier, significantly reducing customer impact. Fixes for all these issues (including updates to deployment and alert response protocols, server settings, and code fixes) have either been pushed already or are in the pipeline for the next couple of days.

resolved

Normal service levels have been fully restored. A post-mortem will be made available soon.

monitoring

Our recommendation servers experienced a brief outage today, beginning with degraded service around 12 PM Pacific. Symptoms included an increased usage of backup recommendations rather than real-time, personalized recommendations. Server-side (API-based) integrations may have been affected more severely than customers using Javascript-based integrations due to differences in backup methods. We have pushed a fix as of 3:10 PM Pacific, and expect service to be restored to normal in the next few minutes.

Report: "Elevated 503 errors for recommendation requests"

Last update
postmortem

This is a postmortem for the [database-caused server outage](http://status.liftigniter.com/incidents/whcp28rx2sg2) on Friday, August 25, 2017, starting around 1:06 PM Pacific Time (20:06 UTC) and ending 3:15 PM Pacific Time (22:15 UTC). We saw a gradual increase in error rates on our model servers. The error rate was initially quite low, but around 1:40 PM, reached a threshold sufficient to trigger our alerts. By 1:45, our on-call engineer was investigating the issue, with two other engineers assisting with the investigation. As the issue continued to grow in scope, we roped in more and more of the engineering team, so that the majority of the engineering team was working on mitigation by 2:30 PM. We discovered and fixed the issue at 3:05 PM and server capacity returned to normal by around 3:15 PM. There were a lot of things we did right, that helped us mitigate the impact. 1. A robust and functioning alerting system triggered alerts so that our engineers were on the issue quickly. 2. Our technical account manager updated the LiftIgniter status page and responded to customer questions about the issue. 3. Our engineering team came together quickly, interrupting their other activities, to mitigate the problem. There were three things we did wrong: 1. Corruption in the database, that had been introduced several months ago, but that resulted in an issue when a related setting was fixed. 2. A non-robust and single-point-of-failure way of reading the database, where a problem with the database led to the entire serving architecture having problems. 3. In hindsight, our path to discovery of the cause of the issue could have been faster. ## Database corruption Some time about three months ago, it appears like a empty row with a non-existent organization was added to our organizations database. This empty row may have been added through erroneous development code or manually through an interactive user interface. The empty row had no effect at the time it was added, because the code that we use to interface with the database automatically filters out rows where some fields are missing or null. However, today, in the process of cleaning up customer status, we marked as "disabled" all organizations in our database that were not current customers or soon-to-start customers. As part of this cleanup, the empty row was also marked as disabled. As explained below, this was the cause of the problem. What we could have done better: * At least some of us had noticed the empty row in the database previously. We should have deleted it earlier rather than letting it sit. * The customer data cleanup that we did that caused the crash should have been done in a reversible fashion, so it could be reverted with a single command rather than having to do a snapshot restoration (which takes time). ## Non-robust reading architecture It turns out that the place in our code where we read the list of disabled organizations is less robust than the rest of the code, and ends up throwing an error. Thus, within minutes of the database update, our frontend servers that were seeking updates from the database were throwing errors. These errors led to servers shutting down and restarting, and over time, the number of available servers in each region declined, trending toward zero. As with many other companies, the database of organizations is a single point of failure. From a hardware perspective, we mitigate this by having replicas in each region, with secure replication. From a code perspective, however, we were not as robust as we need to be. What we could have done better: * We should have had greater robustness in all parts of the database-reading. * We should have had a better understanding of what each error means, and a faster way of pinpointing the exact cause of the error. Precious minutes were spent reviewing the code, adding extra logging, starting up a local interactive environment, trying more specific queries to get to the root cause. If we had those things in place from before we would have gotten quickly to the precise point of failure. ## Discovery process for the error We started with three engineers looking at the issue, and ended with six. Why did it take us 80 minutes to get to the root of the problem, or to find any other way to mitigate it? Part of the reason was that we were investigating several hypotheses, which seemed to us more likely than the eventual reason we found out. Part of the reason was that the mitigation strategies we had in place, namely, restoring the database to an older version (and that we did get started) would have taken slightly longer than root cause detection. In hindsight, there were a few things we could have done better: * We might have spent too much time following hunches that were not supported by evidence, scarred by past experience; for instance, investigating dependency issues despite no recent code push (which have plagued us in the past) and investigating master-slave replication issues (which we should have been able to rule out more quickly). * After identifying the SQL query that was crashing, we should have connected that more quickly with the engineer action of marking the empty row as disabled. * Some of our engineering time was not used efficiently during the investigation process. Specifically, we did not parallelize the investigation sufficiently. If an engineer had been assigned to review the database for oddities earlier on in the debugging process, we could have caught the error earlier.

resolved

We are confident that this incident is now fully resolved, and the appropriate steps have been taken to prevent a recurrence. A full postmortem will be available later today.

monitoring

The source of the outage has been found and corrected and performance has now returned to normal levels. We are monitoring closely and will have a postmortem available soon.

identified

We have confirmed that this incident is impacting both requests for recommendations as well as activity data and inventory collection, starting today, 8/25, at approximately 1:41 PM Pacific (20:41 UTC). We have identified the source of the issue, and are working to resolve as soon as possible.

investigating

We are currently experiencing an elevated rate of 503 errors. We are investigating now and will provide more information on the extent of the issue as we find out more.

Report: "Increased 5XX error rates across regions"

Last update
postmortem

This is a postmortem for the bad deployment that occured on June 30, 2017 Pacific Time (July 1, 2017 UTC). While we at LiftIgniter are disappointed at the incident, we have also found a lot of important lessons to take from it, many of which were not obvious to us as we were designing our alerting and monitoring systems. We hope this postmortem will help our customers understnad what happened, and also provide interesting pointers to other people who deploy and regularly update large-scale infrastructure. Broadly, the problems we experienced were three-fold: 1. Deployment problem: The error that arose due to the deployment was infrequent enough to not show up in local testing and not immediately affect production systems upon deployment, but it was frequent enough to affect systems significantly after a few hours of deployment. This "sweet spot" is a particularly bad place to be. Identifying deployments that potentially fall in this sweet spot, and instituting additional safeguards for such deployments, could reduce the risk of what happened by a lot. 2. Monitoring problem: We have threshold-based alerts for error rates. We also have autoscaling (with autohealing) so that bad servers are automatically removed and replaced with good servers. And we have global load balancing that redirects traffic from bad regions to good ones. All of them are great ideas. But in conjunction, they mean that intermittent, sporadic problems keep getting "fixed" automatically before they trigger our monitoring thresholds. Moving beyond threshold-based alerts, or figuring out thresholds that take into account the corrective effect of autoscaling and global load balancing, would have helped trigger more alerts and solved the problem faster. 3. Timing problem: The deployment was done on a Friday (before the Fourth of July weekend), and its bad effects started to be seen only as we were closing the workday. Therefore, it took us more time to notice and resolve the issue than it otherwise would. ## Deployment problem The deployment had the following characteristics that made it particularly hard to detect: 1. The problem arose from race conditions that would occur only under production loads, and wouldn't show up in local testing. Even under production loads, it would show up only after some time, and nondeterministally. So although we ran a few servers for a while with the new code before deploying to all, we did not catch the problem. 2. Even when the problem did arise, it did not take the server down immediately. Rather, it just increased load on the CPU, which meant that more queries were being rejected or were timing out due to server overloaded. This again is something that would not be detected in local testing (as it requires a production load) and could even be missed in production because it happened only on some queries. 3. The deployment was done in two stages. The first stage introduced the possibility of race condition, but it was fairly rare. Since no problems were noticed in the first stage of deployment, our second stage of deployment made some configuration changes. These configuration changes significantly increased the probability of race conditions, but we did not even think of the second stage as a major change, therefore we did not monitor that deployment as closely. In other words, the deployment fell at the "sweet spot" where it was hard to detect locally or on initial deploy, but it was also frequent enough to actually hit hard after a few hours of running. Looking at the history of bad deploys, we see that there are some parts of our serving logic that tend to dominate this sweet spot. Specifically, all logic that involves dealing with file streams and connecting to external services to push data, tend to have this sort of behavior where they exhibit problems after hours of operation. More generally, anything involving thread safety and concurrence can pose such problems, even if we think we've handled all the issues in our code and can't detect problems locally. In the future, any deployments that touch such logic will be subject to more thorough code review and longer post-deployment monitoring. ## Monitoring problem In terms of alerting, monitoring and auto-recovery, there are three components of relevance here: 1. Threshold-based alerts: Threshold-based alerts are alerts like: "The ratio of 503 error codes to 200 error codes over a moving 3-minute window is over 5%, for a total of 15 minutes". We classify threshold-based alerts based on urgency. 2. Autoscaling: Our serving infrastructure is inside autoscaling groups. The group scales capacity up and down based on both request volume and CPU utilization. It also replaces servers marked unhealthy. 3. Global load balancing: Our global load balancer balances traffic between our autoscaling groups, that are spread across regions. When all systems are healthy, traffic is directed to the closest region to the query. However, if one of the autoscaling groups has very few healthy instances, traffic that would have gone to that is routed to other autoscaling groups. The three components interact. In particular, 2 and 3 can sometimes suppress 1. A server that's starting to get unhealthy and is about to fire a threshold-based alert could (and in our case, did) get killed off by the autoscaler. Unfortunately, we don't always revisit the alerting logic in 1 when we make policy changes in 2 and 3. There is also a natural challenge posed to alerting systems from intermittent problems. The specific problem we had would happen suddenly on a server that was running healthily, then build up to a crescendo for a few minutes and cause the server to be killed by the autoscaling group. In many cases, the window wasn't long enough to trigger our alerting. In one case, the alert did get triggered but was immediately marked as resolved. Also, 3 in particular can (and did) cause cross-regional cascading of problems. What we noticed was that the different regions were "taking turns" exhibiting problems. When one region got in trouble, the global load balancer would redirect its traffic to another region, and the first region would recover gradually, in time to take traffic back as the second region started developing problems. Our main lesson is to take into account interactions between the different alerting, monitoring, and recovery tools so that they don't interfere with each other's proper functioning. Relatedly, whenever we make changes in one piece, we should rethink how it affects the other pieces. In particular, simulating how metrics might change as autoscaling and global load balancing work their way through the system and setting threshold based on that would allow us to get a fuller set of alerts when a problem like this one happened. Another area we are exploring is alerts that go beyond threshold-based alerting, to smarter anomaly detection (something that [Netflix](https://medium.com/netflix-techblog/tracking-down-the-villains-outlier-detection-at-netflix-40360b31732) has talked about). This is a difficult problem (one that , and a different kind of machine learning than what we do at LiftIgniter, so we've previously thought of it more as an intellectual curiosity. Now we see it as something that might offer a real solution to our monitoring challenges. ## Timing problem We timed the push somewhat inconveniently: the problems started emerging on a Friday evening, just before the Fourth of July weekend. That timing, combined with the fact that we did not get as many alerts as we expected, meant that we didn't notice the issue until a couple of hours later. However, we did resolve the issue pretty quickly once we started looking into it, at 9 PM Pacific Time. The timing problem isn't fully separable from the deployment and monitoring problems. It's related to the deployment problem: if we had identified this deployment as one that is likely to develop problems after hours of running in production, we would have chosen a different time for it. It's related to the monitoring problem: if a larger number of alerts had fired, we'd have gotten to it quickly, regardless of the time. Nonetheless, we are using this occasion to review our on-call rotation practices and response protocols, including on-call engineer veto of deployments at times where it's inconvenient to notice and fix issues.

resolved

Our servers have been stable for two hours, and we have had an initial team discussion identifying areas of improvement with our deployment, monitoring, and resolution practices. We are marking the issues as resolved.

monitoring

We experienced increased 5XX error rates starting around 5:45 PM Pacific Time on June 30, 2017 (00:45 UTC on July 1, 2017). The issue was identified as due to a bad code push that caused elevated CPU loads on the servers and intermittent crashes. The code push was reverted as of 9:08 PM Pacific Time on June 30, 2017 (04:08 UTC on July 1, 2017). We have verified that errors have stopped and will continue to monitor the servers for the next few hours.

Report: "CDN issues in South and Southeast Asia"

Last update
resolved

The metrics have been stable after disabling edge locations, and the problem should not reappear as long as we keep edge locations disabled. We are marking the incident as resolved.

monitoring

Through our automated alerting, we were notified of CDN connectivity issues for our primary CDN from South and Southeast Asia, and a corresponding drop in pageviews seen by our backends. The problem appears to have started around 2018-07-31 02:20 UTC, though we have not been able to pinpoint the precise time due to obfuscation of metrics by DNS and content caching. We identified the likely problem as being a misbehaving edge location. To address this, we have temporarily disabled edge locations at around 2018-07-31 02:50 UTC. We reached out to our primary CDN provider for further investigation. After disabling edge locations, our CDN alerts and pageview volume drop alerts have resolved, so that the immediate problem is mitigated. Affected countries include India, Malaysia, Taiwan, and Republic of Korea. Other countries in the South and Southeast Asia (and nearby) were likely also affected. Japan (which had experienced issues in a previous incident, see http://status.liftigniter.com/incidents/5xy6fjyc1rm4) was not affected. The connectivity issues were not experienced by all end users in these regions; in particular, our automated CDN failover health check did not detect a problem with the primary CDN. Due to client-side caching, many users in these regions would have seen no issues.

Report: "Backend datastore node having issues, cold-restarting. Affects inventory API server and recommendation quality"

Last update
resolved

All nodes have been moved to the new, safer, more minimal instance template, and data has been restored. We have also confirmed that the specific issue that caused node crashes is no longer occurring. We have also put together an improved set of best practices around both peacetime capacity changes and emergency responses, so as to minimize data loss. Our cloud provider also created a public issue tracker for their underlying issue at https://issuetracker.google.com/issues/111753610

monitoring

We are continuing to monitor for any further issues.

monitoring

We have confirmed that there is no data loss, and the services are running fine; however, we lack our usual storage capacity buffer right now. Our cloud provider has confirmed ongoing issues on their side that caused our problems, and an ongoing investigation. In the meantime, they have provided us with guidance on working around the issue to reprovision capacity. We are working on that reprovisioning, and will mark the issue resolved when the reprovisioning is completed.

identified

The problems turned out to be more serious than expected, with additional nodes affected; we are getting in touch with our cloud provider for more diagnosis and resolution around the issues. For now, the bad nodes have been removed; due to data replication we expect data loss to be minimal and expect to recover data through our standard recovery procedure.

identified

Starting 10:20 AM PDT (17:20 UTC) on Friday, July 13, one of our datastore nodes in one region started misbehaving. We received notifications for and began addressing the issue within ten minutes. The node is currently doing a cold restart and we expect it to be back up by 11:45 AM PDT (18:45 UTC). No data has been lost; however, until the node comes fully back up, some data will appear to be missing or unavailable. This has impact on the following services: - Inventory API errors: Customers using the GET, POST, and DELETE endpoints of our inventory API will see error rates and their intended actions may not complete. - Degraded recommendation quality: We'll continue to return recommendations; there is no effect on the overall error rates of our model servers. However, the quality of the recommendations will be degraded and latency will be higher due to the difficulty retrieving all the necessary data to make a great recommendation.

Report: "Services giving 502 errors and slow responses globally"

Last update
resolved

We have verified that things are working normally, and also updated our internal documentation to streamline the recovery process if a similar issue occurs in the future. We are marking the issue as resolved.

monitoring

We have verified that services are working fine now; a few users with cached DNS may continue to see issues till 22:00 UTC on July 17, but everything should be fine after that. Some complications arose because of edge cases in the fallback methods we used and the manual switching around of routes, which led to additional issues, but we have resolved everything. We'll be assembling internal documentation on the route switching and recovery process to avoid the complications and to have a speedier response in the future.

identified

After Google reported having resolved their problem at https://status.cloud.google.com/incident/cloud-networking/18012 we switched back to the Google Cloud public load balancer, We are seeing better performance in most regions, but continuing to see some issues in Australia. We are investigating the issues.

identified

We have noticed a resurgence of high client timeout rates in Europe, but everything seems normal elsewhere. We are continuing to investigate.

identified

Here are some more details on the problem and the fixes we are making. The problem: Google Cloud's public load balancer is having networking issues. We use the public load balancer for query.petametrics.com, spi.petametrics.com, console.liftigniter.com, and our other services. Our fixes: (1) We have updated query.petametrics.com to directly point to our Nginx servers in various regions via Route 53. However, there is a 3-hour DNS cache so users who have cached DNS may continue to see issues. We recommend that users bust their DNS cache. (2) We have also updated query1.petametrics.com to directly point to our Nginx servers in various regions via Route 53. This has a 1-minute TTL, so should be immediately effective. Thus, even for users who have query.petametrics.com with a bad cached record, JavaScript model queries will automatically retry with query1.petametrics.com. (3) We are pushing our browser-client (our JavaScript) to choose query1.petametrics.com as the primary query and activity server, to get around the 3-hour DNS caching limit (the JavaScript cache-busts at the turn of each hour). With the three fixes in place, impact on JavaScript customers should be effectively nullified. For API customers: (a) To make query.petametrics.com work, you may need to force DNS cache busting. (b) api.petametrics.com might have issues; unfortunately we don't have a good setup to directly point to the servers.

identified

We are continuing to work on a fix for this issue.

identified

Our cloud provider, Google Cloud, has publicly declared itself as having networking issues: https://status.cloud.google.com/ They have an incident page at https://status.cloud.google.com/incident/cloud-networking/18012 In the meantime, we are doing some rapid rerouting to minimize the impact of this change. Unfortunately, not everything will fully return to normal in the process due to DNS caching, but the majority of users should be able to access the recommendations and send activities.

identified

We've identified that our model-server, api-fe, and email services were giving 502 errors in US West. We believe that this is a problem at the level of our cloud provider, because our dedicated regional endpoints are working. Initially, both the plain HTTP and HTTPS endpoint were down; the plain HTTP endpoint is back up but the HTTPS endpoint continues to be down. The problem began at 19:15 UTC on July 17.

Report: "Inventory API server having intermittent issues"

Last update
resolved

A fix was pushed last night and the problem has not recurred since then. We are marking this issue resolved.

identified

We have identified intermittent blackouts of our inventory API server, lasting a few minutes each, starting about 1 hour ago, on 2018-07-27 02:54 UTC. The problem appears probabilistic and due to specific load characteristics that are not usually seen. We have identified a configuration update that should prevent the problem, and are pushing it right now.

Report: "Inventory API server having intermittent issues"

Last update
resolved

This incident has been resolved.

monitoring

The issue is no longer ongoing, but may recur as we experiment with long-term fixes. We will be closely monitoring the rollout of the long-term fixes to make sure the issue does not have a significant production impact even if it does recur.

identified

Our inventory API server is having intermittent slowness and blackout issues. We believe we have identified the root cause. We have implemented an immediate mitigation mechanism to minimize the production impact, and will work on implementing a more long-term fix within the next week. This issue only affects customers who are inserting inventory via API.

Report: "JavaScript errors on Android 4.0-4.3 after code push adding enhanced security measures to queries"

Last update
resolved

The script errors have been resolved with the push we previously mentioned. We are now marking this as resolved.,

monitoring

We have pushed a fix for the affected end users, and are seeing error rates drop. We expect to see the error rates reset to zero at 2018-05-16 16:00 PDT (23:00 UTC) when the JavaScript cache busts.

identified

We were alerted about an increase in JavaScript error rates with error: INVALID_STATE_ERR: DOM Exception 11 for recommendation requests on Android 4.0-4.3 devices. The result was due to a change we made to enhance request security by adding "withCredentials": "true" to the requests for recommendations, to add an additional layer to prevent third parties from abusing our recommendations. This in turn resulted in a lot of timeouts in requests for recommendations, because the actual request failed to execute. As a result, some of our customers may have seen high client-side timeout rates, although error rates on our servers did not go up. We are temporarily reverting the change as we investigate a better fix.

Report: "Inventory API server giving 404s for about 50% of servers for 40 minutes"

Last update
resolved

For approximately 40 minutes (2018-05-10 21:55Z to 22:35Z), we were returning 404 errors for the POST api.petametrics.com/v1/inventory, POST api.petametrics.com/v1/inventory/update, and DELETE api.petametrics.com/v1/inventory endpoints. The error was due to a code push with incorrect routing configurations for endpoints, and at its peak it affected about half the servers (the ones with the updated code). We pushed a fix immediately after noticing the problem. We did not get alerted about this error because it did not affect the @status endpoint, and the error occurred at a place in our code before the part where we send error metrics to our backend error reporting. An alert for low request volume *would* have been triggered if the rollout had finished with the bad code. However, we caught the problem even before the low-volume alert triggered due to monitoring of the request rate by instance for the instances with the new code.

Report: "Spike in model-server 5XX errors in Europe West due to load balancing configuration changes"

Last update
resolved

The new load balancing system has been stable for a while and we have updated our monitoring for it.

monitoring

As a result of a rollout of changes to load balancing for model-servers, we experienced a brief spike in 5XX errors in Europe West a few hours after the rollout. We were monitoring the system in the few hours after the rollout, so we were able to catch the error spike and address the root cause quickly. Although this specific error has been addressed, we are continuing to monitor the behavior of the new configuration to be able to quickly react to other issues with the configuration. The errors began occurring from 22:27 UTC on 2018-02-14 and resolved by 01:45 UTC on 2018-02-15. The peak error rate observed was 4%.

Report: "JavaScript code compatibility issues with IE 11 and old versions of Chrome (41 and below)"

Last update
resolved

We have fixed a JavaScript code change that was causing our JavaScript to not work on IE 11 and versions of Chrome 41 and below. The error on IE 11 was: SCRIPT1046: Multiple definitions of a property not allowed in strict mode The error on old versions of Chrome was an Uncaught SyntaxException error. The JavaScript code change was made on Sunday, January 14, but deployed to all customers on Friday, January 19. The change was reverted on Friday, January 26, at 7:30 PM Pacific Time (Saturday, January 27, 03:30 UTC), and should roll out to all end users starting Friday, January 26, at 8 PM Pacific Time, with the hourly cache busting. The root cause of the error was a naming conflict between two functions with very similar functionality, introduced during a renaming of one of the two functions. The error was not caught because everything worked normally on modern browsers, and pushing the error to production caused no noticeable impact on total traffic. Fixing the error also had no noticeable impact on total traffic.

Report: "Model server availability issue in US West"

Last update
resolved

We are marking this issue as resolved because we are no longer seeing the extreme latencies for the last several minutes, and are more confident that it was a cloud provider networking issue. If we get additional information from the cloud provider that changes our diagnosis, we will update here.

identified

We are no longer seeing the extreme ping latencies. Also, based on initial investigation, the problem appears to be in the networking of our cloud provider rather than our model server system; however, we do not yet have confirmation of this from the cloud provider. We are opening a ticket with the cloud provider and keeping an eye on the metrics.

investigating

We have received an alert suggesting significantly degraded ping latency for our model-server system in US West. We are investigating to determine the seriousness of the issue and mitigate it as soon as possible. We will keep you posted.

Report: "Degraded response times for model servers in Europe West and intermittent availability issues with inventory API server"

Last update
resolved

We are closing this issue as both systems (model servers in Europe West, and inventory API) has been stable after our fixes were pushed.

monitoring

We have pushed some more long-term fixes for the issue. We expect to mark it as resolved if no further issues occur for the next four hours.

monitoring

We also experienced a recurrence of the issues on the model servers in Europe West. We have pushed a mitigating fix and are working on a more long-term solution.

monitoring

The model servers in Europe West have been stable for a while. The inventory API endpoint, however, had another brief outage of a few minutes but automatically recovered. We will continue to investigate and tweak the settings till both issues are fully resolved.

monitoring

Services appear to be operating in the usual way now. We will monitor for the next few hours for any recurrence of the issue, and then close the incident.

identified

Recent changes in our model serving infrastructure, combined with traffic conditions, led to degraded response times for the model servers in the Europe West region. You may have experienced queries taking a long time to complete, or even status check pings taking several seconds to complete. We also experienced brief periods of 502 bad gateway errors for our inventory API, for the same reason. The first of the alerts triggered at 5:59 AM UTC on December 23, which is 9:59 PM Pacific Time on December 22; the issues appear to have reached a serious level about two minutes before that. The changes have been reverted and the reverted changes are rolling out to all servers. Model serving in other regions was unaffected because the changes only applied to parts currently being used only in Europe West.

Report: "API endpoint down likely due to firewall error"

Last update
resolved

The servers are now operating normally.

identified

Due to a firewall configuration change, our inventory API endpoint api.petametrics.com has been down since 11:30 PM Pacific Time on Friday November 30 (07:30 UTC on Saturday November 11). We got an alert about this error and quickly identified the cause as the firewall configuration change. We are fixing the firewall configuration and the service should be back up in a few minutes. Our query.petametrics.com endpoints for model requests and sending activities have been functioning normally and are unaffected. Thus, only customers doing inventory API operations are affected. Affected customers would receive a 502 Bad Gateway error for inventory API operations made while this issue is ongoing.

Report: "Slight degradation of model server performance in US-East"

Last update
resolved

All issues have been resolved as of approximately 5 PM Pacific Time today, with the majority of the catchup happening by 2 PM Pacific Time.

monitoring

We have removed the bad node. Most of the performance degradation has resolved itself. We are waiting for some backlogs in cross-region data syncing (that began due to this issue) to resolve and expect this will happen in about an hour.

identified

We are currently experiencing higher than normal latency within our system's back-end data retrieval operations in the US-East region only. A slight increase in error rates began at 5:20 AM Pacific. We have identified the cause of the problem as one bad node in a backend system that our servers connect to. We are working on fixing the problem with the node and/or restarting it. About 1/7 of data retrieval operations of one specific type are affected. Impact: - Slightly higher average response times for recommendation requests - Low to moderate impact on recommendation quality - because of the higher latency we may not be able to fetch some candidate item data from storage within the time allowed, and thus cannot include them in the recommendations Next Steps: We are working to resolve the underlying issue behind the latency, and will post further updates soon.

Report: "JavaScript Inventory Insertion - Processing Delays"

Last update
resolved

Starting at 12:01 AM Pacific Time on October 19, 2017 the servers dedicated to JavaScript -based inventory processing were unable to retrieve the latest version of Java, which prevented them from starting and processing items. This was resolved at 7 AM, but briefly resurfaced from 10:15 to 10:45 AM. API inventory processing was unaffected. Our model serving was also unaffected, except to the extent that we missed out on new inventories during those times. If you have any questions, please feel free to contact support@liftigniter.com. If you have strong requirements around inventory insertion reliability and immediate feedback about the success or failure of insertions, we recommend switching to the inventory API: https://liftigniter.readme.io/docs/inventory#setting-up-automated-inventory-updates.

Report: "JavaScript inventory insertion processing delays"

Last update
resolved

By fixing our dependency version, we were able to resolve this issue. All inventory should be processing normally now.

identified

Since Wednesday August 30, we have noticed that for small time intervals, the job that processes JavaScript-based inventory insertions has trouble processing, and then resumes processing (and catches up on its backlog) after a little while. While the time intervals are usually less than an hour, we have seen one case where the time interval was six hours, and another case where it was two hours. What this means: - None of the JavaScript-inserted inventory is permanently lost - However, some JavaScript-inserted inventory that was first seen during the times our job was throwing errors, are delayed by a few hours for insertion. The issue appears to have been caused by a version upgrade of a dependency that is more error-prone. We have engineers actively working on resolving the issue, and will update this status with more information. If you have time-sensitive inventory insertion requirements, we recommend the inventory API instead of JavaScript insertion. If you have questions or concerns, please email support@liftigniter.com

Report: "Slight degradation of model server performance in US-West"

Last update
resolved

Performance has reverted to normal. We already have improvements in our engineering schedule that would reduce the risk of these problems, that we expect to finish in the next few weeks.

identified

All datastore nodes are operational now. However, due to some data rebalancing that needs to happen, we expect to continue to see slightly degraded performance for the next few hours. However, error rates are lower than they were when nodes were unavailable, and the error rates should continue to decline.

identified

All our old nodes are now successfully restarting. We expect the restarts to finish within 20 minutes, and we expect that error rates will start going down in around 30 minutes.

identified

We are currently experiencing higher than normal latency within our system's back-end data retrieval operations in the US-West region only. A slight increase in error rates began at 11:30 AM Pacific, and an alert for model server latency was triggered at 12:30 PM Pacific. To compensate for the degraded datastore connectivity, our model servers are automatically caching inventory data for longer intervals so that datastore lookups are not required as often. Impact: - Slightly higher average response times for recommendation requests - Low to moderate impact on recommendation quality - because of the higher latency we may not be able to fetch some candidate item data from storage within the time allowed, and thus cannot include them in the recommendations - GET /inventory requests to the API may fail even though the item does exist in the inventory Next Steps: We are working to resolve the underlying issue behind the latency, and will post further updates soon.

Report: "Elevated 502 errors for API-based Inventory requests"

Last update
resolved

We have confirmed that the Aerospike update has resolved the CPU consumption issue that was leading to intermittent 502 errors for API inventory insertions.

monitoring

502 errors for insertions to the /inventory API appear to have ceased following an update to Aerospike in our back-end systems. We are monitoring closely to ensure this is fully resolved.

investigating

We are currently seeing a slight and intermittent increase in 502 errors for API-based inventory requests. This affects only the /inventory API. Javascript-based insertions are unaffected. Recommendation requests to the /model API endpoint are also unaffected. The servers processing these requests are gradually increasing to nearly 100% of CPU usage, and as a result are occasionally unable to process new requests. We are investigating now to determine the root cause, and have added more resources to these servers to help mitigate impact.

Report: "JavaScript inventory insertion processing delays"

Last update
resolved

The system has been stable for a while so we are marking this issue as resolved.

monitoring

We have fixed the issue and JS inventory insertions should be inserting stably and in real time. We are continuing to monitor the situation.

investigating

Starting 5 PM Pacific Time on August 1, 2017 (yesterday) we had issues with JavaScript -basedinventory processing. Our process for inserting inventories was running, but due to an issue with message acknowledgement, was not processing all new messages correctly. Therefore, we are lagged on the insertion of new inventory items and metadata updates on existing items. API inventory processing was unaffected. Our model serving was also unaffected, except to the extent that we missed out on new inventories. We are working to resolve the issue and will provide an update here once the issue is resolved. If you have any further questions, please feel free to contact support@liftigniter.com. If you have strong requirements around inventory insertion reliability and immediate feedback about the success or failure of insertions, we recommend switching to the inventory API.

Report: "BigQuery Outage Preventing Data Display in RealTime Tab"

Last update
resolved

We have completed our investigation, and have determined that unfortunately the Realtime reporting cannot be backfilled with the data for July 27th. As a result, the Realtime tab will continue to show minimal, inaccurate data for the time period of July 27th from roughly 8 AM Pacific to 11 PM Pacific when analytics processing was fully restored. This issue did not affect the daily processing done for our Analytics tab however, so accurate reporting for the 27th will still be available there. There was no loss of data to our model servers during this time, and no impact to recommendation quality or model server responsiveness. If you have any questions, please reach out to support@liftigniter.com.

identified

RealTime data display has now been restored, and we are currently investigating whether we can backfill the data in the RealTime tab for July 27th. Further updates to come.

identified

Google's BigQuery service, which powers the analytics behind our RealTime and Analytics reporting tabs, became unavailable yesterday for a brief period. This appears to have prevented our RealTime tab from populating with complete data from 6pm Pacific on July 26th onward. We are actively working to backfill the data and restore Realtime data analytics. Please note that this has no effect on recommendation quality or uptime for our various endpoints or Javascript processing. This is strictly limited to display of the reported data.

Report: "JavaScript Inventory Insertions Not Processed"

Last update
resolved

We are confident that the solutions implemented for this will successfully prevent this issue from occurring again, and are marking this resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The Javascript inventory insertions issue has been resolved, and we are monitoring closely while we introduce additional safeguards to prevent this issue from occurring again.

identified

Around Saturday, June 17, 5 PM Pacific Time, our server that processes all JavaScript-based inventory insertions went into a zombie state where it stopped processing new inventories properly. It did not crash, therefore it did not restart automatically. This affects new inventory insertions and existing inventory updates for all customers who insert inventories through JavaScript. We are immediately restarting the inventory update and investigating the reason for the zombie mode of the job.

Report: "Increased 5XX error rates in US East"

Last update
resolved

Performance of model servers has been stable since our code revert. We have also identified the coding and deployment practices to review and improve in order to avoid similar situations in the future.

monitoring

We have identified the cause as a code change. We have reverted our code version and are seeing the error rates drop. We will continue to monitor for the next few hours.

identified

During a deployment, we saw an increased rate of 503 and 504 error rates in our US East region. Our global load balancer redirected traffic from US East to US West, reducing the customer impact of the issue. We are actively resolving the issue and will keep you posted of more details.

Report: "Inventory API behavior change"

Last update
resolved

We have marked this issue as resolved since the endpoint now returns 200 responses when requesting the deletion of items that do not exist, as it did before June 7th. Please note however that this is expected to change in the future, but we will be providing advance notice of the expected new design prior to any changes.

monitoring

The DELETE method has been reverted, and should now return a 200 response even when requesting the deletion of items that do not exist. This should restore expected functionality to match the behavior of the endpoint before June 7th. If you have any questions, please contact Support!

identified

Pending further development we will be reverting the behavior of the /DELETE endpoint to return 200 responses when requesting deletion of items that do not exist for all customers. This will restore the behavior seen before June 7, 2017, and should take effect within the next few hours. We have created an internal ticket to examine this endpoint's behavior and develop more robust response and error handling for the range of scenarios possible when requesting deletion of a mix of existing and non-existent content. Further updates will be posted here.

identified

Errors: 400 (false alarms) Start date: 2017-06-07 12:00 PM Endpoint: inventory DELETE Customer impact: Affected customers programmatically using DELETE by changing behavior of endpoint On Wednesday, June 7, 2017, we pushed changes to our inventory API server POST/DELETE behavior. Previously, our inventory API server would check a POST or DELETE request for validity, return a 200 OK if valid, and then push the operation into an operation queue where it would be processed by a backend job (generally within a few seconds). Based on customer feedback, we changed the API behavior to execute the POST or DELETE immediately by operating on the datastore, and have our API respond with the status based on whether the POST or DELETE succeeded. Therefore, if there was a problem inserting or deleting an item, the customer using the API would now know it immediately rather than having to wait and check after a few minutes. However, this interacted poorly with another behavior of DELETE, namely, that it is considered to "fail" if the item being deleted does not exist. Therefore, we ended up returning 400 error codes whenever people requested deletion of items that do not exist, perhaps because they have already deleted them. We have created an internal ticket to examine the API behavior and figure out the right way the API should behave. In the meantime, we are also in communication with existing customers affected by the behavior change, so that we can mask the error code for them so that they do not need to modify their scripts in the near term. We will also be updating our documentation and communicating with all customers using our API about planned further API changes, in addition to posting them here at status.liftigniter.com. If you have further questions, please contact Support.