Historical record of incidents for Omnivore.io
Report: "Omnivore Connectivity Issue"
Last updateWe are seeing issues occurring with our 3rd-party cloud connectivity provider. This affects most Omnivore systems to varying degrees. We will update further as we have additional details.
Report: "NCR CloudConnect API - Increased error rate"
Last updateNCR has marked their ordering components as operational and Omnivore has continued to see normal error rates since recovery.
NCR posted an update to their statuspage at 22:13 UTC indicating they have found a possible solution and are working on stabilizing services. Omnivore calls to the NCR CloudConnect API began seeing a return to more normal error rates at 22:10 UTC and have been maintaining those rates since. We will continue to monitor error rates until NCR has confirmed they have resolved the issue.
At this time, we have no further updates from NCR regarding an estimated resolution. We will continue to monitor the situation and update as soon as the situation changes.
Beginning around 19:20 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. All API calls to fetch ticket and clock entry data are failing and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "Omnivore API - Ticket Writes Failing"
Last updateAPI has resumed normal functionality
We have implemented a fix and are observing successful API writes. We will continue to monitor.
We have identified an issue where writes to the Omnivore API are failing 100% of the time. We are implementing actions to recover. During this time, API reads will still function normally. We will provide updates as they are available.
Report: "Brink API - Increased Error Rate for Ticket Calls"
Last updateWe have observed Brink API calls return to their normal success levels for the past hour. Functionality of dependent systems have returned to normal.
Starting at 18:25 UTC, we began seeing an increased number of errors when calling the Brink API, impacting ticket reads. API calls to ticket reads will likely fail at an increased rate and webhooks may be delayed until service returns to normal operation. So far, this seems isolated to a single Brink host. We have reached out to our Brink contacts and will continue to monitor until the situation is resolved. There are no other technical actions we can take to resolve the issue at this time.
Report: "Agents Offline"
Last update## Overview On May 29, 2024, Olo’s Omnivore Platform experienced agent degradation between 20:30 UTC and 21:55 UTC. Some API calls were failing during this time, and some agents went offline at 21:30 UTC. ## What Happened On May 29, 2024, during a routine instance resizing operation for our Connect service cluster, our configuration management system misidentified the IP addresses for the newly deployed instances, causing them to get bootstrapped incorrectly. This resulted in elevated error rates for Omnivore API calls beginning at 20:30 UTC, with 32% of connected Omnivore agents becoming degraded. At 21:30 UTC, some more API calls began to fail, causing 6% of connected agents to go fully offline. We initiated an accelerated rollback of the change, which fully restored service to a healthy state by 21:55 UTC. ## Next Steps * Improve the provisioning process to better detect and alert on this kind of misconfiguration earlier, and before the new instances are put into rotation. * Create additional alerting around agent errors to improve investigation speed.
Affected locations have returned to online status and are operating normally
A fix has been implemented and location status have returned to normal. We will continue to monitor at this time.
At approximately 20:30 UTC, we identified an issue causing many locations to enter either a degraded or offline state. We have identified the issue and are working to resolve it.
Report: "Lavu Partner Outage"
Last updateWith help from the Lavu team, connectivity has been restored for all Lavu Locations with Apps attached.
The Lavu team has communicated that they are targeting a fix for end of Q1 or early Q2.
At this time we are not expecting a resolution until at least Tuesday Jan 16th.
At around 21:47 UTC we were asked by Lavu to take further steps to prevent calls to their services. As such, we are taking action to set all Omnivore Lavu locations offline.
Beginning around 21:26 UTC at the request of Lavu we have disabled webhooks and background processing for Lavu locations to aid in Lavu's outage recovery. During this time Webhooks will be delayed and data may become stale.
Report: "NCR CloudConnect API - Increased error rate"
Last updateCalls to the NCR CloudConnect API have returned to base level timings and error rates. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 00:02 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 8:32 UTC, we saw calls to the NCR CloudConnect API return to base level timings and error rates. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 2:47 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateBeginning around 02:26 UTC, calls to the NCR CloudConnect API began to succeed.
Beginning around 00:00 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateThe number of timeouts has returned to normal levels.
Beginning around 17:03 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "MMS Outage"
Last updateAt around 19:18 Omnivore engineers cleaned up an MMS ingress that was believed to be unused as part of a larger release. During the release, we noticed MMS order counts dropped and immediately began to rollback. We were completely rolled back by 19:37 with ordering traffic returning back to normal at that time. Upon investigation, we found that a manual DNS entry was in place that referenced the removed ingress. Because this entry was not committed to our our infrastructure repository, we mistakenly believed the ingress to be unused. We will audit for any other manual DNS entries in our environment before continuing with this release.
Report: "Brink API - Increased error rate"
Last updateAround 9:15 UTC Brink API calls began to succeed.
Beginning around 07:40, we began seeing an increased number of errors when calling the Brink API impacting ticket reads and clock entries. API calls to ticket reads and clock entries will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our Brink contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 20:58 UTC. We will continue to monitor for any further increases.
Beginning around 8:21 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "Brink API - Increased error rate and Timeouts"
Last updateError rates returned to normal levels around 17:20 UTC. We will continue to monitor for any further increases.
After further investigation, we have found that the timeouts are only happening when making calls to https://api22.brinkpos.net. All other Brink hosts seem to be operational.
Beginning around 16:50 UTC, we began seeing an increased number of timeouts when calling the Brink API impacting our Brink Locations. API calls to fetch Tickets will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our Brink contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "Omnivore API - Degraded Performance"
Last updateAll systems are confirmed stable and the Omnivore API is functioning normally. This incident is now resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and API connectivity has been restored. We are continuing to monitor the affects.
We are still investigating issues reported with the Omnivore API, Clients may experience errors and latency when accessing panel.omnivore.io. We'll provide updates as they come in.
We are currently investigating issues reported with the Omnivore API. Clients may experience errors and latency when accessing panel.omnivore.io.
Report: "Lavu API - Increased Error Rate"
Last updateThis incident has been resolved.
The Lavu API is currently experiencing intermittent degradation. Please see their status page for details: https://status.lavu.com. We will continue to monitor until the Lavu API returns to normal functionality. There are no further technical actions we can take at this time.
Report: "Lavu API - Increased Error Rate"
Last updateThis incident has been resolved.
The Lavu API is currently experiencing intermittent degradation. Please see their status page for details: https://status.lavu.com. We will continue to monitor until the Lavu API returns to normal functionality. There are no further technical actions we can take at this time
Report: "Lavu API - Increased Error Rate"
Last updateThis incident has been resolved.
Monitoring - The Lavu API is currently experiencing intermittent degradation. Please see their status page for details: https://status.lavu.com. We will continue to monitor until the Lavu API returns to normal functionality. There are no further technical actions we can take at this time
Report: "API Outage"
Last update# Overview On November 24, 2023, Olo's Omnivore API experienced a disruption between 21:17 UTC and 22:12 UTC. During this time all API operations with the exception of Add Payment, Open Ticket, and Submit Order were failing, and 25% of Omnivore-related webhooks experienced delayed delivery. # What Happened On November 24, 2023, Olo experienced a disruption to the Omnivore API and related webhook delivery, caused by a failure in the automated process for creating new Omnivore API instances. As traffic to the Omnivore API increased, its auto-scaling system was unable to add capacity to meet it. As a result, at 21:17 UTC all API operations with the exception of Add Payment, Open Ticket, and Submit Order began to fail, and 25% of Omnivore-related webhooks began to experience delayed delivery. We discovered that some of our package dependencies had been updated by their maintainers to require a newer runtime version than what was available in our deployment pipeline. This caused the bootstrapping process to fail for new instances that were needed to handle current traffic levels. With this identified, we implemented and deployed a fix to remove the failing dependencies from the API's critical path, allowing the system to resume scaling out additional API instances and restoring service at 21:12 UTC. # Next Steps * We have already made improvements to our alerting to automatically detect and mitigate similar issues before they become critical. * We will complete our in-progress migration of all Omnivore services into our newer hosting environment, which removes these dependencies as a failure point.
All systems have been functioning normally with API and Webhooks flowing normally for several hours. We will follow up with a postmortem by 12/1/2023.
We have identified the issue and implemented a fix. We are monitoring systems to ensure stability. API and webhooks traffic are flowing normally.
We are currently investigating an issue that is affecting the Omnivore API.
Report: "Lavu API - Increased error rate"
Last updateThe Lavu API outage was resolved around 19:00 UTC. All Omnivore API calls and webhooks involving Lavu Locations have returned to normal operation.
The Lavu API is currently experiencing an outage. Please see their status page for details: https://status.lavu.com. We will continue to monitor until access to the Lavu API has been restored. There are no further technical actions we can take at this time.
Beginning around 18:27 UTC, we began seeing an increased number of errors when calling the Lavu API. API calls to fetch ticket data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are currently investigating the root cause.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 17:35 UTC. We will continue to monitor for any further increases.
Beginning around 17:20 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 5:00 UTC, calls to the NCR CloudConnect API returned to baseline. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 17:21 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. Static data populated by background tasks may become stale. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 17:35 UTC, we began seeing successful calls to the NCR CloudConnect API. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 17:15 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. Static data populated by background tasks may become stale. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 7/13 21:08 UTC, error rates and timeouts for calls to the NCR CloudConnect API resumed nominal levels. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 7/13 at 20:06 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 7/13 05:04 UTC, error rates and timeouts for calls to the NCR CloudConnect API resumed nominal levels. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 7/12 at 18:55 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 7/12 at 18:31 UTC, error rates and timeouts for calls to the NCR CloudConnect API resumed nominal levels. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 7/12 at 17:20 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 6/27 16:03 UTC, error rates and timeouts for calls to the NCR CloudConnect API resumed nominal levels. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 6/26 at 9:40 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 11:04 UTC. We will continue to monitor for any further increases.
Beginning around 09:52 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 06:22 UTC. We will continue to monitor for any further increases.
Beginning around 00:52 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 00:14 UTC. We will continue to monitor for any further increases.
Beginning around 22:52 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 21:09 UTC. We will continue to monitor for any further increases.
Beginning around 19:52 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 18:57 UTC. We will continue to monitor for any further increases.
Beginning around 17:17 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 12:45 UTC, NCR CloudConnect API error rates and timeouts returned to nominal levels. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 12:14 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 13:30 UTC, the NCR CloudConnect API began responding to requests successfully. Error rates have returned to nominal rates. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 12:57 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateAt 17:30 UTC, the NCR CloudConnect API began responding to requests successfully. Error rates have returned to nominal rates. We will continue to monitor the success of calls to the NCR CloudConnect API.
Beginning around 12:12 UTC, we began seeing an increased number of errors when calling the NCR CloudConnect API impacting Tickets, Employee, and Job reads. All API calls may present stale data. Webhooks may also be delayed. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateStarting at 17:45 UTC, we observed calls to the NCR CloudConnect API begin succeeding. Cached data is successfully being updated over time in batches. We will continue to closely monitor the success of these background jobs and for the success of calls to the NCR CloudConnect API.
Beginning around 13:35 UTC, we began seeing an increased number of timeouts when calling the Ticket routes of the NCR CloudConnect API. API calls to fetch Tickets will likely fail and Ticket webhooks will be delayed until the NCR outage resolves. NCR is continuing to update their status page (at https://status.aloha.ncr.com/incidents/cnl38krr6n6b). We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Around 20:00 UTC, the number of timeouts when calling the Ticket routes of the NCR CloudConnect API decreased to normal levels. We are continuing to see an increased error rate when calling the NCR CloudConnect API for employee and job data. API calls to fetch employees may present stale data. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Beginning around 8:03 UTC, we began seeing an increased number of timeouts when calling the Ticket routes of the NCR CloudConnect API. Based on this, we are upgrading the scope of the outage. API calls to fetch Tickets will likely fail and Ticket webhooks will be delayed until the NCR outage resolves. NCR is continuing to update their status page (at https://status.aloha.ncr.com/incidents/cnl38krr6n6b). We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
We are continuing to see an increased error rate when calling the NCR CloudConnect API for employee and job data. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time. Please refer to this NCR Incident for details: https://status.aloha.ncr.com/incidents/cnl38krr6n6b
We are continuing to see an increased error rate when calling the NCR CloudConnect API for employee and job data. API calls to fetch employees may present stale data. We have reached out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Beginning around 10:00 UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting employee and job reads. API calls to fetch employees may present stale data. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "CloudPOS Scheduler Queue Length"
Last updateAfter further investigation, we see no other impacts to address. All systems appear to be fully operational.
As of 19:41 UTC, the Scheduler Queue has returned to baseline. We have confirmed that POS data has been refreshed for all affected POS types (Brink, Toast, Cloud Connect, Lavu, and Lightspeed), including seeing current day Tickets. Webhooks have resumed as well. With the acute phase of the incident being over, we will check for any other impacts before closing the incident.
After scaling up our Scheduler Workers, the queue size has shrunk by ~75%. We will continue to monitor until the queue size is back to baseline.
Around 18:00 UTC, we noticed that our CloudPOS Scheduler queue had an elevated number of tasks waiting to be run. This would likely cause all CloudPOS data to be stale, including Tickets and Clock Entries. It would also lead to delayed webhooks. We are currently scaling up our Scheduler Workers to process the delayed tasks.
Report: "API and Webhooks Intermittent Unavailability"
Last update## Executive Summary On January 27, 2023, between UTC 02:37 and 03:40, ECS instances could not be deployed in our environment because a GPG key was changed on a package used on these instances. This caused a cascading outage of Omnivore’s API, with a period of total downtime between 02:50 and 03:40. ## Background and Root Cause Omnivore utilizes Amazon Web Services Elastic Container Service for some of our services. These instances are deployed as needed and built using Chef's configuration management tool. When Chef runs on these instances, it installs software packages that are needed by the instances. Typically, these software packages are in repositories maintained by the operating system. However, there are a few packages that are maintained by software companies that develop the application. These repositories are secured using GnuPG \(GPG\) keys. Software companies will change their GPG keys from time to time for security reasons. When this happens, the software will not be installed, and an error message will be displayed. When this type of error happens with Chef, the installation of the ECS instance is not completed, and the needed extra resources are not deployed. This is what caused this outage. ## Timeline All times are in UTC 02:37: Omnivore infrastructure team receives an alert that ECS instances were not able to be deployed. 02:45: Omnivore infrastructure team attempts to manually raise the number of ECS instances. 02:50: Omnivore infrastructure team receives an alert that the Omnivore API is failing. 02:56: Omnivore infrastructure team pages service team to alert them to the issue. 03:19: Omnivore infrastructure team discovers that Chef is not able to deploy ECS instances. 03:22: Omnivore infrastructure team attempts to run Chef manually to force deployment. 03:40: Omnivore infrastructure team notes that Chef is failing due to a bad GPG key. 03:40: Omnivore infrastructure team downloads and installs new GPG key, allowing Chef to run to completion. ## Action Items 1. Change the process for Chef deployment to include a fresh download of the GPG key on every run. 2. Consider using a “Golden Image” over deploying with Chef.
This incident has been resolved.
We have identified the problem and have implemented a fix. The API and webhooks are returning to normal. We will continue to monitor.
We are continuing to investigate this issue.
We are currently investigating our API and Webhooks having instability.
Report: "API & Webhook Activity Graph Cloud Provider Outage"
Last updateThis incident has been resolved.
A third-party cloud provider outage is preventing API & Webhook activity graphs from displaying in the Omnivore Control Panel.
Report: "Toast API - Increased error rate"
Last updateAt 9:30 UTC, we found that a service responsible for authenticating against the Toas API had previously lost its database connection and failed to reconnect. We restarted the service and see that Toast API connections have now returned to normal.
Beginning around 8:15 UTC, we began seeing an increased number of errors when calling the Toast API, impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 22:40 UTC. We will continue to monitor for any further increases.
We are continuing to work on a fix for this issue.
Beginning around 21:42 UTC, we began seeing an increased number of errors when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve.
Report: "CloudPOS (Brink, CloudConnect, Toast, Lavu, Lightspeed) Ticket and Ticket List Errors"
Last updateFor a period of 20 minutes, requests for tickets to CloudPOS locations (Brink, CloudConnect, Toast, Lavu, Lightspeed) were failing, after a release. We have identified the issue and rolled back the release.
Report: "Toast API - Increased error rate"
Last updateThis incident has been resolved.
This incident has been resolved
Beginning around 6:28 UTC, we began seeing an increased number of errors when calling the Toast API impacting ticket reads. API calls to fetch ticket data will likely fail at an increased rate and webhooks may be delayed until service is restored. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "Order Error Rate Increase"
Last updateThis incident has been resolved.
Order error rates have returned to normal levels.
We are investigating an increate in order error rate.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 2022-05-21T05:46:09.475Z UTC. We will continue to monitor for any further increases.
Beginning around 2022-05-21T05:08:23.644Z UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 2022-05-19T08:59:12.027Z UTC. We will continue to monitor for any further increases.
Beginning around 2022-05-19T08:50:24.056Z UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 2022-05-13T08:55:11.145Z UTC. We will continue to monitor for any further increases.
Beginning around 2022-05-13T08:23:04.673Z UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 2022-05-10T08:41:24.150Z UTC. We will continue to monitor for any further increases.
Beginning around 2022-05-10T08:24:29.024Z UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 2022-05-06T15:54:25.191Z UTC. We will continue to monitor for any further increases.
Beginning around 2022-05-06T15:53:07.156Z UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.
Report: "Investigating Webhooks"
Last updateThis incident has been resolved. No transaction data was lost.
A fix is in place. We are actively monitoring. You may experience a higher than normal volume while our system catches up.
Webhooks are currently not being sent. We are investigating the cause.
Report: "NCR CloudConnect API - Increased error rate"
Last updateError rates returned to normal levels around 2022-05-04T19:11:13.488Z UTC. We will continue to monitor for any further increases.
Beginning around 2022-05-04T19:06:26.316Z UTC, we began seeing an increased number of timeouts when calling the NCR CloudConnect API impacting ticket and clock entry reads. API calls to fetch ticket and clock entry data will likely fail at an increased rate and webhooks may be delayed until service is restored. We are reaching out to our NCR contacts. We will continue to monitor for the issue to resolve. There are no further technical actions we can take to resolve the issue at this time.