Historical record of incidents for Iron.io
Report: "IronWorker degraded performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results
We are currently investigating this issue.
Report: "IronCache issue"
Last updateOur engineering team is actively resolving the situation, and undertaking necessary steps to recover the system. We expect to post an update momentarily.
Report: "IronWorker and IronMQ Outage"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Log4j System Analysis"
Last updateIron has completed a review of its systems and the Log4j security issue does not affect Iron's services.
Report: "IronWorker Degraded Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "IronWorker Degraded Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
IronWorker customers may experience delays while pushing the tasks. We are currently investigating the issue.
Report: "IronWorker outage"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "IronWorker Degraded Performance"
Last updateIronWorker customers could experience issues while pushing tasks to Public Cluster on March 14 between 12:00 am UTC and 6:30 am UTC. We've resolved the issue. Our development team is monitoring the situation.
Report: "Database Upgrade"
Last updateAmazon performed an upgrade of our PostgreSQL DB version. This DB stores logins, passwords, tokens and other information about our customers. While upgrading they shutted down the database instance, performed the upgrade, and restarted the database instance. As a result, you could see HTTP 401 errors in your logs between 09:28 am UTC and 09:45 am UTC. Here is more information: https://forums.aws.amazon.com/ann.jspa?annID=8176
Report: "IronWorker Degraded Performance"
Last updateOn May 1, at 11 am UTC we identified the issue with our autoscale functionality: there were errors while starting our IronWorker service on new instances but it was working fine on existing ones. After investigating into the errors thrown, we found the issue was happening while pulling our docker images from DockerHub. The root cause was problems on dockerhub side information about which you can find on their status page: https://status.docker.com/ In order to resolve the issue, we have manually launched our service on new instances and temporarily disabled the autoscale functionality.
Report: "IronCache Issue"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "IronCache Issue"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "IronWorker Degraded Performance"
Last update**Overview** On May 13th, at 03:29 UTC, we began routine database upgrades. During the upgrade process we noticed errors in our logs indicating certain queries weren’t able to complete successfully. **What went wrong** After investigating into the errors thrown, we found data anomalies in our Production data set that didn’t exist in our Staging data set. This difference resulted in slow queries and errors that cascaded into service interruptions for a subset of our customers. **What we're doing to prevent this from happening again** Moving forward we’re taking steps to ensure our Staging data set is 100% up to date with our Production data set. If the copies of the data were exact, this would have been caught in Staging and wouldn’t have caused a disruption in service. **Resolution time** The incident was resolved at 11:49 UTC
The migration has completed and service has returned to normal.
Migration is still in progress. This is taking more time than expected but we're monitoring it closely.
We are continuing to work on a fix for this issue.
Due to a database upgrade issue, a portion of our IronWorker customers are experiencing issues with certain API commands. We've identified the issue and are in the process of resolving.
Report: "Scheduler service: maintenance work"
Last updateThis incident has been resolved.
Amazon has scheduled for maintenance one of the instances where our scheduler is running. They say that the instance will be unavailable for 2 hours: on November 7, from 12:00 am to 2:00 am UTC. We're going to run another scheduler on another instance at that period of time.
Report: "IronWorker degraded performance"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "DNS issue with an upstream provider"
Last update**Overview** On August 6th, at 15:07 UTC, we noticed connectivity issues across our network. These connectivity issues caused IronMQ to degrade into an unhealthy state which rendered the service un-usable. **What went wrong** At 12:49 AM PDT, the vendor who we rely on for DNS \(AWS Route 53\) experienced issues. In-network connectivity was broken and many components of our network were unable to communicate with each other. When the vendor issue was resolved at 1:04 AM PDT, the issue persisted within our network due to caching and TTL issues. **What we're doing to prevent this from happening again** * We identified the places within network that could have caused this issue and reviewed their caching strategies and TTL times. Multiple cache times were too aggressive and we’ve increased timeouts in the necessary places. We’re testing various failure scenarios within our staging network to confirm the validity of these timeout values. * We’re currently discussing backup DNS strategies as a team and will be posting updates on our blog about our strategy moving forward, and, continued progress. **Resolution time** The incident was resolved at 16:04 UTC.
This incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Degraded Performance"
Last update**Overview** On April 6th, at around 12:33 pm PST, we noticed an increase in authentication failures and DNS resolution failures. **What went wrong** We had a dramatic increase of queries being sent to our authentication database which resulted in an automated failover process to start. During the time it took to failover and promote, some API endpoints and our UI were affected. **What we're doing to prevent this from happening again** We're currently adding tests to ensure we're able to handle such authentication traffic increases moving forward. We're also looking at solutions to prevent any disruption of service from happening when database failovers are in progress. **Resolution time** The incident was resolved at approximately 12:49 pm PST.
Resolved. 16 minutes total time of disruption. Post-mortem following.
We have resolved the issue and all systems are operational. We will continue to monitor status.
We are currently investigating this issue.
Report: "IronWorker degraded performance"
Last updateThe team here prides itself on providing stable services to our customers, and when things go wrong, we take it seriously. On behalf of myself and the entire team, I want to apologize for yesterdays service disruption. Some details about the incident are as follows: **Overview** On February 20th, at around 4:00 pm PST, we noticed increased CPU rates on our primary MongoDB instance. We immediately contacted our database vendor, mLab, who jumped into chat with us within minutes to help diagnose the issue. Many customers experienced a large slow-down in tasks being processed, and some customers experienced their tasks not being processed at all. **What went wrong** We had a significant increase in the number of tasks coming through our system, and our system is designed to scale up in such cases. However, one query started increasing in run-time and ended up causing CPU to rapidly rise on our primary database. This caused task processing to slow down, and in some cases, tasks from certain projects weren't being processed at all. We eventually traced this query to an account setting that sets a maximum task limit for a given account. Since some of our customers process hundreds of millions of tasks a day and have complex deployments, this setting is often set to a very high number. When this setting is set, however, it results in an extra collection count query to fire off for each task. This influx of queries was the culprit and resulted in our primary database's CPU to be pegged. **What we're doing to prevent this from happening again** * A frontend cache is being implemented to prevent a N+1 query problem with this collection count query. This will prevent resource starvation and a possible thundering herd scenario. * A higher level account flag is being added to mitigate the need for this collection count query. This will result in fewer queries and will result in a platform-wide performance benefit. * We're adding more permutations to our test suite to cover this case as well as other possible resource starvation scenarios. **Resolution time** The incident was resolved at approximately 11:00pm PST.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "IO domain DNS failure"
Last updateToday, 10-28-2016, AWS confirmed that Route 53 DNS services were interrupted. This primarily affected domains that end in .io. The was causing nameservers to return NXDOMAIN intermittently for domains that do exist. The issue has been resolved. For more detailed information, please see https://news.ycombinator.com/item?id=12813065
This incident has been resolved.
Wee are recovering after an upstream provider had a DNS issue
Report: "IronMQ (aws-us-east) service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "IronMQ (aws-us-east) service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "IronMQ (aws-eu-west) service degradation"
Last updateThis incident has been resolved.
Switched DNS to failover. The old messages can be temporarily unavailable. They will be migrated to failover then.
The issue has been identified and a fix is being implemented.
Report: "IronMQ (aws-us-east) service degradation"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "IronWorker degraded performance"
Last updateThis incident has been resolved.
We're increasing resources platform-wide while continuing to investigate the root cause.
We are currently investigating this issue.
Report: "IronMQ (aws-us-east) service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Switched DNS to failover. The old messages can be temporarily unavailable. They will be migrated to failover then.
Report: "IronMQ (aws-us-east) service degradation"
Last updateThis incident has been resolved.
Switched DNS to failover. The old messages can be temporarily unavailable.
Report: "IronMQ (aws-eu-west) is under maintenance"
Last updateThis incident has been resolved.
IronMQ (public aws-eu-west cluster) is under maintenance. The old messages can be temporarily unavailable.
Report: "IronMQ (aws-us-east) service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "IronWorker degraded performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Our team is currently investigating the root cause of the issue
Report: "Iron service degradation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Report: "DNS issue with an upstream provider"
Last updateThis incident has been resolved.
Users may experience connection errors or delays. We have confirmed there is a DNS issue with an upstream provider and are monitoring the situation closely.
Report: "IronWorker issue with Scheduled tasks: aws network died on few instances"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "mq-aws-eu-west-1-1"
Last updateThis incident has been resolved.
Investigating issues with mq-aws-eu-west-1-1
Report: "Aws network issues are causing IronWorker service degradation"
Last updateThis incident has been resolved.
Aws network issues are causing Ironworker service degradation
Report: "IronWorker Service degradation"
Last updateThis incident has been resolved.
Our engineering team is actively resolving the slow processing tasks on the shared public cluster. They are undertaking necessary steps to improve the system.
Report: "AWS S3 is down now causing job processing issues... stand by please while we try and reroute around it"
Last updateThe system has now almost fully caught up. We're continuing to scan for any residual jobs that may have not run but all should have ran, or be queued up to run shortly. Thank you for your patience as AWS recovered their core services. We will be evaluating options of running core iaas outside of AWS.
Job processing is almost fully up to speed again. It may take awhile to get through the backlog of jobs.
We are now seeing recovery of IronWorker and working through backlogs of jobs.
We see jobs going through again... none should be lost but they will be queued up since the issues started this morning.
Update from AWS. We are quickly trying to restore our services as well: Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.
Unfortunately the issue has now cascaded to over 45 AWS services causing unrecoverable issues upstream. At this point, we have to wait on AWS and then begin a fully multi-cloud initiative.
We are considering bypassing s3 but even then, Docker hub is down and would block any Upstream updating of code packages as they are all built with Docker.
Reported S3 issues in US-East: https://status.aws.amazon.com/ Trying to bypass their S3 service completely. We will build around it in the future.
https://news.ycombinator.com/item?id=13755673
The issue has been identified and a fix is being implemented.
Report: "Slower than usual job processing from 7:20am PT to 7:40am PT [resolved]"
Last updateThis issue was resolved at 7:40am PT, we are creating a record to log the times. A routine deployment caused an unexpected slow down in job processing for about 20 minutes this morning. We caught the slowdown and fixed the issue restoring processing speeds. All jobs ran but some with a slight delay. We are improving our canary deployments and internal load testing to discover these issues before they reach production. The Iron.io Ops Team
Report: "IronWorker Service degradation"
Last updateThe issue is now resolved. We're working on a limiting process for the API calls that were putting strain on our databases and should have something soon to avoid this from happening again in the future. If you have any specific questions please contact support@iron.io or your dedicated support channel.
We've discovered the source of the database slowdown and are taking corrective actions to build a permanent fix into the system so this does not happen again. Thanks for your patience while we work through slower than usual queue times. The system did not lose jobs, but rather saw slowdowns in queue times.
The database has stabilized and we are sweeping projects for queued tasks. You may notice some tasks that ran long. Many of them probably finished just did not exit properly and mark as "complete". You can verify by viewing the task logs. We'll update again as we monitor the system.
The cause of the slowdowns is in a backend database under high load. We're searching for the source of the load and as soon as that's found we'll update this page. That said we've restarted the DB and it's mostly recovered so tasks should start flowing again. Stand by.
We have identified a system slow down with IronWorker. This is a high priority issue, and our Operations team is actively investigating. We will post an update as soon as information becomes available.
Report: "Public cluster push queues are delayed but still sending"
Last updateThe issue has been resolved and all messages in push queues have now been flushed and delivered. Note: no messages were lost, but many were significantly delayed in their delivery to endpoints. Additionally, we've identified and patched the root causes that have been causing delays in push queues over the past few weeks. We've found some edge cases around users creating hundreds of thousands of push queues with no subscribers causing the push processor to get caught up working on nothing for a long time causing delays for other customers. We patched things so that the push processor will bypass queues in those cases. This should significantly improve push queue performance. We apologize for any application issues this may have caused. We take these issues seriously and are working around the clock to improve the service across the board for users of all paid tiers. That said, if you need guaranteed reliability and performance, and SLA's in place, please contact us about our enterprise dedicated cluster offering. To discuss dedicated clusters, or anything else about this incident, you can contact us through our in-app messenger, or by emailing support@iron.io.
We are preparing to deploy fixes to improve the push queue delays, please stand by.
We’re working on resolving the issues as quickly as possible.
Report: "Delay in Tasks with Priority 0"
Last updateThis incident has been resolved.
We have identified the issue and implemented a solution
There is no ETA at the moment, but we have identified a solution and are in the process of implementing a fix.
Report: "IronMQ v2 us-east Issues"
Last updateWe have resolved the issue and all systems are operational. We will continue to monitor status.
Our engineering team is diligently working on the issue at hand. We will post an update as soon as information becomes available.
Report: "Iron system issue identified in mq-aws-us-east-1-2.iron.io"
Last updateThis incident has been resolved.
We have identified the issue and implemented a solution. We expect to post our return to normal status momentarily.
Our engineering team is diligently working on the issue at hand. We will post an update as soon as information becomes available.
Report: "Iron system issue"
Last updateThis incident has been resolved.
We have identified the issue and implemented a solution. We expect to post our return to normal status momentarily.
Our engineering team is diligently working on the issue at hand. We will post an update as soon as information becomes available.
Report: "Iron system issue"
Last updateThis incident has been resolved.
We have identified the issue and implemented a solution. We expect to post our return to normal status momentarily.
Our engineering team is diligently working on the issue at hand. We will post an update as soon as information becomes available.
Report: "IronCache system issue"
Last updateWe have resolved the IronCache issue and all systems are operational. We will continue to monitor status.
The Iron Operations team has updated the system, and metrics are returning to normal. Iron will continue to monitor.
Our engineering team has identified an issue with IronCache and is working on the issue. We will post an update as soon as information becomes available.
Report: "Issue affecting MQv2 in us-east-1"
Last updateThis incident has been resolved.
We have identified the issue coming from an upstream vendor. Our Engineering team is working to get this resolved.
We have identified a system issue. This is a high priority issue, and our Operations team is actively investigating. We will post an update as soon as information becomes available.
Report: "IO domain DNS failure"
Last updateThis incident has been resolved.
We have identified Iron.io services as being affected by the larger DNS outage affecting the entire .io top-level domain. We continue to monitor the situation.
Report: "DNS Service degradation"
Last updateThis issue has been resolved and the service is operating normally.
We have identified Iron.io services as being affected by the larger DNS outage affecting the Internet this morning. The larger DNS issue is being caused by a DDoS attack affecting core Internet infrastructure. Customers should expect intermittent service as DNS services are restored. We continue to monitor the situation.
Report: "Iron system issue"
Last updateWe have resolved the issue and all systems are returning back to operational. We will continue to monitor status.
There is no ETA at the moment, but we have identified a solution and are in the process of working the issue.
There is no ETA at the moment, but we have identified a solution and are in the process of working the issue.
Our engineering team is diligently working on the issue at hand. We will post an update as soon as information becomes available.
Report: "Higher than expected latency in IronWorker API."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.