Historical record of incidents for Endless Group
Report: "Kvm4 Host Node Reboot/Maintainance"
Last updateWe have verified there are no hardware problems present on kvm4 and all services have been restored.
We are rebooting the kvm4 host node in order to access option menus and perform emergency system maintenance. This outage only affects systems hosted on that node, and does not affect the main shared DirectAdmin node.
Report: "Network Flapping/Outage"
Last updateThis incident has been resolved.
We are continuing to see some issues with route convergence but all service should be accessible again.
The issue is continuing to occur. We are restarting our core router in an attempt to resolve.
We have reestablished the flapping connection and are monitoring for additional issues
We have identified the issue as a flapping session with one of our transit providers. We have rerouted all traffic while we investigate.
Our network appears to be intermittently available. This may result in offline sites. We are working on resolving the problem now.
Report: "DNS Server Issues (Was: Router Reboot)"
Last updateThis incident has been resolved.
The router has been restored however DNS is still nonfunctional. We are investigating this issue rapidly.
Due to a crash of our DNS server, we have to reboot our core router. We expect this reboot to occur without any major issues.
Report: "VM Host Disk Failure (Was: Continued Maintenance)"
Last updateWe have been monitoring our host systems and have not observed any further issues. We consider this incident to be resolved. If you are still experiencing problems, please contact our support.
All host systems have been successfully restored and confirmed to be operational. New drives were installed in the failing machine. Additionally, our new host system has been joined to the cluster. We are now monitoring to ensure that all components are operating normally. All customer systems should be online at this time. If you are experiencing an issue with your system, please contact us using our support channels.
We are continuing to work on restoring the failed host. Most customer systems should be back online at this time.
We will be rebooting the remaining host system in order to finalize the update.
We have identified the problem as a failing disk in one of our host machines. We are recovering the machine but as this may use a large portion of our in-network bandwidth, please expect degraded performance on your sites at this time.
We are experiencing a problem where one of our host machines is unable to successfully reboot following the upgrade. We are working on this issue as fast as possible.
The previous maintenance tasks have not yet been completed.
Report: "Networking Outage"
Last updateThis incident has been resolved.
DirectAdmin is functioning normally again. If you are having issues with anything, we can still be reached at support@theendlessweb.com or via social media.
We are having a network outage that may continue for the next few days-- DirectAdmin customers: - Set your sites A records to 64.62.143.206 - We will make this change for you if you are using our DNS
Report: "Power Outage at Datacenter"
Last updateThis incident has been resolved.
Our main DirectAdmin and VPS host server is down due to a community power outage on both the A+B (EDIT: see below!) power feeds at our datacenter. We are awaiting power recovery. Our email is not affected as this runs through Office 365. EDIT: The power outage may only be on one of the feeds and the access switch with our transit is on the dead feed.
Report: "High Service Error Rates"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating increased service error rates and outages.
Report: "Helpdesk Upgrades"
Last updateUpgrade is complete.
We are currently upgrading our Helpdesk software. This may cause it to be inaccessible for a few minutes.
Report: "Networking flaps due to datacenter power outage"
Last updateA power outage has affected our datacenter and caused some networking issues. Although our equipment did not specifically lose power, our transit providers did, so there was a brief period of time that our services were unavailable. Additionally, a reboot of our router was required in order to fully restore service, which has been completed.
Report: "Networking outages due to partner error"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently experiencing an incoming network traffic issue caused by one of our partners, @RATELIMITEDME . A rogue VM owned by them is currently consuming all incoming traffic to our host. If you receive the message {"success":true,"message":"The File Handler is working normally.","node_info":{"version":2}} when attempting to visit your site, you are affected by this problem. We are currently working with them to resolve this issue and expect to have service restored soon.
Report: "Quick Downtime to fix something."
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
Quick Downtime to fix something
Report: "JIRA Offline"
Last updateThanks to Atlassian support for helping us with this issue! We're up and running again!
We are experiencing an issue with our JIRA helpdesk that appears to be beyond our ability to fix. We have opened a ticket with Atlassian in order to get help on this issue. In the meantime, you can use Discord to contact us, as emailing our support email or opening a ticket in JIRA is currently not an option.
Report: "MySQL Outage"
Last updateThis incident has been resolved.
The issue has been identified to be a configuration that was reset without our knowledge during a recent system update. The configuration has been fixed and MySQL is now starting again. Please expect degraded performance as many sites are attempting to reconnect to the database now that is is online again.
Our MySQL instance seems to have crashed and is not restarting as intended. We are investigating this issue and hope to be able to have it up and running again soon.
Report: "MySQL Down"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
At this time we are expierencing issues with the MySQL database on our production directadmin server. Please stand by while we investigate the issue.
Report: "Network Downtime"
Last updateThis incident has been resolved.
The session with Vultr has returned to an established state and we are watching to ensure that the session does not flap again.
At this time we have been made aware of a problem affecting networking. Further investigation reveals that our BGP session with Vultr is flapping. We are currently investigating the cause of this but it appears unrelated to our end of the system.
Report: "SQL Database Connection Errors"
Last updateAfter testing that the database is operational, we consider this incident resolved. Please let us know if you have any issues after this point.
A fix has been implemented and we are now monitoring the results. Please notify us via support ticket or discord if your application is unable to connect to or use the database.
This issue has been identified. Please expect up to two hours of downtime on database-related functions.
We are continuing to investigate this issue.
Sites that rely on database connections may not be functioning properly. We are investigating the issue.
Report: "Power Outage"
Last updateAt this time services have been restored. Thanks for sticking with us.
We are continuing to work on a fix for this issue.
At this time we are entirely offline due to inclement weather causing a power outage at our main datacenter location. We do not have an ETA on service restoration, multiple trees have fallen across the power lines and must be handled.
Report: "DirectAdmin User Login Disabled"
Last updateDirectAdmin has gone ahead and fixed the issue with our license and we are back online as normal.
Due to DirectAdmin licensing changes, our DirectAdmin license is being treated as expired on the server. This prevents users from logging into the DA website, but websites will remain online. We retain console access to the server. If there are mission-critical changes that must be made to your website during this time, please contact us and we will find a way to accommodate. We currently have no ETA as to when DA will be working again. We are at the whim of DA support to fix the licensing problem/activate our new license. Additionally, signups are currently not working as DA API has shutdown. If you create a signup, it will take a longer amount of time to get processed as we will have to wait until DA is working again.
Report: "Network Downtime due to Upstream Provider re-provisioning"
Last updateAs we are now seeing 100% visibility on RIPEStat, this incident is now resolved.
We are seeing at least 80% of the global network is able to see our address block. We will leave this incident open as we continue to monitor the situation but at this time the Endless services are back online.
Please continue to monitor the ripe stat page for our IP block to watch the progress of our new announcements. https://stat.ripe.net/185.86.231.0%20-%20185.86.231.255?sourceapp=ripedb#tabId=routing
Our upstreams have enabled our new session. We are now waiting for our prefix to become visible.
Our upstream provider is switching our BGP session to a new ASN. While they do this, we will not have uptime on our IP addresses, and therefore there will be a service outage.
Report: "Degraded Network Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an issue where there may be brief periods of packet loss within our network resulting in degraded performance of customer sites.
Report: "Billing and internal Jira/Confluence instances partially unavailable"
Last updateThis incident has been resolved.
We are currently investigating issues that affect our Billing Portal (portal.theendlessweb.com) and our Jira/Confluence instances. We'll let you know once we have an update.
Report: "Intermittent Network Issues"
Last updateOn 1/26/20 at around 11:34 AM EST, we were informed that the site was becoming extremely slow and responses took a while. > Max - Last Sunday at 11:34 AM > > Can it be that the server has to handle a lot right now? Because responses are really slow right now \(and I mean even slower than my connection normally is\) After this initial message, we noticed that our servers had gone completely offline. Throughout the next few days, the systems would come online for a short period, then go offline again for a longer period, and the cycle would repeat. We were unable to get any onsite staff to have a look at the possible hardware problem, as through the limited periods of connectivity, we were unable to diagnose any issue through our router interfaces or through Proxmox. Throughout the next few days we attempted many troubleshooting steps that we could remotely including gateway resets and router reboots. None of these appeared to have resolved the problem. We were eventually able to get onsite staff to look at the problem and no immediate problem presented itself. We did reseat all networking connections and the issue seems to have resolved itself. We are unsure of the exact cause of the issue, but have suspicions it was a power supply problem with our HP ProCurve switch. We will update this postmortem if any new information is discovered. We have added many new monitoring systems through Datadog to allow for any future incidents to be handled better. Some of the Datadog stats are available on the homepage of this status site, but the rest are available on our public Datadog dashboard, which is linked at the top of this status site. We will be able to better handle any future incidents and be informed quicker when they happen. The following is some more information on the switch issue: > DJ Electro - Today at 7:53 AM > > my thinking was when I connected over Wifi, I couldn't even reach local sites \(192.168.1.1 timed out\). Soooo it couldn't be the modem since I would still be able to reach local addresses, therefore it must be the switch because if that died then it would make sense I would drop connections to all addresses. > Now our switch does this weird thing sometimes where if theres a big voltage drop which can happen if theirs a power grid problem or if the cable is yanked or pulled to the wrong direction, the switch will power cycle instead of doing what anything else does during a brownout. So my guess was that the cable was pulled wayyyy off and so every single time it would complete the boot cycle it would just power cycle again and we were sort of stuck in an infinite loop but I checked the modem and I reseated the power plug on the switch. Thanks for sticking with us, EH Administration
At this time we have determined that the issue is most likely resolved. Expect a postmortem at a later time.
A fix has been implemented and we are monitoring the results.
The issue appears to have reappeared although network access was 30+ minutes. We suspect a L1 problem at this point, possibly a pulled coax line or possible ethernet problem, or a problem with our cable modem. It is also possible that we are having problems with our HP ProCurve network switch. We will likely look into replacing our cable modem after this as it has usually been the source of similar problems. Unfortunately, there are no staff available onsite, so we will be unable to do anything except during the short sections where the network comes online to do anything.
Although we were unable to locate an exact cause of the issue due to only having remote access to the datacenter. At this time, IPv4 access appears to be operational. IPv6 access appears to be not-functional, and we are working on restoring this as we are now able to access our router interface. As such, this issue has been reduced to partial outage.
We have identified an issue with our network stack. At this time, the cause of the issue is unknown. The issue seems to cause problems connecting to any Endless services, and appears to be intermittent. We will update as more information is found.
Report: "Homepage Down"
Last updateAt 10:30 UTC time, we received notification that our homepage was not connecting. Upon further investigation, it turns out an automatic container reboot caused an invalid nginx config to be enabled, thus nginx was unable to start. We removed the rogue configuration and the issue is now resolved. oop-
Report: "Potential Service Degradation"
Last updateThis incident has been resolved.
Our servers are located in an area which has recently been issued tornado warnings. There is a chance we will fall over to backup power, and there is also a chance of network issues. We're posting this to be on the safe side.
Report: "Degraded Performance due to Updates"
Last updateThis incident has been resolved.
At this time we are updating a number of systems in our network in order to patch security issues and maintain the latest versions of software. During this time, performance of DirectAdmin and sites may be slow.
Report: "Network Outages"
Last updateThis incident has been resolved.
We have discovered network outages and are investigating the problem. During this time EH services may not be available.
Report: "Dead dedicated Server"
Last update[closing, see previous update]
We're back online, as of about 4 hours ago! Isn't this an efficient statuspage!!
We're running off a backup node for now so expect some degraded performance, but sites are currently back online. We do not have an ETA yet for our main node coming back online, but we will continue to update here as soon as we know more.
Edit: The staff mailserver never went offline, the client one is still down. (Exim = client email)
Our Dedicated Server has died completely. We are trying to come up with solutions on how to handle this if our Dedi does not come back online in the Datacenter. ( Don't worry, we have a Backup of everything)
Report: "Control Panel Auth Issues"
Last updateApparently this was fixed ~30m ago and we forgot to update this, whoops! Thanks for choosing Endless Hosting!
We have heard multiple reports of the DirectAdmin control panel being unable to authenticate users. User are unable to log in to DirectAdmin, FTP or SSH, however all sites remain online. We are investigating the issue and hope to have it resolved soon. Sorry for the inconvenience!
Report: "Dedi Reboot for IP Changes"
Last updateThis incident has been resolved.
We're rebooting our dedi and reconfiguring some networking things in order to provide IPv6 for our customers and their websites. This should not take very long.
Report: "Datacentre Down"
Last updateThe datacenter is back online
Update: Recieved word from our host, it's a networking issue and they're looking into it.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
Our datacentre seems to have gone down, and we have confirmed with other tenants. We are currently in contact with the datacentre's on-site staff. Migration is now on hold.
Report: "All EH Services Down"
Last updateEverything is now back online.
Due to an issue with our colocation provider, our dedicated server is down. This means all services are also down. We do not have an ETA on when services will be restored.
Report: "JIRA Update Failure"
Last updateOur JIRA instance failed to update during the maintenance, however the issue is now resolved and the service should be back online.
Report: "Signups Down"
Last updateThis incident has been resolved.
Signups are not going through to JIRA
Report: "Cloud Router offline"
Last updateOur Cloud Router went offline for a few moments however it is back online now.
Report: "JIRA Updated"
Last updateWe updated our JIRA instance which resulted in a few minutes of downtime. Everything should be back online now.
Report: "ALL EH Services Down"
Last updateWe first noticed our host was entirely down at around 00:05 UTC. This included our VM management system, Proxmox. This signaled us to contact our provider, InterConnX, about the issue. They provided the following response \(PII Redacted\): > `Hi <Redacted> thanks for the buzz. Looks like the account is fully vetted now. Our upper management had a few things come back to them about servers being used for wrongful marketing and advertisement but it seems to have been closed out by upper management and the server is back online. ` We do not know who may have submitted a report to our host, nor what the report consisted of, as we are not aware of any marketing or advertisement being done using our services. If you were the creator of this report and wished only to have a certain customer/website removed for this offense \(which is also against our own guidelines\), please open a support ticket with us. Thanks, EH Administration
Our host has identified the problem and has restarted our server. A postmortem will follow with further information on the cause of the issue.
We are investigating the issue with all services being down. The issue has been determined to be, in part, caused by our top-level host. Further details pending.
Report: "VPS's Down"
Last updateThe issue was caused due to an issue with our old VPS provider, Google Cloud. As some of our VPS customers know, we had migrated most customers to a new system that we control and did not have this issue. This only applied to customers using our old VPS system who had refused the migration. At this time, all VPS customers have been migrated to the new system, and as such, Google Cloud will no longer be a component on our statuspage. Thanks, EH Administration
We've resolved this issue and customer VPSs are online again.
The fix has been delayed. We are continuing to work on resolving this issue.
All customer VPS' are currently down. They will be back online soon.