Historical record of incidents for Simplero
Report: "AI Chat Bot on sites not working"
Last updateAI Chat Bot on sites should be working again.
Actually, it was our own boo-boo. We deleted a team-member who no longer works here from our OpenAI account, but our servers were using credentials associated to their user. We are changing the credentials right now. Expect everything to be working in 20-25minutes.
The service we use for chat bot (OpenAI) is currently down. As a result the AI chat bot on sites is currently not working.
Report: "Issues with email deliveries"
Last updateThis incident has been resolved.
Our email provider Sendgrid is dealing with an incident which may delay email deliveries.
Report: "All pages listing course lessons are currently broken -- a fix is being deployed and should be out in 8 minutes..."
Last updateAll is working in the land of Simplero again.
We are currently investigating this issue.
Report: "Email Sending is Down"
Last updateOur emails have been unsuspended and they should be up and running again. Emails sent during the suspension have now been delivered.
Our emails have been unsuspended and they should be up and running again. We are working to confirm if emails sent during the suspension will still be sent or if they will need to be resent.
We're in touch with several people at Twilio, but no one is able to actually do anything because it's Christmas here in the US. It's pretty remarkable that a $17Bn market cap ~10,000 person company cannot find a single person who's able to flip a simple switch to rectify an obvious mistake. But that's where we're at. We've also switched over transactional email (login information, receipts/invoices, forgotten password, etc.) to use the channel that does let emails go through. More details in the community: https://simplero.community/forum/posts/193635-email-down
We now know the reason as for the suspension (a phishing e-mail sent to one of our members which was forwarded by our systems to the same member as a notification email). We are still waiting on our email delivery system to restore our account.
Emails are not being delivered. Our email delivery system suddenly suspended our email sending without a clear reason. We have asked for urgent support from them and are waiting for a response.
Report: "Simplero is down"
Last updateOne of our webservers (out of 10) went down for ~45 minutes. We've restarted it so the problem should be fixed. Weirdly enough, our automatic alerts didn't catch this downtime. We'll continue to monitor and figure out a way to setup automatic alerts for this case so we're alerted early on.
Some people are unable to access Simplero. We are investigating the issue.
Report: "Simplero is down"
Last updateWe've resolved the issue and everything should be back to normal.
Our engineers are working on a spam traffic attack that's bringing us down.
Report: "Simplero is down"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Background processing and API is down"
Last updateThe email stats and other stuff is still catching up and will be updated very soon.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are investigating the issue
Report: "Email Delivery Delays"
Last updateThis incident has been resolved.
We are currently experiencing an issue impacting our email delivery system. Users may notice delays in receiving emails sent through our platform. Current Status: Our engineering team has identified the root cause as an unexpected surge in email load, leading to a bottleneck in our processing queues. We are actively working fixing it.
Report: "Simplero is down"
Last updateThis incident has been resolved.
We are currently investigating the issue.
Report: "Errors on Site admin pages"
Last updateAll done. So sorry about this.
All admin pages for sites not on the new experience are throwing errors right now. A fix is going out. Should be all done within 15-20 minutes. (That's how long it takes to deploy and update.)
Report: "Course overview pages are currently broken"
Last updateAll fixed. So sorry about that.
A fix is going out right now. The courses themselves are fine, but the overview page is throwing a 500 server error.
Report: "Search Functionality Disruption"
Last updateThis incident has been resolved.
Search should be working as expected, we are monitoring for any issues.
We are currently experiencing an issue with our search functionality. Our team is aware of the problem and is working diligently to resolve it as soon as possible. We apologize for any inconvenience this may cause and appreciate your patience.
Report: "Simplero is down"
Last updateWe have restarted our database which got us back!
Seems to be affecting our database which is causing all Simplero admin and user facing pages to be down. The Engineering team is investigating.
Report: "FontAwesome is Down đ"
Last updateFontAwesome is back as well as all the fabulous icons and texts đđȘ
Fontawesome is down đ This is affecting fonts and icons used in Simplero. G o here to see their status updates: https://status.fortawesome.com/ We'll do our best to update as we get more information đ·
Report: "Simplero is down right now.."
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Email Delivery Delay"
Last updateThis incident has been resolved.
We are currently experiencing issues with our email provider, which has resulted in delays in email delivery. Outgoing emails may be affected. Our technical team is actively working on resolving this issue and is in communication with the email provider.
Report: "Investigating issues accessing the platform"
Last updateThis incident has been resolved. What happened? We created a new API endpoint and this was used at a much higher rate that we were anticipating. This created a logjam amongst our backend processing which spilled over to page loads. We are so sorry for that! We have now added rate-limiting to this endpoint and are modifying it in a way that prevents this from happening again.
We've identified an issue that may be the cause of the down time. We are deploying a fix and will continue to monitor.
We are currently investigating this issue.
Report: "We have a problem, weâre working on it, it seems to be affecting checkout pages and video assets"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Our "pusher" is extremely busy at the moment, which handles the "purchase processing" screen, video encoding, and transcripts generation. The following are expected to be affected: 1. The "purchase processing" screen will not automatically move on: the user purchasing from your site will need to click the link to force the redirect. 2. Video encoding status will not automatically update in your dashboard but the encoding will still process: you'll just need to refresh the page to see it updated. Video transcription status will not automatically update in your dashboard but it will be generated on the background: you'll just need to refresh the page to see it updated.
Report: "Instagram feeds are down"
Last updateInstagram feeds were working again as of Feb 16th. Did we remember to update this? No. No we did not.
Our integration with Instagram is currently being reviewed by Meta. Weâve submitted the information we need to submit and the Instagram feed section should start working again within 2-3 days. Please hide your Instagram sections for now.
Report: "Attachments (image/file uploads/mentions) on comments/forum posts uploaded 2 days ago not being displayed"
Last updateWe've fixed attachments and mentions posted between February 3 and 5. All attachments and mentions should be functional again.
We have fixed the issue for attachments and mentions posted before February 3 and all those posted going forward. We are working on a fix for those posted between February 3 and 5.
The issue has been identified and a fix is being implemented.
Report: "Website degraded performance"
Last updateWebsite performance was degraded for about 30 minutes. It has gone back to normal. We subsequently found the root cause and fixed it.
Report: "All sites showing error code"
Last updateFixed.
Will show a message like "ERROR: undefined method `google?' for nil:NilClass" or show the site without any styles at all. A fix is currently being deployed. So sorry about this.
Report: "Database upgrade has stalled Broadcast and Email sendings"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating why Broadcasts and Emails are not sending after our Database upgrade. We will update you as soon as possible.
Report: "Looks like AWS is down"
Last updateAWS should be back online
We have disabled a part of our logging service that depends on the affected AWS region. Everything on Simplero should be working again. We are continuously monitoring for other issues that may come up - none so far.
We are continuing to investigate this issue.
Looks like Amazon Web Services is having issues causing outages on Simplero. We are investigating further...
Report: "Emails may not be sending"
Last updateThis issue appears to have been resolved. A small number of emails may not have been sent between 2:28 PM and 3:34 PM EST on October 12. If you sent messages around that time, please check to see if the broadcasts are marked as 'not delivered'.
This issue appears to have been resolved, but we are monitoring to make sure no further issues occur.
We are experiencing an issue were some emails may not be delivering.
Report: "Instagram is having some problems with their API"
Last updateThe virtual hugs worked! Facebook/Instagram have announced that their API is back up. If your account's feed disconnected during this outage you will now be able to reconnect it in Settings > Integrations. Please check yours! Some disconnected and some didn't...
We are monitoring to see when Instagram resolves their issue. Let's all send them virtual hugs...
At around 1AM ET on July 24, we started experiencing problems with our Instagram integration. After much digging on our end (go Owais!), it turns out to be an issue with the Instagram API itself, and not with Simplero. As a result, your Instagram integrations might not work as expected until Instagram resolves these issues. We will monitor and you can also follow along here: Facebook's status page: https://status.fb.com/graph-api
Report: "Auto-response and Automation email delivery stats are not complete"
Last updateThis issue is now resolved and emails sent during this issue should now show correct statistics.
We have implemented a fix and new stats should now be recorded correctly. We are monitoring this fix and exploring methods to update the data on affected email during the issue.
All emails are still being sent, don't worry! But we are currently investigating an issue where the email stats for these messages are zero or incomplete. The number of 'Delivered' emails is not correct, and thus the percentages of other things (like 'Opens') are also wonky. As far as we can tell, other number-based metrics like 'Opens' are accurate - only percentages are affected. The problem may lie with SendGrid, but we haven't fully identified the issue yet. We're on it, though! Our apologies for any inconvenience.
Report: "Saved cards not working for new purchases"
Last updateWe believe we have this all straightened out, and saved cards are again available for new purchases.
Previously-saved cards are temporarily not working for making new purchases. Complete fix expected within a few hours. In the meantime, saved cards do not show as a payment option, so the checkout process for repeat customers is somewhat worse than normal but fully functional. (Only cards processed via Stripe are affected, no other processors.)
Report: "Something is Amiss"
Last updateWe had a backlog of running jobs. Jobs are running again and we are seeing emails and media files uploading again.
We are currently investigating an issue with emails not being sent and video encoding. We'll update as soon as we have the issue resolved.
Report: "Links in email briefly broken"
Last updateFor five minutes or so, anyone clicking a link in an email got an error page. If they tried again in a few minutes, it worked correctly. No change too small for going through the proper steps. Our head of engineering is having a stern talk about expectations and SOPs with...himself. Mea culpa. -Joshua
Report: "Certain email deliveries delayed"
Last updateSome deliveries failed last night between 9:20 PM and 2:20 AM Eastern. We corrected the problem (one of our mail-sending servers rapidly ran out of disk space due to an unrelated series of unfortunate events) and re-sent all the failed emailsâexcept where we could tell that the account owner had already re-sent them. This really was quite the freak combination of problems, but we're taking steps to make sure similar processes can't use up the disk again.
Report: "The case of the unreported deliveries"
Last updateYou may have noticed unusually low % delivered for mailings sent in the last day or so. For about 24 hours starting at 1:45 PM Eastern (18:45 UTC) on November 2, Simplero did not record deliveries or bounces for email addresses with capital letters in them. The mail still got delivered as it always does! Unfortunately we can't get those delivery events back, and affected mailings are going to have somewhat weird-looking reports. Opens and clicks were still tracked correctly.
Report: "Database upgrade"
Last updateThursday we set in motion some infrastructure upgradesâvery carefully, behind the scenes. But it turns out Maria DB has a bug that caused it to âleakâ memory when using a certain kind of data compression, and over the course of several hours, it consumed all available memory, slowed down, and rebooted itself. That caused the few minutes of downtime on Thursday evening. Weâve never had a problem like that before, but thereâs a first time for everything. Now we have alarms so weâll be notified of any memory issues with the database long before they cause a problem. We also decided to upgrade Maria DB to a version that fixes the memory leak bug. Itâs a so-called minor version upgrade, and Amazon even offers to do them for you automatically during a short regularly-scheduled maintenance window, so we expected a few minutes of downtime. Instead, as you know, there was over an hour Saturday night when the database \(and hence the entire application\) was inaccessible. And once the process started there was no stopping it: we were at the mercy of Amazon Web Services. Going forward, weâll announce ahead of time on [status.simplero.com](http://status.simplero.com) and in our Facebook group any time we plan even a few minutes of downtime. And weâre implementing a plan to be able to upgrade the database withâfor realâno more than a few minutes of downtime.
And, we're back! Sorry that took a bit longer than expected. All is safe and sound and operational.
We're currently doing a database upgrade. We expect to be back online in a few minutes. Sorry for the wait.
Report: "More database upgrade"
Last updateThis incident has been resolved.
Upgrade process is completely finished. No data was harmed in the upgrading of this database.
We've been back up for a while now, and we're fairly sure we're out of the woods. But given that we thought we were done 30 minutes ago and then we weren't, we'll leave this Status as Monitoring. To be on the safe side.
We are continuing to investigate this issue.
Apparently our database wasn't quite done updating. This is still expected downtime, it's just taking longer then we'd expected. We're very sorry this is taking so long.
Report: "Something is amiss"
Last updateAnd we're back in business! We'll post more details here about what happened after we do a full post-mortem.
As you've noticed, something is amiss in Simplero-land. We're on it and will get it fixed ASAP.
Report: "Site is down ... working on it"
Last updateAll good now. Thanks for your patience.
Most stuff is back online now. Looks like it's just our own website (simplero.com) that's still borked. Your sites and services are working fine, and you can login to your account by going to youraccount.simplero.com/admin.
We made a boo-boo. We're working hard on restoring service. So sorry, guys. We know we screwed up.
Report: "Brief interruptions caused by maintenance"
Last updateIt's all cleaned up now. We apologize for the inconvenience. A few times the site was offline and everything got paused for a minute, but it's all back to normal, and there should be no lasting effects.
We're experiencing a few brief interruptions in service this morning due to some unexpected problems during system maintenance. We're working on getting it all cleaned up.
Report: "Mail delayed by SendGrid outage"
Last updateSendGrid is reporting that systems are back online. I still wouldn't be surprised if some inbound and outbound messages are delayed.
According to https://status.sendgrid.com, SendGrid is having an outage across all capabilities. Mail sending will be delayed. Our architecture is designed so that mail will get delivered automatically as soon as Sendgrid is back online.
Report: "Temporary network error caused downtime for sites"
Last updateA temporary error with domain name resolution happened to coincide with our process that checks to see that domain names are still configured to point to Simplero, which caused our system to see many domains as no longer pointing to Simplero, which caused that process to mark them inactive. This kind of problem has never happened before in all the years we have supported custom domains, but the system design was still a mistake on our part: DNS systems _can_ fail, so we shouldnât have had a system that deactivated sites based on a single check. We have improved the system so that an active domain must fail multiple checks over a couple days before itâs deactivated.
Our systems suffered temporary, partial errors with domain name resolution this morning, which resulted in a number of customer websites temporarily failing to display. We're still investigating to determine exactly what went wrong and what sequence of events may have caused sites to be offline any longer than necessary.
Report: "Mail sending down"
Last updateOur mail sending partner decided to change all login credentials at 20 minutes past midnight US Eastern Time on a Saturday, without notice, in a way that broke all of our email sending completely. Emails just stopped going out. This is terrible on their part. We're going to reevaluate our business relationship with them, we're going to obviously do everything we can to make sure this won't happen again in the future, and we will create a system to catch a situation like this automatically, and immediately, going forward. I'm so sorry. This is absolutely horrific. Nothing like this has ever happened before in our 11\+ year history, and I've never experienced a supplier behaving this irresponsibly before. We've definitely learned from this. With sincere apologies, âCalvin
Backlogged messaged have been sent.
Email sending is working again, and we are delivering all mail that should have been sent earlier today. We're monitoring to make sure everything gets sent.
All mail sending from Simplero is currently failing. We have identified the problem and are working to correct it.
Report: "Brief downtime"
Last updateWe had nine minutes of downtime from 12:25 AM to 12:34 AM US Eastern time. To support a new feature, a developer made a configuration change of a kind we rarely need to make, and it didn't go well. We're improving our internal documentation, and this won't happen again.
Report: "Notification emails delayed"
Last updateNotification emails and other one-at-a-time emails across Simplero were stalled from 11:43 PM US Eastern time last night until 8:52 AM this morning. All such emails were delivered starting at 8:52 AM. The problem was caused by a configuration error which is now fixed. Broadcasts and newsletters delivered normally and were not affected.
Report: "Some purchases failed during an hour due to network issues"
Last updateFrom 1:22 PM to 2:32 PM US Eastern today, some of our servers were unable to make connections to outside services, including payment gateways. Some payments attempted during this window failed. Full connectivity has been restored. We sincerely apologize for the outage.
Report: "Intermittent connectivity"
Last updateThis morning we had about 30 minutes of intermittent failures affecting the Simplero software and customer websitesâincluding [simplero.com](http://simplero.com): we use our own stuff! That was followed by a few minutes of all services being down completely. Weâre so sorry about that! Hereâs what happened. We deploy a new version of Simplero every time we fix or improve something, typically several times a day. We keep a few previous versions around, and the oldest version gets cleaned up as a new version gets deployed. One of our deploys this morning failed, and the old version kept running. Thatâs as it should be, but the failure today was silent: we didnât realize anything had gone wrong. A few more deploys later, the old version was still running, but it was old enough that the application files got cleaned up right out from under the running application on one of our servers. \(All your media, images, text, customer data, and any other files youâve added to your Simplero were just fine. Only the application itself was affected.\) Another deploy trying to fix the problem meant the old, still-running version got cleaned up on every server, and we went from intermittently down to completely down. Finally, we realized the root cause and undid the changes that were causing new deploys to fail. Going forward, weâre changing our deploy process to make a silent failure like this visible so we can roll it back immediately. Sorry we let you down: weâve learned from this error and weâll make sure this kind of failure canât happen again. Thank you for your patience and for your trust in Simplero.
We're back online.
We're completely down now. Deploying a fix we believe will solve it completely. Fingers crossed.
We've received report of some sites experiencing intermittent connectivity issues. We're currently investigating the issue.
Report: "System Wide Outage"
Last updateHereâs what happened yesterday with our longest downtime in 5 years. First, our background jobs got stuck, and we got a notification about it. It was strange, because there hadnât been a recent deploy or any other recent event that would correlate to that. Then, in an attempt to get them unstuck, a team member made a quick decision to run a full deploy. That turned out to be a mistake, because that ended up taking down EVERYTHING, including our web servers, so now the site was completely down. To be fair, though, given what turned out to be the cause, the web servers would probably have stopped responding fairly soon after, anyway. As soon as the site was down, it was all hands on deck. We spent the majority of the time just trying to figure out what the heck was going on. There was nothing in the logs, no indications of what could be causing this. We tried the logical route: It started with background jobs, it spread to the web servers when they were redeployed. We also, of course, tried the good old âturn it off and back onâ method, but, predictably, it didnât do anything to fix it. Finally we got a clue. Some requests did go through, and they threw an error from our PostgreSQL database saying the connection was bad. That pointed us in the direction of the logging server running PostgreSQL. As soon as we validated that, it was an easy fix to turn off logging to PostgreSQL, which is safe to do since we only use it for internal debugging purposes. Then the site was back up. But what had gone wrong with our PostgreSQL database? We keep stuff there for a limited period of time, and then delete it. It looks like the way we deleted things werenât very efficient, and we also never VACUUMâd our database. Itâs been many years since I last used PostgreSQL, and that was something Iâd forgotten you should do every so often. One thing that threw us was that our system is designed such that if logging to PostgreSQL fails for some reason, the application should be able to keep serving requests. Clearly something about that wasnât working quite right. Weâve now changed our process for how we delete old rows, and implemented a system to VACUUM the database more regularly, as well as split this process out from some other processes it was lumped in with. Again, Iâm super sorry about this. The big factor was just how long it took us to figure out what was going on here. It was completely mystifying for the longest time, until we finally got a clue that put us on the right track. Thank you for being here with us. Weâre grateful every day.
Everything's operational. Our specialized logging system is still offline, but that doesn't affect operations. We're doing some maintenance and cleanup on it, before putting it back in commission. This was the longest-running downtime in five years, and I'm terribly terribly sorry we let you down like this. We are, of course, fixing all of the issues that led to this downtime.
Yup, that was it. We're back. Now on to figuring out what happened to our PostgreSQL installation. It seems like something's really screwed there.
We think we figured out what's going on. It's related to our logging infrastructure.
This is the strangest thing I've seen in my almost 40 years in software development. It's certainly the worst downtime we've had in over five years. We've got all hands on deck trying to figure this thing out, but at this point, we don't even know what's causing the processes to not respond correctly. I'm so so sorry. We take this stuff supremely seriously, and we're working as HARD as we can to bring everything back up.
We are currently experiencing a system-wide outage. We are looking into it and will update with details as soon as we can. Thanks for your patience!
Report: "Video Encoding issues at AWS"
Last updateHallelujah! Media files are encoding effectively now. Join me in raising a glass to our team of coders who figured out several challenging problems today. Thank you all for your patience.
We are continuing to monitor for any further issues.
There's a new twist in today's media file encoding challenge. Dev team is investigating as fast as their fingers can take them. Thanks for your continued patience.
We've figured out a solution and things are catching up. This will be a permanent improvement going forward. That's the good news. Thank you so much for your patience!
Network issues at AWS are affecting video encoding.
Report: "Site is down"
Last updateWe're back up and systems are operational again. Thanks all.
We are continuing to investigate this issue.
We implemented a change that resulted in an outage. We are working to resolve the situation and expect to be back up soon. Thank you for your patience.
Report: "Video encoding is backlogged at the moment"
Last updateEverything's humming along nicely now. Thank you for your patience.
Looks like AWS is behaving again. We still have a little bit of a backlog, but everything is moving forward as it should.
Things are progressing, but the network issues are making it slow. We'll get through it, but we need your patience here.
It looks like a network issue with Amazon's web services that makes connections between our encoding servers and S3 where the video files are stored, makes download/upload very slow and unreliable.
Processing is stuck for a number of videos. We're working on getting it all cleared up as soon as we can.
Report: "Switching over Content Distribution Network"
Last updateAlmost everything's switched over now, and things seem to be working well.
We're switching over our Content Distribution Network. There may be breakage in the app while we do this, but we're monitoring closely. If you notice something, please let us know, but most likely we're already on it.