The Day that Aberdeen Cloud went Bye-bye

Quotes from Twitter

The Day that Aberdeen Cloud went Bye-bye

This is Tommy, calling from the engine room! Were you affected by the Aberdeen Cloud incident that happened on the 28-06-2016? We weren't, but I'd call that partly luck and partly proactivity. We were actually prepared for this.

TL;DR

We were already prepared for a scenario like the Aberdeen Cloud breakdown, owing to our disaster recovery plan. Fortunately we didn't have to set it in motion. Each night we have a simple script which takes off-site backups of all of our hosted sites. We've made the source code available on github, so hopefully this will help others prepare for the likes of the Aberdeen Cloud implosion, and perhaps we can share ideas on how to improve each other's disaster recovery plans.

Our experience with the Aberdeen Cloud incident

We have a large number of sites hosted by various cloud services. Since autumn 2013, we'd mainly used Aberdeen Cloud, and in autumn 2015 we started to explore other options to see what else the market had to offer. Platform.sh was the one that we decided to give a serious test for new clients.

Soon after that, Aberdeen Cloud began to seem a bit flaky. Longer response times on support tickets. Solr services started to fail, along with random outages of various other services. We accepted that for a while, but after having lost and regained SSH access to all of our sites (including git and rsync), we eventually decided that enough was enough and we couldn't put our trust in them anymore.

We had to migrate everything to alternative hosting platforms. Given the similar price points and the fact that they seem well funded and offer excellent support, we decided on Platform.sh. And so began Project Exodus. 

Over the next three weeks we migrated over 20 sites to Platform.sh in a staged approach. I wouldn't say that it was a straight-forward process. Lots of clients had specific quirks to their setup. For example, some needed a PHPBB forum, others had FTP access for uploading files, some integrated with external systems that required firewall changes, etc, while others had custom .htaccess redirection rules that needed to be rewritten for Nginx. However, we were very lucky and had completed Project Exodus nearly a month before Aberdeen Cloud finally came tumbling down.

So what if you were not as fortunate as us, and still have sites whose assets are no longer accessible? Stuck in cyberspace, or maybe just plain deleted?

Well, I'm not sure there is a lot you can do. Maybe read Code Engima's blog post about the different things they've tried. However, to put it bluntly, it sucks! Enough said about the matter.

Now it's time to ensure you're never caught out again.

But what can we do to prevent this from happening again?

All companies probably have their own way to deal with this kind of scenario, but I'll tell you about how Annertech deals with backups and recoveries.

We have two sets of backups. A backup from each day, for each production/live/master environment, which is hosted by the cloud service. Then there is an off-site backup (again, daily) used for disaster recovery. The latter one is the important one in this scenario.

The idea is that even if Amazon (which hosts Aberdeen Cloud services) pulled the plug, we would still have access to our clients' data.

We have a server, hosted by a different company, that: 

  1. Pulls down a copy of the database, from every cloud-hosted site, every night, and saves it for four days, before it deletes it again.
  2. Runs `rsync` on every cloud-hosted site, to get an up-to-date version of the files folder, every night.

Sounds simple? It is. All it requires is that your hosting partner supports running Drush on your remote sites and you're good. If you run Drupal sites, and your hosting partner doesn't support running Drush on the remote sites, find somebody else who does. It's that important!

Regarding the code for the sites; we keep our source code repositories on dedicated git services. And, more than likely, we'll also have a copy or two on developers machines.

I'd like to show you the two backup scripts that I made, one for Aberdeen Cloud and one for Platform.sh.

The code is meant to work in our setup only, and is not (yet) generic enough to just work out of the box elsewhere. The release of the scripts is meant to give you a leg up and some inspiration. This is by no means the final end point for these scripts - we are continually evaluating and improving our system, and I look forward to hearing what ideas you have on where we could take it from here too.

The entire repository of code can "stolen" from github.

When you have a disaster recovery plan you also need to make sure that it actually works. You can do this by downloading the latest backup from each of your sites once a month, installing and then testing them. I've tested a site, where the DB file was corrupt, but only for that site, so make sure that you test all of them. The setup of test sites can also be automated by a script so you don't have to setup 10, 50, or 300 sites and test each manually. Scripts are your friends. Make good use of them and have them do all the hard work.

Now, if you really want to push this further, you should implement a "Smoke test" in all of your installation profiles, so that you can trigger that to see if the site is alive; or perhaps tie it in with a Jenkins server.

If something is unclear, feel free to put a comment below. If you feel like this could be improved, feel free to contribute with a pull request. We are all ears.

 

Want to talk to us about hosting your Drupal site? Great - just get in touch with us on 01 524 0312 or email hello@annertech.com and we'll see what we can do to help.

Comments

Yes, I completely agree. We do the same. It's ok to use cloud provider backup, but you need another you own for disaster recovery. Sadly we weren't really serious about using Aberdeed so we did not implement it for them. We did have backups, but not automatic. This means they weren't as frequent as we would have needed given the total meltdown on Aberdeen. But given we didn't trust them enough we only had one site with them. So didn't take long to rebuild what we lost. What can we say, lesson learned. Be serious about backups, even if you are trying the service!

Annertech's default 'Annertechie' profile image.

I think the "hosted by a different company" bit is something that people often miss. I've seen disaster recovery plans that insist on fully duplicated servers in multiple data centres that are a certain number of miles apart to protect against scenarios like nuclear war, but then get both locations from the same supplier. All it takes is for that supplier to go out or business, or someone at that supplier to decide to turn off your services, and that disaster recovery plan is useless.

A few years ago I worked with an company whose whole multi-million pound business was conducted through their website. They missed a £10 payment for a vanity domain name and as a consequence triggered a bad debtor process with the supplier, who then shut down their website and therefore their business. Because of the dependence on a single supplier there was no way of getting the business back online without dealing with this supplier's bureaucracy, which took a few hours.

Annertech's default 'Annertechie' profile image.

Thanks for this post. We experienced the same dilemma wth one project. Sure it happens that services or complete business close. But the way they do matters. People trust in services more and more and cloud service providers should take responsibility and not just kill the service without caring about their users (at least I didn't receive any notification, did you?). Acquia's Drupalgardens will shut down, too. But they announce it publically, give it a reason and explain how to handle the migration.
We also use Acquia, Platform.sh and Amazee.io in our projects and will give them our trust in the future.

Annertech's default 'Annertechie' profile image.

@Manuel Pistner, no we didn't receive any notification from them until 2 days after people's sites went offline, and nearly a month after we had stopped using their service. To be honest the signs were there - no responses to support tickets, services randomly stopped working (e.g. solr), ssh and git disappeared once, the UI console started giving errors, ... the list goes on. Every hosting company has their share of issues, but it's how they react and handle them that matters. In this case, there was often no response from them at all - well maybe after much chasing you might get a response a few weeks later, they definitely didn't live up to their 24/7 support claim.

It's a shame, because they had an excellent service at the start, with a console UI and reports that surpasses the others I've used, and even had a LiveChat service. The LiveChat disappearing was the first early indication that maybe things weren't going well, but it was quite a few months after that before the real issues started happening.

Annertech's default 'Annertechie' profile image.

Add new comment

Restricted HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd> <h4> <h5> <h6>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.