Don’t put all your eggs (servers/backups) in one basket (cloud)

Just like you diversify income sources, diversify your cloud infrastructure providers!

Intro

We are always taught that we “should not put all our eggs in one basket” (🥚🥚🥚 + 🧺 = ☠) to avoid the risk of losing everything at once. This can be applied to many verticals, whether it’s applied to skills, investing, income sources, etc… We always need to diversify.

Should we also apply the rule and diversify when it comes to cloud providers? We will figure out.

website backups

Photo by Autumn Mott Rodeheaver on Unsplash

Story

Last December, on Christmas Day (disasters usually happen on the holidays for some reason🧐 when everyone is off) one of our clients had their AWS account suspended. Because of the suspension — which we had no idea why it happened in the first place — their production servers, databases and storage completely stopped. Connections to the servers or the databases were timing out, nothing could be reached.

Quick Background

They were using the compute service (EC2) for multiple load-balanced servers, a central caching server, the relational database service (RDS) as a central database serving all applications and the storage service (S3) as a CDN plus an object store for everything else. Luckily the DNS was not managed by Route53 — so that gave some hope in restoring backups on another cloud until the issue is resolved…

More Digging…

We wanted to dig into the AWS account suspension issue deeper to see why it happened and if it was possible to resolve it and get everything up and running quickly. While checking the account billing (since that’s the only thing you can do for a suspended AWS account) we noticed high usage of massively large Windows instances that incurred tremendous charges we know nothing about.

The server instances that we saw on the bill were the most powerful ones to date (Windows running onp3dn.24xlarge) — these were actually just unveiled by Amazonthe same month:

“p3dn.24xlarge has 2.5 GHz (base) and 3.1 GHz (sustained all-core turbo) Intel Xeon P-8175M processors and supports Intel AVX-512.”

Amazon states the following use cases for these machines:

“Machine/Deep learning, high performance computing, computational fluid dynamics, computational finance, seismic analysis, speech recognition, autonomous vehicles, drug discovery.”

The mentioned instances ran for a couple of days on the client’s AWS account before the suspension. What the client knows for sure is that they have not launched these instances by themselves or anyone who has authorized access to the account. Which leaves us thinking about two possible scenarios:

The AWS account was hacked and someone created these server instances
It could be -but unlikely- that it’s a billing error where AWS mistakenly added these charges to the bill

Unfortunately, solving the problem was taking some time, so it made sense to take more than one action in the same time.

Temporary Solution / Hope

The client has always had regular file and database backups 👍 taken hourly, daily and weekly. We concluded that it was time to temporarily deploy all servers and databases on another cloud provider from the most recent backups.

It all looked good until we realized that all backups were stored on Amazon S3 😱 — and that was the exact moment when the last hope vanished since we could not even restore backups because S3 was suspended and practically learned that we should apply this saying:

“Don’t put all your eggs in one basket”

Lesson

It’s just not enough to have regular backups, you are still not safe!

You need to haveregular backupsstored inmore than one cloud basketsince a single cloud account can simply disappear for a reason or two.

Looking at some of the possible options to see if they were going to be sufficient for a quick recovery:

❌ Raw backups stored on a single provider (S3) only — insufficient if the account is suspended
❌ Full server snapshots/images at your cloud provider — insufficient if the account is suspended
✅ Parallel cloud (servers, database, CDN, storage) running on another provider in addition to AWS (or on stand-by) — more expensive and higher overhead, but mostly sufficient
✅ Raw backups stored on multiple storage providers, say S3 and another storage (Google Cloud Storage, DigitalOcean Spaces, etc…) — sufficient in restoring application files and databases in case one account is suspended

Tips

Expect the worst — be prepared
Enable two-factor authentication for all your accounts
Deploy your web infrastructure on more than one cloud if possible
Always have regular backups of your application files, databases and static resources (assets, user content and uploads) stored on more than one storage provider — or at least on a different provider than your cloud infrastructure

If setting up backups for every project is too much manual work -and indeed it can be- then giveSimpleBackupsa try.

SimpleBackupsmakes it a breeze to schedule automated backups of all your website files and databases in a simple dashboard. You will get alerts if any of your backups fail and you can store your backups on different storage providers like AWS S3, Google Cloud Storage, DigitalOcean Spaces and more.