Sorry! Whoops and it won’t happen again

This morning Verrotech suffered a brief but catastrophic outage which affected the majority of our services and customers. From the time we became fully aware until resolution was around eleven minutes but the actual length of service interruption (time before we were aware, time for caches to clear and customers to be back on) was probably as long as thirty or forty minutes.

First things first – Sorry. This was entirely unacceptable and, worse, was within our control and entirely our fault.

A fundamental service underpinning all others is DNS, this is the thing that turns an address like “blog.verrotech.com” into some numbers (an IP address) which actually allow the client computer (your PC) to find the server computer (our server) and get the web page. Because DNS is so important we operate more than one DNS server in different data centres and on different server stacks (this is called redundancy), so if one fails or can’t be reached for whatever reason, everything carries on. That at least is the theory.

One potential downside of having multiple DNS servers (Verrotech run four primary servers) is updates – instead of having to make an update in one place you would have to do it four times, increasing not only workload but also the chances of an error. Luckily DNS (built by very clever people) has thought of this and has a concept of master and slave – you can have one master and any changes made there are passed onto the slaves. Still with me? Ok!

Using this system is means we have one master DNS server and three slaves in our main Verrotech.com configuration.

This morning our master DNS server failed. This triggered an alert but of a low priority – with no changes outstanding it simply meant 1/4 of our DNS servers was unavailable, about 1/4 of lookups would have to be repeated and no real impact would be noticed by anyone. Wrong.

One of the bits you can configure when setting up masters and slaves is an expiry time – if the slaves loose touch with the master how long should they continue to operate before assuming their information is out of date and stop. Sensibly this is a long time, a couple of weeks maybe, so that if there’s a problem with the master you can go ahead and fix it while everything continues to operate normally.

However in our case this expiry was set to 3600 seconds which is one hour only.

As a result an hour after the master failed, and triggered a low-priority alarm, the slaves stopped working as well effectively dropping Verrotech offline. This triggered alarms from all our monitoring services but most of those couldn’t notify anyone because it turns out they too rely on DNS to work properly.

Catastrophe.

Of course our engineers realised quite quickly there was a major problem (we reckon five to ten minutes after the expiry) and quickly brought the master back online which in turn notified the slaves it was back and everything came back online albeit initially a bit slowly as backlogs of emails and other requests cleared.

So a minor problem cascades into a major problem.

Lessons learnt? Many. The slave expiry is now four weeks which should allow any master failure to be dealt with, and we’ll look at how monitoring can be updated to avoid any reliance on DNS at all.

Ultimately we learnt a valuable lesson and no mail or data was lost, but there was definitely an unacceptable loss of connectivity. Again – to any customers affected – sorry.

We will continue to build resilience into our network and work to deliver you the excellent service you’ve come to know and expect from Verrotech.

 

Why we tweet status updates

Someone with a lot of time on their hands might wonder why, when we are an Internet hosting provider, we disseminate our status updates on Twitter. Aren’t we effectively using (and advertising) a competitor?

Well there’s a good reason we use Twitter and no, we’re not in competition with Twitter (which is good as we’d probably be out of business if we were).

We use Twitter because it’s an off-network independent service i.e. it’s availability is unrelated to our network or services in any way. This means that, on the rare occasions when stuff gets real, our update channel for customers is unaffected by whatever has caught fire at our end. No matter how bad the outage with Verrotech services, we can keep everyone abreast of the situation and estimated repair times etc, or even engage directly with customers with DMs.

But what if Twitter fails?

Twitter is of course outside our control, but they have pretty good availability (not Verrotech-good of course, but pretty good). When they do fail for it to be a problem it would have to coincide with a major Verrotech outage as well, which is very unlikely.

To mitigate this anyway for status news we use what we call the availability update tripod of power. That is, our availability news and customer updates has three strong legs.

The latest Verrotech infrastructure (VETS) which we migrated all services onto in January 2016 is built around high-availability virtualisation. The customer facing/Internet facing portion of these are segregated into two main stacks (aka platforms): Stavros and Hektor. These are physically separated platforms in different data centres and with different redundant network links.

These two platforms make up two of the legs, as standard the main website (and blog etc) will be hosted on one and the status.verrotech.com page on the other. Thanks to the wonders of our setup we can actually migrate one to the other in less than a minute but even if that failed, one of those sites would be available and pumping out news.

This is combined with Twitter (the third leg) to mean if you have two URLs bookmarked and follow @verrotechstatus on Twitter, you should never be out of the loop.