This morning Verrotech suffered a brief but catastrophic outage which affected the majority of our services and customers. From the time we became fully aware until resolution was around eleven minutes but the actual length of service interruption (time before we were aware, time for caches to clear and customers to be back on) was probably as long as thirty or forty minutes.
First things first – Sorry. This was entirely unacceptable and, worse, was within our control and entirely our fault.
A fundamental service underpinning all others is DNS, this is the thing that turns an address like “blog.verrotech.com” into some numbers (an IP address) which actually allow the client computer (your PC) to find the server computer (our server) and get the web page. Because DNS is so important we operate more than one DNS server in different data centres and on different server stacks (this is called redundancy), so if one fails or can’t be reached for whatever reason, everything carries on. That at least is the theory.
One potential downside of having multiple DNS servers (Verrotech run four primary servers) is updates – instead of having to make an update in one place you would have to do it four times, increasing not only workload but also the chances of an error. Luckily DNS (built by very clever people) has thought of this and has a concept of master and slave – you can have one master and any changes made there are passed onto the slaves. Still with me? Ok!
Using this system is means we have one master DNS server and three slaves in our main Verrotech.com configuration.
This morning our master DNS server failed. This triggered an alert but of a low priority – with no changes outstanding it simply meant 1/4 of our DNS servers was unavailable, about 1/4 of lookups would have to be repeated and no real impact would be noticed by anyone. Wrong.
One of the bits you can configure when setting up masters and slaves is an expiry time – if the slaves loose touch with the master how long should they continue to operate before assuming their information is out of date and stop. Sensibly this is a long time, a couple of weeks maybe, so that if there’s a problem with the master you can go ahead and fix it while everything continues to operate normally.
However in our case this expiry was set to 3600 seconds which is one hour only.
As a result an hour after the master failed, and triggered a low-priority alarm, the slaves stopped working as well effectively dropping Verrotech offline. This triggered alarms from all our monitoring services but most of those couldn’t notify anyone because it turns out they too rely on DNS to work properly.
Catastrophe.
Of course our engineers realised quite quickly there was a major problem (we reckon five to ten minutes after the expiry) and quickly brought the master back online which in turn notified the slaves it was back and everything came back online albeit initially a bit slowly as backlogs of emails and other requests cleared.
So a minor problem cascades into a major problem.
Lessons learnt? Many. The slave expiry is now four weeks which should allow any master failure to be dealt with, and we’ll look at how monitoring can be updated to avoid any reliance on DNS at all.
Ultimately we learnt a valuable lesson and no mail or data was lost, but there was definitely an unacceptable loss of connectivity. Again – to any customers affected – sorry.
We will continue to build resilience into our network and work to deliver you the excellent service you’ve come to know and expect from Verrotech.