Verrotech

Sorry! Whoops and it won’t happen again

This morning Verrotech suffered a brief but catastrophic outage which affected the majority of our services and customers. From the time we became fully aware until resolution was around eleven minutes but the actual length of service interruption (time before we were aware, time for caches to clear and customers to be back on) was probably as long as thirty or forty minutes.

We are aware of a major issue affecting all services – please bear with us, further update shortly.

— Verrotech Devops (@verrotechstatus) August 25, 2017

First things first – Sorry. This was entirely unacceptable and, worse, was within our control and entirely our fault.

A fundamental service underpinning all others is DNS, this is the thing that turns an address like “blog.verrotech.com” into some numbers (an IP address) which actually allow the client computer (your PC) to find the server computer (our server) and get the web page. Because DNS is so important we operate more than one DNS server in different data centres and on different server stacks (this is called redundancy), so if one fails or can’t be reached for whatever reason, everything carries on. That at least is the theory.

One potential downside of having multiple DNS servers (Verrotech run four primary servers) is updates – instead of having to make an update in one place you would have to do it four times, increasing not only workload but also the chances of an error. Luckily DNS (built by very clever people) has thought of this and has a concept of master and slave – you can have one master and any changes made there are passed onto the slaves. Still with me? Ok!

Using this system is means we have one master DNS server and three slaves in our main Verrotech.com configuration.

This morning our master DNS server failed. This triggered an alert but of a low priority – with no changes outstanding it simply meant 1/4 of our DNS servers was unavailable, about 1/4 of lookups would have to be repeated and no real impact would be noticed by anyone. Wrong.

One of the bits you can configure when setting up masters and slaves is an expiry time – if the slaves loose touch with the master how long should they continue to operate before assuming their information is out of date and stop. Sensibly this is a long time, a couple of weeks maybe, so that if there’s a problem with the master you can go ahead and fix it while everything continues to operate normally.

However in our case this expiry was set to 3600 seconds which is one hour only.

As a result an hour after the master failed, and triggered a low-priority alarm, the slaves stopped working as well effectively dropping Verrotech offline. This triggered alarms from all our monitoring services but most of those couldn’t notify anyone because it turns out they too rely on DNS to work properly.

Catastrophe.

Of course our engineers realised quite quickly there was a major problem (we reckon five to ten minutes after the expiry) and quickly brought the master back online which in turn notified the slaves it was back and everything came back online albeit initially a bit slowly as backlogs of emails and other requests cleared.

We believe the problem is resolved. Was with DNS. Mail is now pouring in so may be slower than usual. Also… (1/2)

— Verrotech Devops (@verrotechstatus) August 25, 2017

(2/2)… your connections may still fail and say "address not found" or similar for a few minutes while caches refresh.

— Verrotech Devops (@verrotechstatus) August 25, 2017

So a minor problem cascades into a major problem.

Lessons learnt? Many. The slave expiry is now four weeks which should allow any master failure to be dealt with, and we’ll look at how monitoring can be updated to avoid any reliance on DNS at all.

Ultimately we learnt a valuable lesson and no mail or data was lost, but there was definitely an unacceptable loss of connectivity. Again – to any customers affected – sorry.

We will continue to build resilience into our network and work to deliver you the excellent service you’ve come to know and expect from Verrotech.

Bleeding Edge Hosting

Verrotech are now offering Fedora 24 hosting with PHP 5.6 and other newer versions – ask to have your hosting migrated or a development/secondary account created on a Fedora 24 box.

Here at Verrotech we love Linux. We use it day in day out and for just about everything – including all of our business critical and hosting services.

Linux comes in a lot of different flavours (distributions) and over the years Verrotech (and Verrotech people) have used most of them at one time or another, including quite a few that have died out or morphed beyond recognition (the distributions not the people). Each distribution has its own strengths and weaknesses and truly its never one size fits all.

Quite a few years back Verrotech settled on using the RedHat stack for its actual servers. I don’t want to get into the near-endless distribution wars on this is better for X or that has feature Y, we just had the opportunity to standardise and our engineers had the most experience running mission critical services on RedHat. We still use other distributions in the organisation, for example I have an Ubuntu desktop, but we just standardised our server platforms.

Later still RedHat split their Linux into two streams RedHat Enterprise Linux (RHEL with a community release called CentOS [n.b. this is an over-simplification for brevity]) and Fedora.

RHEL/CentOS was an enterprise grade operating system, it was very stable (highly tested before release and subject to no major changes to break things running) and Fedora was offered as “bleeding edge technology” (latest versions, more patches, less testing but later versions).

Logically we went with the CentOS route, we want our servers to remain running for long periods and to not break hosting while remaining secure. As of 2016 we run a fleet of CentOS 6 and CentOS 7 boxes with the last CentOS 5 decommissioned. No doubt we’ll go through a process in time of migrating to CentOS 8 when it comes out and eventually decommissioning CentOS 6 as support comes to an end.

So for a long time now our “standard” hosting platform has been CentOS and this has been fine… apart from the occasional issue with versions of software. For example the CentOS 7 current PHP version is 5.4, which is ok for 95% of everything, but some newer apps such as Drupal 8 want to play with 5.6 or above. CentOS 7 doesn’t offer this out of the box and although it is possible that you can install later PHP on CentOS 7 it may not play well and hence is not suitable for shared hosting environments like ours.

Having ummed and ah’d about this we have a solution… as of now we’re provisioning our shared hosting servers with a mix of CentOS 7 and Fedora 24. By standard hosting will still be on the ultra-stable CentOS 7 but you can request for all/some/test accounts to be hosted on Fedora 24.

Having tested this we don’t think it will be any less stable than CentOS but there is always the risk that faster updates will leave software behind – so we recommend you only host items which are subject to regular updates themselves to cope with new versions. We also have plans to start provisioning Fedora-based PHP 7 hosting environments before too long.

If you want to try out Fedora hosting just let your account manager know, we’re happy to provision hosting customers a test environment on one of the Fedora servers to check out the features.

Why we tweet status updates

Someone with a lot of time on their hands might wonder why, when we are an Internet hosting provider, we disseminate our status updates on Twitter. Aren’t we effectively using (and advertising) a competitor?

Well there’s a good reason we use Twitter and no, we’re not in competition with Twitter (which is good as we’d probably be out of business if we were).

We use Twitter because it’s an off-network independent service i.e. it’s availability is unrelated to our network or services in any way. This means that, on the rare occasions when stuff gets real, our update channel for customers is unaffected by whatever has caught fire at our end. No matter how bad the outage with Verrotech services, we can keep everyone abreast of the situation and estimated repair times etc, or even engage directly with customers with DMs.

But what if Twitter fails?

Twitter is of course outside our control, but they have pretty good availability (not Verrotech-good of course, but pretty good). When they do fail for it to be a problem it would have to coincide with a major Verrotech outage as well, which is very unlikely.

To mitigate this anyway for status news we use what we call the availability update tripod of power. That is, our availability news and customer updates has three strong legs.

The latest Verrotech infrastructure (VETS) which we migrated all services onto in January 2016 is built around high-availability virtualisation. The customer facing/Internet facing portion of these are segregated into two main stacks (aka platforms): Stavros and Hektor. These are physically separated platforms in different data centres and with different redundant network links.

These two platforms make up two of the legs, as standard the main website (and blog etc) will be hosted on one and the status.verrotech.com page on the other. Thanks to the wonders of our setup we can actually migrate one to the other in less than a minute but even if that failed, one of those sites would be available and pumping out news.

This is combined with Twitter (the third leg) to mean if you have two URLs bookmarked and follow @verrotechstatus on Twitter, you should never be out of the loop.

New look, new blog, new sites

As anyone that navigated our terrible old website could tell you here at Verrotech we don’t spend much time on branding and visuals. Rather than making lovely marketing pages, we’re all about making our servers sing, and dish up your lovely content 24×7.

Enough was enough however and, with a little free time now thanks to our VETS infrastructure running itself we did a little refresh.

New-look website: http://www.verrotech.com/

New-look status page: http://status.verrotech.com/

And now, hold the presses, a new-look news feed or blog!