— 20 August 2015
Perhaps ‘bad’ isn’t the right word. How about ‘untrained, inexperienced, under-exercised and ill- equipped data centre staff make mistakes’? Certainly the opposite is true – ‘good’ staff make fewer mistakes…
So what, I hear you thinking? Well… I spend a considerable part of my time advising on facility design and delivering training courses centred on the prime target of nearly all data centres – high Availability. This, almost totally, concentrates on designing in reliability, resilience, concurrent maintainability and, in the ultimate expression, fault tolerance. If you prefer; selecting the most appropriate ‘Tier Classification’ or, in EN50600 speak, ‘Availability Class’. So risk profiles are established, budgets are set, designs laid down, specifications written and contractors appointed.
However nearly all of my work ignores the scourge of data centre Availability which is ICT service interruption through ‘human error’ and many studies have been published over the years putting the figure at some 70% of all data centre ‘failures’ being attributed to that problem. In fact a couple of years ago in a data centre seminar session someone from Microsoft let slip that if they added IT hardware, software and human error together it added up to 97% of their service failures. That left me somewhat deflated as I was only influencing 3% of the Availability puzzle.
Still it remains today, although it is more than possible to incorporate anti-error features in the M&E design – through simplicity and dual-bus generally. Indeed it was postulated some years ago by Ed Ansett (when running EYP) that there was a negative effect on reliability when you spent more on (overly) complex systems in the pursuit of higher resilience and that has always stayed in the forefront of my mind when auditing system architecture after a failure. The point is that a poorly designed facility can be totally mitigated by good operatives whilst even a ‘Tier 4’ facility can be turned into an unreliable nightmare by incompetence. The money spent on the resilience of the data centre infrastructure only addresses <30% of your business continuity plan.
Anyway the bottom line is that to achieve high Availability you need to pay most attention to data centre staff, from the most important but least senior upwards. Whilst the most senior facility manager might have the most responsibility the most junior humble M&E technician has the greatest opportunity to cause real damage to your business. Select them carefully, train them well on your system, allow them to simulate failure scenarios, test the system often, update their training regularly and, finally, retain them. Of course, nothing can compensate for malicious sabotage so staff selection and screening becomes even more critical.
Failing that, at the end of the day, you could resort to the Russian Method of data centre personnel selection: Only employ those who own a large dog and have them bring the dog to work every day, train the employee on how all the systems work and train the dog to attack them if they try to touch anything.
This is a guest Blog by:
Prof Ian F Bitterlin
CEng PhD BSc (Hons) BA DipDesInn
FIET MCIBSE MBCS
Visiting Professor, Data Centre Engineering, University Of Leeds