Reliability Criteria
There is a common tendency on the part of engineers engaged in the practice of mission critical facility design to focus on the mathematical and theoretical nuances of infrastructure design – especially in regard to achieving x “9s” of reliability. The fact of the matter is that true reliable performance of mission critical facilities and IT infrastructure is more a matter of art than science.
For starters, most downtime is directly related to human error, not spontaneous component or system failure. This is much harder to quantify and predict. Secondly, design features which make systems easier to maintain and operate will provide far more value to the owner/operator than some costly measures taken to enhance theoretical reliability.
So, what are these factors and criteria characteristic of high performing mission critical infrastructures?
Bigger and Fewer is Better
The tendency is to think that multiple smaller components will be more reliable than a bigger system with fewer components. This tendency comes from the idea that a failure of a bigger component will have a broader effect than a smaller localized failure. However, given a minimum amount of redundancy and the underlying assumption that the “system” can tolerate a component failure and still be functional, then the actual probabilities of downtime are much less with fewer components.
Simplicity
Another tendency that I have seen is to protect the data center from every conceivable failure scenario and maintainence permutation. This leads to complexity, and complexity – because of our human limitations – leads to reduced reliability. The simpler the schemes, the easier they are to operate and maintain, and the more likely they will be to provide many years of faithful and dependable service.
Hardening
During the big dot com boom of the late nineties, we saw a lot of data centers get built with DX cooling units on roofs or outside on grade. While this was a fast and cheap solution, it rendered many of these data centers to class C or worse on the open market, and were difficult to unload at anything more than 15 cents on the dollar. The lesson is to provide a consistent design with integrity. In other words, don’t include features in one area at a high level of reliability, only to become vulnerable by the selected systems in another area. Data centers with any significant impact of downtime will need to seriously look at the hardening of the outer shell, separation of electrical equipment within the facility, no roof or outside mounted critical equipment, and strategies to deal with extended utility outages, be they water, electricity, fuel or communications.
The bottom line is that high performing facilities are based on simple and elegant topologies, ease of use, and robust construction with an integrated design philosophy.