By Steve Bradbury, VP of Consulting & Service Delivery, NextNet Partners
As a business leader, imagine what the impact could be to your business if you lost access to company financial information, were unable to access customer records, had doctors that could not access patient records or your customers were unable to purchase products and services. Most companies in today’s world have become highly dependent on their IT systems and would not be able to conduct business with their customers or run operations if their critical IT systems were down. As companies continue moving to the Cloud and centralizing their IT systems, data centers and core infrastructure have become essential components for the performance and availability of these critical systems. If your data center or critical IT systems suffer a near catastrophic or catastrophic failure, it could severely impact your company’s brand or possibly put you out of business if your data centers are not designed, built and tested to sustain or recover from such an event. If you consider your data center to be a critical part of your business, please read on.
How do you know if your data center is a ticking time bomb?
Unfortunately, there is no easy answer to this question or silver bullet to solve any problems you may find. Hopefully, this article will at very least provide you with a starting point to begin asking the hard questions of your IT organization.
I’ve spent the last twenty years of my career working with enterprise data centers and infrastructure technology in mission critical environments in most every industry including; Healthcare, Financial Services, High-Tech, Government and Aerospace. I’ve worked with companies that have over 150 data centers distributed across the globe and some with only a single data center. In most every environment I’ve consulted or worked, many do not meet minimum Uptime Institute’s or TIA-942 tier classifications and have not executed real world production scenarios to effectively test their recovery plans. I want to point out, a few of these data centers are some of the most sophisticated and highly redundant Tier 3 and Tier 4 rated data centers in the world. However, these too have vulnerabilities where they could experience a critical failure. The examples I use in this paper are taken from real-world actual situations where recovery scenarios were invoked and in a few cases where data centers have been completely down.
As a CIO, CTO, VP or director, you are expected to ensure the availability and performance of the production IT environment. As a CEO, you expect that your data center and core infrastructure have been designed to minimize your risk of encountering a disruption to your business due to an IT systems failure.
Over the course of my career I’ve often referred to the Data Center Operations and Infrastructure teams as the “utility company”. Computers, networks and applications are expected to work as the power does; when you flip a light switch on, you expect it to work. When the lights don’t turn on or go out, all of us in IT know how that story ends.
What questions should you ask IT or your team that support the data centers?
The following questions are intended to help start the conversation with IT or leadership to see how your data center and team would respond in the event of a key or critical system failure.
Q: Are the data center systems running critical applications using High Availability (HA) or Active/Active (AA) configurations?
A1: An Active/Active architecture provides the ability to run systems/applications in two or more locations in real-time. The A/A systems should be able to continue to run at equal or near equal production performance with little to no disruption of service in the event one location is down. An A/A architecture is the most costly solution and typically is put in place for mission critical systems and applications.
A2: In a High Availability configuration, systems are architected to recover in near real-time using a “Hot Stand-by” configuration. These systems often times leverage replication between systems so that in the event of a failure, services can be recovered within hours of a disruption or down-time event.
Q: Will the other infrastructure and application components support the production load or capacity across multiple centers in the event of a failure or service disruption?
A: In the event of a service disruption, will your networks, firewalls, Active Directory, security systems, interfaces, databases, management utilities, etc. handle the production load if services are moved to the backup or recovery site?
Q: Are there updated and clearly defined procedures on how IT and business personnel activate a recovery or Business Continuity Disaster Recovery plan to respond to an event?
A: Have these procedures been recently tested in a real-world scenario? It is extremely important that you understand the testing scenarios, see the direct results of the test or witness the switch from production to backup and vice versa.
Q: What “single points of failure” exist within the data center?
A: Very few data centers have been thoroughly analyzed to understand what single components could cause a significant failure. Over the years, I’ve heard the story from many data center managers, architects and engineers that “they have no single points that could cause a significant disruption to the data center or key infrastructure”. When I hear that, it typically tells me that they do not fully understand their environment or are not thinking holistically. It’s not always the systems and networks that can cause the problem.
What are some examples of single points of failure that are commonly overlooked?
This is a real-world list of uncovered single points of failure, including some that have caused a complete data center outage lasting for hours and even days:
- Power – In the same way that systems and applications are tested, so should the power. Generators, Uninterruptable Power Supplies (UPS), utility power, Power Distribution Units (PDU), static switches and rack power are just a few. Many data center engineers, technicians and managers rely upon electrical vendors to support these critical systems. These systems need to be tested for load, power distribution, redundancy and scale in the same capacity an IT organization would test their systems and applications.
- HVAC – Can your data center produce sufficient ventilation and air-conditioning? If a data center doesn’t have sufficient HVAC infrastructure in place, data centers can overheat causing systems to shut down or crash.
- Electrical Vendors – Did a data center experienced electrical engineer design your data center or did a commercial building electrician design it? This is a very important question since most commercial electricians do not understand how to properly engineer or architect the power and HVAC for an enterprise data center.
- Telco Rooms or MDF’s – The MDF is often times overlooked from a power, HVAC and critical service perspective for the data center. The MDF is where most network connectivity enters into a data center. If you lose access to your network, the centralized data center will lose complete external access. Network access in many companies have single points of failure and are often overlooked since MDF’s are managed by the telco or network teams. Data centers should have network access from at least two diverse locations to avoid a single point of failure scenario.
- Server and Communication Racks – Do the server and communication racks have diversified power? A single computer rack should have at least two end to end separate sources for power that are color coded and tested.
- Core Network – The Core Network for a data center is what provides the interconnectivity between servers, storage, network, backup systems, firewalls and much more. A single device failure or strand of fiber being cut can bring the entire data center down and have a significant impact if not properly engineered. How diverse is the Core Network?
- Digital Certificates – Many organizations overlook something that seems simple, but has the ability to cause major disruptions to the data center and systems they support. Certificates usually have dates for when they will expire. If they are not renewed and expire, access to the systems requiring the certificate authentication will likely shut down and restrict access until renewed. Accurately maintained documentation and clearly defined procedures will help avoid this sort of event. I’ve experienced an expired certificate that took out an enterprise wireless network and ones that have shut down entire sections of a data center and production applications.
- Lifecycle Management Plan – IT is expensive and it’s easy to delay the next OS upgrade, pass on the core network switches because it is too risky, or replace End of Life (EOL) servers. However, IT leaders need to explain the importance of maintaining their Lifecycle Refresh so that your data center does not get into a situation where it is no longer compatible or will not allow you to obtain support for the system. Additionally, neglecting the Data Center Lifecycle will eventually catch up with you and force a major unplanned CAPX expense. I’ve seen companies that have created a billion dollar deficit by neglecting the refresh of systems and applications.
- Production and Change Control (PCC) – Do you have a comprehensive Production and Change Control process to ensure that all changes that go into production have been properly tested and have clearly defined back-out procedures? If you do not, this is one of the most critical functions in an IT organization. PCC is essential for an organization to control change in a data centers and should be built around industry best practices such as ITSM and ITIL. It must have executive sponsorship, leadership governance with IT and business sponsorship and have well defined policies to enforce.
This article only touches on a few of the many complex aspects of the data center that can put your business at risk. Hopefully, it will provide you as a leader, consultant or end user, the information to effectively ask IT these questions. Then, make your own judgements as to whether or not your business is potentially at risk of an IT event that could have been prevented. These issues can be resolved, but you first have to ask the questions and determine your path.
Thank you for reading this article and please send questions or comments to me at firstname.lastname@example.org. Please look for future articles from me coming soon. Future topics: Managed Services and Outsourcing; Cloud Solutions; Unified Communications and Enabling Technologies; IT Service Management; IT Infrastructure and Customer Experience;
*2017 Sources: Ponemon Institute; IDC; IDG Communications