The Reason for System Downtime in IT
The recent wave of high visibility outages and failures has brought attention to a problem that happens all too frequently in IT, but is seldom reported on unless the failure impacts the general public. At the root of the problem, from a technical standpoint, is the IT Skills Gap, a problem about which we frequently write. Computing economics have resulted in geometric growth in computing resources and availability of new and better applications at lower costs. But at the same time, IT budgets are showing little to no growth and IT staffing is lagging behind deployments as a result of insufficient budget and lack of available trained resources.
Business is pushing for more and better applications to be deployed faster to remain competitive. But any new deployment creates risk. The Multiplication Law of Probability states that the probability of occurrence of a given number (n) of independent events is equal to the product of the probability of the occurrence of each of the n events. In the event of an outage or failure, while the risk on one individual system or application is relatively minor, with geometric growth in new deployments and applications, the risk increases. As a simplistic example, if each component of a system has a 99.5% average reliability, when you have, say, 25 components, the reliability of the system as a whole drops to 88.2%, assuming the reliability of each component is independent of that of the others (which in the case of IT systems is untrue —because systems need to communicate and share data, they are, in fact, very dependent on one another). In especially complex IT organizations like those of the major airlines with large numbers of disparate components —mainframes with distributed systems, cloud based, virtual, and on-premises machines, multifaceted networks, and a myriad of applications–risk level and mitigation become serious considerations.
Delta Airlines’ Systems Outage
We don’t know for sure what caused January’s Delta outage (the company simply characterized the problem as a “systems outage” and an “automation issue”). Hardware, or software, or both could have been at the root of the problem. If the core of the problem was on the hardware side, the airline certainly has some sort of high availability recovery plan. Testing the plan under a realistic set of conditions, however, can be problematic.
Assume for a moment, though, that the problem was rooted on the software side. And assume that, as Delta acknowledged, it was an automation tool at the heart of the problem. Now, the benefit that automation can bring is clear —manual hand-offs are problematic. But while automation is the popular answer to staffing deficits and increasing demand for applications, it is critical that the automation tool in question has a set of features and capabilities that, in the event of an outage or failure, can recover quickly.
One such example is the incorporation of changesets. Changesets provide a way to recover more quickly in the event of a failure. A changeset refers to a group of changes which is treated as an indivisible group —a central place to track the changes being made and provide a mechanism to rollback aspects of a change or the entire change itself. This provides improved governance over IT operations by providing an audit trail to show, in a clear and granular fashion, what changes were deployed, when, and by whom. Changeset functionality provides IT organizations with a degree of maintainability, manageability, and control.
One other critical area is change management —that is, a reliable approach to synchronizing and managing objects in different environments, such as Test, QA, and Production. Changesets, in conjunction with change management, provide a powerful way to easily and reliably identify, rollback, restore, and audit groups of changes made within change management.
There is no question that business is relying on technology to drive innovation and move forward faster. The question now is how IT organizations will accommodate and mitigate any risk that inherently accompanies rapid progress.