Resilience

The Principles of Infrastructure Resilience

by Scott Jackson

Resilience, as described in 2007 in documents related to the Critical Infrastructure Protection (CIP) Program, is broader than protection in that protection focuses primarily on survival. Resilience includes plans for the recovery and continued functioning of the infrastructure even if some elements of the infrastructure do not survive. Resilience takes some of the pressure off protection, and allows for recovery even when total protection is not feasible. It also considers how a building, power plant, or other component of critical infrastructure is prepared and protected, whether responders and/or the community at large can take advantage of advance warnings, and whether there are alternative plans in place for continued operation. Resilience also takes into consideration whether those whom responders depend on also are likely to survive and recover.

Resilience has become a subject of increased interest in recent months, at least partly because both the Department of Homeland Security and the Obama Administration have adopted resilience as a primary focus area. Resilient infrastructures can anticipate, survive, and recover from external disruptions, such as terrorist attacks – and from natural disasters, such as hurricanes and earthquakes. Resilient infrastructures also can recover from internal disruptions, such as human and software errors. Infrastructures that are not resilient are called brittle. Fortunately, a world-wide community of scholars, researchers, and decision-makers has made significant strides in defining resilience and developing its guiding principles.

First, an infrastructure can be seen as a system. Systems are collections of parts, called elements or nodes, that together have a common purpose. The common purpose of a transportation infrastructure, to consider but one example, is to carry people and cargo across various distances. The purpose of a power infrastructure is to provide power to residences and industry throughout a fairly large area.

Similar infrastructures exist for a community's water supply, fire protection, healthcare, defense, communications, and law enforcement. Because each of these infrastructures is almost always separately owned and managed – but also dependent in varying degrees on the other infrastructures, the whole is sometimes called a system of systems. The U.S. national infrastructure itself is a system of systems. (The figure accompanying this article is a notional depiction of a national infrastructure system of systems.)

Some components of the U.S. national infrastructure – railroads and airlines, for example – are privately owned and managed. Other parts are government agencies, such as fire and police departments. One of the major challenges facing U.S. decision makers and emergency managers, at all levels of government, is how to make the national infrastructure more resilient. 



The Architecture of an Infrastructure
 The term architecture is used to describe how the various parts of an infrastructure are arranged and relate to one another. The creation of an architecture is sometimes called architecting. The architecting process asks, and seeks answers to, a number of questions, including the following: (a) How many modes of transportation should there be? (b) How can the nodes of a power infrastructure be arranged so that the loss of one node will not cause the entire power infrastructure to cease functioning? (c) How can and should a private or government infrastructure reorganize to survive and recover? The following review of the four principles of resilience – capacity, flexibility, tolerance, and cohesion – answers those and a few other questions.


The first and perhaps most important principle – capacity – is that all infrastructures have, or should have, the capacity to withstand "known" disruptions, such as hurricanes and floods. Capacity includes not only the ability to absorb such disruptions but also a margin of additional ability to cope with disruptions larger than anticipated. Capacity also includes both physical and functional redundancy so that the infrastructure will have alternative ways to survive. Functional redundancy could mean, for example, that a coastal city would have several possible ways for the local populace to evacuate the disaster area and find shelter elsewhere. Those ways could and probably would include the use of cars, trains, boats, aircraft, and other modes of transportation. Obviously, the more "ways" there are to evacuate, the more resilient the infrastructure is.

Unfortunately, there are several fairly recent examples of infrastructures that did not meet the capacity test. The 2007 collapse of the Minneapolis-St. Paul bridge, and the collapse of several New Orleans levees during hurricane Katrina in 2005, are perhaps the most obvious of those examples. The long-term stresses on the bridge, and the overflow heights of the levees, were well known long before disaster struck. The principal readiness shortcoming was that the capacity required was not maintained – or, preferably, augmented – after the infrastructure elements were built.

The capacity example provided last year by US Airways Flight 1549 was somewhat different. There is no evidence to date that the Airbus airplane did not meet the Federal Aviation Administration (FAA) requirements for bird ingestion; in fact, the flock of geese that caused the aircraft's engines to shut down exceeded the FAA requirement. Nevertheless, the aircraft "system" – which included the pilot and crew as well as the passengers – survived, demonstrating that it was a resilient system.

In contrast, the 2001 attacks on the twin towers of the World Trade Center (WTC) in New York City greatly exceeded the capacity of the two buildings to survive such an attack. Both buildings presumably had the capacity to withstand all precedented – i.e., previously known – disruptions. In this case, the recovery of the city itself did not depend only on capacity, but on a number of other resilient factors, such as a supply of generators sufficient to restore power in Manhattan within five hours (as pointed out by David Mendoça and William Wallace in their 2006 report on "Adaptive Capacity").

However, infrastructure resilience cannot depend on capacity alone. Resilience enables infrastructures to recover even when the disruption exceeds the capacity, as in the case of the US Airways flight.

Flexibility: Resilient infrastructures also must be flexible – which means, more specifically, that the infrastructure system should be able to reorganize itself. Once again, the New York Power Restoration case study shows that a major factor in the restoration of power so soon after the attacks on the WTC was the ability of the local power agency to reorganize and focus on deploying generators throughout the city. 



Reorganization also includes the ability of the infrastructure to elevate levels of authority during a disruption. Such elevation of authority is particularly common in the fire prevention arena. The San Francisco Fire Department, for example, employs a system in which, as the severity of an emergency increases, the authority rises first from the firefighter level to the supervisory level and then to the senior decision-making level of the government agencies involved.

Tolerance: Resilient infrastructures also are "tolerant" of disruptions – tolerant in the sense that the infrastructure will not immediately lose all of its capability following a disruption, but will degrade gradually. Tolerance depends, to a great extent, on localized capacity – which already exists today in certain domains. Many hospitals have their own power supplies, for example, in case the public power supply is disabled because of an earthquake or other major disruption. (However, the current national power grid is an example for which localized capacity is not strong enough.)

Another aspect of tolerance is what is called "loose coupling" – a term which implies, for example, that when one node of an infrastructure element fails, that failure will not immediately propagate to and/or affect other nodes. In the 2003 failure of the power grid in the northeastern United States the lack of loose coupling was evident by the rapid propagation and geographic spread of the failure. A valuable lesson learned from that failure, though, is that loose coupling will undoubtedly be a high priority in the design and building of future generations of power infrastructures.

Another capability in the tolerance area is what is called drift correction – which means, basically, that disruptions can be avoided or at least minimized by awareness of their approach in enough time that corrective or compensatory action can be taken. The use of sensors to warn of an impending train collision is one example of how drift correction can be used to prevent or reduce the harmful effects of an imminent disaster.



Cohesiveness: The resilience of an infrastructure also depends on how well the nodes of the infrastructure relate to one another. In a 2006 report on "essential characteristics" of infrastructures, David Woods refers to those relationships as "cross-scale interactions," and points out that they can occur on three levels. The first level is communication, which asks, specifically, if the nodes "talk to one another" – a question addressed by Karl Stephan in a 2007 IEEE (Institute of Electrical and Electronic Engineers) publication in which he points out that many of the agencies in New Orleans lacked the interoperability, at the time Hurricane Katrina hit, they needed to communicate with one another.

The second level of cohesiveness is cooperation. Even with no formal ties, the nodes of an infrastructure should possess the initiative, and capabilities, needed to cooperate with one another. During Katrina, the New Orleans agencies were deficient on this level as well. In contrast, the New York Power Restoration accomplishment after the 9/11 attacks was a case in which cooperation was manifest among and throughout the power companies, fire and police departments, and U.S. military forces involved. The third and highest level of cohesiveness encompasses inter-element collaboration, which includes formal agreements between the nodes to both help and provide resources to one another.



The development, building, and implementation of infrastructure resilience is difficult for a number of reasons, but primarily because of the large and complex array of government agencies and private organizations involved – all of them representing different nodes of the infrastructure. There is general agreement, though, that no single organization, not even the federal government, should or could orchestrate the entire resilience plan. Independent nodes, such as The Infrastructure Security Partnership (TISP), also can and do play a role – by, for example, facilitating agreements between and among the other nodes.



It is important to remember that very few if any of these nodes or entities possess the financial resources needed to fully implement the more costly aspects of resilience. Installing dual railroad tracks, for example, to achieve a certain degree of physical redundancy would be very costly. Priorities need to be established, therefore, in assessing what needs should be funded. Fortunately, some aspects of resilience are practically cost-free. Signing a memorandum of agreement with another agency, for example, usually involves only a relatively small expenditure of administrative costs – an acceptable price to pay when it is remembered that the main responsibility of all of the organizations participating in a resilience effort is to collaborate with one another. 



To briefly summarize: There is general agreement that: (a) The principles of resilience, as defined by the international community, and described above, can and should be applied to the U.S. national infrastructure and to its principal elements; and (b) The most important priority in this effort is and should be the development of a cooperative approach to implementation.


_______________

For additional information relevant to the preceding article and/or closely related topics, see:

Critical thinking: Moving from infrastructure protection to infrastructure resilience. CIP [Critical Infrastructure Protection] Program Discussion Paper Series; George Mason University (2007). Or click on: http://www.resilient.com/download/Research_GMU.pdf

Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions, by Scott Jackson (John Wiley & Sons, 2010).

“NTSB [National Transportation Safety Board] Blames Engineer for Crash,” by Robert Lopez, Dan Weikel & Rich Connell (Los Angeles Times, 22 January 2010).



“Adaptive Capacity: Electric Power Restoration in New York City Following the 11 September 2001 Attacks,” by David Mendoça & William Wallace Proceedings of the Second Resilience Engineering Symposium, 8-10 November, Juan-les-Pins, France 2006.

“We’ve Got to Talk: Emergency Communications and Engineering Ethics,” Karl Stephan, IEEE [Institute of Electrical and Electronic Engineers] Technology & Society Magazine, 2007.



"Essential Characteristics of Resilience," David Woods, Resilience Engineering: Concepts and Precepts, Aldershot, UK: Ashgate, 2006.

__________________________________ Scott Jackson is a lecturer in the Systems Architecting and Engineering graduate program at the University of Southern California (USC). His book, Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions, was published by Wiley in 2010. He is a Fellow of the International Council on Systems Engineering (INCOSE) and represents USC on The Infrastructure Security Partnership (TISP).