Proposed National Academy of Reliable Infrastructure Management (longer read)

We need no further reminder than the 2021 Texas grid collapse and shutdown of the Colonial pipeline for how essential the real-time management of infrastructures is to people’s lives and livelihoods. Yet that management is the missing middle of President Biden’s mega-plan for new infrastructure construction and renovations. Nor is real-time management central to other initiatives like a National Infrastructure Bank, proposed in 2007 and resurfacing in the policy mix through 2020 legislation.

Providing the missing management means banking expertise the nation already has to operate these large, changing systems. They range from energy grids, urban water supplies and flood protection to telecommunications, vessel transportation and aviation, along with others. If anything, the legislation and initiatives will increase the need for real-time professionals to correct for inevitable shortfalls that jeopardize systemwide reliability and safety.

A National Academy of Reliable Infrastructure Management would remediate the nation’s infrastructure crisis by enhancing and advancing that high reliability management. The management challenges are beyond the domains of engineering, economics and systems modeling, but of equal priority and urgency as those of the National Academies of Science, Engineering and Medicine.

RELIABILITY PROFESSIONALS

High reliability management is understood in real-time infrastructure operations as the continuous and safe provision of what are considered to be critical services, even during (especially during) turbulent and changing times.

The Academy would bring together an under-recognized class of experts in that real-time management from around the world. These reliability professionals include, most important, infrastructure control room operators (often with long experience and variable formal education), along with their managers and their immediate expert support staff (more likely to have higher formal degrees).

The Academy would promote their participation through projects, studies, and other advisory and convening activities. The mission is to examine, assemble and advance evidence-based findings for real-time reliability and safety management of infrastructures under 21st century conditions. In doing so, the Academy would provide the heft in facilitating research access to major control centers whose entry has been restricted for proprietary or security reasons.

Why an Academy for infrastructure management? As demonstrated repeatedly, large critical infrastructures must be managed beyond their technology, formal designs, and published regulations. The Academy challenge is to ensure that the tasks and demands of the rapidly changing infrastructure technologies are matched to the people with the skills and expertise to manage them beyond inevitable glitches and contingencies.

As the Academy gained knowledge, it would foster the management expertise to better navigate the interdependencies and interconnections of critical infrastructure sectors. Doing so requires two tracks. Not only is the Academy’s attention on ensuring critical national services like water, electricity and natural gas, hazardous liquids transmission, and aviation, are provided when most needed, always right now without incident. It also means focusing on ensuring their reliable and safe interconnectivity: Natural gas is used for electricity provision, which supplies the water needed by refineries that process the hazardous liquids, including Jet A-1 fuel for aviation.

VALUE-ADDED FOR INTERCONNECTED CRITICAL INFRASTRUCTURES.

The challenge continues to be how to analyze and improve the interconnectivity as it is navigated in real time. No one is responsible for that high reliability management picture. An example illustrates the huge stakes in getting this right. Assume an explosion at a major natural gas reservoir has occurred. Presently, the disaster leads to root-cause analyses, a process of zooming down to determine why and what precipitated the explosion. This is the responsibility of staff in the infrastructure and its regulator of record.

Identifying causes of the explosion is obviously important to prevent further explosions from happening at this and other reservoirs. But knowing causes does not go far enough in making sure that other systems are managed reliably and safely in light of the disaster. Required at the same time is zooming up the system and across systems with which it is interconnected.

What happened to the real-time operations of the natural gas transmission as a whole during and after the explosion? What happened to infrastructures depending on natural gas for their own operations during the explosion and in their next steps ahead? To my knowledge, the regulators of record do not work together to answer the latter question, routinely or as a matter of priority.

Such questions would be of core concern to the new Academy. Was the control center for natural gas transmission able to compensate for loss of the reservoir in real time? Did the control room keep the crisis from spreading to other parts of its transmission and distribution systems, including the variety of end-use customers? How did the control room compensate, where did it stumble, and what are other parts of its system were vulnerable or not?

But more than zooming up through the system is required. In the same instant, we must know what happened because of the explosion to the critical infrastructures depending on its natural gas. Some may also have control centers: Were their operators able to maintain their respective system’s reliability and safety in the face of that explosion? Since natural gas is often interoperable with electricity, it is critical to determine if or to what extent the electricity infrastructure was affected by explosion.

These assessments are also necessary to keep infrastructures interconnected under complex and changing conditions. It’s safe to say that zooming down for a root-cause analysis has been far more common than zooming up and across. But only the former assessments highlight major vulnerabilities introduced when root-cause analyses are the basis for systemwide recommendations to ensure the disasters don’t happen again.

What is missing in root-cause analyses are the negative impacts, if any, of the recommended changes on high reliability management at the system and inter-system levels. Will the changes, when implemented, undermine the capacity of the infrastructure’s control room to prevent disruptions, such as explosions, that it had prevented in the past from cascading across the natural gas system or beyond?

No regulator of record or national body is tasked to answer that question about cascade potential and those entailed with it. That there are answers would be the purview of the new Academy of Reliable Infrastructure Management.

REMODELING INFRASTRUCTURE CASCADES

Infrastructure cascades are understandably of central concern, where failure in one system leads to failure in others. But system engineers and modelers often have a very different view about these than control room operators.

One objective, for example, of network of networks modeling of infrastructures has been identifying which nodes and connections, when deleted, bring the network or sets of networks most immediately to collapse. But not failing immediately is what we expect to find in managed systems. In fact, the datasets we have on really existing infrastructure disruptions show that most are managed so as not to cascade over into other infrastructures and that certain infrastructures, most notably in energy, have a greater potential for cascading.

Modelers defend their focus as one of identifying worst-case scenarios (e.g., in today’s highly charged cyber-security arena). But control room operators and staff live in a real-time world where “what-if” scenarios cannot be the only way to treat probabilities and consequences.

Real-time reliability of their systems as systems must also account for the run of cases and frequencies of past or like events and their precursors. Real-time operators wouldn’t be reliability professionals if they ignored that, in their systems, brownouts at time precede blackouts, some levees are seen to seep long before failing, and the electric grid’s real-time indicators of precursors to disruption or failure typically increase beforehand. Reliability professionals (not least of whom in major control centers that face thousands and thousands of daily cyber-attacks) have to be skilled in both systemwide pattern recognition and in localized “what-if” scenario formulation.

Their expertise also reflects its own real-time indicators of effectiveness. These indicators are rarely if ever recognized by the regulators of record or system models of interconnectivity. The Academy would be the nation’s advocate for that expertise and early warning signals.

NEW INDICATORS FOR PREVENTING INFRASTRUCTURE COLLAPSE, NOW

It’s important to establish from the outset that the Academy would be advancing leading (not lagging) indicators of systemwide collapse. Just as important, the indicators already exist for monitoring critical infrastructures operating at, or beyond, their performance edges, e.g.:

  • The infrastructure’s control room is in prolonged just-for-now performance, which means operators find it more difficult to maneuver out of a corner in which they find themselves. (“Just keep that generator online now!” even though the generator is scheduled for outage maintenance).
  • Real-time control operators are pushed into working increasingly outside their established bandwidths for operations, in effect having to work outside upper and lower bounds of competent performance.
  • Control room operators find that a chokepoint in its infrastructure (a major bottleneck that cannot be worked around) is failing adjacent to the chokepoint of another infrastructure with which it is functionally interconnected.
  • The decision rules operators reliably followed before are now reversed: “Prove we can launch” becomes “Prove we can’t launch” (Challenger Accident); “Ensure a capital cushion to protect against unexpected losses” becomes “From now on, manage for expected losses” (2008 financial crisis).
  • Measurable real-time operational redesigns (workarounds) are no longer effective. Nor can systemwide patterns be recognized or what-if scenarios formulated with the level of granularity as in the past.
  • Instead of being driven by wide social dread of having a next major failure ahead, control room professionals are told their track record up to now is the benchmark for reliability.

No one has the institutional niche and wherewithal to direct and sustain the nation’s attention on measuring and monitoring these real-time tipping points and transitions. The Academy would find it an easier task to cut through all the noise, including typical objections about control rooms and their operators, so as to augment, update and prioritize the indicators list.

POSSIBLE OBJECTIONS

In my judgment, the principal objections to an Academy would not be its cost or clout, both of which would be very real. Rather, the real objections originate in complaints from other disciplines in infrastructure development: “Control room operators aren’t really experts, like the engineers and economists with whom they work” and “Control rooms aren’t innovative; in fact, they’re the opposite.” (The latter misconception is addressed in the next section.)

Major cultural differences have plagued engineers and control room operators and, more recently, “Ops (Operations)” and “IT (Information Technology)” staff. One engineer we interviewed called the control room, “neanderthals.” Economists and engineers assured us: Generally speaking, having to operate in unstudied conditions is a “risk” society must take in order to benefit from major technological advances.

Yet, control room operators continue to press for the specifics—What if this piece of new marketing software fails during the phasing out of those backups?—something we heard again and again, as one “go-live” date had to give way to another in an executive initiative to replace legacy systems in a major state control room. 

There are, of course, exceptions to such behavior. But no one reading should doubt the outsized importance of engineers, economists and system modelers relative to real-time system operators and wraparound support, at the center and in the field, when it comes to major infrastructure change and reform here.

The professional orientation of control operators to prevent systemwide failure is clearly orthogonal to disciplines and professions insisting it’s all but impossible to innovate if you’re not prepared to fail.

Equally telling, calls for new technologies and software to correct for “operator error” are routinely made (1) in the absence of calculations by economists of the everyday savings of disasters averted and (2) in spite of a system model focus on two states of operation, normal and failed, when it is during the intervening state of temporary service disruption that operators demonstrate their skills and use of indicators in restoring service back. These and other differences in professional orientations would be treated far more constructively by a free-standing Academy.

CONTROL ROOMS AS CENTERS OF INNOVATION

Control operators, to the extent they are acknowledged for their expertise, have been disparaged as hidebound with a “don’t fix what’s already working” mentality. The reality is that because things are not working in real time, control operators must innovate so as to maintain system reliability and safety then.

Three domains of control room innovation are core to the Academy’s mission:

  1. Control rooms as unique centers of systemwide innovation and evolution

It is not sufficiently understood by engineers, economists and system modelers that infrastructure control rooms are an historically unique organizational formation. (Here as elsewhere, I thank my research colleague, Paul Schulman, for the insight.) They have evolved over time to take hard systemwide decisions under difficult societal conditions that require a decision, now.

In fact, the evolutionary advantage of control rooms lies in the skills, expertise and team situation awareness of its operators to redesign in real time what prove to be incomplete or otherwise defective technology, design and regulation. More, meeting the high reliability mandate must be done so as not to threaten the limits of the system to operate as a whole. There are no guarantees here, but the expertise is required when “fool-proof” technology and designs are found, too frequently, to be otherwise.

The Academy would treat these specifically organizational and management practices, skills and core competencies with the priority and resources the nation deserves.

2. Importance of the reliability-matters test for other major technological innovations

It’s indisputable that innovations for infrastructures proposed by outside experts and consultants are required. To ensure viability, they must pass the reliability-matters test. Would the innovation, if implemented, reduce the task volatility that real-time operators face? Does it increase their options to respond to task volatility? Does it increase their maneuverability in responding to different, often unpredictable or uncontrollable, performance conditions?

Among many control room operators interviewed, I never met one who was against any innovation that increased options, reduced task volatility and/or increase performance maneuverability across changing conditions. I have, however, met economists, engineers and others who dismiss this reliability-matters test, as they also dismiss “only workarounds,” as proof of a control room’s “resistance to change.”

The Academy will not be able to stop the premature introduction of novel software and hardware into systemwide operations, but it can monitor their real-time management impacts and interconnected knock-on effects (as in the natural gas example and indicators list).

3. Control operators and support staff as innovators in systemwide risk assessment

Talk of “trade-offs” is ubiquitous when discussing new designs and technologies. Control operators and wraparound support see the real-time demands of their high reliability mandate along different lines. 

For them—as for the infrastructure-reliant public—reliability in real time becomes “non-fungible.” That is, high reliability can’t be traded off against cost or efficiency or whatever when the safe and continuous provision of the critical service matters: again, right now, without failure. No number of economists, engineers and system modelers insisting that reliability is “actually” a probability estimate of meeting a standard will change the real-time mandate that systemwide disasters must be prevented from ever happening.

Nuclear reactors must not blow up, urban water supplies must not be contaminated by cryptosporidium or worse, electric grids must not island, jumbo jets must not drop from the sky, irreplaceable dams must not breach or overtop, and autonomous underwater vessels must not hazard the very oil rigs they are repairing. That disasters can or do happen reinforces the dread and commitment of the public and control operators to this precluded-event standard.

The better practices for high reliability management developed and modified across runs of different cases and infrastructures would be the Academy’s principal subject. The Academy’s ambit would be worldwide in this regard and well beyond published best practices of professional societies and industry associations only.

Infrastructure mandates for managing and innovating reliably and safely are, in short, not going away. Nor can they, even when systems are necessarily smaller, more decentralized, less interconnected, and more sustainable. Those systems too will be managed as if peoples’ lives and livelihoods depend on it—because they do.

Principal sources

High Reliability Management and Reliability and Risk (2008 and 2016 respectively from Stanford University Press and co-authored with Paul R. Schulman). A summary can be found in E. Roe and P. Schulman (2018). “A Reliability & Risk Framework for the Assessment and Management of System Risks in Critical Infrastructures with Central Control Rooms,” Safety Science 110 (Part C): 80-88

For a shorter version of this blog, see “A National Academy of Reliable Infrastructure Management.” Issues in Science and Technology (August 3, 2021), accessed online at https://issues.org/national-academy-reliable-infrastructure-management-roe/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s