Thinking infrastructurally about 7 major policy and management issues

Critical infrastructures are usually defined as those large-scale systems and physical assets so vital to society that their failures drastically undermine society and economy, in whole or major part.

But critical infrastructures are also a very useful lens through which to rethink topics of major importance like risk/uncertainty and low probability/high consequence events, or infrastructure fragility and market failure, or healthcare and cognitive reversals, or that ever-present worry, Big System Collapse. Below are seven (7) reconsiderations prompted by thinking infrastructurally.

1. Thinking infrastructurally about whole-cycle risk and uncertainty.

Think of an infrastructure as having an entire cycle of operations, ranging from normal, through disrupted and restored back, or if not, tripping over into failed operations, followed by emergency response including efforts at initial service recovery, then into asset and full service recovery, and onto a new normal (if there is to be one).

There are, of course, other ways to characterize the cycle or lifespan—for example, shouldn’t maintenance and repair be separated out of normal (routine) and disrupted (non-routine) operations?—but this segmentation from normal through to a new normal works for our purposes here.

I want to suggest that “risk and uncertainty” vary both in type and degree with respect to these different stages in infrastructure operations. In normal, disrupted and restoration operations, we observed infrastructure control room operators worrying about management risks due to complacency, misjudgment, or exhausting options.When infrastructures fail, the management risks and uncertainties are very different.

The cause-and-effect relationships of normal, disrupted and restored operations are moot when “operating blind” in failure. What was cause-and-effect is now replaced in failure by nonmeasurable uncertainties accompanied by disproportionate impacts, with no presumption that causation (let alone correlation) is any clearer in that conjuncture. Further, when there is urgency, clarity and logic in immediate emergency response, that in no way obviates the need for impromptu improvisations and unpredicted, let alone hitherto unimagined, shifts in human and technical interconnectivities as system failure unfolds.

As for system recovery, in earlier research the control room operators we interviewed (during their normal operations) spoke of the probability of failure being even higher in recovery than during usual times. Had we interviewed them in an actual system failure, their having to energize or re-pressurize line by line may have been described in far more specially demanding terms of operating in the blind, working on the fly and riding uncertainty, full of improvisations and improvisational behavior.

In short, risk and uncertainty are to be distinguished comparatively in terms of an infrastructure’s different stages of its lifespan operations. Once we recognize that the conventional notion of infrastructures having only two states–normal and failed–is grotesquely underspecified for empirical work, the whole-cycle comparisons of different understandings of infrastructure risk and uncertainty become far more rewarding.

For example, it believe that what separates the risks and uncertainties of longer-term recovery from risks and uncertainties found in a new-normal is whether or not the infrastructures have adopted new standards for their high reliability management. Endless recovery is trying to catch-up to some kind of reliability and safety standards; new-normal is managing to standards and the risks that follow from managing to the standards.

This may or may not be in the form of earlier, old-normal standards seeking to prevent specific types of failures from ever happening. We now know that major distributed internet systems, increasingly viewed as critical infrastructures, are reliable precisely because they expect components to fail and are better prepared for that eventuality, along with other contingencies. Each component should be able to fail in order for the system to be reliable unlike systems where management is geared to ensure some components never fail.

More can be said, but let me leave you with those who insist “the new normal” is at best endless recovery, with far more having to cope with risk and uncertainty than proactively being able to manage them. There are of course no guarantees in the whole cycle, but at least its format doesn’t, e.g., miss Dresden-now by stopping the cycle at the highly controversial Allied bombings and devastation of 1945. In case it also needs saying, a new-normal, if there is one, brings with it dependencies that are both positive and negative.

2. Thinking infrastructurally about low-probability, high-consequence events.

Return to having to operate blindly and on the fly in widespread infrastructure failure, where cause-and-effect scenarios most often found in normal operations have given way to being confronted by all manner of nonmeasurable uncertainties and disproportionate impacts, none of which seem obviously cause-and-effect.

The point is that both nonmeasurability and disproportionality still convey important information for their infrastructure operations during and after the disaster. This information is especially significant when causal understanding is most obscure(d).

When experienced emergency managers find themselves in the stages of systemwide infrastructure failure and immediate emergency response, the nonmeasurability of uncertainties and disproportionality of impacts tells them to prepare for and be ready to improvise, irrespective of what formal playbooks and plans have set out beforehand.

“Coping with risk” is highly misleading when an important part of that “coping” is proactive improvisations and in response to infrastructure failures that unfold in ways well beyond predicting or imagining a “low probability and high consequence event.”

3. Thinking infrastructurally about fragility of large socio-technical systems.

The last thing most people think is that infrastructures are fragile. If anything, they are massive structures, where “heavy” and “sturdy” come to mind. But the fact that they not only fail in systemwide disasters, but that they also require routine (and nonroutine) maintenance and repair as they depreciate over their lifespans, requires us to take the fragility features seriously.

Fortunately, there are those who write on infrastructure fragility from a broadly socio-cultural perspective rather than the socio-technical one with which I am familiar:

For all of their impressive heaviness, infrastructures are, at the end of the day, often remarkably light and fragile creatures—one or two missed inspections, suspect data points, or broken connectors from disaster. That spectacular failure is not continually engulfing the systems around us is a function of repair: the ongoing work by which “order and meaning in complex sociotechnical systems are maintained and transformed, human value is preserved and extended, and the complicated work of fitting to the varied circumstances of organizations, systems, and lives is accomplished” . . . .

It reminds us of the extent to which infrastructures are earned and re-earned on an ongoing, often daily, basis. It also reminds us (modernist obsessions notwithstanding) that staying power, and not just change, demands explanation. Even if we ignore this fact and the work that it indexes when we talk about infrastructure, the work nonetheless goes on. Where it does not, the ineluctable pull of decay and decline sets in and infrastructures enter the long or short spiral into entropy that—if untended—is their natural fate.

Jackson S (2015) Repair. Theorizing the contemporary: The infrastructure toolbox. Cultural
Anthropology website. Available at:

The nod to “sociotechnical systems” is welcome as is the recognition that these systems have to be managed–a great part of which is repair and maintenance–in order to operate. Added to routine and non-routine maintenance and repair are the just-in-time or just-for-now workarounds (software and hardware) that are necessitated by those inevitable technology, design and regulatory glitches–inevitable because comprehensiveness is impossible to achieve in complex large-scale systems.

Not only is this better-than-expected operation (beyond design and technology) because of repair and maintenance. It is also because real-time system operators have to actively manage in order to preclude must-never-happen events like loss of nuclear containment, cryptosporidium contamination of urban water supplies, or jumbo jets dropping like flies from the sky. That these events do from time-to-time happen only increases the widespread affective dread that they must not happen again.

What to my knowledge has not been pursued in the socio-technical literature is that specific focus on repair:

Attending to repair can also change how we approach questions of value and valuation as it pertains to the infrastructures around us. Repair reminds us that the loop between infrastructure, value, and meaning is never fully closed at points of design, but represents an ongoing and sometimes fragile accomplishment. While artifacts surely have politics (or can), those politics are rarely frozen at the moment of design, instead unfolding across the lifespan of the infrastructure in question: completed, tweaked, and sometimes transformed through repair. Thus, if there are values in design there are also values in repair—and good ethical and political reasons to attend not only to the birth of infrastructures, but also to their care and feeding over time.

That the values expressed through repair (we would say, expressed as the practices of actual repair) need to be understood as thoroughly as the practices of actual design reflects, I believe, a major research gap in the socio-technical literature with which I am familiar.

Finally, I cannot over-stress the importance of this notion of infrastructure fragility, contrary to any sturdy-monolith imaginary one might have. One can only hope, by way of example, that wind energy infrastructure being imposed by the Morocco-Siemens occupiers of Western Sahara is so fragile as to necessitate of them endlessly massive and costly repairs and maintenance–but I confess that is my management take from a socio-technical perspective.

4. Thinking infrastructurally about the market failure economists don’t talk about.

Economists tell us there are four principal types of market failure: public goods, externalities, asymmetric information, and market power. They do not talk about the fifth type, the one where efficient markets actually cause market failure by destroying the infrastructure underlying and stabilizing markets and their allocative activities.

Consider here the 2010 flash crash of the U.S. stock market. Subsequent investigations found that market transactions happened so quickly and were so numerous under conditions of high-frequency trading and collocated servers that a point came when no liquidity was left to meet proffered transactions. Liquidity dried up and with it, price discovery. ‘‘Liquidity in a high-speed world is not a given: market design and market structure must ensure that liquidity provision arises continuously in a highly fragmented, highly interconnected trading environment,’’ as a report by the Commodity Futures Trading Commission (CFTC) put it for the crash. Here, efficiencies realized through high transaction speeds worked against a market infrastructure that would have operated reliably otherwise.

The economist will counter by insisting, ‘‘Obviously the market was not efficient because the full costs of reliability were not internalized.’’ But my point remains: Market failure under standard normal conditions of efficiency say nothing about anything so fundamental as infrastructure reliability as foundational to economic efficiency.

The research challenge is to identify under what conditions does the fifth market failure arises empirically. Until that is done, the better part of wisdom—the better part of government regulation—would be to assume fully efficient markets are low-performance markets when the stabilizing market infrastructure underlying them is prone to this type of market failure.

But what, then, is “prone”? Low-performing market infrastructure results from the vigorous pursuit of self-interest and efficiencies when hobbling real-time market infrastructure operators in choosing strategies for longer-term high reliability of the market infrastructure.

There is another way to put the point: High reliability management of critical infrastructures does not mean those infrastructures are to run at 100% full capacity. Quite the reverse. High reliability requires the respective infrastructures not work full throttle: Positive redundancy or fallback assets and options—what the economists’ mis-identified “excess capacity”—are needed in case of sudden loss of running assets and facilities, the loss of which would threaten infrastructure-wide reliability and, with it, price discovery. To accept that “every system is stretched to operate at its capacity” may well be the worst threat to an infrastructure and its economic contributions.

In this view, critical infrastructures are economically most reliably productive when full capacity is not the long-term operating goal. Where so, efficiency no longer serves as a benchmark for economic performance. Rather, we must expect the gap between actual capacity and full capacity in the economy to be greater under a high reliability standard, where the follow-on impacts for the allocation and distribution of services are investments in having a long term.

5. Thinking infrastructurally about healthcare.

The US Department of Homeland Security states healthcare is one of the nation’s critical infrastructures sectors, along with others like large-scale water and energy supplies.

Infrastructures, however, vary considerably in their mandates to provide vital services safely and continuously. The energy infrastructure differs depending on whether it is for electricity or natural gas, while the latter two differ from large-scale water supplies (I’ve studied all three). Yet the infrastructures for water and energy, with their central control rooms, are more similar when compared to, say, education or healthcare without such centralized operations center.

What would healthcare look like if it were managed more like other infrastructures that have centralized control rooms and systems, such as those for water and energy? Might the high reliability of infrastructural elements within the healthcare sector be a major way to better ensure patient safety?

Four points are raised by way of answer:

(1) High reliability theory and practice suggest that the manufacture of vaccines and compounds, by way of example, can be made reliable and safe, at least up to the point of injection. Failure in those back-end processes is exceptionally notable—as in the fungal meningitis contamination at the New England Compounding Center—because failure is preventable.

When the perspective is on medical error, the patient is at the center of the so-called sharp-end of the healthcare system. But healthcare reliability is a set of processes that includes the capacities and performance of upstream and wraparound organizations. When dominated by considerations of the sharp-end, we overlook—at our peril—the strong-end of healthcare with its backward linkages for producing medicines and treatments reliably and safely.

(2) If healthcare were an infrastructure more like those with centralized control centers, the criticality and centrality of societal dread in driving reliable service provision would be dramatically underscored.

Yet, aside from that special and important case of public health emergencies (think the COVID-19 pandemic), civic attitudes toward health and medical safety lack the public dread we find to be the key foundation of support for the level of reliability pursued in other infrastructures, such as nuclear power and commercial aviation.

Commission of medical errors hasn’t generated the level of public dread associated with nuclear meltdowns or jumbo-jetliners dropping from the air. Medical errors, along with fires in medical facilities, are often “should-never-happen events,” not “must-never-happen events.”

What would generate the widespread societal dread needed to produce “must-never-happen” behavior? Answer: Getting medical treatment kills or maims you unless managed reliably and safely.

(3) How a reliable and safe healthcare system encourages a more reliable healthcare consumer would be akin to asking how does a reliable grid or water supply encourage the electricity or water consumer to be energy or water conscious. Presumably, the movement to bring real-time monitoring healthcare technology into the patient’s habitation is increasingly part of that calculus.

(4) In all this focus on the patient, it mustn’t be forgotten that there are healthcare control rooms beyond those of manufacturers of medicines mentioned above: Think most immediately of the pharmacy systems inside and outside hospitals and their pharmacists/prescriptionists as reliability professionals.

One final point from an infrastructure perspective when it comes to healthcare risks and uncertainties. Can we find systematically interconnected healthcare providers so critical in the US that they could bring the healthcare sector down (say, as was threatened when the 12 systematically interconnected banking institutions were under threat during the 2008 financial crisis)? If so, we would have a healthcare sector in need of “stress tests” for systemic risks just as post-2008 financial services institutions had to undergo.

6. Thinking infrastructurally about cognitive reversals.

What else can we do, senior executives and company boards tell themselves, when business is entirely on the line? We have to risk failure in order to succeed! But what if the business is one of the many critical infrastructures privately owned or managed?

Here, if upper management seeks to implement risk-taking changes, they rely on middle-level reliability professionals, who, when they take risks, do so in order to reduce the chances of systemwide failure. To reliability-seeking professionals, the risk-taking activities of their upper management look like a form of suicide for fear of death.

When professionals are compelled to reverse practices they know and find to be reliable, the results have been deadly:

• Famously in the Challenger accident, engineers had been required up to the day of that flight to show why the shuttle could launch; on that day, the decision rule was reversed to one showing why launch couldn’t take place.

• Once it was good bank practice to hold capital as a cushion against unexpected losses; capital security arrangements now mandate they hold capital against losses expected from their high-risk lending. Mortgage brokers traditionally made money on the performance and quality of mortgages they made; in the run-up to the 2008 financial crisis, their compensation changed to one based on the volume of loans originated but passed on.

• Originally, the Deepwater Horizon rig had been drilling an exploration well; that status changed when on April 15 2010 BP applied to the U.S. Minerals Management Service (MMS) to convert the site to a production well. The MMS approved by the change. The explosion occurred five days later.

In brief, decision-rule reversals have led to system failures and more: NASA was never the same; we are still trying to get out of the 2008 financial mess and the Great Recession that followed; and the MMS disappeared from the face of the earth.

“But, that’s a strawman,” you counter. “Of course, we wouldn’t deliberately push reliability professionals into unstudied conditions, if we could avoid it.” Really?

The oft-recommended approach, Be-Prepared-for-All-Hazards, looks first like the counsel of wisdom. It however is dangerous if requiring emergency and related organizations to cooperate in ways they currently cannot, using information they will not have or cannot obtain, for all manner of interconnected scenarios, which if treated with equal seriousness, produce considerable modeling and analytic uncertainties, let alone really-existing impracticalities.

7. Thinking infrastructurally about Big System Collapse.

Here are early warning signals—typically not recognized—that those major critical infrastructures upon which we survive are in fact operating at, or beyond, their performance edges:

–The infrastructure’s control room is in prolonged just-for-now performance. This means operators find it more difficult to maneuver out of a corner in which they find themselves. (“Yes, yes, I know this is risky, but just keep it online for now!”)

–The real-time control operators are working outside their official or unofficial bandwidths for performance—in effect having to work outside their unique domain of competence.

–The decision rules operators reliably followed before are turned inside out: “Prove we can do that” becomes “Prove we can’t.”

–Real-time operational redesigns (workarounds) by control room operators of inevitably defective equipment, premature software, and incomplete procedures are not effective as before.

–Their control room skills as professionals in identifying systemwide patterns and undertaking what-if scenario become attenuated or no longer hold.

–Instead of being driven by dread of the next major failure, control room professionals are told that their track record up to now is to be benchmark for system reliability ahead.

I have yet to come across these as key indicators of infrastructure and big system collapse in the literature I’ve read.

Principal sources: Excerpted and revised from previous blog entries.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s