A whole-cycle approach to infrastructure risk and uncertainty

–Taxonomies of risk and uncertainty are all but irresistible. Assuming the probabilities and consequences of failure are independent and assuming you can differentiate known from unknown values of each is a great starting heuristic, especially when it comes to what makes unknown unknowns so comparatively special. I will continue to use this taxonomy.

There are, of course, other ways to think about risk and uncertainty relevant for major policy and management. I’d like to introduce one of those here, namely, a whole-cycle approach to risk and uncertainty in society’s critical infrastructures.


–The terms, risk and uncertainty, are used all the time by real-time infrastructure operators without meaning or referring to “expert probability estimates,” be the latter Bayesian, based in frequencies or otherwise. But their operational usages of risk and uncertainty differ depending on where the operators are in the cycle of infrastructure operations and the standards of effective management at those stages.

–In critical infrastructures that are managed with high reliability (i.e., safe and continuous provision of the critical service, even during–especially during–turbulent times), the types of risks to be managed follow from the standard of high reliability being managed to: Certain events have to be prevented from ever happening.

This means that the risks arising out of becoming complacent, or having to think about too many balls in the air at once, or backing the control room into a corner rise to the fore and must be managed in real time. The temptation is to quibble about whether the precluded events standard of reliability is deterministic or “really” probabilistic, but the crux here is the control room knowing as much about cause-and-effect, tacitly and otherwise during these operations.


–Infrastructures however can and do fail systemwide, even though not as often as outsiders seem to expect. A complex socio-technical system in failure differs vastly from that system in normal operations under standards of high reliability management. This means infrastructure risks and uncertainty also vastly differ when the infrastructure is in systemwide failure. For example, in earlier research control room operators we interviewed (during their normal operations) spoke of the probability of failure being even higher in recovery than during usual times. Had we interviewed them in an actual system failure, their having to energize or re-pressurize line by line may have been described in far more demanding terms of operating in the blind, working on the fly and riding uncertainty.

Note the phrase, “more demanding;” it is not “the estimated risk of failure in recovery is now numerically higher.” It is more demanding because the cause-and-effect of normal operations is moot when “operating blind” in failure. If there is urgency, clarity and logic in immediate emergency response it in no way obviates the need for impromptu improvisations and unpredicted, let alone hitherto unimagined, shifts in human and technical interconnectivities as system failure unfolds.

–What was cause-and-effect is now replaced by nonmeasurable uncertainties accompanied by disproportionate impacts, with no presumption that causation (let alone correlation) is any clearer in that conjuncture. What had been the high reliability standard of precluded events has been replaced by a requisite variety standard of effective emergency response, that is, then-and-there task demands are matched by then-and-there resource capabilities, even if only temporarily. Trade-offs are everywhere in infrastructure failure and differ considerably from those in normal operations where systemwide reliability and safety cannot be traded-off without jeopardizing the entire system and users.


–In short, instead of different types of risk and uncertainty being compared by virtue of an overarching taxonomy, risk and uncertainty here are to be distinguished comparatively in terms of an infrastructure’s different stages of its operations. Once we also understand that the conventional notion that infrastructures have only two states–normal and failed–is grotesquely underspecified for empirical work, the whole-cycle comparisons of different understandings of infrastructure risk and uncertainty become far more rewarding.

For example–at this may be too simple for other cases–assume a major infrastructure has witnessed operations that were normal, disrupted, restored back to normal or tripped into outright failure, immediately responded to when failed (e.g., saving lives), followed by restoration of backbone services (electricity, water, telecoms), then into longer term recovery of destroyed assets (involving far more stakeholders and trade-offs), and afterwards the establishment of a new normal, if there is to be one.

It is my belief that what truly separates the risks and uncertainties of longer-term recovery from risks and uncertainties found in a new normal isn’t that, e.g., the politics have changed, but rather when or if infrastructures adopt new standards for their high reliability management.

This may or may not be in the form of different standards seeking to prevent specific types of failures from ever happening. We already know that major distributed internet systems, now considered critical, are reliable precisely because they expect components to fail and are better prepared for that and other contingencies. Here each component should be able to fail in order for the system to be reliable unlike systems where management is geared to ensure some components never fail.


More has to be said, but let me leave you with is a concern, namely, commentators who assume “the new normal” is at best endless recovery, with far more having to cope than proactively managing. There are of course no guarantees in the whole cycle, but at least its format doesn’t, e.g., miss Dresden-now by stopping the cycle at the highly controversial Allied bombings and devastation of 1945.

Also, in case it needs saying, new-normal high reliability brings with it dependencies that are both positive and negative, then and thereafter in operations. If you insist that all such dependencies are vulnerabilities, then you have to explain why people are pulled, not pushed, to vulnerabilities and what would the counterfactual be instead.

Principal sources: See also the earlier blogs, “Recasting ‘low probability, high consequence events'” and “Ongoing disasters, resilience and governance: really?”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s