A whole-cycle approach to infrastructure risk and uncertainty

I

The terms, risk and uncertainty, are used all the time by real-time infrastructure operators without meaning or referring to “expert probability estimates,” be the latter Bayesian, based in frequencies or otherwise. But their operational usages of risk and uncertainty differ depending on where the operators are in the cycle of infrastructure operations and the standards of effective management at those stages.

In critical infrastructures that are managed with high reliability (i.e., safe and continuous provision of the critical service, even during–especially during–turbulent times), the types of risks to be managed follow from the standard of reliability being managed to: Certain events have to be prevented from ever happening.

This means that the risks arising out of becoming complacent, or having to decide with too many balls in the air at once, or backing the control room into a corner rise to the fore and must be managed in real time. The temptation is to quibble about whether the precluded events standard of reliability is deterministic or “really” probabilistic, but the crux here is control room operators knowing as much about cause-and-effect, tacitly and otherwise during these operations.

II

Infrastructures however can and do fail systemwide, even though not as often as outsiders seem to expect.

A complex socio-technical system in failure differs vastly from that system in normal operations under standards of high reliability management. This means infrastructure risks and uncertainty also vastly differ when the infrastructure is in systemwide failure. For example, in earlier research control room operators we interviewed (during their normal operations) spoke of the probability of failure being even higher in recovery than during usual times. Had we interviewed them in an actual system failure, their having to energize or re-pressurize line by line may have been described in far more demanding terms of operating in the blind, working on the fly and riding uncertainty.

Note the phrase, “more demanding;” it is not “the estimated risk of failure in recovery is now numerically higher.” It is more demanding because the cause-and-effect of normal operations is moot when “operating blind” in failure. If there are urgency, clarity and logic in immediate emergency response, it in no way obviates the need for impromptu improvisations and unpredicted, let alone hitherto unimagined, shifts in human and technical interconnectivities as system failure unfolds.

This means that what had been cause-and-effect is now replaced by nonmeasurable uncertainties accompanied by disproportionate impacts, with no presumption that causation (let alone correlation) is any clearer in that conjuncture. What had been the high reliability standard of precluded events has been replaced by a requisite variety standard of effective emergency response, that is, then-and-there task demands are matched by then-and-there resource capabilities, even if only temporarily. Trade-offs are everywhere in infrastructure failure and differ considerably from those in normal operations, where systemwide reliability and safety cannot be traded-off without jeopardizing the entire system and users.

III

In short, risk and uncertainty are to be distinguished comparatively in terms of an infrastructure’s different stages of its operations. Once we also understand that the conventional notion that infrastructures have only two states–normal and failed–is grotesquely underspecified for empirical work, the whole-cycle comparisons of different understandings of infrastructure risk and uncertainty become far more central and rewarding.

For example–this may be too simple for some cases–assume a major infrastructure has witnessed operations that were normal, disrupted, restored back to normal or tripped into outright failure, immediately responded to when failed (e.g., saving lives), followed by restoration of backbone services (electricity, water, telecoms), then into longer term recovery of destroyed assets (involving more and different stakeholders and trade-offs), and afterwards the establishment of a new normal, if there is to be one.

It is my belief that what truly separates the risks and uncertainties of longer-term recovery from risks and uncertainties found in a new normal isn’t that, e.g., the politics and conflicts have altered, but rather when or if infrastructures adopt new standards for their reliability management.

This may (or not) be in the form of different standards seeking to prevent specific types of failures from ever happening. We already know that major distributed internet systems, now considered critical, are reliable because they expect components to fail and are better prepared for that and other contingencies. Here each component should be able to fail in order for the system to be reliable, unlike systems where management is geared to ensuring some components never fail.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s