Thinking infrastructurally about risk and uncertainty

–The terms, risk and uncertainty, are used all the time by real-time infrastructure operators without meaning or referring to “expert probability estimates,” be they Bayesian, or based in frequencies, or recast as threats, vulnerabilities and exposure. But the operational usages of risk and uncertainty differ depending on where the operators are in the cycle of infrastructure operations and the standards of effective management at those stages.

For example, control room operators we interviewed (during their normal operations) spoke of the probability of failure being even higher in recovery than during usual times. Had we interviewed them in an actual system failure, their having to energize or re-pressurize line by line would have been described in far more demanding terms of operating in the blind, working on the fly and riding uncertainty.

–Note the phrase, “more demanding;” it is not “the estimated risk of failure in recovery is now numerically higher.”

It is more demanding because the cause-and-effect of normal operations is moot when “operating blind” (their term) in failure. What had been cause-and-effect is now replaced by nonmeasurable uncertainties accompanied by disproportionate impacts, with no presumption that causation (let alone correlation) is any clearer in that conjuncture.

What may have been the high reliability standard of preventing certain disasters from every happening has now been replaced by a requisite variety standard of effective emergency response, that is, then-and-there task demands are matched by then-and-there resource capabilities, even if only temporarily. It is true that there are urgency, clarity and logic in immediate response after failure, but they in no way obviate the need for impromptu improvisations and unpredicted, let alone hitherto unimagined, shifts in human and technical interconnectivities as system failure unfolds.


Once we understand that the conventional notion that infrastructures have only two states–normal and failed–is grotesquely underspecified for empirical work, the whole-cycle comparisons of different understandings of infrastructure risk and uncertainty become far more central and rewarding.

Assume a major infrastructure has witnessed systemwide operations that were normal, disrupted, restored back to normal or tripped into outright failure, immediately responded to when failed (e.g., saving lives), followed by restoration of backbone services (electricity, water, telecoms), then into longer term recovery of destroyed assets (involving more and different stakeholders and trade-offs), and afterwards the establishment of a new normal, if there is to be one.

It is my belief that what truly separates the risks and uncertainties of longer-term recovery from risks and uncertainties found in a new normal isn’t that, e.g., the politics and conflicts have altered, but rather when or if infrastructures adopt new standards for their reliability management.

This may (or not) be in the form of different standards seeking to prevent specific types of events from ever happening. We already know that major distributed internet systems, now considered critical, are reliable because they expect components to fail and are better prepared for that and other contingencies. Here each component should be able to fail in order for the system to be reliable, unlike systems where management is geared to ensuring some components never fail.


More has to be said, but let me leave you with a worry: namely, those commentators who assume “the new normal” is at best endless attempts at repair, where coping is the order of the day and managing for recovery no longer possible (if only because of management’s unintended consequences and the economics of coping).

From a whole-cycle approach, this reductionism is premature and thus exaggerated. In the first place, how can you have “proper pricing of risk,” if you don’t know the socio-technical system to be managed across its states of operation, the reliability standard to which it is to be managed then and there, and the risks and uncertainties entailed by subscribing to that standard for those systems? In the second place, there are of course no guarantees that the whole cycle will be spanned, but at least its format doesn’t, e.g., miss Dresden-now by stopping time at its 1945 devastation.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s