- When defining risk of failure as the product of the probability of failure (Pf) times the consequences of failure (Cf), Pf and Cf are NOT independent of each other, as conventional risk analysis would have it.
a. Both are connected indirectly by the “intervening variable” of their shared failure scenario. It’s Pf and Cf with-respect-to the same failure scenario.
b. Further, the more granular the failure scenario, the more likely that Pfs and Cfs are directly interconnected. In the case of interinfrastructural cascades, one consequences of infrastructure1 failing (Cf1) may be to increase infrastructure2’s probability of failure (Pf2). (We also know that cognitive biases are such that the estimating one affects estimating the another, but leave that “interconnection” for others to expand upon.)
c. Such is why a risk estimate must never be confused with being a prediction, i.e., if the risk is left unattended, failure is a matter of time. Even were Pf and Cf not interconnected, the efficacy of prediction depends on how detailed the with-respect-to scenario is. The function of the failure scenario is to identify and detail (if not isolate) conditions for cause and effect upon which prediction is or is not made possible.
- Less rather than more granularity in the failure scenario might account for fewer criteria with respect to what qualifies as “effectiveness” in normal operations under conditions of high turbulence.
a. We have argued, for instance, that an explosion at a gasline section in a utility’s natural gas transmission system has to be analyzed in terms of its consequences within the system and inter-system levels as well. (The same could be said for fires induced by a utility’s electricity transmission system.)
i. It may be that the natural gas system operated reliably at the systemwide level, where the infrastructures that depended on natural gas provision also operated reliably during the explosion/fire.
ii. The negative consequences of the explosion are, in other words, offset by the positive consequences of maintaining systemwide reliability and intersystem dependencies.
b. The point here is that a failure scenario exclusively focused at the site-level within a system can miss scenarios (and related criteria) for maintaining normal operations at the systemwide and intersystem levels under disturbance conditions.
- Identifying risk(s) in the absence of first defining the operational system and the reliability standard(s) being managed to ends up with having no stopping rule to the possible failure scenarios and types of risks/uncertainties that matter.
a. Accordingly, all manner of things end up posing risks and uncertainties, e.g.
…different assets; multiple lines of business; with respect to system capacity, controls and marketing factors; in terms of the risks’ time-dependence versus independence; in terms of the risks associated with emergency work as distinct from that planned; investment risks versus operational ones; risks with respect not only to system safety and reliability, but also organizationally in terms of financial risk and regulatorily in terms of risks of non-compliance….ad infinitum
After a point, it must become an open question how managing all these risks and uncertainties (along with many, many more) contributes to the control room operating the system reliably in real time. (This leaves aside the very vexed issue of risk management not being the same as safety management, i.e., just because you reduce risk does not mean you have improved safety.)
b. This lack of a stopping rule for failure scenarios to be worried about represents a hazard or is its own failure scenario, when it discourages (further) thinking through and acting on failure scenarios about which more is known and can be done. When we asked infrastructure interviewees what were the “nightmares that keep them awake at nights,” they identified not only measurable risks and noncalculable uncertainties with respect to specified failure scenarios but also the fact that the scenarios were part of a limitless set of possibilities for what could go dangerously wrong.
What does this all add up to?
First, the probabilities and consequences (Pf and Cf) of large system failure are often underestimated: (1) the measured estimates of Pf do not adequately address important nonmeasurable uncertainties, including those related to real-time system management and (2) there are so many more failure modes than the conventional scenarios (e.g., earthquake or flood) assume.
What is not sufficiently recognized is that the identification of what uncertainties and failure modes is narrowed down to those relevant in meeting and realizing the reliability standard(s) that govern what is taken to be relevant risk and uncertainty.
Second, the danger in focusing on interconnected probabilities and consequences of failure that have been sundered for methodological purposes is that, when stranded at your cognitive limits, you do not recognize those cases that are little more than one-off contingencies with disproportionate effects. Indeed, you have little if any idea about what the counterfactual would be. Such are the furthest things you can imagine from a methodological Pf and a Cf.
Where and when both sets of considerations hold, the infrastructure’s risk mitigation programs and controls become a priority source of indicators and metrics reflecting how seriously catastrophic failure scenarios are treated by infrastructure managers.
Indeed, the existing controls and mitigations may provide the only real evidence, outside the real-time management of the control room, of what currently works well (or not) with respect to improving not only system reliability but system safety when pegged to catastrophic system failure.
A clear priority for safety management would be to identify, assess and better validate already existing risk mitigation programs, controls and their metrics in terms of their own failure rates, given their associated catastrophic failure scenarios.
Nevertheless, formal risk management frameworks and programs are apt to narrow the complex of “reliability standard, system and possibility” to disaggregated specifics of “risk, asset and probability”, and in the process commit a major category mistake. It’s as if in talking about water you’re immediately asked to think “H2O” and to separate out oxygen and hydrogen from each other along with best measuring each—while all along assuming that this analysis enables you to talk about water as water per se, e.g., having the property of “wetness.”
In other words, the fact that risk is not formally calculated must not be taken to mean risk is not formally appraised and evaluated by other means, most prominently through the skills in systemwide pattern recognition and localized scenario formulation of real-time control room operators.
Principal source: On the very important differences between risk management and safety management, see C. Danner and P. Schulman (2018), “Rethinking risk assessment for public utility safety regulation.” Risk Analysis 39 (5), 1044-1059