New benchmark metrics for major risk and uncertainty (longer read)

Starting point for policy and management with respect to risk and uncertainty. The methodological demand is always: First differentiate! Differentiation matters especially with respect to risk and uncertainty. There is no such thing as risk or uncertainty on its own; it is always risk or uncertainty with respect to something.

The language of risk and uncertainty is now so naturalized it always seems the obvious point of departure, like filing alphabetically or chronologically: “The first thing we have to do is assess the risks of flooding here…” No. The first thing you do is detail the with-respect-to scenarios of interest.

To start with, you identify the boundaries of the flood system as it is actually managed and then the standards of reliability to which it is being managed (namely, events must be precluded or avoided by way of management) and from which derive the specific risks to be managed to meet standard(s). The risks follow from the standard to be met for the system as bounded for management in real time.

Why is this important? It means that benchmarks or metrics for risk and uncertainty are all about the details in the with-respect-to scenarios.

An example. Focus on an island in the western California Delta–for example, Sherman Island–and consider criteria that engineers rely on for establishing priorities with respect to reducing levee fragility there (the island’s encircling levees are needed because its productive areas are considerably below water level):

  • Criterion 1. Levee fragility priority can be set in terms of the weakest stretch of levee around the island, i.e., the stretch of levee that has the highest probability of failure (Pf). This has obvious implications for collocated elements from different infrastructures, e.g., a very high levee Pf should counsel against plans to place, say, a huge chemical tank facility next to it. (You’d assume commonsense would commend this as well.)
  • Criterion 2. Levee fragility priority can be set in terms of the stretch with the highest loss of life (and/or other assets) arising from levee failure. If the levee breaches where most island residents live, then there is less time for evacuation. Clearly, consequences of failure (Cf) are important here, and this criterion is about the levee stretch that has the greatest risk of failure, not just probability of failure. (Risk here is the product of Pf times Cf.)

Sherman Island’s weakest levee stretch, at the time of our research, was said to be on the southwest part of the island; the stretch with the greatest loss of life appeared to be on the eastern and south-east side with more residences. Other factors constant and from the perspective of Criterion 2, it is better in fact that the weakest stretch of levee (according to Criterion 1) is on the other side of the island, so as to ensure more time for evacuation.

–A third criterion, in contrast, reflects the extent to which the levee infrastructure of the island is part and parcel of a wider interconnected critical infrastructure system (ICIS):

  • Criterion 3. Levee fragility priority can be in terms of stretch that has the greatest risk to the entailed ICIS. ICIS risk of failure is not the same as risk of levee failure only, as stretches of Sherman Island levees are in fact not just elements in the levee system there but also elements in other critical infrastructures. With respect to Sherman Island, there is the levee stretch with Hwy 160 on top; there are also other stretches serving as the waterside banks of the deepwater shipping channels; another stretch serves to protect a large wetland berm (as fishing and bird habitat). If those stretches of levee fail, so too by definition do elements fail in the deepwater shipping channel, Hwy 160 or the Delta’s threatened habitat.

Criterion 3 asks: What is the effect on the road system or shipping system or wetlands ecosystem, when that shared ICIS element on Sherman Island fails? If a stretch of Hwy 160 fails, road traffic in the Delta would have been detoured; if a stretch of the deepwater shipping channel fails, shipping traffic would have been rerouted to other ports; and so on. In some cases the service cannot continue because there is no default options, e.g., the Sherman Island wetlands berm in terms of its habitat and fish can’t be “rerouted” were protective levee to fail.

What infrastructure system that shares one or more ICIS elements on Sherman Island would be affected the most in terms of increasing the probability of its failing as a system, were such Sherman Island levee stretches to fail? The answer: A levee breach anywhere on Sherman Island would increase the probability of the closing the key pumps for the State Water Project. That is, the Pf of the state and federal water projects would increase were Sherman Island to flood, because saltwater would be pulled further up from the San Francisco Bay into the freshwater Delta.

–In sum, the three with-respect-to risk assessment criteria—others are possible—differ appreciably as to where risk analysts focus attention in terms of levee fragility: the weakest stretch (Pf) may not be the same stretch whose failure would have the greatest loss of life and property (Cf), while any stretch that failed would pose the greatest ICIS risk (namely, the probability that an ICIS element failing increases the probability of failure of one or more of the constituent systems sharing that element).

You would expect that calls for more and more “inter-organizational coordination” would have to be prioritized in light of these criteria distinctions. You’d be wrong. Criterion 3 was altogether outside conventional remit for risk assessment and management up to and at the time of the research.

Broader methodological implications of risk and uncertainty with-respect-to scenarios. Before proceeding to new metrics based in benchmarks for such risks, uncertainties and criteria, it is important to tease out what we mean and imply by “with respect to” in more general methodological terms:

  1. If you define risk of failure as the product of the probability of failure (Pf) times the consequences of failure (Cf), then Pf and Cf are NOT independent of each other, as conventional risk analysis would have it.

Both are connected indirectly by the “intervening variable” of their failure scenario. It’s Pf and Cf with-respect-to the same failure scenario. It’s the failure scenario which details the operative: reliability standard (are you seeking to preclude specific events or avoid them if possible; are some events inevitable or compensable after the fact); evaluative criteria (are you managing Pf [probability] or both Pf and Cf (risk); and (3) the system being managed (are you managing, e.g., the within or across different infrastructures).

Accordingly, the more granular the failure scenario (the greater the details about the above), the more likely that Pfs and Cfs are directly interconnected. In the most obvious case of interinfrastructural cascades, one consequences of infrastructure1 failing (Cf1) may be to increase infrastructure2’s probability of failure (Pf2).

This is why a risk estimate must never be confused with being a prediction, i.e., “if the risk is left unattended, failure is a matter of time.” Even were Pf and Cf not interconnected, the efficacy of prediction depends on how detailed the with-respect-to scenario is. The function of the failure scenario is to identify and detail (if not isolate) conditions for cause and effect upon which prediction is or is not made possible. Without the scenario, you simply cannot assume more uncertainty means more risk; it may mean only more uncertainty over estimated risk in light of the with-respect-to scenario. You will note that many “large system failure scenarios,” a.k.a. crisis narratives, are devoid of just such detail when it comes to the operative reliability standards, evaluative criteria and (sub)systems to be managed.

  1. Identifying risk(s) in the absence of first defining the operational system and the reliability standard(s) being managed to ends up with having no stopping rule for possible failure scenarios and types of risks/uncertainties.

Without defining these initial conditions, all manner of elements and factors end up posing risks and uncertainties, e.g.

…different assets; multiple lines of business; system capacity, controls and marketing factors; in terms of the risks’ time-dependence versus independence; in terms of the risks associated with emergency work as distinct from planned work; investment risks versus operational ones; risks with respect not only to system safety and reliability, but also organizationally in terms of financial risk and in terms of risks of regulatory non-compliance….ad infinitum

At some point, it becomes open question how managing all these and more risks and uncertainties contributes to the infrastructure’s control room operating the system reliably in real time. Conventional root cause analysis of infrastructure failure becomes highly vexed in the absence of a specified failure scenario. (For that matter, would you talk about the wetness of water by analyzing “H2O” only in terms of its oxygen and hydrogen molecules?)

In fact, the lack of a stopping rule for failure scenarios to be worried about represents a hazard or is its own failure scenario, when it discourages (further) thinking through and acting on failure scenarios about which more is already known and can be managed. When we asked infrastructure interviewees what were the “nightmares that keep them awake at night,” they identified not only measurable risks along with nonmeasurable uncertainties with respect to specific failure scenarios but also the fact that these scenarios seemed part of a limitless set of possibilities for what could go dangerously wrong.

What do these considerations add up to for the purposes of identifying new, more appropriate benchmark metrics for large system risk and uncertainty?

Most obviously, the probabilities and consequences (Pf and Cf) of large system failure can be underestimated. But this is not only because: (1) the measured estimates of Pf do not adequately address important nonmeasurable uncertainties (i.e., where either Pf or Cf cannot be measured in the time required) and (2) there are so many more failure modes than the conventional scenarios (e.g., earthquake or flood) assume.

It is also because—and importantly so as we just saw–the  failure scenarios themselves have not been specific enough with respect to the boundaries of the system being managed and the reliability standard(s) that govern what is taken to be relevant risk and uncertainty.

Second, the infrastructure’s already-existing risk mitigation programs and controls become a priority source of indicators and metrics reflecting how seriously catastrophic failure scenarios are treated by infrastructure managers. The existing controls and mitigations may provide the only real evidence, outside the real-time management of the infrastructure control room (if present), of what currently works well with respect to improving system reliability and safety when pegged to catastrophic system failure.

To put it another way, the fact that risk is not calculated through formal risk analysis and management protocols must not be taken to mean risk is not formally appraised and evaluated by other means, most prominently (1) through the skills in systemwide pattern recognition and localized scenario formulation of real-time control room operators and (2) relevant evaluation of risk mitigation programs and existing risk controls.

Against this background and in comparison to conventional risk analysis today, at least three new benchmark metrics for major risk and uncertainty can be identified by virtue of their different with-respect-to failure scenarios.

I. New risk benchmark 

When control operators and their managers in large critical infrastructures know that some events must never happen—the nuclear reactor must not lose containment, the urban supply must not be contaminated by cryptosporidium, the electricity grid must no separate and island—and we know that they know because they behave accordingly—then better practices emerge for ensuring just that. (Again, this is why we look to evaluating existing mitigation programs and controls, and not just in the infrastructure concerned but in like infrastructures.)

Mandates to reliably preclude certain events put enormous pressure to focus on and adapt practices that are actually working to meet the mandates (including the appropriate evaluative criteria for measuring how effectively have the mandates been met). Where better practices have emerged, you know that others too face political, economic and social constraints and nonetheless have jumped a bar higher than we yourselves are currently facing under the very similar constraints, including evaluative criteria and reliability standards.

Where so, then conventional risk analysis gets its questions only half right by stopping short of the other questions to be asked beforehand. The conventional questions, “What could go wrong?” “How likely is that?” and “What are the consequences if that were to happen?” should be preceded by: “What’s working?” “What’s even better?” “How can we get there?” and only then do we ask: “What could go wrong in trying to get there?” “How likely is that?” and “What are the consequences if that were to happen?

(BTW, which would you prefer to start with in highly uncertain conditions: conventional risk analysis or high reliability management? The Maginot Line or the electricity grid enabling you to read this question?)

II. New metric for ranking crisis scenarios 

Start with a rather well-known prediction of Martin Rees, British science advisor, who assigned no better than a 50/50 chance that humanity survives the current century because of catastrophes of our making. How might we evaluate and rank his prediction in terms of risk and uncertainty?

Turn to another famous prediction, that of U.S. President, Woodrow Wilson (in his time expert in several fields), who predicted in September 1919 with “absolute certainty” that there would be another world war if the US did not join the League of Nations. Assume a unit of measurement called the Wilson. It is equal to the confidence today’s experts have that Woodrow Wilson did foresee the start of World War II.

Obviously, “the start of World War II” is inexact. Wilson did not predict the rise of Hitler, the Shoah, or carnage on the Eastern Front. But crisis scenarios for financial cascades, global cyber-attacks, and fast-spreading pandemics of as-yet unknown viruses lack comparable specificity by way of risk and uncertainty.

The question is this: How confident are experts in their crisis scenarios when that confidence is measured out in Wilsons? When it comes to nuclear terrorism, are the experts, say, 30 times more confident that such terrorism will happen than they are that Woodrow Wilson foresaw World War II? For that matter, what would be the consensus view of specialists when it comes to denominating other disaster scenarios into fractions or multiples of Wilsons?

The temptation is to dismiss outright that Woodrow Wilson foresaw the future. Were that dismissal scientific consensus, however, it would be quite significant for our purposes: Here at least is one scenario that is just-not-possible-at-all. Nor risk or uncertainty of being wrong here! To render any such conclusion means, however, the criteria used for concluding so apply to other crisis scenarios.

In short, we’re back to baseline confidence measures and the dray work of developing multiple ways of triangulating on and estimating specialist confidence, scenario by scenario, in the face of difficulties and inexperience over what and about which we know and do not know.

Several key points, though, become clearer at this point. To ask how confident specialists are about nuclear terrorism specifically quickly becomes just what is meant by “an act of nuclear terrorism.” What, indeed, are the pertinent with-respect-to scenarios?

This devil-in-the-details leads to a second half of our thought experiment. Assume now we face a specific crisis scenario. It could be that act of nuclear terrorism, or that computer glitch sending global markets into free-fall or that bioengineered pathogen destroying near and far.

Assume a visualization of the widening scenario is simulated and presented so as to pressure decisionmakers to prevent that scenario from happening, once they see how catastrophe unfolds and metastasizes.

Assume also a running tally in the visualization shows the estimated monetary amount of the disaster’s costs—lives, property, whatever—burgeoning into the millions, then billions, now trillions. The tally in quick order reinforces how imperative it is to take urgent preventive action in the midst of all this interconnectivity (evaluative criterion #3 above).

But hold on. Assume the visualization and tally remain the same, but the simulation’s goal now is to estimate the cost of a catastrophe that can’t or won’t be prevented. The tally then becomes an unofficial price tag of the emergency prevention and management system put into place after this disaster, so that a like calamity “will never happen again” (the precluded event standard of reliability above). The commonplace here is that, sadly, it takes a disaster to bring about far better and more comprehensive disaster prevention and management afterward.

The temptation with this part of the thought experiment is to assert that, absent outright prevention, a world won’t be left from which to mount an effective crisis management infrastructure later on. That, though, surely depends on the specific catastrophe and the extenuations of implementing an emergency response infrastructure that its losses trigger. Again: The devil is in the details of the with-respect-to scenarios.

Note, though, just how difficult it is for anyone, subject matter experts let alone others, to come up with plausible details about the crisis response structure to be in place after the losses incurred. To do that requires deep knowledge and realism—far more, in other words, than the much-touted “imagination” on its own.

To short, we are asked to treat possible crisis scenarios seriously until proven otherwise, when those offering the scenarios are unable to specify what it takes to disprove the scenarios or prevent their recurrence. Or to put the point more positively, what deserves ranking, and where it is possible, are those crises of sufficient detail to be triangulated upon and confirmed.

III. New metric for estimating societal risk acceptance 

It is generally understood that “acceptable-risk” standards, based on past failure frequencies and commitments of “never again,” can be fleeting and ephemeral. More, the retrospective orientation to letting past (in)frequency of failures set the standard has led to complacency and the very accident to be forestalled, as in: “Well, it hasn’t happened in the past, so what’s the problem now…”

It’s worth asking, what can be offered by way of a prospective orientation—“we are no more reliable than the next failure ahead”—to identifying standards of acceptable/unacceptable societal risk. What does “societal risk acceptance” look like if instead of being based on past frequencies, it is grounded in the expectation that all manner of major system accidents and failure lie in wait unless actively managed against?

I suggest the following thought experiment, the aim of which identifies a proxy for “acceptable societal risk.” To telegraph ahead, the proxy proposed is the aggregate curve of the major real-time control room risks of society’s key critical infrastructures.

–Assume: that society has identified critical infrastructures indispensable to its survival; that the key infrastructures have central control rooms for operating the entire systems; and that the respective control room operators have a set of chief risks that they must manage in order to maintain systemwide reliability, at least in real time. (Here high reliability is defined as the safe and continuous provision of the critical service, even during periods of high risk and uncertainty.)

While huge assumptions, their virtue is trying to operationalize the far less detailed premise of current approaches—most notably ALARP (“as low as reasonably practicable”)—that somehow “society sets acceptable and unacceptable risks,” leaving the somehow utterly without specifics.

Under the precluded-event standard of reliability (i.e., the event or a set of conditions to be prevented must never happen, given the society-wide dread associated with system failure), our research found that control operators need to be able to maneuver across four performance modes so as to maintain normal operations. Each performance mode was found to have its own chief risk.

The four modes range from anticipatory exploration of options (just in case) when operations are routine and many management strategies and options are available, to a real-time improvisation of options and strategies (just in time) when task conditions are more volatile. Control room professionals and their support staff may have to operate temporarily in a high-risk mode (just for now) when system volatility is high and options few. They may also be able, in emergencies when options have dwindled, to impose onto their service users a single emergency action scenario (just this way) in order to stabilize the situation.

The chief risk in just-in-case performance is that professionals are not paying attention and become complacent—reliability professionals have let their guard down and ceased to be vigilant, e.g., to sudden changes in system volatility (think of system volatility as the degree to which the task environment is unpredictable and/or uncontrollable). As for just-in-time performance, the risk is misjudgment by the operators with so many balls in the air to think about at one time. The great risk in just-this-way performance is that not everyone who must comply does so.

Last, just-for-now performance is the most unstable performance mode of the four and the one managers want most to avoid or exit as soon as they can. Here the risk of “just keep doing that right now!” is tunneling into a course of action without escape options. What you feel compelled to do now may well increase the risks in the next step or steps ahead (in effect, options and volatility are no longer independent).

Note that the commonplace admonitions for being reliable—don’t get complacent; avoid overconfidence; once you’ve backed yourself into a corner, quick fixes work only just for now, if that; and don’t expect everyone to comply with command and control—all recognize these chief performance mode risks on time-critical jobs.

–Step back now and further assume that estimates have been computed by control room operators in consultation with subject matter experts for the risks of complacency, misjudgment, non-compliance and closing off alternatives, within the infrastructure concerned. Such is then done for (a stratified sample of) society’s key infrastructures with control rooms.

There is no reason to believe the estimates of any one of the four key risks are the same for the same performance mode across all infrastructures during their respective normal operations. Different precluded events standards are operationalized very differently in terms of the thresholds under which they are not to operate. Complacency or misjudgment could empirically be more a problem in some control rooms than others.

Assume the performance-mode risk estimates (e.g., a stratified/weighted sample of them) have been rank ordered, highest to lowest, for these infrastructures operating to a precluded-event standard by their respective control rooms. A plot of points measured in terms of their respective Pf and Cf coordinates is generated in the form of a downward sloping function (e.g., logarithmic or regression). This function reflects the revealed allocation of acceptable societal risks at the time of calculation for the critical infrastructure services of interest in really-existing normal operations to preclude their respective dreadful events from happening.

The downward sloping function would, by definition, be a prospectively oriented standard of acceptable risk for society’s (sampled) critical infrastructures operating to the precluded-event standard by their control rooms. It is prospective because the unit of analysis isn’t the risk of system failure—again, typically calculated retrospectively on the basis of the past record, if any—but rather the current risks of real-time control operators failing in systemwide management, now and in their next operational steps ahead. Note the two-dimensionality of the prospective “next steps ahead”: It refers not only to the future ahead but also the future that has to be made–prefigured–for the present.

–Even though all this is difficult to detail, let alone operationalize—but less so than the conventional ALARP!—three implications are immediate.

First, because control rooms manage latent risks (uncertainties with respect to probabilities or consequences of system failure) as well as manifest risks (with known Pf and Cf), any such downward-sloping function will necessarily have a bandwidth around it. That bandwidth, however, is not one that can be chalked up to “differences in societal values and politics.” Rather the bandwidths reflect more so the control room uncertainties (often technical and procedural, but related also to unstudied or unstudiable conditions).

It is true that some real-time uncertainties to be managed are linked directly to societal values and politics—think here of those new or revised compliance regulations that followed from the last disaster—have their greatest real-time impacts. Even then, the challenge is to show how the application at this time and for this case of any compliance procedure follows from said societal values. That is no easy task because analysis would also drive down to the case or event level and not just up to the policy or regulatory level where societal values are (or so it is said) easier to identify.

A related implication is also noteworthy. The bandwidth around a societal risk acceptance function as defined above varies because not every critical infrastructure manages to a precluded-event standard. Other standards (and associated evaluative criteria) can be managed to. Even so, note how remote this acknowledgement is from any argument that societal values determine directly (or even primarily) the operative standards managed to.

An example is helpful. A primary reason why critical infrastructures manage to an avoided-events standard today—these events should be avoided, albeit they cannot always be in practice—is because their inter-infrastructural connectivity does not allow individual control rooms to preclude failures or disruptions in the other infrastructures upon which they depend or which depend on them. It is better to say that in these interconnected cases the shift from one (precluded-event) to another (avoided-event) reliability standard reveals societal preferences for interconnected critical infrastructures before it demonstrates any first-order derivation from more generalized or abstracted “societal values” per se.

Third, a very practical implication follows. It is likely that that policy and regulatory leaders who do not understand the uniquely prospective orientation of reliability professionals are apt not only to confuse their own values and views about the future for those of control room reliability professionals, but that they—the policymakers and regulators—will make mistakes because they don’t appreciate the distinctive orientation of these professionals as well.[1]

A last point when it comes to major risk and uncertainty in policy and management. In case it needs saying, the risk and uncertainty discussed above–so too the standards, evaluative criteria, and “systems”–are socially constructed and historicized. Their expression is very much of a time and of a place.

That said, acknowledging the historical, social, cultural, economic…basis of our knowledge about the complex we have been summarizing as “risk and uncertainty” has rarely gone far enough when it comes to policy and management discussed above.

For, there is the corollary of social construction and historicism: Humans can only know—really know—that which they create. (Such is the insight of St. Augustine for philosophy, Giambattista Vico for history, Roy Bhaskar for science….) Humans know mathematics in a way they cannot know the universe, because the former is a thoroughly human creation about which more and more can be made to know. Their uncertainties are socially constructed in a way that, for lack of a better word, “unknowledge” about the universe is not.

This corollary means that to accept that “Risk, uncertainty and allied notions are socially constructed concepts easily historicized” needs to be pushed further.

What is missing are the details and specifics of the connections among risk, uncertainty and associated terms that we make and the meanings we draw out for these connections, often under conditions of surprise.

Our creations are always surprising us and we seek to explain these occurrences by means of analogies that extend the range of what we call knowledge. That which we have created by way of risk and uncertainty—and continue to create—has become very complex. In fact: so complex as to continually provoke more complexity-as-knowledge and with it more action-as-complexity.

[1] What are specific direct relationships between political leaders and infrastructure control operators? At first pass, leaders would seem to be all about just-this-way command and control in emergencies  But we know of infrastructure’s reliability professionals who determine emergency declarations, as they are the best informed in real time, not political outsiders. Indeed, a big issue is ensuring “politics stays out of the control room” as much as possible. We found leaders to be important in the negative liberty sense of staying out of the way of control room operators working under just-in-time and just-for-now performance demands. As for just-in-case performance during times of low system volatility, leaders lead best by ensuring reliability professionals are able to build up their inventory of resources to be used in a crisis. In short, reliability professionals have more performance modes than leaders realize, we believe.

Worse, what is a “crisis” to control operators is not necessarily known to or regarded by those political leaders whose policies reduce operator options, increase their task volatility, and reduce their maneuverability to prolonged just-for-now performance only, among other real-time inflictions. To put the point more generally, when it comes to crisis management, the conventional literature on leadership is either top down (leaders direct) or bottom up (self-organizing). We add a third category: control rooms, and not just in terms of incident command centers during the emergency but already-exiting infrastructure control rooms that continue to operate during the emergency. Adding the third is to insist on the preexisting nature of management in which crises that would have happened did not because of ongoing reliability operations.

Principal sources. This blog entry consolidates, edits and updates earlier blogs: “A new standard for societal risk acceptance,” “Easily-missed points on risks with respect to failure scenarios and their major implications,” “Risk criteria with respect to asset versus system scenarios,” “Half-way risk,” “With respect to what?,” and “Yes, ‘risk and uncertainty’ are socially constructed and historicized. Now what? The missing corollary and 3 examples”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s