–It is understood that “acceptable-risk” standards, based on past failure frequencies and commitments of “never again,” can be fleeting and ephemeral. Refinery explosions, stock market flash crashes, and massive identity theft lead to calls for government action. Something is done in response, but the sense of urgency and never-again is rarely sustained. There is always a fresh disaster or crisis scenario to reclaim public attention. Worse, the retrospective orientation to letting past (in)frequency set the standard has led to complacency and the very accident to be forestalled, as in: “Well, it hasn’t happened in the past, so what’s the problem…”
It’s worthwhile then to ask what can be offered by way of a prospective orientation—“we are no more reliable than the next failure ahead”—to identifying standards of acceptable/unacceptable societal risk. What does societal risk acceptance look like if instead of being based on past events, it is grounded in the expectation that all manner of accidents lie in wait unless actively managed against?
Note the profound risk management implication of the two different orientations: Even if (the wildly improbable event occurred where) the exact frequency probability of system failure were known, the conflict between the two orientations would persist.
I suggest the following thought experiment, the aim of which identifies a proxy for “acceptable societal risk.” To telegraph ahead, the proxy proposed is the aggregate curve of the major real-time control room risks of society’s critical infrastructures.
–Assume: that society has identified critical infrastructures indispensable to its survival; that the key infrastructures have central control rooms for operating the entire system; and that the respective control room operators have a set of chief risks that they must manage in order to maintain systemwide reliability (which includes safety), at least in real time. While huge assumptions, their virtue is trying to operationalize the unworldly premise of current approaches—most notably ALARP (“as low as reasonably practicable”)—that somehow “society sets acceptable and unacceptable risks,” leaving the somehow utterly open.
Under the precluded-event standard of reliability (i.e., the event or a set of conditions to be prevented must never happen, given the society-wide dread associated with system failure), we found that control operators need to be able to maneuver across four performance modes so as to maintain even normal operations. Each performance mode has its own chief risk, we found in our interviews with operators.
The four modes range from anticipatory exploration of options (just in case) when operations are routine and many management strategies and options are available, to a real-time improvisation of options and strategies (just in time) when task conditions are more volatile. Reliability professionals may have to operate temporarily in a high-risk mode (just for now) when system volatility is high and options few. They may also be able, in emergencies when options have dwindled, to impose onto their users a single emergency scenario (just this way) in order to stabilize the situation.
The chief risk in just-in-case performance is that professionals are not paying attention and become complacent—reliability professionals have let their guard down and ceased to be vigilant, e.g., to sudden changes in system volatility (think of system volatility as the degree to which the task environment is unpredictable and/or uncontrollable). As for just-in-time performance, the risk is misjudgment by the operators with so many balls in the air to think about at one time. The great risk in just-this-way performance is that not everyone who must comply will comply with one-off measures to reduce system volatility.
Last, just-for-now performance is the most unstable performance mode of the four and the one managers want most to avoid or exit as soon as they can. Here the risk of “just keep doing that right now!” is tunneling into a course of action without escape alternatives. What you feel compelled to do now may well increase the risks in the next step or steps ahead (in effect, options and volatility are no longer independent dimensions).
Note that the commonplace admonitions for being reliable on the job—don’t get complacent; avoid overconfidence; once you’ve backed yourself into a corner, quick fixes work only just for now, if that; and don’t expect everyone to comply with command and control—all recognize these chief performance mode risks. Note also the two-dimensionality of “the next steps”: It refers not only to the future ahead but also the future that has to be made for the present.
–Step back now and further assume that estimates have been computed by control room operators in consultation with subject matter experts for the risks of complacency, misjudgment, non-compliance and closing off alternatives, within the infrastructure concerned. Such then is done for all the society’s key infrastructures with control rooms.
There is no reason to believe the estimates of any one of the four key risks are the same for the same performance mode across all infrastructures during their respective normal operations. Different precluded events standards are operationalized very differently in terms of the thresholds under which they are not to operate. Complacency or misjudgment could empirically be more a problem in some control rooms than others.
Assume the performance-mode risk estimates (or stratified/weighted sample of them) have been rank ordered, highest to lowest, for these infrastructures operating to a precluded-event standard by their respective control rooms. A plot of points is generated in the form of a downward sloping function. This function would be the revealed allocation of acceptable societal risks at the time of calculation for the critical infrastructure services of interest in their really-existing normal operations to preclude different dreadful events from happening with respect to vital societal services.
The downward sloping function would, by definition, be a prospectively oriented standard of acceptable risk for society’s (sampled) critical infrastructures operating to the precluded-event standard by their control rooms. It is prospective because the unit of analysis isn’t the risk of system failure—again, typically calculated retrospectively on the basis of “the past record”—but rather the current risks of real-time control operators failing in systemwide management, now and in their next steps ahead.
–Even though all this is difficult to operationalize—but less so than the traditional ALARP!—three implications are immediate.
First, because control rooms manage latent risks (uncertainties with respect to probabilities and consequences of system failure) as well as manifest risks (with known Pf and Cf), any such downward-sloping function will necessarily have a bandwidth around it. That bandwidth, however, is not one that can be chalked up to “differences in societal values and politics.” Rather the bandwidths reflect moreso the control room uncertainties (often technical and procedural, but related also to unstudied or unstudiable conditions).
It is true that some real-time uncertainties to be managed are linked directly to societal values and politics—think here of those new or revised compliance regulations that followed from the last disaster—have their greatest real-time impacts. Even then, the challenge is to show how the application at this time and for this case of any compliance procedure follows from said societal values. That is no easy task because analysis would also drive down to the case or event level and not just up to the policy or regulatory level where societal values are (or so it is said) easier to identify.
A related, second implication is noteworthy as well. The bandwidth around a societal risk acceptance function as defined above varies because not every critical infrastructure manages to a precluded-event standard. Other standards can be managed to. Even so, note how remote this acknowledgement is from any argument that societal values determine directly (or even primarily) the operative standards managed to.
An example will have to suffice. A primary reason why critical infrastructures manage to an avoided-events standard today—these events should be avoided, albeit they cannot always be in practice—is because their inter-infrastructural connectivity does not allow individual control rooms to preclude failures or disruptions in the other infrastructures upon which they depend or which depend on them. It is better to say that in these cases the shift from one (precluded-event) to another (avoided-event) reliability standard reveals societal preferences for interconnected critical infrastructures before it demonstrates any first-order derivation from more generalized or abstracted “societal values” per se.
Third, a much more practical implication follows. It is likely that that policy and regulatory leaders who do not understand the uniquely prospective orientation of reliability professionals are apt not only to confuse their own values and views about the future for those of the reliability professionals, but that they—the policymakers and regulators—will make mistakes precisely because they don’t appreciate the distinctive orientation of these professionals. Indeed, former’s ignorance will muddle things up even more for the real time operators.