Introduction
One central insight from the literature and theory of High Reliability Organizations (HROs) is that past reliable performance of large socio-technical systems does not predict, let alone ensure, reliable performance into the future. In this sense, then, there are two futures of interest in this blog entry: the future of large socio-technical systems from the perspective of HRO Studies and the future of HRO Studies.
It’s to be recognized from the get-go that “HRO Studies” now houses many reliability-seeking rooms. A major path trajectory for research and taxonomies has been the early work of Todd LaPorte, Gene Rochlin, Karlene Roberts and Paul Schulman around a set of hazardous organizations mandated to maintain reliable (safe and continuous) operations (LaPorte and Consolini 1991; Roberts 1993; Rochlin 1993; Schulman 1993). We need look no further for the durability of this trajectory than the many comparisons of HRO/HRT (high reliability theory) to Charles Perrow’s Normal Accidents Theory (more recently, Bamburger 2014 for cybersecurity; Min and Borch 2021 for financial markets).
Yet, it is also true that HRO Studies has been extended and changed in ways initially unforeseen with contributions to, inter alia, management practice (e.g., Weick and Sutcliffe 2001), safety science (e.g., Hopkins 2014; Amalberti 2013), resilience theory and practice (e.g., Hollnagel, Woods, and Leveson 2006; Boin and van Eeten 2013), networked reliability (e.g., de Bruijne 2006; Roe and Schulman 2016), and analyses of high reliability as a continuous quantifiable variable in the operations of health care, nuclear power, and other industries (Vogus and Sutcliffe 2007; Schöbel 2009; May 2013; O’Neil and Kriz 2013).
In additiion, other fields, like development studies, have also drawn water from this wider trough (Scoones 2024). Most notably for our purposes here, this diversification of HRO Studies has witnessed a shift from identifying HRO properties or features at a point in time to identifying processes of high reliability organizing and managing over time (e.g., Ramanujam and Roberts 2018). Another piece of evidence for the continuing relevance of HROs, now writ large, is the Google Ngram for the term “high reliability organizations” shows a steady climb in the literature.
That said, as there is something perverse in assuming past trends and growth are a reliable guide for the future of and in HRO Studies, how then to see better what’s ahead for the field and what’s ahead by way of large socio-technical systems?
Specific aim
My answer is to return to another, more methodological insight of LaPorte, Rochlin, Roberts and Schulman: The specific cases analyzed matter profoundly. In their set, they found reliable management at that point in time where theory would still tell you not to expect it.
But, of course, there were other reliability-seeking cases and unsurprisingly some of these did not and still do not exhibit HRO processes, let alone features. That indeed is the lesson I take away from the many published “HRO v NAT” comparisons: not that one theory is better “overall” than the other, but that there is still no substitute for attending to differences in the cases examined, in time and over time–especially if theory is to matter for practice.
Focus, rationale and roadmap
I want to suggest then that HRO Studies might benefit at this juncture by a comparative and longitudinal analysis involving similar or related cases. I am familiar with only one, that starting with the California Independent System Operator (CAISO), which operates most of the state’s electric transmission grid.
Here I focus on the aforementioned “network reliability” strand of HRO Studies and ask: What have we learned since Mark de Bruijne published his transmission grid study of CAISO in 2006? I start with this work for three reasons: His book preceded our own 2008 work on managing networked high reliability in electricigty transmission; we aligned subsequent work with De Bruijne and others (Roe and Schulman 2016: 10); and I want to be very transparent upfront that I am not cherry-picking quotes from our own work to substantiate the implications drawn at the end of this blog entry.
Both Paul Schulman and I continued the CAISO research and analysis after our Dutch colleagues left, the results of which were in the form of updates and framework extensions in Roe and Schulman (2008, 2016). More recently, Paul Schulman and I have investigated the notion of network reliability across interconnected critical infrastructures, including but not limited to electricity (Roe and Schulman 2018, 2023).
In what follows, I first present De Bruijne’s findings. I then update this earlier notion of networked reliability in light of our research to the present on interconnected critical infrastructures, including that of electricity. The conclusion focuses on what I take to be important implications for the network reliability, both as a strand in HRO Studies and in the future(s) of large socio-technical systems. To telegraph ahead and in T.S. Eliot’s words, “the end of all our exploring will be to arrive where we started and know the place for the first time.”
Summary of Networked Reliability (de Bruijne 2006)
The full text of Mark de Bruijne’s Networked Reliability is well worth reviewing and can be accessed at https://www.researchgate.net/publication/306011428_Networked_Reliability_Institutional_fragmentation_and_the_reliability_of_service_provision_in_critical_infrastructures.
For those who do not have the time, the work is summarized in a 2007 article De Bruijne co-authored with his dissertation advisor and our early CAISO research colleague, Michel van Eeten, “Systems that Should Have Failed: Critical Infrastructure Protection in an Institutionally Fragmented Environment” (accessed at https://www.researchgate.net/publication/227701135_Systems_that_Should_Have_Failed_Critical_Infrastructure_Protection_in_an_Institutionally_Fragmented_Environment)
I quote at length from De Bruijne and Van Eeten in order to establish for later purposes of comparison a separate and uninterrupted benchmark for the networked reliability then under study in the early 2000’s:
A key question that arises from these developments is: how do CI [critical infrastructure] industries, consisting of networks of organizations, many with competing goals and interests, provide reliable services in the absence of conventional forms of command and control? This raises another question that, logically, precedes it: are institutionally fragmented CIs in fact still reliable?
Does Institutional Fragmentation Affect the Reliability of Service Provision?
The exact relationship between institutional restructuring and the reliability of services and networks has so far remained largely obscured. The available empirical data on reliability – measured in terms of the frequency and length of disruptions to end-users – fail to provide an unequivocal answer. We were able, however, to draw upon extensive field research on reliability-related issues in large-scale water systems (Van Eeten and Roe, 2002; Roe and Van Eeten, 2002), electricity grids (Schulman et al, 2004; Roe et al, 2005) and telecommunication networks (Van Eeten et al, 2005; De Bruijne, 2006). Together, these field studies comprise over 130 interviews, extensive control room observations and literature reviews.
Without repeating previous discussions of our findings, we can draw out a number of implications, primarily based on our studies in electricity and telecommunications. First of all, while there are no conclusive data regarding the reliability of services and networks post-restructuring [post-deregulation, privatization and liberalization], the data that is available suggests that the network operators and service providers have managed to cope with these changes. The two focal organizations that we studied – the California Independent System Operator (ISO) and Dutch mobile telephony operator KPN Mobile – succeeded in maintaining a high reliability of service provision. The organizations displayed virtually unchanged levels of service provision before and after restructuring. The ISO’s reliability performance during California’s electricity crisis in 2000 and 2001 – one of the most turbulent periods in which any restructured critical infrastructure industry ever operated – did not lead to outage rates that differed significantly from those of the utilities before restructuring. In the end, the lights stayed on for most of the time, notwithstanding the popular images in the media of sweeping blackouts across. The reported rolling blackouts occurred on eight days for 27 hours, compared to the 125 days on which just 1.5 percent of operating reserves remained and stage 3 emergencies were declared. The aggregate amount of load shed during California’s electricity blackouts was quite small, adding up to no more than one hour’s worth of electricity to all residential homes in the state. This performance fell within the margins of the average annual reliability performance of the investor-owned utilities before restructuring. However, other key reliability indicators (e.g. the number of high-voltage transmission line overloads and the number of violations in the ISO’s control area) did show that the system was operating closer to the edge of failure – demonstrating the massive pressure under which the system was operating. In other words, although negative effects of restructuring could be identified, the organizations involved, most notable the ISO, managed to cope with these effects and maintain acceptable levels of service and network reliability.
Similarly, the Dutch mobile telephone operator KPN Mobile displayed a steady reliability performance from 1996 to 2001, notwithstanding seven-fold increase in customers, the rapid expansion and innovation of its mobile network and the six-fold increase in the number of services it provided over this network. From 1996 to 2001, the company displayed steadily rising call completion rates (CCR) and call setup success rates (CSSR), which in the telecommunication industry are considered key proxies for the reliability of service provision. In addition to these steadily improving reliability indicators, KPN experienced ‘only’ a 50 percent increase in the number of ‘calamities’ – which they define as incidents with an impact on customers.
While significant, this number pales in comparison to the growth rate of customers, network and services. KPN Mobile achieved this performance under cut-throat competition in the market which forced them to undertake drastic cost reductions in their operations.
Considering the effects of institutional fragmentation on how these CIs were organized and operated, the abovementioned performance of both the ISO and KPN Mobile may be considered an astonishing feat. Despite operating under conditions with significantly reduced resources time and again the organizations managed to maintain a reliable provision of CI services. These findings are all the more puzzling since the two dominant organizational theories that are used to assess the reliability, or lack thereof, of complex, large-scale technological systems would predict a negative impact on the ability of organizations to reliably manage these CIs.
The Normal Accident Theory (NAT) (Perrow, 1999a) and High-Reliability Theory (HRT) (Roberts, 1993) both expect that institutional fragmentation caused by restructuring negatively affects the ability to reliably manage these infrastructures and that reliability of service provision accordingly should have suffered. However, the case studies did not confirm the theoretically assumed negative relationship between the effects of institutional fragmentation and ability to reliably manage these infrastructures even though infrastructure operations did become more complex to manage and behaved more volatile (De Bruijne et al., 2006). Evidence did show that the infrastructures operated ‘closer to the edge’ than before restructuring. So how can we explain the performance record of restructured CIs and the more or less continued high reliability of the provided services in the researched cases?
Coping with institutional fragmentation
Based on these findings, it could be concluded that institutional fragmentation and restructuring not only negatively affected the ability of organizations that manage CIs to provide highly reliable services, but also offered new options that enabled organizations involved in the management of these systems to maintain reliability under extremely demanding conditions. The case studies revealed a large number of hitherto unknown or unrecognized conditions that enabled these organizations to cope with the effects of institutional fragmentation (De Bruijne, 2006). Examples include the increased use of real- time, on-line experimenting; the gradual redefinition of reliability norms and criteria to fit the new conditions and the increased use of support staff and informal wheeling and dealing in real-time in control rooms. These conditions, which many at first glance would consider detrimental to the provision of reliable services, were found to contribute to the ability of the organizations to maintain a reliable provision of services.
The research found both NAT and HRT flawed in their assumptions on the main relationships between the conditions that facilitate reliability and the levels of reliability achieved. The networked environment clearly emphasized different reliability-enhancing characteristics than those identified by NAT and HRT (cf. Grabowski & Roberts, 1996; Schulman et al., 2004). The implication is that NAT and HRT, which until now have been presented as generic organizational theories of (un)reliability, need to be modified in order to be valid under conditions of networked reliability (see also Schulman et al, 2004; De Bruijne, 2006). In general terms, we have identified three shifts of emphasis in organizational processes and resource allocation.
(i) From long-term planning to real-time management
Institutional fragmentation and the introduction of competition create more volatile and technologically more complex infrastructures. Many of the procedures and routines that had been designed to reliably operate the CIs do not function anymore. Infrastructure operations used to emphasize the importance of complete information, centralized planning and command and control. Institutional fragmentation caused those in control of infrastructure operations to be confronted with less than adequate information and control, leading to more surprises and reliability-threatening events. This in turn emphasizes a need for more flexible response capability to maintain reliable services. Real-time operations – typically focused in and around control centers – increases in importance, reducing the strong reliance on long-term, detailed planning that has characterized CIs (cf. De Bruijne et al., 2006; Van Eeten et al., 2006; Roe et al., 2002).
(ii) From design and analysis to improvisation and experience
More volatility and complexity also means more unpredictability. As Demchak (1991, p. 3) has said, the chief manifestation of complexity is surprise. Operations move more often ‘outside analysis’, beyond the well-studied situations for which technology has been designed and procedures have been tested. Under these circumstances, relying on established procedures, routines and guidelines decreases rather than ensures reliability. In real-time, control room operators increasingly have to rely on their experience and improvisational abilities to deal with surprises and volatile events. Referential knowledge, improvisation, ‘instinct’ and experience gain precedence in comparison to detailed procedures and routines. It becomes more important to train operators to know when not to follow procedure and how to still maintain reliability.
(iii) From standardized and formal to real-time informal communication and coordination
The third shift moves infrastructure operations away from formal and hierarchical towards informal and ‘rich’ modes of communication and coordination. To put it differently: real-time resists formalization. Faced with surprises and threatening events, CI operations are constrained by hierarchical, unilateral, and formal modes of communication and coordination; albeit legacies from the pre-restructuring days or those installed after restructuring to ensure competition and level playing fields. Both types severely handicap operators’ abilities to improvise and provide reliable services. Especially when faced with reliability-threatening events, informal communication and coordination mechanisms take over or augment formal mechanisms. The need for real-time communication has already been identified in the literature on coordination in networks of organizations as well. In the absence of formal communication and coordination arrangements between organizations in networks, informal coordination and communication evolve and take over (cf. Chisholm, 1989). Powell (1990:304) finds information passed through networks (of organizations) must be “thicker” than information obtained through markets and “freer” than information communicated through hierarchies.
Real-time, ‘rich’ informal communication and coordination has been identified as one of the most important sources of networked reliability: “[R]eal-time values and privileges the non-routine over the routine, the informal over the formal, and the relational over the representational” (Roe et al., 2002:9-5). In other words, the ability of system operators to engage in a rich exchange of information and informal deals enhances their knowledge of system conditions, stimulates creativity and increases their options for maintaining reliability. To be sure, under ‘normal’ operating regimes, the need for ‘rich’ and varied communication and coordination is constrained by the competitive environment in which CIs nowadays operate. However, when threats occur and move towards real-time, ‘rich’ and informal communication and coordination become increasingly important. Real-time informal infrastructure operations enable types of interventions and control that are typically unacceptable at any other time or place.
(endnotes deleted for readability; citations kept in order to date the findings in the quoted text)
How this picture has changed to the present
As Paul Schulman and I were never invited back to CAISO after 2008, nothing in what follows can be interpreted as remarks about that grid transmission manager today. It should, however, be noted that we did find further evidence of CAISO moving closer to the edge of reliability performance with the introduction of new systemwide marketing software (Roe and Schulman 2016).
Rather than CAISO specifically, what is of interest here is a comparison and update of the notion of networked reliability scanned in the above quote and updated in our subsequent work (most recently, Roe and Schulman 2023).
For me, the most striking contrast is this: Some of us in this networked reliability strand of HRO Studies are having to spend considerable time on parsing out the features and processes of the interconnectivities between and among the networked critical infrastructures.
Stay with electricity as the example. The network of primary interest is no longer the one connecting the then-fragmented, deregulated units for generation, transmission and distribution of the once integrated energy utilities. Today’s electricity network of interest revolves how it is interconnected with other “lifeline” infrastructures, not least of which are the large socio-technical systems for water, telecommunications and transportation.
Further, the configurations of these interconnections are far more varied than originally studied for maintaining the continuous provision of a critical service, even during (especially during) turbulent periods. Serial dependencies and reciprocal interdependencies are matched by pooled and mediated interconnectivities in a wide variety of permutations and combinations. Empirically, many more versions of Interconnected Critical Infrastructure Systems (ICISs) can be identified and demonstrated than even system modelers acknowledge to date (Roe and Schulman 2016).
One of the ironies of having this now-wider understanding of network interconnectivities is that the picture has become more granular and detailed for purposes of operations and management than was the case for in describing the restructured utilities. The centrality of human ingenuity under urgent circumstances moves beyond the control room and into the field in periods of system disruption, failure, immediate emergency response and initial service restoration. “Rich” informal communications and coordination take place between different infrastructure staff when their respective system control variables overlap or are shared (e.g., the railroad bridge over a major shipping navigation way becomes stuck). In fact, there are cases where improvisations undertaken together by the different infrastructures, field staff and/or control rooms, are the real-time interconnectivities that matter for the respective operations.
Initial implications
Just as the extended quote of De Bruijne and Van Eeten is date-stamped by the then very live issue of energy deregulation, so my preceding update will be seen as date-stamped by what is today’s headline issue of emergency management in a world of interconnected critical infrastructure faced by all manner of crises.
But such dating is not the problem here. What remains a problem is that finding in De Bruijne and Van Eeten: “The research found both NAT and HRT flawed in their assumptions on the main relationships between the conditions that facilitate reliability and the levels of reliability achieved. The networked environment clearly emphasized different reliability-enhancing characteristics than those identified by NAT and HRT.” We–and I include myself here–are still being astonished by higher levels of reliability performance than current theories and expectations would expect us to believe. Aspirational high reliability seems to be transformed into high reliability management at least in some cases and without any guarantees for doing so in the future. To say this can’t go on forever is hardly the point; rather: How is this still happening?
How did CAISO survive the introduction of its disruptive the then-new system marketing software that we studied? How has China’s high-speed rail system been as reliable as it has been, given its massive size and scale? Does the capacity to achieve reliable normal operations in digital platforms–not by precluding or avoiding certain events but by adapting to electronic component and subsystem failure most anywhere and most all of the time–offer a very different skill-set for “reliability management” in other digitized critical infrastructures? Are there in fact more “control rooms” and “reliability professionals” out there than those of us who study them acknowledge?
Note in asking these questions I am reproducing the same level of astonishment and question-asking that motivated the earliest HRO researchers with respect to the systems they studied. If so, studies of networked reliability–and HRO Studies as a whole?–have always had a future in search of more answers.
References [to be provided]
39 thoughts on “The “future” in HRO Studies: the example of networked reliability as a form of reliability seeking”