Wheels roll over feet, kitchen knives slice into fingers, heaters catch houses on fire, software crashes losing work and chemical plants blow up. Each of these things is man-made and each is performing actions they were not intended to perform. As Charles Perrow terms it in Normal Accidents: Living with High-Risk Technology, these "accidents" are intrinsic to the technology--wheels, knives, etc. Why do the things we build fail and why, when they do, are we so often surprised by the ways in which they fail? Can we build systems which function perfectly in all circumstances? Can we avoid all accidents with enough time, enough information, enough practice running the system or enough process around it?
I would wager that for many of the practicing engineers reading this the intuitive answer will be, no, we cannot. Long experience and deliberate effort to create perfectly functioning systems, which fail even so, tend to make one dubious about the possibility of pulling it off in practice. We invent new methods of developing software in the hopes of making this more likely and new formalisms to make verification of system behavior possible. While systematic testing in development (but not TDD, which its creators say is really more about design) puts the system under test through its paces before it reaches a production environment, the paces it's put through are determined by the human engineers and limited by their imaginations or experience. Similarly, sophisticated formalisms like dependent types or full blown proof assistants make the specification of the system extremely exact. However they can do nothing about externally coupled systems and their behaviors, which may be unpredictable and drive our finely crafted systems right off a cliff--literally in the case of a self-driving car hit by a semi-truck. Finally, the organization of humans around the running system provides a further complication in terms of mistaken assumptions about necessary maintenance, inattentive monitoring or, rarely, sabotage.
Perrow's contention in Normal Accidents is that certain system faults are intrinsic; there is no way to build a perfect system:
There are many improvements we can make that I will not dwell on, because they are fairly obvious--such as better operator training, safer designs, more quality control and more effective regulation. ... I will dwell upon characteristics of high-risk technologies that suggest that no matter how effective conventional safety devices are, there is a form of accident that is inevitable.1 These inevitable accidents, which persist despite all effort, are what Perrow terms 'normal', giving the book it's title and strongly suggesting that systems cannot be effectively designed or evaluated without considering their potential failures. The characteristics that cause these normal accidents are, briefly stated:
linear and complex interactions within the system tight and loose coupling with internal sub-systems and external systems
The definitions of these and their subtle interactions--irony of ironies--represents the bulk of Perrow's work. (The book is not as gripping a read as Digital Apollo--which I reviewed last month--but it does make a very thorough examination of the Three Mile Island accident, Petrochemical facilities, the Airline Industry and large geo-engineering projects like mines and dams to elucidate the central argument.) With regard to interactions in systems Perrow points out that:
Linear interactions are those in expected and familiar production or maintenance sequence, and those that are quite visible even if unplanned. Complex interactions are those of unfamiliar sequences, or unplanned and unexpected sequences and either not visible or not immediately comprehensible.2
Having previously established that his text follows the use of 'coupling' in the engineering sense as a connection, Perrow states:
Loosely coupled systems, whether for good or ill, can incorporate shocks and failures and pressure for change without destablization (of the system). Tightly coupled systems will respond more quickly to these perturbations ... Loose coupling, then, allows certain parts of the system to express themselves according to their own logic or interests.
Roughly, you can think of the interactions occurring within the system and the coupling being centered on external interfaces. When we test our systems--and we should all be testing--in small components and in a full, integrated environment, what we're doing, to use Perrow's terms, is driving the system through it's linear interaction space. We can address the complex interactions with extensive batteries of testing in a production-like environment and use sophisticated monitoring. Yet, testing can lull us into a false sense of mastery over the system and instrumentation, an integrated component of the system, is not immune to normal accidents of it's own. The Three Mile Island accident, for example, was caused largely by the complex interaction of the plant's instrumentation, its human operators and a redundant failure-handling sub-system:
Since there had been problems with this relief valve before ... an indicator had been recently added to the valve to warn operators if it did not reset. ... (S)ince nothing is perfect, it just so happened that this time the indicator itself failed ... Safety systems, such as warning lights, are necessary but they have the potential for deception. If there had been no light assuring them the valve had closed, the operators would have taken other steps to check the status of the valve ... (A)ny part of the system might be interacting with other parts in unanticipated ways.3
Perrow's book is studded with various asides. Some date the book in detrimental ways--the small political rants about US President Reagan come to mind--but others are particularly chilling:
Had the accident at Three Mile Island taken place in one of the plants near Moscow, it would have exposed the operators to potentially lethal doses, and irradiated a large population.4
From the publish date, it would be yet two years before the Chernobyl disaster.
The key contribution of Perrow's Normal Accidents is not merely more material on how to avoid or mitigate accidents, though there is a bit of that. Instead, Perrow's ultimate focus is the determination not of how a system should be constructed but if. Given that every system will eventually fail--and fail in ways we cannot predict--are the failure conditions of the system something we can live with? If not, then we might rethink building the system.
It's here that we find Perrow's primary applicability to software engineering. In our work we are often tasked with building critical systems for the health of companies or of humanity in general. Each new feature brings a possibility of new failures, each new system in the architecture more so. Would the OpenSSL Heartbleed vulnerability have happened if there had been greater concern with having fewer complex interactions within the code-base and the integration of experimental features? Perhaps not. Was the global cost worth it? Probably not.
Failure is inevitable in any system. The next time you're engineering something new or adapting something old, ask yourself, "What are the normal accidents here? Is there a design that meets the same end without these faults? If not, can I live with the results?" To recognize, in a systematic way, that failures are intrinsic and not something that can be simply worked away with enough testing or documentation or instrumentation has been incredibly influential in how I approach my work. Perhaps Normal Accidents: Living with High-Risk Technologies will be similarly influential to you.
Originally published on the Huffington Post.