I think about failure a lot. Why do things go wrong and what can we do about it? Professionally I’m a software engineer, presently the Infrastructure Lead for Goodwater Capital. How does software fail and what can we do about it? How do human processes fail and what can we do about it? Anyway, as such, I’ve been discussing failure explicitly with folks lately and realized I needed a brief summarization of my foundational thoughts on the matter. This little note is that.
To start, I’m a follower of the works of Professor Nancy Leveson, Professor Jim Gray, Professor Charles Perrow and Dr. Joe Armstrong. Joe, especially, was someone whose work I have admired since I first encountered it in undergrad and someone I admired personally. I wear fancy socks and get drunk at parties and talk about distributing the world’s computing environment in smart solar panels or mining the sun in his honor to this very day. I'm great at parties. Anyhow, influenced by these folks I believe the following:
- Faults in software systems occur along interface boundaries. That is, where distinct, working software systems are integrated.
- Faults are a result of resource constraints, interface contract violation or errors in business logic.
- Faults are inevitable. Reliable software allows its failing bits to fail and be restarted into a clean, working state. Let it crash.
- There is no such thing as human error, only process error.
That’s the why things fail. Now, what can we do about it? This is the hard bit and is hugely contextual (and is why I have a career). There are, however, some things that are absolutely worth reading to get a sense of what we can do.
- "Why Do Computers Stop and What Can Be Done About It?", Jim Gray, 1985
- "Making Reliable Distributed Systems in the Presence of Software Errors", Joe Armstrong, 2003
- "The Role of Software in Spacecraft Accidents", Nancy Leveson, 1996
- "Normal Accidents: Living with High-Risk Technologies", Charles Perrow, 1984
- "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA", Diane Vaughan, 1997
- "Digital Apollo: Human and Machine in Spaceflight", David Mindell, 2011
- "The Imperative of Responsibility: In Search of an Ethics for the Technological Age", Hans Jonas, 1984
- "Imperial", William T. Vollmann, 2010
- "A Week on the Concord and Merrimack Rivers", Henry David Thoreau, 1849
Some of these are probably obvious reads – The Role of Software in Spacecraft Accidents – and some less so. Systems don't fail in isolation. They fail because they were designed to – whether safely or not – and this design is a matter of politics in the organization that did the designing, in the limits of human knowledge at the time of design and the perseverance of people to keep a culture of maintenance alive. Worse, a thing may be unleashed on the world with no concern for how it will fail, leaving everyone in the future to cope with the consequences. A Week on the Concord and Merrimack Rivers lays out a week's float trip that Thoreau and his brother took in a period when the United States was young and industrialization was just beginning to reshape the land. Thoreau laments new locks and dams they find, which took the rivers from natural things we lived with to human objects to be controlled, industrial objects. We live now in an industrial world. We know there are huge upsides and, well, extinction-level downsides. Thoreau, being a man of his time, can't have known what was coming but the hint of it's there. Thoreau notes with grief that once public fishing spots were converted to sluice gates and that a certain fish, once prevalent in the rivers, are now quite rarely caught in the rivers. Was mild flood control worth imperiling people's food supply down-river? Maybe. The point is, no one thought to ask ahead of time and once the fish are dead and gone and the river's reshaped it's too late to backtrack.
No system is purely technical and this is nowhere more clear than in the study of failure. Being good at this kind of work is, I think, a matter of being broad in your learning. A narrow focus on 'purely technical' things will leave you blind.