The last several years of my career have been spent in specialized teams doing specialized things. In a purely technical sense, that's been delightful and I've really learned a lot. My work at Dropbox has me working in a split experience team and mentorship is part of my work responsibility again. This post is inspired by a conversation I had with a team member.
So, some background. The project my team is working on is an extractive refactor of a network IO multiplexer. That is, there's an existing software system whose main responsibility is doing network IO in a reasonably fast way which is coupled to another software system in a non-trivial way and my team's responsibility is to separate these, creating loosely coupled interfaces as we go along. Fairly typical bit of work. Since we're the Application Performance team we are meant to be obsessive about the performance characteristics of the newly extracted software system. The project is still in a demonstration phase – demonstrate that the extracted system can reasonably function in like manner to the existing setup – and, so, the codebase has a number of rough edges.
Here's one rough edge: none of the network IO performed by the system have retries associated with them. Now, one of my junior colleagues rightly called this out as being detrimental. On any IO failure of any of the bundled IO done during a request we kill the whole request. The reasonable thing to do here, as my colleague correctly argued, is to retry up to some limit, with back-off to avoid stampeding a degraded IO target.
"Absolutely, absolutely," I replied, "there just hasn't been any time to write that yet. What we have is safe bad."
"Yeah, uh, safe bad. So we know our current implementation is bad – and there's tickets to correct it – but at least it's safe. If we accidentally launch without correcting this issue then we'll, at least, never topple any of our downstream IO targets. All we do is harm ourselves."
Put another way, this system is "fail-safe". Systems that are said to be "fail-safe" share the property that the failure states they reach are unable to cause harm to connected mechanisms or their environment. Systems that are "fail-dangerous" are the opposite. In fact, "fail-deadly" systems – like a dead-man switch on a bomb – are intentionally the opposite. If you've ever seen TNG-era Star Trek you have a pretty good example of a "fail-dangerous" system: the warp-core. Whenever the ship takes on damage Geordi calls up from Engineering to say that the warp core is about to explode, they're evacuating and he'll eject the core at the last possible moment. What happens if the core won't eject or Geordi misjudges the situation and ejects the core with too little time to spare?
Well, not happy things. Gonna be a real tear-jerker of a time-travel episode to get everything patched back up.
TNG-era warp cores require active interference to achieve a safe failure state. Imagine how much less dramatic Star Trek episodes would be if the warp core sat on a great big spring and was ejected into space whenever the warp core's power output fell, because the warp core's own power was used to operate magnets keeping that great big spring from spronging the core out into space. If the core can't generate enough power, phwoosh, out the core to harmlessly explode. If the core's power is temporarily interrupted and it goes phwoosh into space for no reason, well, they go collect it and plug it back in. Bad, but safe bad.
There are plenty of interesting real-world examples of fail-safe systems. Elevator brakes, for instance, are kept from braking by the tension of the cable. Cable snaps, brake engage. Modern nuclear reactor designs self-limit when they get too hot. If the reaction starts to run-away it ruins the conditions required to sustain the reaction. Push lawnmowers only engage the blades when you're holding down the mower's lever. Space launches out of Florida aim East out over the Atlantic. If the craft suffers a breakup or is broken up by the range officer, the pieces fall harmlessly into the water. Failures, but safe failures.