Tag: failure modes

Deepwater Horizon: “Failure Modes” vs. “Fail-safe”

The Deepwater Horizon oil spill has been in the news a lot lately. Congressman Bart Stupak (D-MI) has famously asked the question about the blow-out preventer, “How can a device that has 260 failure modes be considered failsafe?”

I’ve heard the soundbite enough times that I feel like I need to blog about it.

I believe Stupak misunderstands the terms he’s using, and is thus confused. Or quite possibly, the sound bite lacks enough context. Either way…

“Fail Safe” does not mean “has zero failure modes” or “cannot fail.” “Fail safe” means, simply, that when a failure occurs, the design ensures that the failure happens in a safe manner.

Obviously, in the case of the blow out preventer on Deepwater Horizon, not all of its failure modes result in a safe failure. However, the number of failure modes has nothing to do with whether it is safe or not. Obviously, the fewer failure modes a thing has, the better, because it means that there are fewer things that can go wrong with it. But just because something has a high failure mode count does not necessarily make it a bad thing. Indeed, quite likely it means that it has been analyzed and tested extensively, and is well-understood. Identifying as many failure modes as there are to discover is a good thing. You don’t want there to be a failure mode that you don’t know about, or didn’t anticipate.

I don’t know anything at all about drilling for oil, but to anyone with an engineering background who understands what the terms “failure mode” and “fail safe” mean, Stupak’s question sounds idiotic.

This sort of situation is something that people who work in technical fields encounter all the time. So naturally, the soundbite irritates me every time I hear it.

Customers and upper management usually seem to think that it’s the engineer’s job to design a solution that handles any conceivable problem, and does so gracefully. Moreover, they want solutions that are delivered on time and under budget. They want to never have to think about the tools they use, especially if that means having to understand how the tool works. They want their solutions to “just work”. Well, don’t we all.

This desire for solutions that just work and don’t have problems is indeed rational. Unfortunately, it is unrealistic. We can only do our best to make solutions which handle as many foreseeable problems as we can envision, as gracefully as we are able. This does not mean that there will never be problems. This does not mean that it won’t be more useful to have operators who have some level of understanding of what’s going on inside the tools they rely on than an operator who has no clue.

I try to deliver products which are as forgiving to the clueless as possible, but it will always be the case that a clueless operator will be at a disadvantage. The idea should not be that the purpose of human enterprise is to create a world that enables people to be clueless, and takes care of the clueless as though they are helpless. The idea should be that the tools we use should remove burdens for the user so that they can be empowered to do more. A great tool removes those burdens for the user, but also allows the user to understand the work being done by the tool if they need to.

It is my fervent belief that the world is improved by people who seek to understand it. Absent understanding, one can only be at the mercy of what they fail to understand. The modern world is a very complicated place indeed, and so the less one needs to actively deal with in order to get by, the better. But that is not to say that ignorance is desirable, a virtue, or bliss.

I’m not saying that Deepwater Horizon couldn’t have been prevented. Very likely it could have. Not all of the possible preventative measures were implemented.

But misunderstandings such as conflating failure mode count with fail safe certainly do not help to bring about a culture of safety. If an executive or legislative level person, who shapes policies that govern these systems that we must design to be safe cannot understand the terms they use, and the implications of what they are being told by technical people, then only badness can result from it.

Too often engineers who are doing nothing more than speaking plainly are not understood when they speak up about potential problems, only to be scapegoated when a worst-case scenario comes to pass. But ultimately, responsibility should be owned by the people in control who set policy on what is considered an acceptable level of risk. If these people cannot understand the language that is used to communicate about risk, then they do not deserve to hold the power to make the decision to accept risks.