I recently had a discussion with one of the best root cause investigators and problem solvers I know, Thor McRockhammer. Thor had concerns about a case where the expected conditions were not met and there were indications that individuals engaged in troubleshooting and as a result not only made the problem worse but led to a set of issues that seem rather systematic.
Our conversation (which I do not want to go into too much detail on) was a great example of troubleshooting going wrong.
Troubleshooting is defined as “Reactive problem solving based upon quick responses to immediate symptoms. Provides relief and immediate problem mitigation. But may fail to get at the real cause, which can lead to prolonged cycles of firefighting.” Troubleshooting usually goes wrong one of a few ways:
- Not knowing when troubleshooting shouldn’t be executed
- Using troubleshooting exclusively
- Not knowing when to go to other problem solving tools (usually “Gap from standard”) or to trigger other quality systems, such as change management.
Troubleshooting is a reactive process of fixing problems by rapid response and short-term corrective actions. It covers noticing the problem, stopping the damage and preventing spread of the problem.
So if our departure from expected conditions was a leaky gasket, then troubleshooting is to try to stop the leak. If our departure is a missing cart then troubleshooting usually involves finding the cart.
Troubleshooting puts things back into the expected condition without changing anything. It addresses the symptom and not the fundamental problems and their underlying causes. They are carried out directly by the people who experience the symptoms, relying upon thorough training, expertise and procedures designed explicitly for troubleshooting.
With out leaky gasket example, our operators are trained and have procedural guidance to tighten or even replace a gasket. They also know what not to do (for example don’t weld the pipe, don’t use a different gasket, etc). There is also a process for documenting the troubleshooting happened (work order, comment, etc).
To avoid the problems listed above troubleshooting needs a process that people can be thoroughly trained in. This process needs to cover what to do, how to communicate it, and where the boundaries are.

Step |
What we do |
Things to be aware of |
Concern |
· What do we known about the exact nature of the problem? |
· What do your standards say about how this concern should be documented? o For example, can be addressed as a comment or does it require a deviation or similar non-conformance · If the concern stems from a requirement it must be documented. |
Cause |
· What do you know about the apparent (or root) cause of the problem? |
· Troubleshooting is really good at dealing with superficial cause-and-effect relationships. As the cause deepens, fixing it requires deeper problem-solving. · The cause can be a deficiency or departure from a standard
|
Countermeasure |
· What immediate or temporary countermeasures can be taken to reduce or eliminate the problem? · Are follow-up or more permanent countermeasures required to prevent recurrence? o If so, do you need to investigate more deeply? |
· Countermeasures need to be evaluated against change management · Countermeasures cannot ignore, replace or go around standards · Apply good knowledge management |
Check results |
· Did the results of the action have any immediate effect on eliminating the concern or problem? · Does the problem repeat? o If so, do you need to investigate more deeply? |
· Recurrence should trigger deeper problem-solving and be recorded in the quality system. · Beware troubleshooting countermeasures becoming tribal knowledge and the new way of working |
6 thoughts on “When troubleshooting causes trouble”