Incident – Investigations of a Dog

In the current world scenario, which is marked by high volatility, uncertainty, complexity, and ambiguity (VUCA), threats are increasingly unforeseen. As organizations, we are striving for this concept of Resilience.

Resilience is one of those hot words, and like many hot business terms it can mean a few different things depending on who is using it, and that can lead to confusion. I tend to see the following uses, which are similar in theme.

Where used	Meaning
Physics	The property of a material to absorb energy when deformed and not fracture nor break; in other words, the material’s elasticity.
Ecology	The capacity of an ecosystem to absorb and respond to disturbances without permanent damage to the relationships between species.
Psychology	An individual’s coping mechanisms and strategies.
Organizational and Management studies	The ability to maintain an acceptable level of service in the face of periodic or catastrophic systemic and singular faults and disruptions (e.g. natural disasters, cyber or terrorist attacks, supply chain disturbances).

For our purposes, resilience can be viewed as the ability of an organization to maintain quality over time, in the face of faults and disruptions. Given we live in a time of disruption, resilience is obviously of great interest to us.

In my post “Principles behind a good system” I lay out eight principles for good system development. Resilience is not a principle, it is an outcome. It is through applying our principles we gain resilience. However, like any outcome we need to design for it deliberately.

Principles behind a good system

We gain resilience in the organization through levers that can be lumped together as operational and organizational.

The attributes that give resilience are the same that we build as part of our quality culture:

The ability to anticipate problems before they build-up (includes risk management)
Diversified structural flexibility to respond to both adverse and beneficial changes
Broken down silos and glass ceilings to allow risk information to flow freely and prevent ‘risk blindness’
Rapid response capabilities to avoid incidents escalating into crises
Learn from previous mistakes from themselves and others making the necessary changes

On the operational side, we have processes to drive risk management, business continuity, and issue management. A set of activities that we engage in.

Like many activities they key is to think of these as holistic endeavors proactively building resiliency into the organizaiton.

I recently had a discussion with one of the best root cause investigators and problem solvers I know, Thor McRockhammer. Thor had concerns about a case where the expected conditions were not met and there were indications that individuals engaged in troubleshooting and as a result not only made the problem worse but led to a set of issues that seem rather systematic.

Our conversation (which I do not want to go into too much detail on) was a great example of troubleshooting going wrong.

Troubleshooting is defined as “Reactive problem solving based upon quick responses to immediate symptoms. Provides relief and immediate problem mitigation. But may fail to get at the real cause, which can lead to prolonged cycles of firefighting.” Troubleshooting usually goes wrong one of a few ways:

Not knowing when troubleshooting shouldn’t be executed
Using troubleshooting exclusively
Not knowing when to go to other problem solving tools (usually “Gap from standard”) or to trigger other quality systems, such as change management.

Troubleshooting is a reactive process of fixing problems by rapid response and short-term corrective actions. It covers noticing the problem, stopping the damage and preventing spread of the problem.

So if our departure from expected conditions was a leaky gasket, then troubleshooting is to try to stop the leak. If our departure is a missing cart then troubleshooting usually involves finding the cart.

Troubleshooting puts things back into the expected condition without changing anything. It addresses the symptom and not the fundamental problems and their underlying causes. They are carried out directly by the people who experience the symptoms, relying upon thorough training, expertise and procedures designed explicitly for troubleshooting.

With out leaky gasket example, our operators are trained and have procedural guidance to tighten or even replace a gasket. They also know what not to do (for example don’t weld the pipe, don’t use a different gasket, etc). There is also a process for documenting the troubleshooting happened (work order, comment, etc).

To avoid the problems listed above troubleshooting needs a process that people can be thoroughly trained in. This process needs to cover what to do, how to communicate it, and where the boundaries are.

4 Cs of Trouble shooting from Art Smalley’s Four Types of Problems

Step	What we do	Things to be aware of
Concern	· What do we known about the exact nature of the problem?	· What do your standards say about how this concern should be documented? o For example, can be addressed as a comment or does it require a deviation or similar non-conformance · If the concern stems from a requirement it must be documented.
Cause	· What do you know about the apparent (or root) cause of the problem?	· Troubleshooting is really good at dealing with superficial cause-and-effect relationships. As the cause deepens, fixing it requires deeper problem-solving. · The cause can be a deficiency or departure from a standard
Countermeasure	· What immediate or temporary countermeasures can be taken to reduce or eliminate the problem? · Are follow-up or more permanent countermeasures required to prevent recurrence? o If so, do you need to investigate more deeply?	· Countermeasures need to be evaluated against change management · Countermeasures cannot ignore, replace or go around standards · Apply good knowledge management
Check results	· Did the results of the action have any immediate effect on eliminating the concern or problem? · Does the problem repeat? o If so, do you need to investigate more deeply?	· Recurrence should trigger deeper problem-solving and be recorded in the quality system. · Beware troubleshooting countermeasures becoming tribal knowledge and the new way of working

Trouble shooting is in a box

Think of your standards as a box. This box defines what should happen per our qualified/validated state, our procedures, and the like. We can troubleshoot as much as we want within the box. We cannot change the box in any way, nor can we leave the box without triggering our deviation/nonconformance system (reactive) or entering change management (proactive).

Communication is critical for troubleshooting

Troubleshooting processes need a mechanism for letting supervisors happen. Troubleshooting that happens in the dark is going to cause a disaster.

Operators need to be trained how to document troubleshooting. Sometimes this is as simple as a notation or comment, other-times you trigger a corrective action process.

Engaging in troubleshooting, and not documenting it starts to look a like fraud and is a data integrity concern.

Change Management

The change management process should be triggered as a result of troubleshooting. Operators should be trained to interpret it. This is often were concept of exact replacements and like-for-like come in.

It is trouble shooting to replace a part with an exact part. Anything else (including functional equivalency) is a higher order change management activity. It is important that everyone involved knows the difference.

ITIL Incident Management

ITIL Incident Management (or similar models) is just troubleshooting and has all the same hallmarks and concerns.

Conclusion

Trouble shooting is an integral process. And like all processes it should have a standard, be trained on, and go through continuous improvement.

	Covers	Is it troubleshooting?
Like-for- Like	Spare parts that are identical replacements (has the same the same manufacturer, part number, material of construction, version) Existing contingency procedures (documented, verified, ideally part of qualification/validation)	Yes This should be built into procedures like corrective maintenance, spare parts, operations and even contingency procedures.
Functionally equivalent	Equivalent, for example, performance, specifications, physical characteristics, usability, maintainability, cleanability, safety	No Need to understand root cause. Need to undergo appropriate level of change management
New	Anything else	No Need to understand root cause. Need to undergo appropriate level of change management

Tag: Incident

Resilience

When troubleshooting causes trouble