Managing Events Systematically

Being good at problem-solving is critical to success in an organization. I’ve written quite a bit on problem-solving, but here I want to tackle the amount of effort we should apply.

Not all problems should be treated the same. There are also levels of problems. And these two aspects can contribute to some poor problem-solving practices.

It helps to look at problems systematically across our organization. The iceberg analogy is a pretty popular way to break this done focusing on Events, Patterns, Underlying Structure, and Mental Model.

Iceberg analogy

Events

Events start with the observation or discovery of a situation that is different in some way. What is being observed is a symptom and we want to quickly identify the problem and then determine the effort needed to address it.

This is where Art Smalley’s Four Types of Problems comes in handy to help us take a risk-based approach to determining our level of effort.

Type 1 problems, Troubleshooting, allows us to set problems with a clear understanding of the issue and a clear pathway. Have a flat tire? Fix it. Have a document error, fix it using good documentation practices.

It is valuable to work the way through common troubleshooting and ensure the appropriate linkages between the different processes, to ensure a system-wide approach to problem solving.

Corrective maintenance is a great example of troubleshooting as it involved restoring the original state of an asset. It includes documentation, a return to service and analysis of data. From that analysis of data problems are identified which require going deeper into problem-solving. It should have appropriate tie-ins to evaluate when the impact of an asset breaking leads to other problems (for example, impact to product) which can also require additional problem-solving.

It can be helpful for the organization to build decision trees that can help folks decide if a given problem stays as troubleshooting or if it it also requires going to type 2, “gap from standard.”

Type 2 problems, gap from standard, means that the actual result does not meet the expected and there is a potential of not meeting the core requirements (objectives) of the process, product, or service. This is the place we start deeper problem-solving, including root cause analysis.

Please note that often troubleshooting is done in a type 2 problem. We often call that a correction. If the bioreactor cannot maintain temperature during a run, that is a type 2 problem but I am certainly going to immediately apply troubleshooting as well. This is called a correction.

Take documentation errors. There is a practice in place, part of good documentation practices, for addressing troubleshooting around documents (how to correct, how to record a comment, etc). By working through the various ways documentation can go wrong, applying which ones are solved through troubleshooting and don’t involve type 2 problems, we can create a lot of noise in our system.

Core to the quality system is trending, looking for possible signals that require additional effort. Trending can help determine where problems lay and can also drive up the level of effort necessary.

Underlying Structure

Root Cause Analysis is about finding the underlying structure of the problem that defines the work applied to a type 2 problem.

Not all problems require the same amount of effort, and type 2 problems really have a scale based on consequences, that can help drive the level of effort. This should be based on the impact to the organization’s ability to meet the quality objectives, the requirements behind the product or service.

For example, in the pharma world there are three major criteria:

  •  safety, rights, or well-being of patients (including subjects and participants human and non-human)
  • data integrity (includes confidence in the results, outcome, or decision dependent on the data)
  • ability to meet regulatory requirements (which stem from but can be a lot broader than the first two)

These three criteria can be sliced and diced a lot of ways, but serve our example well.

To these three criteria we add a scale of possible harm to derive our criticality, an example can look like this:

ClassificationDescription
CriticalThe event has resulted in, or is clearly likely to result in, any one of the following outcomes:   significant harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted; or regulatory action against the company.
MajorThe event(s), were they to persist over time or become more serious, could potentially, though not imminently, result in any one of the following outcomes:  
harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted.
MinorAn isolated or recurring triggering event that does not otherwise meet the definitions of Critical or Major quality impacts.
Example of Classification of Events in a Pharmaceutical Quality System

This level of classification will drive the level of effort on the investigation, as well as drive if the CAPA addresses underlying structures alone or drives to addressing the mental models and thus driving culture change.

Mental Model

Here is where we address building a quality culture. In CAPA lingo this is usually more a preventive action than a corrective action. In the simplest of terms, corrective actions is address the underlying structures of the problem in the process/asset where the event happened. Preventive actions deal with underlying structures in other (usually related) process/assets or get to the Mindsets that allowed the underlying structures to exist in the first place.

Solving Problems Systematically

By applying this system perspective to our problem solving, by realizing that not everything needs a complete rebuild of the foundation, by looking holistically across our systems, we can ensure that we are driving a level of effort to truly build the house of quality.

Bystander Effect, Open Communication a​nd Quality Culture

Our research suggests that the bystander effect can be real and strong in organizations, especially when problems linger out in the open to everyone’s knowledge. 

Insiya Hussain and Subra Tangirala (January 2019) “Why Open Secrets Exist in Organizations” Harvard Business Review

The bystander effect occurs when the presence of others discourages an individual from intervening in an emergency situation. When individuals relinquish responsibility for addressing a problem, the potential negative outcomes are wide-ranging. While a great deal of the research focuses on helping victims, the overcoming the bystander effect is very relevant to building a quality culture.

The literature on this often follows after social psychologists John M. Darley and Bibb Latané who identified the concept in the late ’60s. They defined five characteristics bystanders go through:

  1. Notice that something is going on
  2. Interpret the situation as being an emergency
  3. Degree of responsibility felt
  4. Form of assistance
  5. Implement the action choice

This is very similar to the 5 Cs of trouble-shooting: Concern (Notice), Cause (Interpret), Countermeasure (Form of Assistance and Implement), Check results.

What is critical here is that degree of responsibility felt. Without it we see people looking at a problem and shrugging, and then the problem goes on and on. It is also possible for people to just be so busy that the degree of responsibility is felt to the wrong aspect, such as “get the task done” or “do not slow down operations” and it leads to the wrong form of assistance – the wrong troubleshooting.

When building a quality culture, and making sure troubleshooting is an ingrained activity, it is important to work with employees so they understand that their voices are not redundant and that they need to share their opinions even if others have the same information. As the HBR article says: “If you see something, say something (even if others see the same thing).”

Building a quality culture is all about building norms which encourage detection of potential threats or problems and norms which encouraged improvements and innovation.

When troubleshooting causes trouble

I recently had a discussion with one of the best root cause investigators and problem solvers I know, Thor McRockhammer. Thor had concerns about a case where the expected conditions were not met and there were indications that individuals engaged in troubleshooting and as a result not only made the problem worse but led to a set of issues that seem rather systematic.

Our conversation (which I do not want to go into too much detail on) was a great example of troubleshooting going wrong.

Troubleshooting is defined as “Reactive problem solving based upon quick responses to immediate symptoms. Provides relief and immediate problem mitigation. But may fail to get at the real cause, which can lead to prolonged cycles of firefighting.” Troubleshooting usually goes wrong one of a few ways:

  1. Not knowing when troubleshooting shouldn’t be executed
  2. Using troubleshooting exclusively
  3. Not knowing when to go to other problem solving tools (usually “Gap from standard”) or to trigger other quality systems, such as change management.

Troubleshooting is a reactive process of fixing problems by rapid response and short-term corrective actions. It covers noticing the problem, stopping the damage and preventing spread of the problem.

So if our departure from expected conditions was a leaky gasket, then troubleshooting is to try to stop the leak. If our departure is a missing cart then troubleshooting usually involves finding the cart.

Troubleshooting puts things back into the expected condition without changing anything. It addresses the symptom and not the fundamental problems and their underlying causes. They are carried out directly by the people who experience the symptoms, relying upon thorough training, expertise and procedures designed explicitly for troubleshooting.

With out leaky gasket example, our operators are trained and have procedural guidance to tighten or even replace a gasket. They also know what not to do (for example don’t weld the pipe, don’t use a different gasket, etc). There is also a process for documenting the troubleshooting happened (work order, comment, etc).

To avoid the problems listed above troubleshooting needs a process that people can be thoroughly trained in. This process needs to cover what to do, how to communicate it, and where the boundaries are.

4 Cs of Trouble shooting from Art Smalley’s Four Types of Problems

Step

What we do

Things to be aware of

Concern

·       What do we known about the exact nature of the problem?

·       What do your standards say about how this concern should be documented?

o   For example, can be addressed as a comment or does it require a deviation or similar non-conformance

·       If the concern stems from a requirement it must be documented.

Cause

·       What do you know about the apparent (or root) cause of the problem?

·       Troubleshooting is really good at dealing with superficial cause-and-effect relationships. As the cause deepens, fixing it requires deeper problem-solving.

·       The cause can be a deficiency or departure from a standard

 

Countermeasure

·       What immediate or temporary countermeasures can be taken to reduce or eliminate the problem?

·       Are follow-up or more permanent countermeasures required to prevent recurrence?

o   If so, do you need to investigate more deeply?

·       Countermeasures need to be evaluated against change management

·       Countermeasures cannot ignore, replace or go around standards

·       Apply good knowledge management

Check results

·       Did the results of the action have any immediate effect on eliminating the concern or problem?

·       Does the problem repeat?

o   If so, do you need to investigate more deeply?

·       Recurrence should trigger deeper problem-solving and be recorded in the quality system.

·       Beware troubleshooting countermeasures becoming tribal knowledge and the new way of working

 

Trouble shooting is in a box

Think of your standards as a box. This box defines what should happen per our qualified/validated state, our procedures, and the like. We can troubleshoot as much as we want within the box. We cannot change the box in any way, nor can we leave the box without triggering our deviation/nonconformance system (reactive) or entering change management (proactive).

Communication is critical for troubleshooting

Troubleshooting processes need a mechanism for letting supervisors happen. Troubleshooting that happens in the dark is going to cause a disaster.

Operators need to be trained how to document troubleshooting. Sometimes this is as simple as a notation or comment, other-times you trigger a corrective action process.

Engaging in troubleshooting, and not documenting it starts to look a like fraud and is a data integrity concern.

Change Management

The change management process should be triggered as a result of troubleshooting. Operators should be trained to interpret it. This is often were concept of exact replacements and like-for-like come in.

It is trouble shooting to replace a part with an exact part. Anything else (including functional equivalency) is a higher order change management activity. It is important that everyone involved knows the difference.

CoversIs it troubleshooting?
Like-for- LikeSpare parts that are identical replacements (has the same the same manufacturer, part number, material of construction, version)

Existing contingency procedures (documented, verified, ideally part of qualification/validation)
Yes

This should be built into procedures like corrective maintenance, spare parts, operations and even contingency procedures.
Functionally equivalentEquivalent, for example, performance, specifications, physical characteristics, usability, maintainability, cleanability, safety

No

Need to understand root cause.
Need to undergo appropriate level of change management
NewAnything else No

Need to understand root cause.
Need to undergo appropriate level of change management

This applies to both administrative and technical controls.

ITIL Incident Management

ITIL Incident Management (or similar models) is just troubleshooting and has all the same hallmarks and concerns.

Conclusion

Trouble shooting is an integral process. And like all processes it should have a standard, be trained on, and go through continuous improvement.