Each and every process should go through the following steps:
Define those problems that should be escalated and those that should not. Everyone working in a process should have the same definition of what is a problem. Often times we end up with a hierarchy of issues that are solved within the process – Level 1 – and those processes that go to a root cause process (deviation/CAPA) – level 2.
Identify the ways to notice a problem. Make the work as visual as possible so it is easier to detect the problem.
Define the escalation method. There should be one clear way to surface a problem. There are many ways to create a signal, but it should be simple, timely, and very clear.
These three elements make up the request for help.
The next two steps make up the response to that request.
How does the individual respond, and most importantly when? This should be standardized so the other end of that help chain is not wondering whether, when, and in what form that help is going to arrive.
In order for this to work, it is important to identify clear ownership of the problem. There always must be one person clearly accountable, even if only responsible for bits, so they can push the problem forward.
It is easy for problem-solving to stall. So make sure progress is transparent. Knowing what is being worked on, and what is not, is critical.
Prioritization is key. Not every problem needs solving so have a mechanism to ensure the right problems are being solved in the process.
It helps to look at problems systematically across our organization. The iceberg analogy is a pretty popular way to break this done focusing on Events, Patterns, Underlying Structure, and Mental Model.
Events start with the observation or discovery of a situation that is different in some way. What is being observed is a symptom and we want to quickly identify the problem and then determine the effort needed to address it.
This is where Art Smalley’s Four Types of Problems comes in handy to help us take a risk-based approach to determining our level of effort.
Type 1 problems, Troubleshooting, allows us to set problems with a clear understanding of the issue and a clear pathway. Have a flat tire? Fix it. Have a document error, fix it using good documentation practices.
It is valuable to work the way through common troubleshooting and ensure the appropriate linkages between the different processes, to ensure a system-wide approach to problem solving.
Corrective maintenance is a great example of troubleshooting as it involved restoring the original state of an asset. It includes documentation, a return to service and analysis of data. From that analysis of data problems are identified which require going deeper into problem-solving. It should have appropriate tie-ins to evaluate when the impact of an asset breaking leads to other problems (for example, impact to product) which can also require additional problem-solving.
It can be helpful for the organization to build decision trees that can help folks decide if a given problem stays as troubleshooting or if it it also requires going to type 2, “gap from standard.”
Type 2 problems, gap from standard, means that the actual result does not meet the expected and there is a potential of not meeting the core requirements (objectives) of the process, product, or service. This is the place we start deeper problem-solving, including root cause analysis.
Please note that often troubleshooting is done in a type 2 problem. We often call that a correction. If the bioreactor cannot maintain temperature during a run, that is a type 2 problem but I am certainly going to immediately apply troubleshooting as well. This is called a correction.
Take documentation errors. There is a practice in place, part of good documentation practices, for addressing troubleshooting around documents (how to correct, how to record a comment, etc). By working through the various ways documentation can go wrong, applying which ones are solved through troubleshooting and don’t involve type 2 problems, we can create a lot of noise in our system.
Core to the quality system is trending, looking for possible signals that require additional effort. Trending can help determine where problems lay and can also drive up the level of effort necessary.
Root Cause Analysis is about finding the underlying structure of the problem that defines the work applied to a type 2 problem.
Not all problems require the same amount of effort, and type 2 problems really have a scale based on consequences, that can help drive the level of effort. This should be based on the impact to the organization’s ability to meet the quality objectives, the requirements behind the product or service.
For example, in the pharma world there are three major criteria:
safety, rights, or well-being of patients (including subjects and participants human and non-human)
data integrity (includes confidence in the results, outcome, or decision dependent on the data)
ability to meet regulatory requirements (which stem from but can be a lot broader than the first two)
These three criteria can be sliced and diced a lot of ways, but serve our example well.
To these three criteria we add a scale of possible harm to derive our criticality, an example can look like this:
The event has resulted in, or is clearly likely to result in, any one of the following outcomes: significant harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted; or regulatory action against the company.
The event(s), were they to persist over time or become more serious, could potentially, though not imminently, result in any one of the following outcomes: harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted.
An isolated or recurring triggering event that does not otherwise meet the definitions of Critical or Major quality impacts.
Example of Classification of Events in a Pharmaceutical Quality System
This level of classification will drive the level of effort on the investigation, as well as drive if the CAPA addresses underlying structures alone or drives to addressing the mental models and thus driving culture change.
Here is where we address building a quality culture. In CAPA lingo this is usually more a preventive action than a corrective action. In the simplest of terms, corrective actions is address the underlying structures of the problem in the process/asset where the event happened. Preventive actions deal with underlying structures in other (usually related) process/assets or get to the Mindsets that allowed the underlying structures to exist in the first place.
By applying this system perspective to our problem solving, by realizing that not everything needs a complete rebuild of the foundation, by looking holistically across our systems, we can ensure that we are driving a level of effort to truly build the house of quality.
The Is-Is Not matrix is a great tool for problem-solving, that I usually recommend to help frame the problem. It is based on the 5W2H methodology and then asks the question of what is different between the problem and what has been going right.
What specific objects have the deviation? What is the specific deviation?
What similar object(s) could reasonably have the deviation, but does not? What other deviations could reasonably be observed, but are not?
Where is the object when the deviation is observed (geographically)? Where is the deviation on the object?
Where else could the object be when the deviations are observed but are not? Where else could the deviation be located on the object, but is not?
When was the deviation observed first (in clock and calendar time)? When since that time has the deviation been observed? Any pattern? When, in the object’s history or life cycle, was the deviation observed first?
When else could the deviation have been observed, but was not? When since that time could the deviation have been observed, but was not? When else, in the object’s history or life cycle, could the deviation have been observed first, but was not?
How many objects have the deviation? What is the size of a single deviation? How many deviations are on each object? What is the trend? (…in the object?) (…in the number of occurrences of the deviation?) (…in the size of the deviation?)
How many objects could have the deviation, but do not? What other size could the deviation be, but is not? How many deviations could there be on each object, but are not? What could be the trend, but is not? (…in the object?) (…in the number of occurrences of the deviation?) (…in the size of the deviation?)
Who is involved (avoid blame, stick to roles, shifts, etc) To whom, by whom, near whom does this occur
Who is not involved? Is there a trend of a specific role, shift, or another distinguishing factor?
All three are right on the nose, and I’ve posted a bunch on the topics. Definitely go and read the post.
What I want to delve deeper into is Stephanie’s point that “Deviation systems should also be built to triage events into risk-based categories with sufficient time allocated to each category to drive risk-based investigations and focus the most time and effort on the highest risk and most complex events.”
That is an accurate breakdown, and exactly what regulators are asking for. However, I think the implementation of risk-based categories can sometimes lead to confusion, and we can spend some time unpacking the concept.
Risk is the possible effect of uncertainty. Risk is often described in terms of risk sources, potential events, their consequences, and their likelihoods (where we get likelihoodXseverity from).
But there are a lot of types of uncertainty, IEC31010 “Risk management – risk management techniques” lists the following examples:
uncertainty as to the truth of assumptions, including presumptions about how people or systems might behave
variability in the parameters on which a decision is to be based
uncertainty in the validity or accuracy of models which have been established to make predictions about the future
events (including changes in circumstances or conditions) whose occurrence, character or consequences are uncertain
uncertainty associated with disruptive events
the uncertain outcomes of systemic issues, such as shortages of competent staff, that can have wide ranging impacts which cannot be clearly defined lack of knowledge which arises when uncertainty is recognized but not fully understood
uncertainty arising from the limitations of the human mind, for example in understanding complex data, predicting situations with long-term consequences or making bias-free judgments.
Most of these are only, at best, obliquely relevant to risk categorizing deviations.
So it is important to first build the risk categories on consequences. At the end of the day these are the consequence that matter in the pharmaceutical/medical device world:
harm to the safety, rights, or well-being of patients, subjects or participants (human or non-human)
compromised data integrity so that confidence in the results, outcome, or decision dependent on the data is impacted
These are some pretty hefty areas and really hard for the average user to get their minds around. This is why building good requirements, and understanding how systems work is so critical. Building breadcrumbs in our procedures to let folks know what deviations are in what category is a good best practice.
There is nothing wrong with recognizing that different areas have different decision trees. Harm to safety in GMP can mean different things than safety in a GLP study.
The second place I’ve seen this go wrong has to do with likelihood, and folks getting symptom confused with problem confused with cause.
All deviations are with a situation that is different in some way from expected results. Deviations start with the symptom, and through analysis end up with a root cause. So when building your decision-tree, ensure it looks at symptoms and how the symptom is observed. That is surprisingly hard to do, which is why a lot of deviation criticality scales tend to focus only on severity.
We do not have enough people to process the deviations we get
45% of deviations are recurring
You hear this sort of framing regularly. Notice that only the third is a problem, the other two are solutions. And in the case of the first statement it can leave to some negative results. The second just has you throw more resources at the problem, which may or may not be a good thing. In both cases we are biasing the problem-solving process just as we begin.
The third problem statement pushes us to think. A measurable fact raises other questions that will help us develop better solutions: why are out deviations recurring? Why are we not solving issues when they first occur? What processes/areas are they recurring in? Are we putting the right amount of effort on important deviations? How can we eliminate these deviations?
If a problem statement has only one solution, reframe it to avoid jumping to conclusions.
By focusing on a problem statement with objective facts (45% of deviations are recurring) we can ask deeper, thoughtful questions which will lead to wisdom, and to better solutions.
To build a good problem statement:
Begin with observable facts, not opinions, judgments, or interpretations.
Describe what is happening by answering questions like “How much/How many/How long/How often.” This creates room for exploration and discovery.
Iterate on the problem statement. As you think more deeply on the situation modify your first version. This is a sign that you understand more about the situation. This is the kind of data that will join with the facts you discover to lead towards sound decisions.
The 5W2H tool is always a good place to start.
Who are the people directly concerned with the problem? Who does this? Who should be involved but wasn’t? Was someone involved who shouldn’t be?
Roles and Departments
Action, steps, description
When did the problem occur?
Times, dates, place In process
Where did the problem occur?
Why is it important?
Why did we do this? What are the requirements? What is the expected condition?
How did we discover. Where in the process was it?
Method, process, procedure
How Many? How Much?
How many things are involved? How often did the situation happen? How much did it impact?
Remember this can be iterative as you discover more information and the problem statement at the end might not necessarily be the problem statement at the beginning.
Is used to…
Understand and target a problem. Provide a scope. Evaluate any risks. Make objective decisions
Answers the following… (5W2H)
What? (problem that occurred) When? (timing of what occurred) Where? (location of what occurred) Who? (persons involved/observers) Why? (why it matters, not why it occurred) How Much/Many? (volume or count) How Often? (First/only occurrence or multiple)
Object (What was affected?) Defect (What went wrong?)