Treating All Investigations the Same

Stephanie Gaulding, a colleague in the ASQ, recently wrote an excellent post for Redica on “How to Avoid Three Common Deviation Investigation Pitfalls“, a subject near and dear to my heart.

The three pitfalls Stephanie gives are:

  1. Not getting to root case
  2. Inadequate scoping
  3. Treating investigations the same

All three are right on the nose, and I’ve posted a bunch on the topics. Definitely go and read the post.

What I want to delve deeper into is Stephanie’s point that “Deviation systems should also be built to triage events into risk-based categories with sufficient time allocated to each category to drive risk-based investigations and focus the most time and effort on the highest risk and most complex events.”

That is an accurate breakdown, and exactly what regulators are asking for. However, I think the implementation of risk-based categories can sometimes lead to confusion, and we can spend some time unpacking the concept.

Risk is the possible effect of uncertainty. Risk is often described in terms of risk sources, potential events, their consequences, and their likelihoods (where we get likelihoodXseverity from).

But there are a lot of types of uncertainty, IEC31010 “Risk management – risk management techniques” lists the following examples:

  • uncertainty as to the truth of assumptions, including presumptions about how people or systems might behave
  • variability in the parameters on which a decision is to be based
  • uncertainty in the validity or accuracy of models which have been established to make predictions about the future
  • events (including changes in circumstances or conditions) whose occurrence, character or consequences are uncertain
  • uncertainty associated with disruptive events
  • the uncertain outcomes of systemic issues, such as shortages of competent staff, that can have wide ranging impacts which cannot be clearly defined lack of knowledge which arises when uncertainty is recognized but not fully understood
  • unpredictability
  • uncertainty arising from the limitations of the human mind, for example in understanding complex data, predicting situations with long-term consequences or making bias-free judgments.

Most of these are only, at best, obliquely relevant to risk categorizing deviations.

So it is important to first build the risk categories on consequences. At the end of the day these are the consequence that matter in the pharmaceutical/medical device world:

  • harm to the safety, rights, or well-being of patients, subjects or participants (human or non-human)
  • compromised data integrity so that confidence in the results, outcome, or decision dependent on the data is impacted

These are some pretty hefty areas and really hard for the average user to get their minds around. This is why building good requirements, and understanding how systems work is so critical. Building breadcrumbs in our procedures to let folks know what deviations are in what category is a good best practice.

There is nothing wrong with recognizing that different areas have different decision trees. Harm to safety in GMP can mean different things than safety in a GLP study.

The second place I’ve seen this go wrong has to do with likelihood, and folks getting symptom confused with problem confused with cause.

bridge with a gap

All deviations are with a situation that is different in some way from expected results. Deviations start with the symptom, and through analysis end up with a root cause. So when building your decision-tree, ensure it looks at symptoms and how the symptom is observed. That is surprisingly hard to do, which is why a lot of deviation criticality scales tend to focus only on severity.

4 major types of symptoms

SIPOC for Data Governance

The SIPOC is a great tool for understanding data and fits nicely into a larger data process mapping initiative.

By understanding where the data comes from (Suppler), what it is used for (Customer) and what is done to the data on its trip from supplier to the customer (Process), you can:

  • Understand the requirements that the customer has for the data
  • Understand the rules governing how the data is provided
  • Determine the gap between what is required and what is provided
  • Track the root cause of data failures – both of type and of quality
  • Create requirements for modifying the processes that move the data

The SIPOC can be applied at many levels of detail. At a high level, for example, batch data is used to determine supply. At a detailed level, a rule for calculating a data element can result in an unexpected number because of a condition that was not anticipated.

SIPOC for manufacturing data utilizing the MES (high level)

Hierarchy is not inevitable

I’m on the record in believing that Quality as a process is an inherently progressive one and that when we stray from those progressive roots we become exactly what we strive to avoid. One only has to look at the history of Six Sigma, TQM, and even Lean to see that.

I’m a big proponent of Humanocracy for that very reason.

One cannot read much of business writing without coming across the great leader (or even worse great man) hypothesis, which serves to naturalize power and existing forms of authority. One cannot even escape the continued hagiography of Jack Welch, even though he’s been discredited in many ways for his toxic legacy.

We cannot drive out fear unless we unmask power by revealing its contradictions, hypocrisies, and reliance on violence and coercion. The way we work is a result of human decisions, and thus capable of being remade.

We all have a long way to go here. I, for example, catch myself all the time speaking of leadership in hierarchical ways. One of the current things I am working on is exorcising the term ‘leadership team’ from my vocabulary. It doesn’t serve any real purpose and it fundamentally puts the idea of leadership as a hierarchical entity.

Another thing I am working on is to tackle the thorn of positional authority, the idea that the higher the rank in the organization the more decision-making authority you have. Which is absurd. In every organization, I’ve been in people have positions of authority that cover areas they do not have the education, experience, and training to make decisions in. This is why we need to have clear decision matrixes, establish empowered process owners and drive democratic leadership throughout the organization.

Success/Failure Space, or Why We Can Sometimes Seem Pessimistic

When evaluating a system we can look at it in two ways. We can identify ways a thing can fail or the various ways it can succeed.

Success/Failure Space

These are really just two sides of the coin in many ways, with identifiable points in success space coinciding with analogous points in failure space. “Maximum anticipated success” in success space coincides with “minimum anticipated failure” in failure space.

Like everything, how we frame the question helps us find answers. Certain questions require us to think in terms of failure space, others in success. There are advantages in both, but in risk management, the failure space is incredibly valuable.

It is generally easier to attain concurrence on what constitutes failure than it is to agree on what constitutes success. We may desire a house that has great windows, high ceilings, a nice yard. However, the one we buy can have a termite-infested foundation, bad electrical work, and a roof full of leaks. Whether the house is great is a matter of opinion, but we certainly know all it is a failure based on the high repair bills we are going to accrue.

Success tends to be associated with the efficiency of a system, the amount of output, the degree of usefulness. These characteristics are describable by continuous variables which are not easily modeled in terms of simple discrete events, such as “water is not hot” which characterizes the failure space. Failure, in particular, complete failure, is generally easy to define, whereas the event, success, maybe more difficult to tie down

Theoretically the number of ways in which a system can fail and the number of ways in which a system can ·succeed are both infinite, from a practical standpoint there are generally more ways to success than there are to failure. From a practical point of view, the size of the population in the failure space is less than the size of the population in the success space. This leads to risk management focusing on the failure space.

The failure space maps really well to nominal scales for severity, which can be helpful as you build your own scales for risk assessments.

For example, let’s look at an example of a morning commute.

Example of the failure space for a morning commute

Tree Analysis – Fault, Cause, Question and Success

The pictorial tree is a favorite for representing branching paths. Some common ones include Fault Tree Analysis (FTA); Cause Trees to analyze used retrospectively to analyze events that have already occurred; Question Trees to aid in problem-solving; and, even Success Trees to figure out why something went right.

Apple Tree illustration with long branching roots

Inductive or Deductive

Inductive Reasoning: Induction is reasoning from individual cases to a general conclusion. We start from a particular initiating condition and attempt to ascertain the effect of that fault or condition on a system.
Deductive Reasoning: Deduction is reasoning from the general to the specific. We start with the way the system has failed and we attempt to find out what modes of system behavior contribute to this failure.

The beauty of a pictorial representation is that depending on which way you go on the tree pictorially represents the form of reasoning that is used.

Inductive reasoning is the branches, and tools like a Cause Tree, are used to determine what system states (usually failed states) are possible. The inductive techniques provide answers to the generic question, “What happens if–?” The process consists of assuming a particular state of existence of a component or components and analyzing to determine the effect of that condition on the system.

Deductive reasoning is the roots, and tools like Fault Tree Analysis, take some specific system state, which is generally a failure state, and chains of more basic faults contributing to this
undesired events are built up in a systematic way to determine how a given failure can occur.

Success/Failure Space

We operate in a success/failure space. We are constantly identifying ways a thing can fail or the various ways of success.

Success/Failure Space

These are really just two sides of the coin in many ways, with identifiable points in success space coinciding with analogous points in failure space. “Maximum anticipated success” in success space coincides with “minimum anticipated failure” in failure space.

Like everything, how we frame the question helps us find answers. Certain questions require us to think in terms of failure space, others in success. There are advantages in both, but in risk management, the failure space is incredibly valuable.

Fault Tree Analysis

Fault Tree Analysis (FTA) is a tool for identifying and analyzing factors that contribute to an undesired event (called the “top event”). The top event is analyzed by first identifying its immediate and necessary causes. The logical relationship between these causes is represented by several gates such as AND and OR gates. Each cause is then analyzed step-wise in the same way until further analysis becomes unproductive. The result is a graphical representation of a Boolean equation in a tree diagram.

The Undesired Event

Fault tree analysis is a deductive failure analysis that focuses on one particular undesired event to determine the causes of this event. The undesired event constitutes the top event in a fault tree diagram and generally consists of a complete failure.

If the top event is too general, the analysis becomes unmanageable; if it is too specific, the analysis does not provide a sufficiently broad view.

Top events are usually a failure of a critical requirement or process step.

A fault tree is not a model of all possible failures or all possible causes for failure. A fault tree is tailored to its top event which corresponds to some particular failure mode, and the fault tree includes only those faults that contribute to this top event. These faults are not exhaustive – they cover only the most credible faults as assessed by the risk team.

The Symbology of a Fault Tree

Events

Base EventThe circle describes a basic initiating fault event that requires no further
development. The circle signifies that the appropriate limit of
resolution has been reached.
Undeveloped EventThe diamond describes a specific fault event that is not further developed, either because the event is of insufficient consequence or because information relevant to the event is unavailable.
Conditioning EventThe ellipse is used to record any conditions or restrictions that apply to any logic gate. It is used primarily with the INHIBIT and PRIORITY AND-gates.
External EventThe house is used to signify an event that is normally expected to occur: e.g., a phase change. The house symbol displays events that are not, of themselves, faults.

External does not mean external to the organization.
Intermediate EventAn intermediate event is a fault event that occurs because of one or more
antecedent causes acting through logic gates. All intermediate events are symbolized by rectangles.
Event Symbols used in a Fault Tree Analysis

Gates

There are two basic types of fault tree gates: the OR-gate and the AND-gate. All other gates are really special cases of these two basic types.

OR-gateThe OR-gate is used to show that the output event occurs only if one or more of the input events occur. There may be any number of input events to an OR-gate.
AND-gateThe AND-gate is used to show that the output fault occurs only if all the input faults occur. There may be any number of input faults to an AND-gate.
INHIBIT-gateThe INHIBIT-gate, represented by the hexagon, is a special case of the AND-gate. The output is caused by a single input, but some qualifying condition must be satisfied before the input can produce the output. The condition that must exist is the conditional input. A description of this conditional input is spelled out within an ellipse drawn to the right of the gate.
EXCLUSIVE OR-gate The EXCLUSIVE OR-gate is a special case of the OR-gate in which the output event occurs only if exactly one of the input events occur
PRIORITY AND-gateThe PRIORITY AND-gate is a special case of the AND-gate in which the output event occurs only if all input events occur in a specified ordered sequence. The sequence is usually shown inside an ellipse drawn to the right of the gate. In practice, the necessity of having a specific sequence is not usually encountered.
Gate Symbols used in a Fault Tree Analysis

Procedure

  1. Identify the system or process that will be examined, including boundaries that will limit the analysis. FTA often stems from a previous risk assessment, such as a FMEA or Structured What-If; or, it comes from a root cause analysis.
  2. Identify the members of the Risk Team. The Risk Team is comprised of the Process Owner, the Facilitator, and Subject Matter Experts (SMEs) with expertise in the process being reviewed.
  3. Identify the Top Event, the type of failure that will be analyzed as narrowly and specifically as possible.
  4. Identify the events that may be immediate cause sof the top event. Write these events at the level below te event they cause.
    1. For each event ask “Is this a basic failure? Or can it be analyzed for its immediate causes?”
      1. If the event is a basic failure, draw a circle around it.
      2. If it can be analyzed for its own causes draw a rectangle around it (NOTE: if appropriate, other event types are possible)
  5. Ask “How are these events related to the one they cause?” Use the gate symbols to show the relationships. The lower-level events are the input events. They one they cause, above the gate is the output event.
  6. For each event that is not basic, repeat steps 4 and 5. Continue until all branches of the tree end in a basic or undeveloped event.
  7. To determine the mathematical probability of failure, assign probabilities to each of the basic events. Use Boolean algebra to calculate the probability of each high-level event and the top event. Discussions of the math is a very different post.
  8. Analyze the tree to understand the relations between the causes and to find ways to prevent failures. Use the gate relationships to find the most efficient ways to reduce risk. Focus attention on the causes most likely to happen.
FTA example using lack of team (basic)

The Question Tree

A critical task in problem-solving is determining what kinds of analysis and corresponding data would best solve the problem. Rather than a shortage of techniques, there are too many to choose from, we can often reflexively use the same few, basic tools out of familiarity and habit. This can mislead when the situation is complex, non-routine, and/or unfamiliar.

This is where a Question Tree comes in handy to determine what analyses and data are suited for a particular problem-solving situation. This tool is also known as a logic tree or a decision tree. Question Trees are structures for seeing the elements of a problem clearly, and keeping track of different levels of the problem, which we can liken to trunks, branches, twigs, and leaves. You can arrange them from left to right, right to left, or top to bottom— whatever makes the elements easier for you to visualize. Think of a Question Tree as a mental model of your problem. Better trees have a clearer and more complete logic of relationships linking the parts to each other, are more comprehensive and have no overlap.

The Question Tree is very powerful when working through broad and complex problems that no single analysis or framework can solve. By developing a set of questions that are connected to one another in the form of a tree we can determine what data analysis is needed, which can help us break out of the habit of using the same analysis tool even when it is a bad one for the job.

The core question is the starting point. It is made easier to solve by decomposing it into a few, more specific sub-questions. The logic of decomposition is such that the answers to these sub-questions should together fully answer the question they emerge from. The first level of sub-questions may still be too broad to solve with specific analyses and data, so each is decomposed further.

The process of decomposition continues until a sub-question is reached that can be answered using a particular technique or framework, and the data needed is specific enough to be identified. A Question Tree is thus constructed, and the final set of questions indicates the analyses and data needed. As much as possible, the questions on the tree are also framed such that they have “yes” or “no” as potential answers.

They, too, are hypotheses to be settled with data, analysis, and evidence. They can also be used to test assumptions and beliefs, evaluate expectations, explore puzzles and oddities, and generate solution options.

In decomposing a question, ask whether the sub-questions are mutually exclusive and collectively exhaustive. This can help generate the sub-question not asked, and thus, reduce errors of omission. Building a Question Tree is an iterative and nonlinear process. If later information so dictates, previously done work on the tree should be adjusted.

Judgment plays a role in building a Question Tree, so it is unlikely that two people working independently on a complex starting question will create identical trees, but these are bound to overlap. Expertise matters and the strength of teams should be leveraged.

Success and Cause Trees

These are variations of the fault tree analysis: a Success Tree where the top event is desired and a Cause Tree used to investigate a past event as part of root cause analysis.