The Theory of Active and Latent Failures was proposed by James Reason in his book, Human Error. Reason stated accidents within most complex systems, such as health care, are caused by a breakdown or absence of safety barriers across four levels within a system. These levels can best be described as Unsafe Acts, Preconditions for Unsafe Acts, Supervisory Factors, and Organizational Influences. Reason used the term “active failures” to describe factors at the Unsafe Acts level, whereas “latent failures” was used to describe unsafe conditions higher up in the system.
This is represented as the Swiss Cheese model, and has become very popular in root cause analysis and risk management circles and widely applied beyond the safety world.
Swiss Cheese Model
In the Swiss Cheese model, the holes in the cheese depict the failure or absence of barriers within a system. Such occurrences represent failures that threaten the overall integrity of the system. If such failures never occurred within a system (i.e., if the system were perfect), then there would not be any holes in the cheese. We would have a nice Engelberg cheddar.
Not every hole that exists in a system will lead to an error. Sometimes holes may be inconsequential. Other times, holes in the cheese may be detected and corrected before something bad happens. This process of detecting and correcting errors occurs all the time.
The holes in the cheese are dynamic, not static. They open and close over time due to many factors, allowing the system to function appropriately without catastrophe. This is what human factors engineers call “resilience.” A resilient system is one that can adapt and adjust to changes or disturbances.
Holes in the cheese open and close at different rates. The rate at which holes pop up or disappear is determined by the type of failure the hole represents.
Holes that occur at the Unsafe Acts level, and even some at the Preconditions level, represent active failures. Active failures usually occur during the activity of work and are directly linked to the bad outcome. Active failures change during the process of performing, opening, and closing over time as people make errors, catch their errors, and correct them.
Latent failures occur higher up in the system, above the Unsafe Acts level — the Organizational, Supervisory, and Preconditions levels. These failures are referred to as “latent” because when they occur or open, they often go undetected. They can lie “dormant” or “latent” in the system for an extended period of time before they are recognized. Unlike active failures, latent failures do not close or disappear quickly.
Most events (harms) are associated with multiple active and latent failures. Unlike the typical Swiss Cheese diagram above, which shows an arrow flying through one hole at each level of the system, there can be a variety of failures at each level that interact to produce an event. In other words, there can be several failures at the Organizational, Supervisory, Preconditions, and Unsafe Acts levels that all lead to harm. The number of holes in the cheese associated with events are more frequent at the Unsafe Acts and Preconditions levels, but (usually) become fewer as one progresses upward through the Supervisory and Organizational levels.
Given the frequency and dynamic nature of activities, there are more opportunities for holes to open up at the Unsafe and Preconditions levels on a frequent basis and there are often more holes identified at these levels during root cause investigation and risk assessments.
The way the holes in the cheese interact across levels is important:
One-to-many mapping of causal factors is when a hole at a higher level (e.g., Preconditions) may result in several holes at a lower level (e.g. Unsafe acts)
Many-to-one mapping of causal factors when multiple holes at the higher level (e.g. preconditions) might interact to produce a single hole at the lower level (e.g. Unsafe Acts)
By understand the Swiss Cheese model, and Reason’s wider work in Active and Latent Failures, we can strengthen our approach to problem-solving.
Let us turn our failure space model, and level of problems, to deviations in a clinical trial. This is one of those areas that regulations and tribal practice have complicated, perhaps needlessly. It is also complicated by the different players of clinical sites, sponsor, and usually these days a number of Contract Research Organizations (CRO).
What is a Protocol Deviation?
Protocol deviation is any change, divergence, or departure from the study design or procedures defined in the approved protocol.
Protocol deviations may include unplanned instances of protocol noncompliance. For example, situations in which the clinical investigator failed to perform tests or examinations as required by the protocol or failures on the part of subjects to complete scheduled visits as required by the protocol, would be considered protocol deviations.
In the case of deviations which are planned exceptions to the protocol such deviations should be reviewed and approved by the IRB, the sponsor, and by the FDA for medical devices, prior to implementation, unless the change is necessary to eliminate apparent immediate hazards to the human subjects (21 CFR 312.66), or to protect the life or physical well-being of the subject (21 CFR 812.150(a)(4)).
The FDA, July 2020. Compliance Program Guidance Manual for Clinical Investigator Inspections (7348.811).
In assessing protocol deviations/violations, the FDA instructs field staff to determine whether changes to the protocol were: (1) documented by an amendment, dated, and maintained with the protocol; (2) reported to the sponsor (when initiated by the clinical investigator); and (3) approved by the IRB and FDA (if applicable) before implementation (except when necessary to eliminate apparent immediate hazard(s) to human subjects).
Regulation/Guidance
States
ICH E-6 (R2) Section 4.5.1-4.5.4
4.5.1“trial should be conducted in compliance with the protocol agreed to by the sponsor and, if required by the regulatory authorities…” 4.5.2 The investigator should not implement any deviation from, or changes of, the protocol without agreement by the sponsor and prior review and documented approval/favorable opinion from the IRB/IEC of an amendment, except where necessary to eliminate an immediate hazard(s) to trial subjects, or when the change(s) involves only logistical or administrative aspects of the trial (e.g., change in monitor(s), change of telephone number(s)). 4.5.3 The investigator, or person designated by the investigator, should document and explain any deviation from the approved protocol. 4.5.4 The investigator may implement a deviation from, or a change in, the protocol to eliminate an immediate hazard(s) to trial subjects without prior IRB/IEC approval/favorable opinion.
ICH E3, section 9.6
The sponsor should describe the quality management approach implemented in the trial and summarize important deviations from the predefined quality tolerance limits and remedial actions taken in the clinical study report
21CFR 312.53(vi) (a)
investigators selected “Will conduct the study(ies) in accordance with the relevant, current protocol(s) and will only make changes in a protocol after notifying the sponsor, except when necessary to protect the safety, the rights, or welfare of subjects.”
21CFR 56.108(a)
IRB shall….ensur[e] that changes in approved research….may not be initiated without IRB review and approval except where necessary to eliminate apparent immediate hazards to the human subjects.
21 CFR 56.108(b)
“IRB shall….follow written procedures for ensuring prompt reporting to the IRB, appropriate institutional officials, and the Food and Drug Administration of… any unanticipated problems involving risks to human subjects or others…[or] any instance of serious or continuing noncompliance with these regulations or the requirements or determinations of the IRB.”
45 CFR 46.103(b)(5)
Assurances applicable to federally supported or conducted research shall at a minimum include….written procedures for ensuring prompt reporting to the IRB….[of] any unanticipated problems involving risks to subjects or others or any serious or continuing noncompliance with this policy or the requirements or determinations of the IRB.
FDA Form-1572 (Section 9)
lists the commitments the investigator is undertaking in signing the 1572 wherein the clinical investigator agrees “to conduct the study(ies) in accordance with the relevant, current protocol(s) and will only make changes in a protocol after notifying the sponsor, except when necessary to protect the safety, the rights, or welfare of subjects… [and] not to make any changes in the research without IRB approval, except where necessary to eliminate apparent immediate hazards to the human subjects.”
A few key regulations and guidances (not meant to be a comprehensive list)
How Protocol Deviations are Implemented
Many companies tend to have a failure scale built into their process, differentiating between protocol deviations and violations based on severity. Others use a minor, major, and even critical scale to denote differences in severity. The axis here for severity is the degree to which affects the subject’s rights, safety, or welfare, and/or the integrity of the resultant data (i.e., the sponsor’s ability to use the data in support of the drug).
Other companies divide into protocol deviations and violations:
Protocol Deviation: A protocol deviation occurs when, without significant consequences, the activities on a study diverge from the IRB-approved protocol, e.g., missing a visit window because the subject is traveling. Not as serious as a protocol violation.
Protocol Violation: A divergence from the protocol that materially (a) reduces the quality or completeness of the data, (b) makes the ICF inaccurate, or (c) impacts a subject’s safety, rights or welfare. Examples of protocol violations may include: inadequate or delinquent informed consent; inclusion/exclusion criteria not met; unreported SAEs; improper breaking of the blind; use of prohibited medication; incorrect or missing tests; mishandled samples; multiple visits missed or outside permissible windows; materially inadequate record-keeping; intentional deviation from protocol, GCP or regulations by study personnel; and subject repeated noncompliance with study requirements.
This is probably a place when nomenclature can serve to get in the way, rather than provide benefit. The EMA says pretty much the same in “ICH guideline E3 – questions and answers (R1).“
Principles of Events in Clinical Practice
Severity of the event is based on degree to which affects the subject’s rights, safety, or welfare, and/or the integrity of the resultant data
Events happen beyond the Protocol. These need to be managed appropriately as well.
The event needs to be categorized, evaluated and trended by the sponsor
Severity of the Event
Starting in the study planning stage, ICH E6(R2) GCP requires sponsors to identify risks to critical study processes and study data and to evaluate these risks based on likelihood, detectability and impact on subject safety and data integrity.
Sponsors then establish key quality indicators (KQIs) and quality tolerance thresholds. KQI is really just a key risk indicator and should be treated similarly.
Study events that exceed the risk threshold should trigger an evaluation to determine if action is needed. In this way, sponsors can proactively manage risk and address protocol noncompliance.
The best practice here is to have a living risk assessment for each study. Evaluate across studies to understand your overall organization risk, and look for opportunities for wide-scale mitigations. Feedup into your risk register.
Event Classification for Clinical Protocols and GCPs
Where the Event happens
Deviations in the clinical space are a great example of the management of supplier events, and at the end of the day there is little difference between a GMP supplier event management, a GLP or a GCP. The individual requirements might be different but the principles and the process are the same.
Each entity in the trial organization should have their own deviation system where they investigate deviations, performing root cause investigation and enacting CAPAs.
This is where it starts to get tricky. first of all, not all sites have the infrastructure to do this well. Second the nature of reporting, usually through the Electronic Data Capture (EDC) system, can lead to balkanization at the site. Site’s need to have strong compliance programs through compiling deviation details into a single sitewide system that allows the site to trend deviations across studies in addition to following sponsor reporting requirements.
Unfortunately too many site’s rely on the sponsor’s program. Sponsors need to be evaluating the strength of this program during site selection and through auditing.
Events Happen
Consistent Event Reporting is Critical
Deviations should be to all process, procedure and plans, and just not the protocol.
Categorization and Trending
Categorizing deviations is usually a pain point and an area where more consistency needs to be driven. I recommend first having a good standard set of categorizations. The industry would benefit from adopting a standard, and I think Norman Goldfarb’s proposal is still the best.
Once you have categories, and understand to your KQIs and other aspects you need to make sure they are consistently done. The key mechanisms of this are:
When evaluating a system we can look at it in two ways. We can identify ways a thing can fail or the various ways it can succeed.
Success/Failure Space
These are really just two sides of the coin in many ways, with identifiable points in success space coinciding with analogous points in failure space. “Maximum anticipated success” in success space coincides with “minimum anticipated failure” in failure space.
Like everything, how we frame the question helps us find answers. Certain questions require us to think in terms of failure space, others in success. There are advantages in both, but in risk management, the failure space is incredibly valuable.
It is generally easier to attain concurrence on what constitutes failure than it is to agree on what constitutes success. We may desire a house that has great windows, high ceilings, a nice yard. However, the one we buy can have a termite-infested foundation, bad electrical work, and a roof full of leaks. Whether the house is great is a matter of opinion, but we certainly know all it is a failure based on the high repair bills we are going to accrue.
Success tends to be associated with the efficiency of a system, the amount of output, the degree of usefulness. These characteristics are describable by continuous variables which are not easily modeled in terms of simple discrete events, such as “water is not hot” which characterizes the failure space. Failure, in particular, complete failure, is generally easy to define, whereas the event, success, maybe more difficult to tie down
Theoretically the number of ways in which a system can fail and the number of ways in which a system can ·succeed are both infinite, from a practical standpoint there are generally more ways to success than there are to failure. From a practical point of view, the size of the population in the failure space is less than the size of the population in the success space. This leads to risk management focusing on the failure space.
The failure space maps really well to nominal scales for severity, which can be helpful as you build your own scales for risk assessments.
For example, let’s look at an example of a morning commute.
Example of the failure space for a morning commute
The pictorial tree is a favorite for representing branching paths. Some common ones include Fault Tree Analysis (FTA); Cause Trees to analyze used retrospectively to analyze events that have already occurred; Question Trees to aid in problem-solving; and, even Success Trees to figure out why something went right.
Apple Tree illustration with long branching roots
Inductive or Deductive
Inductive Reasoning: Induction is reasoning from individual cases to a general conclusion. We start from a particular initiating condition and attempt to ascertain the effect of that fault or condition on a system. Deductive Reasoning: Deduction is reasoning from the general to the specific. We start with the way the system has failed and we attempt to find out what modes of system behavior contribute to this failure.
The beauty of a pictorial representation is that depending on which way you go on the tree pictorially represents the form of reasoning that is used.
Inductive reasoning is the branches, and tools like a Cause Tree, are used to determine what system states (usually failed states) are possible. The inductive techniques provide answers to the generic question, “What happens if–?” The process consists of assuming a particular state of existence of a component or components and analyzing to determine the effect of that condition on the system.
Deductive reasoning is the roots, and tools like Fault Tree Analysis, take some specific system state, which is generally a failure state, and chains of more basic faults contributing to this undesired events are built up in a systematic way to determine how a given failure can occur.
Success/Failure Space
We operate in a success/failure space. We are constantly identifying ways a thing can fail or the various ways of success.
Success/Failure Space
These are really just two sides of the coin in many ways, with identifiable points in success space coinciding with analogous points in failure space. “Maximum anticipated success” in success space coincides with “minimum anticipated failure” in failure space.
Like everything, how we frame the question helps us find answers. Certain questions require us to think in terms of failure space, others in success. There are advantages in both, but in risk management, the failure space is incredibly valuable.
Fault Tree Analysis
Fault Tree Analysis (FTA) is a tool for identifying and analyzing factors that contribute to an undesired event (called the “top event”). The top event is analyzed by first identifying its immediate and necessary causes. The logical relationship between these causes is represented by several gates such as AND and OR gates. Each cause is then analyzed step-wise in the same way until further analysis becomes unproductive. The result is a graphical representation of a Boolean equation in a tree diagram.
The Undesired Event
Fault tree analysis is a deductive failure analysis that focuses on one particular undesired event to determine the causes of this event. The undesired event constitutes the top event in a fault tree diagram and generally consists of a complete failure.
If the top event is too general, the analysis becomes unmanageable; if it is too specific, the analysis does not provide a sufficiently broad view.
Top events are usually a failure of a critical requirement or process step.
A fault tree is not a model of all possible failures or all possible causes for failure. A fault tree is tailored to its top event which corresponds to some particular failure mode, and the fault tree includes only those faults that contribute to this top event. These faults are not exhaustive – they cover only the most credible faults as assessed by the risk team.
The Symbology of a Fault Tree
Events
Base Event
The circle describes a basic initiating fault event that requires no further development. The circle signifies that the appropriate limit of resolution has been reached.
Undeveloped Event
The diamond describes a specific fault event that is not further developed, either because the event is of insufficient consequence or because information relevant to the event is unavailable.
Conditioning Event
The ellipse is used to record any conditions or restrictions that apply to any logic gate. It is used primarily with the INHIBIT and PRIORITY AND-gates.
External Event
The house is used to signify an event that is normally expected to occur: e.g., a phase change. The house symbol displays events that are not, of themselves, faults.
External does not mean external to the organization.
Intermediate Event
An intermediate event is a fault event that occurs because of one or more antecedent causes acting through logic gates. All intermediate events are symbolized by rectangles.
Event Symbols used in a Fault Tree Analysis
Gates
There are two basic types of fault tree gates: the OR-gate and the AND-gate. All other gates are really special cases of these two basic types.
OR-gate
The OR-gate is used to show that the output event occurs only if one or more of the input events occur. There may be any number of input events to an OR-gate.
AND-gate
The AND-gate is used to show that the output fault occurs only if all the input faults occur. There may be any number of input faults to an AND-gate.
INHIBIT-gate
The INHIBIT-gate, represented by the hexagon, is a special case of the AND-gate. The output is caused by a single input, but some qualifying condition must be satisfied before the input can produce the output. The condition that must exist is the conditional input. A description of this conditional input is spelled out within an ellipse drawn to the right of the gate.
EXCLUSIVE OR-gate
The EXCLUSIVE OR-gate is a special case of the OR-gate in which the output event occurs only if exactly one of the input events occur
PRIORITY AND-gate
The PRIORITY AND-gate is a special case of the AND-gate in which the output event occurs only if all input events occur in a specified ordered sequence. The sequence is usually shown inside an ellipse drawn to the right of the gate. In practice, the necessity of having a specific sequence is not usually encountered.
Gate Symbols used in a Fault Tree Analysis
Procedure
Identify the system or process that will be examined, including boundaries that will limit the analysis. FTA often stems from a previous risk assessment, such as a FMEA or Structured What-If; or, it comes from a root cause analysis.
Identify the Top Event, the type of failure that will be analyzed as narrowly and specifically as possible.
Identify the events that may be immediate cause sof the top event. Write these events at the level below te event they cause.
For each event ask “Is this a basic failure? Or can it be analyzed for its immediate causes?”
If the event is a basic failure, draw a circle around it.
If it can be analyzed for its own causes draw a rectangle around it (NOTE: if appropriate, other event types are possible)
Ask “How are these events related to the one they cause?” Use the gate symbols to show the relationships. The lower-level events are the input events. They one they cause, above the gate is the output event.
For each event that is not basic, repeat steps 4 and 5. Continue until all branches of the tree end in a basic or undeveloped event.
To determine the mathematical probability of failure, assign probabilities to each of the basic events. Use Boolean algebra to calculate the probability of each high-level event and the top event. Discussions of the math is a very different post.
Analyze the tree to understand the relations between the causes and to find ways to prevent failures. Use the gate relationships to find the most efficient ways to reduce risk. Focus attention on the causes most likely to happen.
FTA example using lack of team (basic)
The Question Tree
A critical task in problem-solving is determining what kinds of analysis and corresponding data would best solve the problem. Rather than a shortage of techniques, there are too many to choose from, we can often reflexively use the same few, basic tools out of familiarity and habit. This can mislead when the situation is complex, non-routine, and/or unfamiliar.
This is where a Question Tree comes in handy to determine what analyses and data are suited for a particular problem-solving situation. This tool is also known as a logic tree or a decision tree. Question Trees are structures for seeing the elements of a problem clearly, and keeping track of different levels of the problem, which we can liken to trunks, branches, twigs, and leaves. You can arrange them from left to right, right to left, or top to bottom— whatever makes the elements easier for you to visualize. Think of a Question Tree as a mental model of your problem. Better trees have a clearer and more complete logic of relationships linking the parts to each other, are more comprehensive and have no overlap.
The Question Tree is very powerful when working through broad and complex problems that no single analysis or framework can solve. By developing a set of questions that are connected to one another in the form of a tree we can determine what data analysis is needed, which can help us break out of the habit of using the same analysis tool even when it is a bad one for the job.
The core question is the starting point. It is made easier to solve by decomposing it into a few, more specific sub-questions. The logic of decomposition is such that the answers to these sub-questions should together fully answer the question they emerge from. The first level of sub-questions may still be too broad to solve with specific analyses and data, so each is decomposed further.
The process of decomposition continues until a sub-question is reached that can be answered using a particular technique or framework, and the data needed is specific enough to be identified. A Question Tree is thus constructed, and the final set of questions indicates the analyses and data needed. As much as possible, the questions on the tree are also framed such that they have “yes” or “no” as potential answers.
They, too, are hypotheses to be settled with data, analysis, and evidence. They can also be used to test assumptions and beliefs, evaluate expectations, explore puzzles and oddities, and generate solution options.
In decomposing a question, ask whether the sub-questions are mutually exclusive and collectively exhaustive. This can help generate the sub-question not asked, and thus, reduce errors of omission. Building a Question Tree is an iterative and nonlinear process. If later information so dictates, previously done work on the tree should be adjusted.
Judgment plays a role in building a Question Tree, so it is unlikely that two people working independently on a complex starting question will create identical trees, but these are bound to overlap. Expertise matters and the strength of teams should be leveraged.
Success and Cause Trees
These are variations of the fault tree analysis: a Success Tree where the top event is desired and a Cause Tree used to investigate a past event as part of root cause analysis.