Thinking of Swiss Cheese: Reason’s Theory of Active and Latent Failures

The Theory of Active and Latent Failures was proposed by James Reason in his book, Human Error. Reason stated accidents within most complex systems, such as health care, are caused by a breakdown or absence of safety barriers across four levels within a system. These levels can best be described as Unsafe Acts, Preconditions for Unsafe Acts, Supervisory Factors, and Organizational Influences. Reason used the term “active failures” to describe factors at the Unsafe Acts level, whereas “latent failures” was used to describe unsafe conditions higher up in the system.

This is represented as the Swiss Cheese model, and has become very popular in root cause analysis and risk management circles and widely applied beyond the safety world.

Swiss Cheese Model

In the Swiss Cheese model, the holes in the cheese depict the failure or absence of barriers within a system. Such occurrences represent failures that threaten the overall integrity of the system. If such failures never occurred within a system (i.e., if the system were perfect), then there would not be any holes in the cheese. We would have a nice Engelberg cheddar.

Not every hole that exists in a system will lead to an error. Sometimes holes may be inconsequential. Other times, holes in the cheese may be detected and corrected before something bad happens. This process of detecting and correcting errors occurs all the time.

The holes in the cheese are dynamic, not static. They open and close over time due to many factors, allowing the system to function appropriately without catastrophe. This is what human factors engineers call “resilience.” A resilient system is one that can adapt and adjust to changes or disturbances.

Holes in the cheese open and close at different rates. The rate at which holes pop up or disappear is determined by the type of failure the hole represents.

  1. Holes that occur at the Unsafe Acts level, and even some at the Preconditions level, represent active failures. Active failures usually occur during the activity of work and are directly linked to the bad outcome. Active failures change during the process of performing, opening, and closing over time as people make errors, catch their errors, and correct them.
  2. Latent failures occur higher up in the system, above the Unsafe Acts level — the Organizational, Supervisory, and Preconditions levels. These failures are referred to as “latent” because when they occur or open, they often go undetected. They can lie “dormant” or “latent” in the system for an extended period of time before they are recognized. Unlike active failures, latent failures do not close or disappear quickly.

Most events (harms) are associated with multiple active and latent failures. Unlike the typical Swiss Cheese diagram above, which shows an arrow flying through one hole at each level of the system, there can be a variety of failures at each level that interact to produce an event. In other words, there can be several failures at the Organizational, Supervisory, Preconditions, and Unsafe Acts levels that all lead to harm. The number of holes in the cheese associated with events are more frequent at the Unsafe Acts and Preconditions levels, but (usually) become fewer as one progresses upward through the Supervisory and Organizational levels.

Given the frequency and dynamic nature of activities, there are more opportunities for holes to open up at the Unsafe and Preconditions levels on a frequent basis and there are often more holes identified at these levels during root cause investigation and risk assessments.

The way the holes in the cheese interact across levels is important:

  • One-to-many mapping of causal factors is when a hole at a higher level (e.g., Preconditions) may result in several holes at a lower level (e.g. Unsafe acts)
  • Many-to-one mapping of causal factors when multiple holes at the higher level (e.g. preconditions) might interact to produce a single hole at the lower level (e.g. Unsafe Acts)

By understand the Swiss Cheese model, and Reason’s wider work in Active and Latent Failures, we can strengthen our approach to problem-solving.

Plus cheese is cool.

Swiss Cheese on a cheese board with knife

Call a Band-Aid a Band-Aid: Corrections and Problem-Solving

A common mistake made in problem-solving, especially within the deviation process, is not giving enough foresight to band-aids. As I discussed in the post “Treating All Investigations the Same” it is important to be able to determine what problems need deep root-cause analysis and which ones should be more catch and release.

For catch and release you usually correct, document, and close. In these cases the problem is inherently small enough and the experience suggesting a possible course of action – the correction – sound enough, that you can proceed without root cause analysis and a solution. If those problems persist, and experience and intuition-drive solutions prove ineffective, then we might decide to engage in structured problem-solving for a more effective solution and outcome.

In the post “When troubleshooting causes trouble” I discussed that lays out the 4Cs: Concern, Cause, Countermeasure, Check Results. It is during the Countermeasure step that we determine what immediate or temporary countermeasures can be taken to reduce or eliminate the problem? Where we apply correction and immediate action.

It helps to agree on what a correction is, especially as it relates to corrective actions. Folks often get confused here. A Correction addresses the problem, it does not get to addressing the cause.

Fixing a tire, rebooting a computer, doing the dishes. These are all corrections.

As I discussed in “Design Problem Solving into the Process” good process design involves thinking of as many problems that could occur, identifying the ways to notice these problems, and having clear escalation paths. For low-risk issues, that is often just fix, record, move on. I talk a lot more about this in the post “Managing Events Systematically.”

A good problem-solving system is built to help people decide when to apply these band-aids, and when to engage in more structured problem-solving. This reliance on situational awareness is key to build into the organization.

Design Problem Solving into the Process

Good processes and systems have ways designed into them to identify when a problem occurs, and ensure it gets the right rigor of problem-solving. A model like Art Smalley’s can be helpful here.

Each and every process should go through the following steps:

  1. Define those problems that should be escalated and those that should not. Everyone working in a process should have the same definition of what is a problem. Often times we end up with a hierarchy of issues that are solved within the process – Level 1 – and those processes that go to a root cause process (deviation/CAPA) – level 2.
  2. Identify the ways to notice a problem. Make the work as visual as possible so it is easier to detect the problem.
  3. Define the escalation method. There should be one clear way to surface a problem. There are many ways to create a signal, but it should be simple, timely, and very clear.

These three elements make up the request for help.

The next two steps make up the response to that request.

  1. Who is the right person to respond? Supervisor? Area management? Process Owner? Quality?
  2. How does the individual respond, and most importantly when? This should be standardized so the other end of that help chain is not wondering whether, when, and in what form that help is going to arrive.

In order for this to work, it is important to identify clear ownership of the problem. There always must be one person clearly accountable, even if only responsible for bits, so they can push the problem forward.

It is easy for problem-solving to stall. So make sure progress is transparent. Knowing what is being worked on, and what is not, is critical.

Prioritization is key. Not every problem needs solving so have a mechanism to ensure the right problems are being solved in the process.

Problem solving within a process

Managing Events Systematically

Being good at problem-solving is critical to success in an organization. I’ve written quite a bit on problem-solving, but here I want to tackle the amount of effort we should apply.

Not all problems should be treated the same. There are also levels of problems. And these two aspects can contribute to some poor problem-solving practices.

It helps to look at problems systematically across our organization. The iceberg analogy is a pretty popular way to break this done focusing on Events, Patterns, Underlying Structure, and Mental Model.

Iceberg analogy

Events

Events start with the observation or discovery of a situation that is different in some way. What is being observed is a symptom and we want to quickly identify the problem and then determine the effort needed to address it.

This is where Art Smalley’s Four Types of Problems comes in handy to help us take a risk-based approach to determining our level of effort.

Type 1 problems, Troubleshooting, allows us to set problems with a clear understanding of the issue and a clear pathway. Have a flat tire? Fix it. Have a document error, fix it using good documentation practices.

It is valuable to work the way through common troubleshooting and ensure the appropriate linkages between the different processes, to ensure a system-wide approach to problem solving.

Corrective maintenance is a great example of troubleshooting as it involved restoring the original state of an asset. It includes documentation, a return to service and analysis of data. From that analysis of data problems are identified which require going deeper into problem-solving. It should have appropriate tie-ins to evaluate when the impact of an asset breaking leads to other problems (for example, impact to product) which can also require additional problem-solving.

It can be helpful for the organization to build decision trees that can help folks decide if a given problem stays as troubleshooting or if it it also requires going to type 2, “gap from standard.”

Type 2 problems, gap from standard, means that the actual result does not meet the expected and there is a potential of not meeting the core requirements (objectives) of the process, product, or service. This is the place we start deeper problem-solving, including root cause analysis.

Please note that often troubleshooting is done in a type 2 problem. We often call that a correction. If the bioreactor cannot maintain temperature during a run, that is a type 2 problem but I am certainly going to immediately apply troubleshooting as well. This is called a correction.

Take documentation errors. There is a practice in place, part of good documentation practices, for addressing troubleshooting around documents (how to correct, how to record a comment, etc). By working through the various ways documentation can go wrong, applying which ones are solved through troubleshooting and don’t involve type 2 problems, we can create a lot of noise in our system.

Core to the quality system is trending, looking for possible signals that require additional effort. Trending can help determine where problems lay and can also drive up the level of effort necessary.

Underlying Structure

Root Cause Analysis is about finding the underlying structure of the problem that defines the work applied to a type 2 problem.

Not all problems require the same amount of effort, and type 2 problems really have a scale based on consequences, that can help drive the level of effort. This should be based on the impact to the organization’s ability to meet the quality objectives, the requirements behind the product or service.

For example, in the pharma world there are three major criteria:

  •  safety, rights, or well-being of patients (including subjects and participants human and non-human)
  • data integrity (includes confidence in the results, outcome, or decision dependent on the data)
  • ability to meet regulatory requirements (which stem from but can be a lot broader than the first two)

These three criteria can be sliced and diced a lot of ways, but serve our example well.

To these three criteria we add a scale of possible harm to derive our criticality, an example can look like this:

ClassificationDescription
CriticalThe event has resulted in, or is clearly likely to result in, any one of the following outcomes:   significant harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted; or regulatory action against the company.
MajorThe event(s), were they to persist over time or become more serious, could potentially, though not imminently, result in any one of the following outcomes:  
harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted.
MinorAn isolated or recurring triggering event that does not otherwise meet the definitions of Critical or Major quality impacts.
Example of Classification of Events in a Pharmaceutical Quality System

This level of classification will drive the level of effort on the investigation, as well as drive if the CAPA addresses underlying structures alone or drives to addressing the mental models and thus driving culture change.

Mental Model

Here is where we address building a quality culture. In CAPA lingo this is usually more a preventive action than a corrective action. In the simplest of terms, corrective actions is address the underlying structures of the problem in the process/asset where the event happened. Preventive actions deal with underlying structures in other (usually related) process/assets or get to the Mindsets that allowed the underlying structures to exist in the first place.

Solving Problems Systematically

By applying this system perspective to our problem solving, by realizing that not everything needs a complete rebuild of the foundation, by looking holistically across our systems, we can ensure that we are driving a level of effort to truly build the house of quality.

Treating All Investigations the Same

Stephanie Gaulding, a colleague in the ASQ, recently wrote an excellent post for Redica on “How to Avoid Three Common Deviation Investigation Pitfalls“, a subject near and dear to my heart.

The three pitfalls Stephanie gives are:

  1. Not getting to root case
  2. Inadequate scoping
  3. Treating investigations the same

All three are right on the nose, and I’ve posted a bunch on the topics. Definitely go and read the post.

What I want to delve deeper into is Stephanie’s point that “Deviation systems should also be built to triage events into risk-based categories with sufficient time allocated to each category to drive risk-based investigations and focus the most time and effort on the highest risk and most complex events.”

That is an accurate breakdown, and exactly what regulators are asking for. However, I think the implementation of risk-based categories can sometimes lead to confusion, and we can spend some time unpacking the concept.

Risk is the possible effect of uncertainty. Risk is often described in terms of risk sources, potential events, their consequences, and their likelihoods (where we get likelihoodXseverity from).

But there are a lot of types of uncertainty, IEC31010 “Risk management – risk management techniques” lists the following examples:

  • uncertainty as to the truth of assumptions, including presumptions about how people or systems might behave
  • variability in the parameters on which a decision is to be based
  • uncertainty in the validity or accuracy of models which have been established to make predictions about the future
  • events (including changes in circumstances or conditions) whose occurrence, character or consequences are uncertain
  • uncertainty associated with disruptive events
  • the uncertain outcomes of systemic issues, such as shortages of competent staff, that can have wide ranging impacts which cannot be clearly defined lack of knowledge which arises when uncertainty is recognized but not fully understood
  • unpredictability
  • uncertainty arising from the limitations of the human mind, for example in understanding complex data, predicting situations with long-term consequences or making bias-free judgments.

Most of these are only, at best, obliquely relevant to risk categorizing deviations.

So it is important to first build the risk categories on consequences. At the end of the day these are the consequence that matter in the pharmaceutical/medical device world:

  • harm to the safety, rights, or well-being of patients, subjects or participants (human or non-human)
  • compromised data integrity so that confidence in the results, outcome, or decision dependent on the data is impacted

These are some pretty hefty areas and really hard for the average user to get their minds around. This is why building good requirements, and understanding how systems work is so critical. Building breadcrumbs in our procedures to let folks know what deviations are in what category is a good best practice.

There is nothing wrong with recognizing that different areas have different decision trees. Harm to safety in GMP can mean different things than safety in a GLP study.

The second place I’ve seen this go wrong has to do with likelihood, and folks getting symptom confused with problem confused with cause.

bridge with a gap

All deviations are with a situation that is different in some way from expected results. Deviations start with the symptom, and through analysis end up with a root cause. So when building your decision-tree, ensure it looks at symptoms and how the symptom is observed. That is surprisingly hard to do, which is why a lot of deviation criticality scales tend to focus only on severity.

4 major types of symptoms