The Golden Start to a Deviation Investigation

How you respond in the first 24 hours after discovering a deviation can make the difference between a minor quality issue and a major compliance problem. This critical window-what I call “The Golden Day”-represents your best opportunity to capture accurate information, contain potential risks, and set the stage for a successful investigation. When managed effectively, this initial day creates the foundation for identifying true root causes and implementing effective corrective actions that protect product quality and patient safety.

Why the First 24 Hours Matter: The Evidence

The initial response to a deviation is crucial for both regulatory compliance and effective problem-solving. Industry practice and regulatory expectations align on the importance of quick, systematic responses to deviations.

Regulatory expectations explicitly state that deviation investigation and root cause determination should be completed in a timely manner, and industry expectations usually align on deviations being completed within 30 days of discovery.
In the landmark U.S. v. Barr Laboratories case, “the Court declared that all failure investigations must be performed promptly, within thirty business days of the problem’s occurrence”
Best practices recommend assembling a cross-functional team immediately after deviation discovery and conduct initial risk assessment within 24 hours”
Initial actions taken in the first day directly impact the quality and effectiveness of the entire investigation process

When you capitalize on this golden window, you’re working with fresh memories, intact evidence, and the highest chance of observing actual conditions that contributed to the deviation.

Managing Events Systematically

Identifying the Problem: Clarity from the Start

Clear, precise problem definition forms the foundation of any effective investigation. Vague or incomplete problem statements lead to misdirected investigations and ultimately, inadequate corrective actions.

Document using specific, factual language that describes what occurred versus what was expected
Include all relevant details such as procedure and equipment numbers, product names and lot numbers
Apply the 5W2H method (What, When, Where, Who, Why if known, How much is involved, and How it was discovered)
Avoid speculation about causes in the initial description
Remember that the description should incorporate relevant records and photographs of discovered defects.

5W2H	Typical questions	Contains
Who?	Who are the people directly concerned with the problem? Who does this? Who should be involved but wasn’t? Was someone involved who shouldn’t be?	User IDs, Roles and Departments
What?	What happened?	Action, steps, description
When?	When did the problem occur?	Times, dates, place In process
Where?	Where did the problem occur?	Location
Why is it important?	Why did we do this? What are the requirements? What is the expected condition?	Justification, reason
How?	How did we discover. Where in the process was it?	Method, process, procedure
How Many? How Much?	How many things are involved? How often did the situation happen? How much did it impact?	Number, frequency

The quality of your deviation documentation begins with this initial identification. As I’ve emphasized in previous posts, the investigation/deviation report should tell a story that can be easily understood by all parties well after the event and the investigation. This narrative begins with clear identification on day one.

Elements	Problem Statement
Is used to…	Understand and target a problem. Providing a scope. Evaluate any risks. Make objective decisions
Answers the following… (5W2H)	What? (problem that occurred);When? (timing of what occurred); Where? (location of what occurred); Who? (persons involved/observers); Why? (why it matters, not why it occurred); How Much/Many? (volume or count); How Often? (First/only occurrence or multiple)
Contains…	Object (What was affected?); Defect (What went wrong?)
Provides direction for…	Escalation(s); Investigation

Going to the GEMBA: Being Where the Action Is

GEMBA-the actual place where work happens-is a cornerstone concept in quality management. When a deviation occurs, there is no substitute for being physically present at the location.

Observe the actual conditions and environment firsthand
Notice details that might not be captured in written reports
Understand the workflow and context surrounding the deviation
Gather physical evidence before it’s lost or conditions change
Create the opportunity for meaningful conversations with operators

Human error occurs because we are human beings. The extent of our knowledge, training, and skill has little to do with the mistakes we make. We tire, our minds wander and lose concentration, and we must navigate complex processes while satisfying competing goals and priorities – compliance, schedule adherence, efficiency, etc.

Foremost to understanding human performance is knowing that people do what makes sense to them given the available cues, tools, and focus of their attention at the time. Simply put, people come to work to do a good job – if it made sense for them to do what they did, it will make sense to others given similar conditions. The following factors significantly shape human performance and should be the focus of any human error investigation:

Physical Environment Environment, tools, procedures, process design	Organizational Culture Just- or blame-culture, attitude towards error
Management and Supervision Management of personnel, training, procedures	Stress Factors Personal, circumstantial, organizational

We do not want to see or experience human error – but when we do, it’s imperative to view it as a valuable opportunity to improve the system or process. This mindset is the heart of effective human error prevention.

Conducting an Effective GEMBA Walk for Deviations

When conducting your GEMBA walk specifically for deviation investigation:

Arrive with a clear purpose and structured approach
Observe before asking questions
Document observations with photos when appropriate
Look for environmental factors that might not appear in reports
Pay attention to equipment configuration and conditions
Note how operators interact with the process or equipment

A deviation gemba is a cross-functional team meeting that is assembled where a potential deviation event occurred. Going to the gemba and “freezing the scene” as close as possible to the time the event occurred will yield valuable clues about the environment that existed at the time – and fresher memories will provide higher quality interviews. This gemba has specific objectives:

Obtain a common understanding of the event: what happened, when and where it happened, who observed it, who was involved – all the facts surrounding the event. Is it a deviation?
Clearly describe actions taken, or that need to be taken, to contain impact from the event: product quarantine, physical or mechanical interventions, management or regulatory notifications, etc.
Interview involved operators: ask open-ended questions, like how the event unfolded or was discovered, from their perspective, or how the event could have been prevented, in their opinion – insights from personnel experienced with the process can prove invaluable during an investigation.

Deviation GEMBA Tips

Typically there is time between when notification of a deviation gemba goes out and when the team is scheduled to assemble. It is important to come prepared to help facilitate an efficient gemba:

Assemble procedures and other relevant documents and records. This will make references easier during the gemba.
Keep your team on-track – the gemba should end with the team having a common understanding of the event, actions taken to contain impact, and the agreed-upon next steps of the investigation.

You will gain plenty of investigational leads from your observations and interviews at the gemba – which documents to review, which personnel to interview, which equipment history to inspect, and more. The gemba is such an invaluable experience that, for many minor events, root cause and CAPA can be determined fairly easily from information gathered solely at the gemba.

Informal Rubric for Conducting a Good Deviation GEMBA

Describe the timeliness of the team gathering at the gemba.
Were all required roles and experts present?
Was someone leading or facilitating the gemba?
Describe any interviews the team performed during the gemba.
Did the team get sidetracked or off-topic during the gemba
Was the team prepared with relevant documentation or information?
Did the team determine batch impact and any reportability requirements?
Did the team satisfy the objectives of the gemba?
What did the team do well?
What could the team improve upon?

Speaking with Operators: The Power of Cognitive Interviewing

Interviewing personnel who were present when the deviation occurred requires special techniques to elicit accurate, complete information. Traditional questioning often fails to capture critical details.

Cognitive interviewing, as I outlined in my previous post on “Interviewing,” was originally created for law enforcement and later adopted during accident investigations by the National Transportation Safety Board (NTSB). This approach is based on two key principles:

Witnesses need time and encouragement to recall information
Retrieval cues enhance memory recall

How to Apply Cognitive Interviewing in Deviation Investigations

Mental Reinstatement: Encourage the interviewee to mentally recreate the environment and people involved
In-Depth Reporting: Encourage the reporting of all the details, even if it is minor or not directly related
Multiple Perspectives: Ask the interviewee to recall the event from others’ points of view
Several Orders: Ask the interviewee to recount the timeline in different ways. Beginning to end, end to beginning

Most importantly, conduct these interviews at the actual location where the deviation occurred. A key part of this is that retrieval cues access memory. This is why doing the interview on the scene (or Gemba) is so effective.

Component	What It Consists of
Mental Reinstatement	Encourage the interviewee to mentally recreate the environment and people involved.
In-Depth Reporting	Encourage the reporting of all the details.
Multiple Perspectives	Ask the interviewee to recall the event from others’ points of view.
Several Orders	Ask the interviewee to recount the timeline in different ways.

Approach the Interviewee Positively:
- Ask for the interview.
- State the purpose of the interview.
- Tell interviewee why he/she was selected.
- Avoid statements that imply blame.
- Focus on the need to capture knowledge
- Answer questions about the interview.
- Acknowledge and respond to concerns.
- Manage negative emotions.
Apply these Four Components:
- Use mental reinstatement.
- Report everything.
- Change the perspective.
- Change the order.
Apply these Two Principles:
- Witnesses need time and encouragement to recall information.
- Retrieval cues enhance memory recall.
Demonstrate these Skills:
- Recreate the original context and had them walk you through process.
- Tell the witness to actively generate information.
- Adopt the witness’s perspective.
- Listen actively, do not interrupt, and pause before asking follow-up questions.
- Ask open-ended questions.
- Encourage the witness to use imagery.
- Perform interview at the Gemba.
- Follow sequence of the four major components.
- Bring support materials.
- Establish a connection with the witness.
- Do Not tell them how they made the mistake.

Initial Impact Assessment: Understanding the Scope

Within the first 24 hours, a preliminary impact assessment is essential for determining the scope of the deviation and the appropriate response.

Apply a risk-based approach to categorize the deviation as critical, major, or minor
Evaluate all potentially affected products, materials, or batches
Consider potential effects on critical quality attributes
Assess possible regulatory implications
Determine if released products may be affected

This impact assessment is also the initial risk assessment, which will help guide the level of effort put into the deviation.

Factors to Consider in Initial Risk Assessment

Patient safety implications
Product quality impact
Compliance with registered specifications
Potential for impact on other batches or products
Regulatory reporting requirements
Level of investigation required

This initial assessment will guide subsequent decisions about quarantine, notification requirements, and the depth of investigation needed. Remember, this is a preliminary assessment that will be refined as the investigation progresses.

Immediate Actions: Containing the Issue

Once you’ve identified the deviation and assessed its potential impact, immediate actions must be taken to contain the issue and prevent further risk.

Quarantine potentially affected products or materials to prevent their release or further use
Notify key stakeholders, including quality assurance, production supervision, and relevant department heads
Implement temporary corrective or containment measures
Document the deviation in your quality management system
Secure relevant evidence and documentation
Consider whether to stop related processes

Industry best practices emphasize that you should Report the deviation in real-time. Notify QA within 24 hours and hold the GEMBA. Remember that “if you don’t document it, it didn’t happen” – thorough documentation of both the deviation and your immediate response is essential.

Affected vs Related Batches

Not every Impact is the same, so it can be helpful to have two concepts: Affected and Related.

Affected Batch: Product directly impacted by the event at the time of discovery, for instance, the batch being manufactured or tested when the deviation occurred.
Related Batch: Product manufactured or tested under the same conditions or parameters using the process in which the deviation occurred and determined as part of the deviation investigation process to have no impact on product quality.

Setting Up for a Successful Full Investigation

The final step in the golden day is establishing the foundation for the comprehensive investigation that will follow.

Assemble a cross-functional investigation team with relevant expertise
Define clear roles and responsibilities for team members
Establish a timeline for the investigation (remembering the 30-day guideline)
Identify additional data or evidence that needs to be collected
Plan for any necessary testing or analysis
Schedule follow-up interviews or observations

In my post on handling deviations, I emphasized that you must perform a time-sensitive and thorough investigation within 30 days. The groundwork laid during the golden day will make this timeline achievable while maintaining investigation quality.

Planning for Root Cause Analysis

During this setup phase, you should also begin planning which root cause analysis tools might be most appropriate for your investigation. Select tools based on the event complexity and the number of potential root causes and when “human error” appears to be involved, prepare to dig deeper as this is rarely the true root cause

Identifying Phase of your Investigation

If	Then you are at
The problem is not understood. Boundaries have not been set. There could be more than one problem	Problem Understanding
Data needs to be collected. There are questions about frequency or occurrence. You have not had interviews	Data Collection
Data has been collected but not analyszed	Data Analysis
The root cause needs to be determined from the analyzed data	Identify Root Cause

Root Cause Analysis Tools Chart body { font-family: Arial, sans-serif; line-height: 1.6; margin: 20px; } table { border-collapse: collapse; width: 100%; margin-bottom: 20px; } th, td { border: 1px solid #ddd; padding: 8px 12px; vertical-align: top; } th { background-color: #f2f2f2; font-weight: bold; text-align: left; } tr:nth-child(even) { background-color: #f9f9f9; } .purpose-cell { font-weight: bold; } h1 { text-align: center; color: #333; } ul { margin: 0; padding-left: 20px; }

Root Cause Analysis Tools Chart

Purpose	Tool	Description
Problem Understanding	Process Map	A picture of the separate steps of a process in sequential order, including: materials or services entering or leaving the process (inputs and outputs) decisions that must be made people who become involved time involved at each step, and/or process measurements.
	Critical Incident Technique (CIT)	A process used for collecting direct observations of human behavior that have critical significance, and meet methodically defined criteria.
	Comparative Analysis	A technique that focuses a problem-solving team on a problem. It compares one or more elements of a problem or process to evaluate elements that are similar or different (e.g. comparing a standard process to a failing process).
	Performance Matrix	A tool that describes the participation by various roles in completing tasks or deliverables for a project or business process. Note: It is especially useful in clarifying roles and responsibilities in cross-functional/departmental positions.
	5W2H Analysis	An approach that defines a problem and its underlying contributing factors by systematically asking questions related to who, what, when, where, why, how, and how much/often.
Data Collection	Surveys	A technique for gathering data from a targeted audience based on a standard set of criteria.
	Check Sheets	A technique to compile data or observations to detect and show trends/patterns.
	Cognitive Interview	An interview technique used by investigators to help the interviewee recall specific memories from a specific event.
	KNOT Chart	A data collection and classification tool to organize data based on what is Known Need to know Opinion, and Think we know.
Data Analysis	Pareto Chart	A technique that focuses efforts on problems offering the greatest potential for improvement.
	Histogram	A tool that summarizes data collected over a period of time, and graphically presents frequency distribution.
	Scatter Chart	A tool to study possible relationships between changes in two different sets of variables.
	Run Chart	A tool that captures study data for trends/patterns over time.
	Affinity Diagram	A technique for brainstorming and summarizing ideas into natural groupings to understand a problem.
Root Cause Analysis	Interrelationship Digraphs	A tool to identify, analyze, and classify cause and effect relationships among issues so that drivers become part of an effective solution.
	Why-Why	A technique that allows one to explore the cause-and-effect relationships of a particular problem by asking why; drilling down through the underlying contributing causes to identify root cause.
	Is/Is Not	A technique that guides the search for causes of a problem by isolating the who, what, when, where, and how of an event. It narrows the investigation to factors that have an impact and eliminates factors that do not have an impact. By comparing what the problem is with what the problem is not, we can see what is distinctive about a problem which leads to possible causes.
	Structured Brainstorming	A technique to identify, explore, and display the factors within each root cause category that may be affecting the problem/issue, and/or effect being studied through this structured idea-generating tool.
	Cause and Effect Diagram (Ishikawa/Fishbone)	A tool to display potential causes of an event based on root cause categories defined by structured brainstorming using this tool as a visual aid.
	Causal Factor Charting	A tool to analyze human factors and behaviors that contribute to errors, and identify behavior-influencing factors and gaps.
Other Tools	Prioritization Matrix	A tool to systematically compare choices through applying and weighting criteria.
	Control Chart	A tool to monitor process performance over time by studying its variation and source.
	Process Capability	A tool to determine whether a process is capable of meeting requirements or specifications.

Making the Most of Your Golden Day

The first 24 hours after discovering a deviation represent a unique opportunity that should not be wasted. By following the structured approach outlined in this post-identifying the problem clearly, going to the GEMBA, interviewing operators using cognitive techniques, conducting an initial impact assessment, taking immediate containment actions, and setting up for the full investigation-you maximize the value of this golden day.

Remember that excellent deviation management is directly linked to product quality, patient safety, and regulatory compliance. Each well-managed deviation is an opportunity to strengthen your quality system.

I encourage you to assess your current approach to the first 24 hours of deviation management. Are you capturing the full value of this golden day, or are you letting critical information slip away? Implement these strategies, train your team on proper deviation triage, and transform your deviation response from reactive to proactive.

Your deviation management effectiveness doesn’t begin when the investigation report is initiated-it begins the moment a deviation is discovered. Make that golden day count.

Self-Checking in Work-As-Done

Self-checking is one of the most effective tools we can teach and use. Rooted in the four aspects of risk-based thinking (anticipate, monitor, respond, and learn), it refers to the procedures and checks that employees perform as part of their routine tasks to ensure the quality and accuracy of their work. This practice is often implemented in industries where precision is critical, and errors can lead to significant consequences. For instance, in manufacturing or engineering, workers might perform self-checks to verify that their work meets the required specifications before moving on to the next production stage.

A proactive approach enhances the reliability, safety, and quality of various systems and practices by allowing for immediate detection and correction of errors, thereby preventing potential failures or flaws from escalating into more significant issues.

The memory aid STAR (stop, think, act, review) helps the user recall the thoughts and actions associated with self-checking.

Stop – Just before conducting a task, pause to:
- Eliminate distractions.
- Focus attention on the task.
Think – Understand what will happen when the action is performed.
- Verify the action is appropriate.
- Recall the critical parameters and the action’s expected result(s).
- Consider contingencies to mitigate harm if an unexpected result occurs.
- If there is any doubt, STOP and get help.
Act – Perform the task per work-as-prescribed
Review – Verify that the expected result is obtained.
- Verify the desired change in critical parameters.
- Stop work if criteria are not met.
- Perform the contingency if an unexpected result occurs.

Managing Events Systematically

Being good at problem-solving is critical to success in an organization. I’ve written quite a bit on problem-solving, but here I want to tackle the amount of effort we should apply.

Not all problems should be treated the same. There are also levels of problems. And these two aspects can contribute to some poor problem-solving practices.

It helps to look at problems systematically across our organization. The iceberg analogy is a pretty popular way to break this done focusing on Events, Patterns, Underlying Structure, and Mental Model.

Events

Events start with the observation or discovery of a situation that is different in some way. What is being observed is a symptom and we want to quickly identify the problem and then determine the effort needed to address it.

This is where Art Smalley’s Four Types of Problems comes in handy to help us take a risk-based approach to determining our level of effort.

Type 1 problems, Troubleshooting, allows us to set problems with a clear understanding of the issue and a clear pathway. Have a flat tire? Fix it. Have a document error, fix it using good documentation practices.

It is valuable to work the way through common troubleshooting and ensure the appropriate linkages between the different processes, to ensure a system-wide approach to problem solving.

Corrective maintenance is a great example of troubleshooting as it involved restoring the original state of an asset. It includes documentation, a return to service and analysis of data. From that analysis of data problems are identified which require going deeper into problem-solving. It should have appropriate tie-ins to evaluate when the impact of an asset breaking leads to other problems (for example, impact to product) which can also require additional problem-solving.

It can be helpful for the organization to build decision trees that can help folks decide if a given problem stays as troubleshooting or if it it also requires going to type 2, “gap from standard.”

Type 2 problems, gap from standard, means that the actual result does not meet the expected and there is a potential of not meeting the core requirements (objectives) of the process, product, or service. This is the place we start deeper problem-solving, including root cause analysis.

Please note that often troubleshooting is done in a type 2 problem. We often call that a correction. If the bioreactor cannot maintain temperature during a run, that is a type 2 problem but I am certainly going to immediately apply troubleshooting as well. This is called a correction.

Take documentation errors. There is a practice in place, part of good documentation practices, for addressing troubleshooting around documents (how to correct, how to record a comment, etc). By working through the various ways documentation can go wrong, applying which ones are solved through troubleshooting and don’t involve type 2 problems, we can create a lot of noise in our system.

Trends/Patterns

Core to the quality system is trending, looking for possible signals that require additional effort. Trending can help determine where problems lay and can also drive up the level of effort necessary.

Underlying Structure

Root Cause Analysis is about finding the underlying structure of the problem that defines the work applied to a type 2 problem.

Not all problems require the same amount of effort, and type 2 problems really have a scale based on consequences, that can help drive the level of effort. This should be based on the impact to the organization’s ability to meet the quality objectives, the requirements behind the product or service.

For example, in the pharma world there are three major criteria:

safety, rights, or well-being of patients (including subjects and participants human and non-human)
data integrity (includes confidence in the results, outcome, or decision dependent on the data)
ability to meet regulatory requirements (which stem from but can be a lot broader than the first two)

These three criteria can be sliced and diced a lot of ways, but serve our example well.

To these three criteria we add a scale of possible harm to derive our criticality, an example can look like this:

Classification	Description
Critical	The event has resulted in, or is clearly likely to result in, any one of the following outcomes: significant harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted; or regulatory action against the company.
Major	The event(s), were they to persist over time or become more serious, could potentially, though not imminently, result in any one of the following outcomes: harm to the safety, rights, or well-being of subjects or participants (human or non-human), or patients; compromised data integrity to the extent that confidence in the results, outcome, or decision dependent on the data is significantly impacted.
Minor	An isolated or recurring triggering event that does not otherwise meet the definitions of Critical or Major quality impacts.

Example of Classification of Events in a Pharmaceutical Quality System

This level of classification will drive the level of effort on the investigation, as well as drive if the CAPA addresses underlying structures alone or drives to addressing the mental models and thus driving culture change.

Mental Model

Here is where we address building a quality culture. In CAPA lingo this is usually more a preventive action than a corrective action. In the simplest of terms, corrective actions is address the underlying structures of the problem in the process/asset where the event happened. Preventive actions deal with underlying structures in other (usually related) process/assets or get to the Mindsets that allowed the underlying structures to exist in the first place.

By applying this system perspective to our problem solving, by realizing that not everything needs a complete rebuild of the foundation, by looking holistically across our systems, we can ensure that we are driving a level of effort to truly build the house of quality.

Bystander Effect, Open Communication and Quality Culture

Our research suggests that the bystander effect can be real and strong in organizations, especially when problems linger out in the open to everyone’s knowledge.
Insiya Hussain and Subra Tangirala (January 2019) “Why Open Secrets Exist in Organizations” Harvard Business Review

The bystander effect occurs when the presence of others discourages an individual from intervening in an emergency situation. When individuals relinquish responsibility for addressing a problem, the potential negative outcomes are wide-ranging. While a great deal of the research focuses on helping victims, the overcoming the bystander effect is very relevant to building a quality culture.

The literature on this often follows after social psychologists John M. Darley and Bibb Latané who identified the concept in the late ’60s. They defined five characteristics bystanders go through:

Notice that something is going on
Interpret the situation as being an emergency
Degree of responsibility felt
Form of assistance
Implement the action choice

This is very similar to the 5 Cs of trouble-shooting: Concern (Notice), Cause (Interpret), Countermeasure (Form of Assistance and Implement), Check results.

What is critical here is that degree of responsibility felt. Without it we see people looking at a problem and shrugging, and then the problem goes on and on. It is also possible for people to just be so busy that the degree of responsibility is felt to the wrong aspect, such as “get the task done” or “do not slow down operations” and it leads to the wrong form of assistance – the wrong troubleshooting.

When building a quality culture, and making sure troubleshooting is an ingrained activity, it is important to work with employees so they understand that their voices are not redundant and that they need to share their opinions even if others have the same information. As the HBR article says: “If you see something, say something (even if others see the same thing).”

Building a quality culture is all about building norms which encourage detection of potential threats or problems and norms which encouraged improvements and innovation.

When troubleshooting causes trouble

I recently had a discussion with one of the best root cause investigators and problem solvers I know, Thor McRockhammer. Thor had concerns about a case where the expected conditions were not met and there were indications that individuals engaged in troubleshooting and as a result not only made the problem worse but led to a set of issues that seem rather systematic.

Our conversation (which I do not want to go into too much detail on) was a great example of troubleshooting going wrong.

Troubleshooting is defined as “Reactive problem solving based upon quick responses to immediate symptoms. Provides relief and immediate problem mitigation. But may fail to get at the real cause, which can lead to prolonged cycles of firefighting.” Troubleshooting usually goes wrong one of a few ways:

Not knowing when troubleshooting shouldn’t be executed
Using troubleshooting exclusively
Not knowing when to go to other problem solving tools (usually “Gap from standard”) or to trigger other quality systems, such as change management.

Troubleshooting is a reactive process of fixing problems by rapid response and short-term corrective actions. It covers noticing the problem, stopping the damage and preventing spread of the problem.

So if our departure from expected conditions was a leaky gasket, then troubleshooting is to try to stop the leak. If our departure is a missing cart then troubleshooting usually involves finding the cart.

Troubleshooting puts things back into the expected condition without changing anything. It addresses the symptom and not the fundamental problems and their underlying causes. They are carried out directly by the people who experience the symptoms, relying upon thorough training, expertise and procedures designed explicitly for troubleshooting.

With out leaky gasket example, our operators are trained and have procedural guidance to tighten or even replace a gasket. They also know what not to do (for example don’t weld the pipe, don’t use a different gasket, etc). There is also a process for documenting the troubleshooting happened (work order, comment, etc).

To avoid the problems listed above troubleshooting needs a process that people can be thoroughly trained in. This process needs to cover what to do, how to communicate it, and where the boundaries are.

4 Cs of Trouble shooting from Art Smalley’s Four Types of Problems

Step	What we do	Things to be aware of
Concern	· What do we known about the exact nature of the problem?	· What do your standards say about how this concern should be documented? o For example, can be addressed as a comment or does it require a deviation or similar non-conformance · If the concern stems from a requirement it must be documented.
Cause	· What do you know about the apparent (or root) cause of the problem?	· Troubleshooting is really good at dealing with superficial cause-and-effect relationships. As the cause deepens, fixing it requires deeper problem-solving. · The cause can be a deficiency or departure from a standard
Countermeasure	· What immediate or temporary countermeasures can be taken to reduce or eliminate the problem? · Are follow-up or more permanent countermeasures required to prevent recurrence? o If so, do you need to investigate more deeply?	· Countermeasures need to be evaluated against change management · Countermeasures cannot ignore, replace or go around standards · Apply good knowledge management
Check results	· Did the results of the action have any immediate effect on eliminating the concern or problem? · Does the problem repeat? o If so, do you need to investigate more deeply?	· Recurrence should trigger deeper problem-solving and be recorded in the quality system. · Beware troubleshooting countermeasures becoming tribal knowledge and the new way of working

Trouble shooting is in a box

Think of your standards as a box. This box defines what should happen per our qualified/validated state, our procedures, and the like. We can troubleshoot as much as we want within the box. We cannot change the box in any way, nor can we leave the box without triggering our deviation/nonconformance system (reactive) or entering change management (proactive).

Communication is critical for troubleshooting

Troubleshooting processes need a mechanism for letting supervisors happen. Troubleshooting that happens in the dark is going to cause a disaster.

Operators need to be trained how to document troubleshooting. Sometimes this is as simple as a notation or comment, other-times you trigger a corrective action process.

Engaging in troubleshooting, and not documenting it starts to look a like fraud and is a data integrity concern.

Change Management

The change management process should be triggered as a result of troubleshooting. Operators should be trained to interpret it. This is often were concept of exact replacements and like-for-like come in.

It is trouble shooting to replace a part with an exact part. Anything else (including functional equivalency) is a higher order change management activity. It is important that everyone involved knows the difference.

	Covers	Is it troubleshooting?
Like-for- Like	Spare parts that are identical replacements (has the same the same manufacturer, part number, material of construction, version) Existing contingency procedures (documented, verified, ideally part of qualification/validation)	Yes This should be built into procedures like corrective maintenance, spare parts, operations and even contingency procedures.
Functionally equivalent	Equivalent, for example, performance, specifications, physical characteristics, usability, maintainability, cleanability, safety	No Need to understand root cause. Need to undergo appropriate level of change management
New	Anything else	No Need to understand root cause. Need to undergo appropriate level of change management

This applies to both administrative and technical controls.

ITIL Incident Management

ITIL Incident Management (or similar models) is just troubleshooting and has all the same hallmarks and concerns.

Conclusion

Trouble shooting is an integral process. And like all processes it should have a standard, be trained on, and go through continuous improvement.