The Golden Start to a Deviation Investigation

How you respond in the first 24 hours after discovering a deviation can make the difference between a minor quality issue and a major compliance problem. This critical window-what I call “The Golden Day”-represents your best opportunity to capture accurate information, contain potential risks, and set the stage for a successful investigation. When managed effectively, this initial day creates the foundation for identifying true root causes and implementing effective corrective actions that protect product quality and patient safety.

Why the First 24 Hours Matter: The Evidence

The initial response to a deviation is crucial for both regulatory compliance and effective problem-solving. Industry practice and regulatory expectations align on the importance of quick, systematic responses to deviations.

Regulatory expectations explicitly state that deviation investigation and root cause determination should be completed in a timely manner, and industry expectations usually align on deviations being completed within 30 days of discovery.
In the landmark U.S. v. Barr Laboratories case, “the Court declared that all failure investigations must be performed promptly, within thirty business days of the problem’s occurrence”
Best practices recommend assembling a cross-functional team immediately after deviation discovery and conduct initial risk assessment within 24 hours”
Initial actions taken in the first day directly impact the quality and effectiveness of the entire investigation process

When you capitalize on this golden window, you’re working with fresh memories, intact evidence, and the highest chance of observing actual conditions that contributed to the deviation.

Managing Events Systematically

Identifying the Problem: Clarity from the Start

Clear, precise problem definition forms the foundation of any effective investigation. Vague or incomplete problem statements lead to misdirected investigations and ultimately, inadequate corrective actions.

Document using specific, factual language that describes what occurred versus what was expected
Include all relevant details such as procedure and equipment numbers, product names and lot numbers
Apply the 5W2H method (What, When, Where, Who, Why if known, How much is involved, and How it was discovered)
Avoid speculation about causes in the initial description
Remember that the description should incorporate relevant records and photographs of discovered defects.

5W2H	Typical questions	Contains
Who?	Who are the people directly concerned with the problem? Who does this? Who should be involved but wasn’t? Was someone involved who shouldn’t be?	User IDs, Roles and Departments
What?	What happened?	Action, steps, description
When?	When did the problem occur?	Times, dates, place In process
Where?	Where did the problem occur?	Location
Why is it important?	Why did we do this? What are the requirements? What is the expected condition?	Justification, reason
How?	How did we discover. Where in the process was it?	Method, process, procedure
How Many? How Much?	How many things are involved? How often did the situation happen? How much did it impact?	Number, frequency

The quality of your deviation documentation begins with this initial identification. As I’ve emphasized in previous posts, the investigation/deviation report should tell a story that can be easily understood by all parties well after the event and the investigation. This narrative begins with clear identification on day one.

Elements	Problem Statement
Is used to…	Understand and target a problem. Providing a scope. Evaluate any risks. Make objective decisions
Answers the following… (5W2H)	What? (problem that occurred);When? (timing of what occurred); Where? (location of what occurred); Who? (persons involved/observers); Why? (why it matters, not why it occurred); How Much/Many? (volume or count); How Often? (First/only occurrence or multiple)
Contains…	Object (What was affected?); Defect (What went wrong?)
Provides direction for…	Escalation(s); Investigation

Going to the GEMBA: Being Where the Action Is

GEMBA-the actual place where work happens-is a cornerstone concept in quality management. When a deviation occurs, there is no substitute for being physically present at the location.

Observe the actual conditions and environment firsthand
Notice details that might not be captured in written reports
Understand the workflow and context surrounding the deviation
Gather physical evidence before it’s lost or conditions change
Create the opportunity for meaningful conversations with operators

Human error occurs because we are human beings. The extent of our knowledge, training, and skill has little to do with the mistakes we make. We tire, our minds wander and lose concentration, and we must navigate complex processes while satisfying competing goals and priorities – compliance, schedule adherence, efficiency, etc.

Foremost to understanding human performance is knowing that people do what makes sense to them given the available cues, tools, and focus of their attention at the time. Simply put, people come to work to do a good job – if it made sense for them to do what they did, it will make sense to others given similar conditions. The following factors significantly shape human performance and should be the focus of any human error investigation:

Physical Environment Environment, tools, procedures, process design	Organizational Culture Just- or blame-culture, attitude towards error
Management and Supervision Management of personnel, training, procedures	Stress Factors Personal, circumstantial, organizational

We do not want to see or experience human error – but when we do, it’s imperative to view it as a valuable opportunity to improve the system or process. This mindset is the heart of effective human error prevention.

Conducting an Effective GEMBA Walk for Deviations

When conducting your GEMBA walk specifically for deviation investigation:

Arrive with a clear purpose and structured approach
Observe before asking questions
Document observations with photos when appropriate
Look for environmental factors that might not appear in reports
Pay attention to equipment configuration and conditions
Note how operators interact with the process or equipment

A deviation gemba is a cross-functional team meeting that is assembled where a potential deviation event occurred. Going to the gemba and “freezing the scene” as close as possible to the time the event occurred will yield valuable clues about the environment that existed at the time – and fresher memories will provide higher quality interviews. This gemba has specific objectives:

Obtain a common understanding of the event: what happened, when and where it happened, who observed it, who was involved – all the facts surrounding the event. Is it a deviation?
Clearly describe actions taken, or that need to be taken, to contain impact from the event: product quarantine, physical or mechanical interventions, management or regulatory notifications, etc.
Interview involved operators: ask open-ended questions, like how the event unfolded or was discovered, from their perspective, or how the event could have been prevented, in their opinion – insights from personnel experienced with the process can prove invaluable during an investigation.

Deviation GEMBA Tips

Typically there is time between when notification of a deviation gemba goes out and when the team is scheduled to assemble. It is important to come prepared to help facilitate an efficient gemba:

Assemble procedures and other relevant documents and records. This will make references easier during the gemba.
Keep your team on-track – the gemba should end with the team having a common understanding of the event, actions taken to contain impact, and the agreed-upon next steps of the investigation.

You will gain plenty of investigational leads from your observations and interviews at the gemba – which documents to review, which personnel to interview, which equipment history to inspect, and more. The gemba is such an invaluable experience that, for many minor events, root cause and CAPA can be determined fairly easily from information gathered solely at the gemba.

Informal Rubric for Conducting a Good Deviation GEMBA

Describe the timeliness of the team gathering at the gemba.
Were all required roles and experts present?
Was someone leading or facilitating the gemba?
Describe any interviews the team performed during the gemba.
Did the team get sidetracked or off-topic during the gemba
Was the team prepared with relevant documentation or information?
Did the team determine batch impact and any reportability requirements?
Did the team satisfy the objectives of the gemba?
What did the team do well?
What could the team improve upon?

Speaking with Operators: The Power of Cognitive Interviewing

Interviewing personnel who were present when the deviation occurred requires special techniques to elicit accurate, complete information. Traditional questioning often fails to capture critical details.

Cognitive interviewing, as I outlined in my previous post on “Interviewing,” was originally created for law enforcement and later adopted during accident investigations by the National Transportation Safety Board (NTSB). This approach is based on two key principles:

Witnesses need time and encouragement to recall information
Retrieval cues enhance memory recall

How to Apply Cognitive Interviewing in Deviation Investigations

Mental Reinstatement: Encourage the interviewee to mentally recreate the environment and people involved
In-Depth Reporting: Encourage the reporting of all the details, even if it is minor or not directly related
Multiple Perspectives: Ask the interviewee to recall the event from others’ points of view
Several Orders: Ask the interviewee to recount the timeline in different ways. Beginning to end, end to beginning

Most importantly, conduct these interviews at the actual location where the deviation occurred. A key part of this is that retrieval cues access memory. This is why doing the interview on the scene (or Gemba) is so effective.

Component	What It Consists of
Mental Reinstatement	Encourage the interviewee to mentally recreate the environment and people involved.
In-Depth Reporting	Encourage the reporting of all the details.
Multiple Perspectives	Ask the interviewee to recall the event from others’ points of view.
Several Orders	Ask the interviewee to recount the timeline in different ways.

Approach the Interviewee Positively:
- Ask for the interview.
- State the purpose of the interview.
- Tell interviewee why he/she was selected.
- Avoid statements that imply blame.
- Focus on the need to capture knowledge
- Answer questions about the interview.
- Acknowledge and respond to concerns.
- Manage negative emotions.
Apply these Four Components:
- Use mental reinstatement.
- Report everything.
- Change the perspective.
- Change the order.
Apply these Two Principles:
- Witnesses need time and encouragement to recall information.
- Retrieval cues enhance memory recall.
Demonstrate these Skills:
- Recreate the original context and had them walk you through process.
- Tell the witness to actively generate information.
- Adopt the witness’s perspective.
- Listen actively, do not interrupt, and pause before asking follow-up questions.
- Ask open-ended questions.
- Encourage the witness to use imagery.
- Perform interview at the Gemba.
- Follow sequence of the four major components.
- Bring support materials.
- Establish a connection with the witness.
- Do Not tell them how they made the mistake.

Initial Impact Assessment: Understanding the Scope

Within the first 24 hours, a preliminary impact assessment is essential for determining the scope of the deviation and the appropriate response.

Apply a risk-based approach to categorize the deviation as critical, major, or minor
Evaluate all potentially affected products, materials, or batches
Consider potential effects on critical quality attributes
Assess possible regulatory implications
Determine if released products may be affected

This impact assessment is also the initial risk assessment, which will help guide the level of effort put into the deviation.

Factors to Consider in Initial Risk Assessment

Patient safety implications
Product quality impact
Compliance with registered specifications
Potential for impact on other batches or products
Regulatory reporting requirements
Level of investigation required

This initial assessment will guide subsequent decisions about quarantine, notification requirements, and the depth of investigation needed. Remember, this is a preliminary assessment that will be refined as the investigation progresses.

Immediate Actions: Containing the Issue

Once you’ve identified the deviation and assessed its potential impact, immediate actions must be taken to contain the issue and prevent further risk.

Quarantine potentially affected products or materials to prevent their release or further use
Notify key stakeholders, including quality assurance, production supervision, and relevant department heads
Implement temporary corrective or containment measures
Document the deviation in your quality management system
Secure relevant evidence and documentation
Consider whether to stop related processes

Industry best practices emphasize that you should Report the deviation in real-time. Notify QA within 24 hours and hold the GEMBA. Remember that “if you don’t document it, it didn’t happen” – thorough documentation of both the deviation and your immediate response is essential.

Affected vs Related Batches

Not every Impact is the same, so it can be helpful to have two concepts: Affected and Related.

Affected Batch: Product directly impacted by the event at the time of discovery, for instance, the batch being manufactured or tested when the deviation occurred.
Related Batch: Product manufactured or tested under the same conditions or parameters using the process in which the deviation occurred and determined as part of the deviation investigation process to have no impact on product quality.

Setting Up for a Successful Full Investigation

The final step in the golden day is establishing the foundation for the comprehensive investigation that will follow.

Assemble a cross-functional investigation team with relevant expertise
Define clear roles and responsibilities for team members
Establish a timeline for the investigation (remembering the 30-day guideline)
Identify additional data or evidence that needs to be collected
Plan for any necessary testing or analysis
Schedule follow-up interviews or observations

In my post on handling deviations, I emphasized that you must perform a time-sensitive and thorough investigation within 30 days. The groundwork laid during the golden day will make this timeline achievable while maintaining investigation quality.

Planning for Root Cause Analysis

During this setup phase, you should also begin planning which root cause analysis tools might be most appropriate for your investigation. Select tools based on the event complexity and the number of potential root causes and when “human error” appears to be involved, prepare to dig deeper as this is rarely the true root cause

Identifying Phase of your Investigation

If	Then you are at
The problem is not understood. Boundaries have not been set. There could be more than one problem	Problem Understanding
Data needs to be collected. There are questions about frequency or occurrence. You have not had interviews	Data Collection
Data has been collected but not analyszed	Data Analysis
The root cause needs to be determined from the analyzed data	Identify Root Cause

Root Cause Analysis Tools Chart body { font-family: Arial, sans-serif; line-height: 1.6; margin: 20px; } table { border-collapse: collapse; width: 100%; margin-bottom: 20px; } th, td { border: 1px solid #ddd; padding: 8px 12px; vertical-align: top; } th { background-color: #f2f2f2; font-weight: bold; text-align: left; } tr:nth-child(even) { background-color: #f9f9f9; } .purpose-cell { font-weight: bold; } h1 { text-align: center; color: #333; } ul { margin: 0; padding-left: 20px; }

Root Cause Analysis Tools Chart

Purpose	Tool	Description
Problem Understanding	Process Map	A picture of the separate steps of a process in sequential order, including: materials or services entering or leaving the process (inputs and outputs) decisions that must be made people who become involved time involved at each step, and/or process measurements.
	Critical Incident Technique (CIT)	A process used for collecting direct observations of human behavior that have critical significance, and meet methodically defined criteria.
	Comparative Analysis	A technique that focuses a problem-solving team on a problem. It compares one or more elements of a problem or process to evaluate elements that are similar or different (e.g. comparing a standard process to a failing process).
	Performance Matrix	A tool that describes the participation by various roles in completing tasks or deliverables for a project or business process. Note: It is especially useful in clarifying roles and responsibilities in cross-functional/departmental positions.
	5W2H Analysis	An approach that defines a problem and its underlying contributing factors by systematically asking questions related to who, what, when, where, why, how, and how much/often.
Data Collection	Surveys	A technique for gathering data from a targeted audience based on a standard set of criteria.
	Check Sheets	A technique to compile data or observations to detect and show trends/patterns.
	Cognitive Interview	An interview technique used by investigators to help the interviewee recall specific memories from a specific event.
	KNOT Chart	A data collection and classification tool to organize data based on what is Known Need to know Opinion, and Think we know.
Data Analysis	Pareto Chart	A technique that focuses efforts on problems offering the greatest potential for improvement.
	Histogram	A tool that summarizes data collected over a period of time, and graphically presents frequency distribution.
	Scatter Chart	A tool to study possible relationships between changes in two different sets of variables.
	Run Chart	A tool that captures study data for trends/patterns over time.
	Affinity Diagram	A technique for brainstorming and summarizing ideas into natural groupings to understand a problem.
Root Cause Analysis	Interrelationship Digraphs	A tool to identify, analyze, and classify cause and effect relationships among issues so that drivers become part of an effective solution.
	Why-Why	A technique that allows one to explore the cause-and-effect relationships of a particular problem by asking why; drilling down through the underlying contributing causes to identify root cause.
	Is/Is Not	A technique that guides the search for causes of a problem by isolating the who, what, when, where, and how of an event. It narrows the investigation to factors that have an impact and eliminates factors that do not have an impact. By comparing what the problem is with what the problem is not, we can see what is distinctive about a problem which leads to possible causes.
	Structured Brainstorming	A technique to identify, explore, and display the factors within each root cause category that may be affecting the problem/issue, and/or effect being studied through this structured idea-generating tool.
	Cause and Effect Diagram (Ishikawa/Fishbone)	A tool to display potential causes of an event based on root cause categories defined by structured brainstorming using this tool as a visual aid.
	Causal Factor Charting	A tool to analyze human factors and behaviors that contribute to errors, and identify behavior-influencing factors and gaps.
Other Tools	Prioritization Matrix	A tool to systematically compare choices through applying and weighting criteria.
	Control Chart	A tool to monitor process performance over time by studying its variation and source.
	Process Capability	A tool to determine whether a process is capable of meeting requirements or specifications.

Making the Most of Your Golden Day

The first 24 hours after discovering a deviation represent a unique opportunity that should not be wasted. By following the structured approach outlined in this post-identifying the problem clearly, going to the GEMBA, interviewing operators using cognitive techniques, conducting an initial impact assessment, taking immediate containment actions, and setting up for the full investigation-you maximize the value of this golden day.

Remember that excellent deviation management is directly linked to product quality, patient safety, and regulatory compliance. Each well-managed deviation is an opportunity to strengthen your quality system.

I encourage you to assess your current approach to the first 24 hours of deviation management. Are you capturing the full value of this golden day, or are you letting critical information slip away? Implement these strategies, train your team on proper deviation triage, and transform your deviation response from reactive to proactive.

Your deviation management effectiveness doesn’t begin when the investigation report is initiated-it begins the moment a deviation is discovered. Make that golden day count.

You Gotta Have Heart: Combating Human Error

The persistent attribution of human error as a root cause deviations reveals far more about systemic weaknesses than individual failings. The label often masks deeper organizational, procedural, and cultural flaws. Like cracks in a foundation, recurring human errors signal where quality management systems (QMS) fail to account for the complexities of human cognition, communication, and operational realities.

The Myth of Human Error as a Root Cause

Regulatory agencies increasingly reject “human error” as an acceptable conclusion in deviation investigations. This shift recognizes that human actions occur within a web of systemic influences. A technician’s missed documentation step or a formulation error rarely stem from carelessness alone but emerge from:

Procedural complexity: Overly complicated standard operating procedures (SOPs) that exceed working memory capacity
Cognitive overload: High-stress environments where operators juggle competing priorities
Latent system flaws: Poor equipment design, inadequate training reinforcement, or misaligned incentives

The aviation industry’s “Tower of Babel” problem—where siloed teams develop isolated communication loops—parallels pharmaceutical manufacturing. The Quality Unit may prioritize regulatory compliance, while production focuses on throughput, creating disjointed interpretations of “quality.” These disconnects manifest as errors when cross-functional risks go unaddressed.

Cognitive Architecture and Error Propagation

Human cognition operates under predictable constraints. Attentional biases, memory limitations, and heuristic decision-making—while evolutionarily advantageous—create vulnerabilities in GMP environments. For example:

Attentional tunneling: An operator hyper-focused on solving a equipment jam may overlook a temperature excursion alert.
Procedural drift: Subtle deviations from written protocols accumulate over time as workers optimize for perceived efficiency.
Complacency cycles: Over-familiarity with routine tasks reduces vigilance, particularly during night shifts or prolonged operations.

These cognitive patterns aren’t failures but features of human neurobiology. Effective QMS design anticipates them through:

Error-proofing: Automated checkpoints that detect deviations before critical process stages
Cognitive load management: Procedures (including batch records) tailored to cognitive load principles with decision-support prompts
Resilience engineering: Simulations that train teams to recognize and recover from near-misses

Strategies for Reframing Human Error Analysis

Conduct Cognitive Autopsies

Move beyond 5-Whys to adopt human factors analysis frameworks:

Human Error Assessment and Reduction Technique (HEART): Quantifies the likelihood of specific error types based on task characteristics
Critical Action and Decision (CAD) timelines: Maps decision points where system defenses failed

For example, a labeling mix-up might reveal:

Task factors: Nearly identical packaging for two products (29% contribution to error likelihood)
Environmental factors: Poor lighting in labeling area (18%)
Organizational factors: Inadequate change control when adding new SKUs (53%)

Redesign for Intuitive Use

The redesign of for intuitive use requires multilayered approaches based on understand how human brains actually work. At the foundation lies procedural chunking, an evidence-based method that restructures complex standard operating procedures (SOPs) into digestible cognitive units aligned with working memory limitations. This approach segments multiphase processes like aseptic filling into discrete verification checkpoints, reducing cognitive overload while maintaining procedural integrity through sequenced validation gates. By mirroring the brain’s natural pattern recognition capabilities, chunked protocols demonstrate significantly higher compliance rates compared to traditional monolithic SOP formats.

Complementing this cognitive scaffolding, mistake-proof redesigns create inherent error detection mechanisms.

To sustain these engineered safeguards, progressive facilities implement peer-to-peer audit protocols during critical operations and transition periods.

Leverage Error Data Analytics

The integration of data analytics into organizational processes has emerged as a critical strategy for minimizing human error, enhancing accuracy, and driving informed decision-making. By leveraging advanced computational techniques, automation, and machine learning, data analytics addresses systemic vulnerabilities.

Human Error Assessment and Reduction Technique (HEART): A Systematic Framework for Error Mitigation

Benefits of the Human Error Assessment and Reduction Technique (HEART)

1. Simplicity and Speed: HEART is designed to be straightforward and does not require complex tools, software, or large datasets. This makes it accessible to organizations without extensive human factors expertise and allows for rapid assessments. The method is easy to understand and apply, even in time-constrained or resource-limited environments.

2. Flexibility and Broad Applicability: HEART can be used across a wide range of industries—including nuclear, healthcare, aviation, rail, process industries, and engineering—due to its generic task classification and adaptability to different operational contexts. It is suitable for both routine and complex tasks.

3. Systematic Identification of Error Influences: The technique systematically identifies and quantifies Error Producing Conditions (EPCs) that increase the likelihood of human error. This structured approach helps organizations recognize the specific factors—such as time pressure, distractions, or poor procedures—that most affect reliability.

4. Quantitative Error Prediction: HEART provides a numerical estimate of human error probability for specific tasks, which can be incorporated into broader risk assessments, safety cases, or design reviews. This quantification supports evidence-based decision-making and prioritization of interventions.

5. Actionable Risk Reduction: By highlighting which EPCs most contribute to error, HEART offers direct guidance on where to focus improvement efforts—whether through engineering redesign, training, procedural changes, or automation. This can lead to reduced error rates, improved safety, fewer incidents, and increased productivity.

6. Supports Accident Investigation and Design: HEART is not only a predictive tool but also valuable in investigating incidents and guiding the design of safer systems and procedures. It helps clarify how and why errors occurred, supporting root cause analysis and preventive action planning.

7. Encourages Safety and Quality Culture and Awareness: Regular use of HEART increases awareness of human error risks and the importance of control measures among staff and management, fostering a proactive culture.

When Is HEART Best Used?

Risk Assessment for Critical Tasks: When evaluating tasks where human error could have severe consequences (e.g., operating nuclear control systems, administering medication, critical maintenance), HEART helps quantify and reduce those risks.
Design and Review of Procedures: During the design or revision of operational procedures, HEART can identify steps most vulnerable to error and suggest targeted improvements.
Incident Investigation: After an failure or near-miss, HEART helps reconstruct the event, identify contributing EPCs, and recommend changes to prevent recurrence.
Training and Competence Assessment: HEART can inform training programs by highlighting the conditions and tasks where errors are most likely, allowing for focused skill development and awareness.
Resource-Limited or Fast-Paced Environments: Its simplicity and speed make HEART ideal for organizations needing quick, reliable human error assessments without extensive resources or data.

Generic Task Types (GTTs): Establishing Baselines

HEART classifies human activities into nine Generic Task Types (GTT) with predefined nominal human error probabilities (NHEPs) derived from decades of industrial incident data:

GTT Code	Task Description	Nominal HEP Range
A	Complex, novel tasks requiring problem-solving	0.55 (0.35–0.97)
B	Shifting attention between multiple systems	0.26 (0.14–0.42)
C	High-skill tasks under time constraints	0.16 (0.12–0.28)
D	Rule-based diagnostics under stress	0.09 (0.06–0.13)
E	Routine procedural tasks	0.02 (0.007–0.045)
F	Restoring system states	0.003 (0.0008–0.007)
G	Highly practiced routine operations	0.0004 (0.00008–0.009)
H	Supervised automated actions	0.00002 (0.000006–0.0009)
M	Miscellaneous/undefined tasks	0.003 (0.008–0.11)

Comprehensive Taxonomy of Error-Producing Conditions (EPCs)

HEART’s 38 Error Producing Conditionss represent contextual amplifiers of error probability, categorized under the 4M Framework (Man, Machine, Media, Management):

EPC Code	Description	Max Effect	4M Category
1	Unfamiliarity with task	17×	Man
2	Time shortage	11×	Management
3	Low signal-to-noise ratio	10×	Machine
4	Override capability of safety features	9×	Machine
5	Spatial/functional incompatibility	8×	Machine
6	Model mismatch between mental and system states	8×	Man
7	Irreversible actions	8×	Machine
8	Channel overload (information density)	6×	Media
9	Technique unlearning	6×	Man
10	Inadequate knowledge transfer	5.5×	Management
11	Performance ambiguity	5×	Media
12	Misperception of risk	4×	Man
13	Poor feedback systems	4×	Machine
14	Delayed/incomplete feedback	4×	Media
15	Operator inexperience	3×	Man
16	Impoverished information quality	3×	Media
17	Inadequate checking procedures	3×	Management
18	Conflicting objectives	2.5×	Management
19	Lack of information diversity	2.5×	Media
20	Educational/training mismatch	2×	Management
21	Dangerous incentives	2×	Management
22	Lack of skill practice	1.8×	Man
23	Unreliable instrumentation	1.6×	Machine
24	Need for absolute judgments	1.6×	Man
25	Unclear functional allocation	1.6×	Management
26	No progress tracking	1.4×	Media
27	Physical capability mismatches	1.4×	Man
28	Low semantic meaning of information	1.4×	Media
29	Emotional stress	1.3×	Man
30	Ill-health	1.2×	Man
31	Low workforce morale	1.2×	Management
32	Inconsistent interface design	1.15×	Machine
33	Poor environmental conditions	1.1×	Media
34	Low mental workload	1.1×	Man
35	Circadian rhythm disruption	1.06×	Man
36	External task pacing	1.03×	Management
37	Supernumerary staffing issues	1.03×	Management
38	Age-related capability decline	1.02×	Man

HEP Calculation Methodology

The HEART equation incorporates both multiplicative and additive effects of EPCs:

Where:

NHEP: Nominal Human Error Probability from GTT
EPC_i: Maximum effect of i-th EPC
APOE_i: Assessed Proportion of Effect (0–1)

HEART Case Study: Operator Error During Biologics Drug Substance Manufacturing

A biotech facility was producing a monoclonal antibody (mAb) drug substance using mammalian cell culture in large-scale bioreactors. The process involved upstream cell culture (expansion and production), followed by downstream purification (protein A chromatography, filtration), and final bulk drug substance filling. The manufacturing process required strict adherence to parameters such as temperature, pH, and feed rates to ensure product quality, safety, and potency.

During a late-night shift, an operator was responsible for initiating a nutrient feed into a 2,000L production bioreactor. The standard operating procedure (SOP) required the feed to be started at 48 hours post-inoculation, with a precise flow rate of 1.5 L/hr for 12 hours. The operator, under time pressure and after a recent shift change, incorrectly programmed the feed rate as 15 L/hr rather than 1.5 L/hr.

Outcome:

The rapid addition of nutrients caused a metabolic imbalance, leading to excessive cell growth, increased waste metabolite (lactate/ammonia) accumulation, and a sharp drop in product titer and purity.
The batch failed to meet quality specifications for potency and purity, resulting in the loss of an entire production lot.
Investigation revealed no system alarms for the high feed rate, and the error was only detected during routine in-process testing several hours later.

HEART Analysis

Task Definition

Task: Programming and initiating nutrient feed in a GMP biologics manufacturing bioreactor.
Criticality: Direct impact on cell culture health, product yield, and batch quality.

Generic Task Type (GTT)

GTT Code	Description	Nominal HEP
E	Routine procedural task with checking	0.02

Error-Producing Conditions (EPCs) Using the 5M Model

5M Category	EPC (HEART)	Max Effect	APOE	Example in Incident
Man	Inexperience with new feed system (EPC15)	3×	0.8	Operator recently trained on upgraded control interface
Machine	Poor feedback (no alarm for high feed rate, EPC13)	4×	0.7	System did not alert on out-of-range input
Media	Ambiguous SOP wording (EPC11)	5×	0.5	SOP listed feed rate as “1.5L/hr” in a table, not text
Management	Time pressure to meet batch deadlines (EPC2)	11×	0.6	Shift was behind schedule due to earlier equipment delay
Milieu	Distraction during shift change (EPC36)	1.03×	0.9	Handover occurred mid-setup, leading to divided attention

Human Error Probability (HEP) Calculation

HEP ≈ 3.5 (350%)
This extremely high error probability highlights a systemic vulnerability, not just an individual lapse.

Root Cause and Contributing Factors

Operator: Recently trained, unfamiliar with new interface (Man)
System: No feedback or alarm for out-of-spec feed rate (Machine)
SOP: Ambiguous presentation of critical parameter (Media)
Management: High pressure to recover lost time (Management)
Environment: Shift handover mid-task, causing distraction (Milieu)

Corrective Actions

Technical Controls

Automated Range Checks: Bioreactor control software now prevents entry of feed rates outside validated ranges and requires supervisor override for exceptions.
Visual SOP Enhancements: Critical parameters are now highlighted in both text and tables, and reviewed during operator training.

Human Factors & Training

Simulation-Based Training: Operators practice feed setup in a virtual environment simulating distractions and time pressure.
Shift Handover Protocol: Critical steps cannot be performed during handover periods; tasks must be paused or completed before/after shift changes.

Management & Environmental Controls

Production Scheduling: Buffer time added to schedules to reduce time pressure during critical steps.
Alarm System Upgrade: Real-time alerts for any parameter entry outside validated ranges.

Outcomes (6-Month Review)

Metric	Pre-Intervention	Post-Intervention
Feed rate programming errors	4/year	0/year
Batch failures (due to feed)	2/year	0/year
Operator confidence (survey)	62/100	91/100

Lessons Learned

Systemic Safeguards: Reliance on operator vigilance alone is insufficient in complex biologics manufacturing; layered controls are essential.
Human Factors: Addressing EPCs across the 5M model—Man, Machine, Media, Management, Milieu—dramatically reduces error probability.
Continuous Improvement: Regular review of near-misses and operator feedback is crucial for maintaining process robustness in biologics manufacturing.

This case underscores how a HEART-based approach, tailored to biologics drug substance manufacturing, can identify and mitigate multi-factorial risks before they result in costly failures.

Causal Factor

A causal factor is a significant contributor to an incident, event, or problem that, if eliminated or addressed, would have prevented the occurrence or reduced its severity or frequency. Here are the key points to understand about causal factors:

Definition: A causal factor is a major unplanned, unintended contributor to an incident (a negative event or undesirable condition) that, if eliminated, would have either prevented the occurrence of the incident or reduced its severity or frequency.
Distinction from root cause: While a causal factor contributes to an incident, it is not necessarily the primary driver. The root cause, on the other hand, is the fundamental reason for the occurrence of a problem or event. (Pay attention to the deficiencies of the model)
Multiple contributors: An incident may have multiple causal factors, and eliminating one causal factor might not prevent the incident entirely but could reduce its likelihood or impact. Swiss-Cheese Model.
Identification methods: Causal factors can be identified through various techniques, including: Root cause analysis (including such tools as fishbone diagrams (Ishikawa diagrams) or the Why-Why technique), Causal Learning Cycle(CLC) analysis, and Causal factor charting.
Importance in problem-solving: Identifying causal factors is crucial for developing effective preventive measures and improving safety, quality, and efficiency.
Characteristics: Causal factors must be mistakes, errors, or failures that directly lead to an incident or fail to mitigate its consequences. They should not contain other causal factors within them.
Distinction from root causes: It’s important to note that root causes are not causal factors but rather lead to causal factors. Examples of root causes often mistaken for causal factors include inadequate procedures, improper training, or poor work culture.

Human Factors are not always Causal Factors, but can be!

Human factor and human error are related concepts but are not the same. A human error is always a causal factor, and the human factor explains why human errors can happen.

Human Error

Human error refers to an unintentional action or decision that fails to achieve the intended outcome. It encompasses mistakes, slips, lapses, and violations that can lead to accidents or incidents. There are two types:

Unintentional Errors include slips (attentional failures) and lapses (memory failures) caused by distractions, interruptions, fatigue, or stress.
Intentional Errors are violations in which an individual knowingly deviates from safe practices, procedures, or regulations. They are often categorized into routine, situational, or exceptional violations.

Human Factors

Human factors is a broader field that studies how humans interact with various system elements, including tools, machines, environments, and processes. It aims to optimize human well-being and overall system performance by understanding human capabilities, limitations, behaviors, and characteristics.

Physical Ergonomics focuses on human anatomical, anthropometric, physiological, and biomechanical characteristics.
Cognitive Ergonomics deals with mental processes such as perception, memory, reasoning, and motor response.
Organizational Ergonomics involves optimizing organizational structures, policies, and processes to improve overall system performance and worker well-being.

Relationship Between Human Factors and Human Error

Causal Relationship: Human factors delve into the underlying reasons why human errors occur. They consider the conditions and systems that contribute to errors, such as poor design, inadequate training, high workload, and environmental factors.
Error Prevention: By addressing human factors, organizations can design systems and processes that minimize the likelihood of human errors. This includes implementing error-proofing solutions, improving ergonomics, and enhancing training and supervision.

Key Differences

Focus:
- Human Error: Focuses on the outcome of an action or decision that fails to achieve the intended result.
- Human Factors: Focuses on the broader context and conditions that influence human performance and behavior.
Approach:
- Human Error: Often addressed through training, disciplinary actions, and procedural changes.
- Human Factors: Involves a multidisciplinary approach to design systems, environments, and processes that support optimal human performance and reduce the risk of errors.

Peer Checking

Peer checking is a technique where two individuals work together to prevent errors before and during a specific action or task. Here are the key points about peer checking:

It involves a performer (the person doing the task) and a peer checker (someone familiar with the task who observes the performer).
The purpose is to prevent errors by the performer by having a second set of eyes verify the correct action is being taken.
The performer and peer checker first agree on the intended action and component. Then, the performer performs the action while the peer observes to confirm it was done correctly.
It augments self-checking by the performer but does not replace self-checking. Both individuals self-check in parallel.
The peer checker provides a fresh perspective that is not trapped in the performer’s task mindset, allowing them to potentially identify hazards or consequences the performer may miss.
It is recommended for critical, irreversible steps or error-likely situations where an extra verification can prevent mistakes.
Peer checking should be used judiciously and not mandated for all actions, as overuse can make it become a mechanical process that loses effectiveness.
It can also be used to evaluate potential fatigue or stress in a co-worker before starting a task.

Personally, I think we overcheck, and the whole process loses effectiveness. A big part of automation and computerized systems like an MES is removing the need for peer checking. But frankly, I’m pretty sure it will never go away.

Expert Intuition and Risk Management

Saturday Morning Breakfast Cereal source http://smbc-comics.com/comic/horrible

Risk management is a crucial aspect of any organization or project. However, it is often subject to human errors in subjective risk judgments. This is because most risk assessment methods rely on subjective inputs from experts. Without certain precautions, experts can make consistent errors in judgment about uncertainty and risk.

There are methods that can correct the systemic errors that people make, but very few organizations implement them. As a result, there is often an almost universal understatement of risk. We need to keep in mind a few rules about experience and expertise.

Experience is a nonrandom, nonscientific sample of events throughout our lifetime.
Experience is memory-based and we are very selective regarding what we choose to remember,
What we conclude from our experience can be full of logical errors
Unless we get reliable feedback on past decisions, there is no reason to believe our experience will tell us much.

No matter how much experience we accumulate, we seem to be very inconsistent in its application.

Experts have unconscious heuristics and biases that impact their judgment, some important ones include:

Misconceptions of chance: If you flip a coin six times, which result is more likely (H= heads, T= tails): HHHTTT or HTHTTH? They are both equal, but many people assume that because the first series looks “less random” than the second, it must be less likely. This is an example of representativeness bias. We appear to judge odds based on what we assume to be representative scenarios. Human beings easily confuse patterns and randomness.
The conjunction fallacy: We often see specific events as more likely than broader categories of events.
Irrational belief in small samples
Disregarding variance in small samples. Small samples have more random variance that large samples is considered less than it should be.
Insensitivity to prior probabilities: People tend to ignore the past and focus on new information when making subjective estimates.

This is all about overconfidence as an expert, which will consistently underestimate risks.

What are some ways to overcome this? I recommend the following be built into your risk management system.

Pretend you are in the future looking back at failure. Start with the assumption that a major disaster did happen and describe how it happened.
Look to risks from others. Gather a list of related failures, for example, regulatory agency observations, and think of risks in relation to those.
Include Everyone. Your organization has numerous experts on all sorts of specific risks. Make the effort to survey representatives of just about every job level.
Do peer reviews. Check assumptions by showing them to peers who are not immersed in the assessment.
Implement metrics for performance. The Brier score is a way to evaluate the result of predictions both by how often the team was right and by the probability the estimated for getting a correct answer.