The Golden Start to a Deviation Investigation

How you respond in the first 24 hours after discovering a deviation can make the difference between a minor quality issue and a major compliance problem. This critical window-what I call “The Golden Day”-represents your best opportunity to capture accurate information, contain potential risks, and set the stage for a successful investigation. When managed effectively, this initial day creates the foundation for identifying true root causes and implementing effective corrective actions that protect product quality and patient safety.

Why the First 24 Hours Matter: The Evidence

The initial response to a deviation is crucial for both regulatory compliance and effective problem-solving. Industry practice and regulatory expectations align on the importance of quick, systematic responses to deviations.

  • Regulatory expectations explicitly state that deviation investigation and root cause determination should be completed in a timely manner, and industry expectations usually align on deviations being completed within 30 days of discovery.
  • In the landmark U.S. v. Barr Laboratories case, “the Court declared that all failure investigations must be performed promptly, within thirty business days of the problem’s occurrence”
  • Best practices recommend assembling a cross-functional team immediately after deviation discovery and conduct initial risk assessment within 24 hours”
  • Initial actions taken in the first day directly impact the quality and effectiveness of the entire investigation process

When you capitalize on this golden window, you’re working with fresh memories, intact evidence, and the highest chance of observing actual conditions that contributed to the deviation.

Identifying the Problem: Clarity from the Start

Clear, precise problem definition forms the foundation of any effective investigation. Vague or incomplete problem statements lead to misdirected investigations and ultimately, inadequate corrective actions.

  • Document using specific, factual language that describes what occurred versus what was expected
  • Include all relevant details such as procedure and equipment numbers, product names and lot numbers
  • Apply the 5W2H method (What, When, Where, Who, Why if known, How much is involved, and How it was discovered)
  • Avoid speculation about causes in the initial description
  • Remember that the description should incorporate relevant records and photographs of discovered defects.
5W2HTypical questionsContains
Who?Who are the people directly concerned with the problem? Who does this? Who should be involved but wasn’t? Was someone involved who shouldn’t be?User IDs, Roles and Departments
What?What happened?Action, steps, description
When?When did the problem occur?Times, dates, place In process
Where?Where did the problem occur?Location
Why is it important?Why did we do this? What are the requirements? What is the expected condition?Justification, reason
How?How did we discover. Where in the process was it?Method, process, procedure
How Many? How Much?How many things are involved? How often did the situation happen? How much did it impact?Number, frequency

The quality of your deviation documentation begins with this initial identification. As I’ve emphasized in previous posts, the investigation/deviation report should tell a story that can be easily understood by all parties well after the event and the investigation. This narrative begins with clear identification on day one.

ElementsProblem Statement
Is used to…Understand and target a problem. Providing a scope. Evaluate any risks. Make objective decisions
Answers the following… (5W2H)What? (problem that occurred);When? (timing of what occurred); Where? (location of what occurred); Who? (persons involved/observers); Why? (why it matters, not why it occurred); How Much/Many? (volume or count); How Often? (First/only occurrence or multiple)
Contains…Object (What was affected?); Defect (What went wrong?)
Provides direction for…Escalation(s); Investigation

Going to the GEMBA: Being Where the Action Is

GEMBA-the actual place where work happens-is a cornerstone concept in quality management. When a deviation occurs, there is no substitute for being physically present at the location.

  • Observe the actual conditions and environment firsthand
  • Notice details that might not be captured in written reports
  • Understand the workflow and context surrounding the deviation
  • Gather physical evidence before it’s lost or conditions change
  • Create the opportunity for meaningful conversations with operators

Human error occurs because we are human beings. The extent of our knowledge, training, and skill has little to do with the mistakes we make. We tire, our minds wander and lose concentration, and we must navigate complex processes while satisfying competing goals and priorities – compliance, schedule adherence, efficiency, etc.

Foremost to understanding human performance is knowing that people do what makes sense to them given the available cues, tools, and focus of their attention at the time. Simply put, people come to work to do a good job – if it made sense for them to do what they did, it will make sense to others given similar conditions. The following factors significantly shape human performance and should be the focus of any human error investigation:

Physical Environment
Environment, tools, procedures, process design
Organizational Culture
Just- or blame-culture, attitude towards error
Management and Supervision
Management of personnel, training, procedures
Stress Factors
Personal, circumstantial, organizational

We do not want to see or experience human error – but when we do, it’s imperative to view it as a valuable opportunity to improve the system or process. This mindset is the heart of effective human error prevention.

Conducting an Effective GEMBA Walk for Deviations

When conducting your GEMBA walk specifically for deviation investigation:

  • Arrive with a clear purpose and structured approach
  • Observe before asking questions
  • Document observations with photos when appropriate
  • Look for environmental factors that might not appear in reports
  • Pay attention to equipment configuration and conditions
  • Note how operators interact with the process or equipment

A deviation gemba is a cross-functional team meeting that is assembled where a potential deviation event occurred. Going to the gemba and “freezing the scene” as close as possible to the time the event occurred will yield valuable clues about the environment that existed at the time – and fresher memories will provide higher quality interviews. This gemba has specific objectives:

  • Obtain a common understanding of the event: what happened, when and where it happened, who observed it, who was involved – all the facts surrounding the event. Is it a deviation?
  • Clearly describe actions taken, or that need to be taken, to contain impact from the event: product quarantine, physical or mechanical interventions, management or regulatory notifications, etc.
  • Interview involved operators: ask open-ended questions, like how the event unfolded or was discovered, from their perspective, or how the event could have been prevented, in their opinion – insights from personnel experienced with the process can prove invaluable during an investigation.

Deviation GEMBA Tips

Typically there is time between when notification of a deviation gemba goes out and when the team is scheduled to assemble. It is important to come prepared to help facilitate an efficient gemba:

  • Assemble procedures and other relevant documents and records. This will make references easier during the gemba.
  • Keep your team on-track – the gemba should end with the team having a common understanding of the event, actions taken to contain impact, and the agreed-upon next steps of the investigation.

You will gain plenty of investigational leads from your observations and interviews at the gemba – which documents to review, which personnel to interview, which equipment history to inspect, and more. The gemba is such an invaluable experience that, for many minor events, root cause and CAPA can be determined fairly easily from information gathered solely at the gemba.

Informal Rubric for Conducting a Good Deviation GEMBA

  • Describe the timeliness of the team gathering at the gemba.
  • Were all required roles and experts present?
  • Was someone leading or facilitating the gemba?
  • Describe any interviews the team performed during the gemba.
  • Did the team get sidetracked or off-topic during the gemba
  • Was the team prepared with relevant documentation or information?
  • Did the team determine batch impact and any reportability requirements?
  • Did the team satisfy the objectives of the gemba?
  • What did the team do well?
  • What could the team improve upon?

Speaking with Operators: The Power of Cognitive Interviewing

Interviewing personnel who were present when the deviation occurred requires special techniques to elicit accurate, complete information. Traditional questioning often fails to capture critical details.

Cognitive interviewing, as I outlined in my previous post on “Interviewing,” was originally created for law enforcement and later adopted during accident investigations by the National Transportation Safety Board (NTSB). This approach is based on two key principles:

  • Witnesses need time and encouragement to recall information
  • Retrieval cues enhance memory recall

How to Apply Cognitive Interviewing in Deviation Investigations

  • Mental Reinstatement: Encourage the interviewee to mentally recreate the environment and people involved
  • In-Depth Reporting: Encourage the reporting of all the details, even if it is minor or not directly related
  • Multiple Perspectives: Ask the interviewee to recall the event from others’ points of view
  • Several Orders: Ask the interviewee to recount the timeline in different ways. Beginning to end, end to beginning

Most importantly, conduct these interviews at the actual location where the deviation occurred. A key part of this is that retrieval cues access memory. This is why doing the interview on the scene (or Gemba) is so effective.

ComponentWhat It Consists of
Mental ReinstatementEncourage the interviewee to mentally recreate the environment and people involved.
In-Depth ReportingEncourage the reporting of all the details.
Multiple PerspectivesAsk the interviewee to recall the event from others’ points of view.
Several OrdersAsk the interviewee to recount the timeline in different ways.
  • Approach the Interviewee Positively:
    • Ask for the interview.
    • State the purpose of the interview.
    • Tell interviewee why he/she was selected.
    • Avoid statements that imply blame.
    • Focus on the need to capture knowledge
    • Answer questions about the interview.
    • Acknowledge and respond to concerns.
    • Manage negative emotions.
  • Apply these Four Components:
    • Use mental reinstatement.
    • Report everything.
    • Change the perspective.
    • Change the order.
  • Apply these Two Principles:
    • Witnesses need time and encouragement to recall information.
    • Retrieval cues enhance memory recall.
  • Demonstrate these Skills:
    • Recreate the original context and had them walk you through process.
    • Tell the witness to actively generate information.
    • Adopt the witness’s perspective.
    • Listen actively, do not interrupt, and pause before asking follow-up questions.
    • Ask open-ended questions.
    • Encourage the witness to use imagery.
    • Perform interview at the Gemba.
    • Follow sequence of the four major components.
    • Bring support materials.
    • Establish a connection with the witness.
    • Do Not tell them how they made the mistake.

Initial Impact Assessment: Understanding the Scope

Within the first 24 hours, a preliminary impact assessment is essential for determining the scope of the deviation and the appropriate response.

  • Apply a risk-based approach to categorize the deviation as critical, major, or minor
  • Evaluate all potentially affected products, materials, or batches
  • Consider potential effects on critical quality attributes
  • Assess possible regulatory implications
  • Determine if released products may be affected

This impact assessment is also the initial risk assessment, which will help guide the level of effort put into the deviation.

Factors to Consider in Initial Risk Assessment

  • Patient safety implications
  • Product quality impact
  • Compliance with registered specifications
  • Potential for impact on other batches or products
  • Regulatory reporting requirements
  • Level of investigation required

This initial assessment will guide subsequent decisions about quarantine, notification requirements, and the depth of investigation needed. Remember, this is a preliminary assessment that will be refined as the investigation progresses.

Immediate Actions: Containing the Issue

Once you’ve identified the deviation and assessed its potential impact, immediate actions must be taken to contain the issue and prevent further risk.

  • Quarantine potentially affected products or materials to prevent their release or further use
  • Notify key stakeholders, including quality assurance, production supervision, and relevant department heads
  • Implement temporary corrective or containment measures
  • Document the deviation in your quality management system
  • Secure relevant evidence and documentation
  • Consider whether to stop related processes

Industry best practices emphasize that you should Report the deviation in real-time. Notify QA within 24 hours and hold the GEMBA. Remember that “if you don’t document it, it didn’t happen” – thorough documentation of both the deviation and your immediate response is essential.

Affected vs Related Batches

Not every Impact is the same, so it can be helpful to have two concepts: Affected and Related.

  • Affected Batch:  Product directly impacted by the event at the time of discovery, for instance, the batch being manufactured or tested when the deviation occurred.
  • Related Batch:  Product manufactured or tested under the same conditions or parameters using the process in which the deviation occurred and determined as part of the deviation investigation process to have no impact on product quality.

Setting Up for a Successful Full Investigation

The final step in the golden day is establishing the foundation for the comprehensive investigation that will follow.

  • Assemble a cross-functional investigation team with relevant expertise
  • Define clear roles and responsibilities for team members
  • Establish a timeline for the investigation (remembering the 30-day guideline)
  • Identify additional data or evidence that needs to be collected
  • Plan for any necessary testing or analysis
  • Schedule follow-up interviews or observations

In my post on handling deviations, I emphasized that you must perform a time-sensitive and thorough investigation within 30 days. The groundwork laid during the golden day will make this timeline achievable while maintaining investigation quality.

Planning for Root Cause Analysis

During this setup phase, you should also begin planning which root cause analysis tools might be most appropriate for your investigation. Select tools based on the event complexity and the number of potential root causes and when “human error” appears to be involved, prepare to dig deeper as this is rarely the true root cause

Identifying Phase of your Investigation

IfThen you are at
The problem is not understood. Boundaries have not been set. There could be more than one problemProblem Understanding
Data needs to be collected. There are questions about frequency or occurrence. You have not had interviewsData Collection
Data has been collected but not analyszedData Analysis
The root cause needs to be determined from the analyzed dataIdentify Root Cause
Root Cause Analysis Tools Chart body { font-family: Arial, sans-serif; line-height: 1.6; margin: 20px; } table { border-collapse: collapse; width: 100%; margin-bottom: 20px; } th, td { border: 1px solid ; padding: 8px 12px; vertical-align: top; } th { background-color: ; font-weight: bold; text-align: left; } tr:nth-child(even) { background-color: ; } .purpose-cell { font-weight: bold; } h1 { text-align: center; color: ; } ul { margin: 0; padding-left: 20px; }

Root Cause Analysis Tools Chart

Purpose Tool Description
Problem Understanding Process Map A picture of the separate steps of a process in sequential order, including:
  • materials or services entering or leaving the process (inputs and outputs)
  • decisions that must be made
  • people who become involved
  • time involved at each step, and/or
  • process measurements.
Critical Incident Technique (CIT) A process used for collecting direct observations of human behavior that
  • have critical significance, and
  • meet methodically defined criteria.
Comparative Analysis A technique that focuses a problem-solving team on a problem. It compares one or more elements of a problem or process to evaluate elements that are similar or different (e.g. comparing a standard process to a failing process).
Performance Matrix A tool that describes the participation by various roles in completing tasks or deliverables for a project or business process.
Note: It is especially useful in clarifying roles and responsibilities in cross-functional/departmental positions.
5W2H Analysis An approach that defines a problem and its underlying contributing factors by systematically asking questions related to who, what, when, where, why, how, and how much/often.
Data Collection Surveys A technique for gathering data from a targeted audience based on a standard set of criteria.
Check Sheets A technique to compile data or observations to detect and show trends/patterns.
Cognitive Interview An interview technique used by investigators to help the interviewee recall specific memories from a specific event.
KNOT Chart A data collection and classification tool to organize data based on what is
  • Known
  • Need to know
  • Opinion, and
  • Think we know.
Data Analysis Pareto Chart A technique that focuses efforts on problems offering the greatest potential for improvement.
Histogram A tool that
  • summarizes data collected over a period of time, and
  • graphically presents frequency distribution.
Scatter Chart A tool to study possible relationships between changes in two different sets of variables.
Run Chart A tool that captures study data for trends/patterns over time.
Affinity Diagram A technique for brainstorming and summarizing ideas into natural groupings to understand a problem.
Root Cause Analysis Interrelationship Digraphs A tool to identify, analyze, and classify cause and effect relationships among issues so that drivers become part of an effective solution.
Why-Why A technique that allows one to explore the cause-and-effect relationships of a particular problem by asking why; drilling down through the underlying contributing causes to identify root cause.
Is/Is Not A technique that guides the search for causes of a problem by isolating the who, what, when, where, and how of an event. It narrows the investigation to factors that have an impact and eliminates factors that do not have an impact. By comparing what the problem is with what the problem is not, we can see what is distinctive about a problem which leads to possible causes.
Structured Brainstorming A technique to identify, explore, and display the
  • factors within each root cause category that may be affecting the problem/issue, and/or
  • effect being studied through this structured idea-generating tool.
Cause and Effect Diagram (Ishikawa/Fishbone) A tool to display potential causes of an event based on root cause categories defined by structured brainstorming using this tool as a visual aid.
Causal Factor Charting A tool to
  • analyze human factors and behaviors that contribute to errors, and
  • identify behavior-influencing factors and gaps.
Other Tools Prioritization Matrix A tool to systematically compare choices through applying and weighting criteria.
Control Chart A tool to monitor process performance over time by studying its variation and source.
Process Capability A tool to determine whether a process is capable of meeting requirements or specifications.

Making the Most of Your Golden Day

The first 24 hours after discovering a deviation represent a unique opportunity that should not be wasted. By following the structured approach outlined in this post-identifying the problem clearly, going to the GEMBA, interviewing operators using cognitive techniques, conducting an initial impact assessment, taking immediate containment actions, and setting up for the full investigation-you maximize the value of this golden day.

Remember that excellent deviation management is directly linked to product quality, patient safety, and regulatory compliance. Each well-managed deviation is an opportunity to strengthen your quality system.

I encourage you to assess your current approach to the first 24 hours of deviation management. Are you capturing the full value of this golden day, or are you letting critical information slip away? Implement these strategies, train your team on proper deviation triage, and transform your deviation response from reactive to proactive.

Your deviation management effectiveness doesn’t begin when the investigation report is initiated-it begins the moment a deviation is discovered. Make that golden day count.

You Gotta Have Heart: Combating Human Error

The persistent attribution of human error as a root cause deviations reveals far more about systemic weaknesses than individual failings. The label often masks deeper organizational, procedural, and cultural flaws. Like cracks in a foundation, recurring human errors signal where quality management systems (QMS) fail to account for the complexities of human cognition, communication, and operational realities.

The Myth of Human Error as a Root Cause

Regulatory agencies increasingly reject “human error” as an acceptable conclusion in deviation investigations. This shift recognizes that human actions occur within a web of systemic influences. A technician’s missed documentation step or a formulation error rarely stem from carelessness alone but emerge from:

The aviation industry’s “Tower of Babel” problem—where siloed teams develop isolated communication loops—parallels pharmaceutical manufacturing. The Quality Unit may prioritize regulatory compliance, while production focuses on throughput, creating disjointed interpretations of “quality.” These disconnects manifest as errors when cross-functional risks go unaddressed.

Cognitive Architecture and Error Propagation

Human cognition operates under predictable constraints. Attentional biases, memory limitations, and heuristic decision-making—while evolutionarily advantageous—create vulnerabilities in GMP environments. For example:

  • Attentional tunneling: An operator hyper-focused on solving a equipment jam may overlook a temperature excursion alert.
  • Procedural drift: Subtle deviations from written protocols accumulate over time as workers optimize for perceived efficiency.
  • Complacency cycles: Over-familiarity with routine tasks reduces vigilance, particularly during night shifts or prolonged operations.

These cognitive patterns aren’t failures but features of human neurobiology. Effective QMS design anticipates them through:

  1. Error-proofing: Automated checkpoints that detect deviations before critical process stages
  2. Cognitive load management: Procedures (including batch records) tailored to cognitive load principles with decision-support prompts
  3. Resilience engineering: Simulations that train teams to recognize and recover from near-misses

Strategies for Reframing Human Error Analysis

Conduct Cognitive Autopsies

Move beyond 5-Whys to adopt human factors analysis frameworks:

  • Human Error Assessment and Reduction Technique (HEART): Quantifies the likelihood of specific error types based on task characteristics
  • Critical Action and Decision (CAD) timelines: Maps decision points where system defenses failed

For example, a labeling mix-up might reveal:

  • Task factors: Nearly identical packaging for two products (29% contribution to error likelihood)
  • Environmental factors: Poor lighting in labeling area (18%)
  • Organizational factors: Inadequate change control when adding new SKUs (53%)

Redesign for Intuitive Use

The redesign of for intuitive use requires multilayered approaches based on understand how human brains actually work. At the foundation lies procedural chunking, an evidence-based method that restructures complex standard operating procedures (SOPs) into digestible cognitive units aligned with working memory limitations. This approach segments multiphase processes like aseptic filling into discrete verification checkpoints, reducing cognitive overload while maintaining procedural integrity through sequenced validation gates. By mirroring the brain’s natural pattern recognition capabilities, chunked protocols demonstrate significantly higher compliance rates compared to traditional monolithic SOP formats.

Complementing this cognitive scaffolding, mistake-proof redesigns create inherent error detection mechanisms.

To sustain these engineered safeguards, progressive facilities implement peer-to-peer audit protocols during critical operations and transition periods.

Leverage Error Data Analytics

The integration of data analytics into organizational processes has emerged as a critical strategy for minimizing human error, enhancing accuracy, and driving informed decision-making. By leveraging advanced computational techniques, automation, and machine learning, data analytics addresses systemic vulnerabilities.

Human Error Assessment and Reduction Technique (HEART): A Systematic Framework for Error Mitigation

Benefits of the Human Error Assessment and Reduction Technique (HEART)

1. Simplicity and Speed: HEART is designed to be straightforward and does not require complex tools, software, or large datasets. This makes it accessible to organizations without extensive human factors expertise and allows for rapid assessments. The method is easy to understand and apply, even in time-constrained or resource-limited environments.

2. Flexibility and Broad Applicability: HEART can be used across a wide range of industries—including nuclear, healthcare, aviation, rail, process industries, and engineering—due to its generic task classification and adaptability to different operational contexts. It is suitable for both routine and complex tasks.

3. Systematic Identification of Error Influences: The technique systematically identifies and quantifies Error Producing Conditions (EPCs) that increase the likelihood of human error. This structured approach helps organizations recognize the specific factors—such as time pressure, distractions, or poor procedures—that most affect reliability.

4. Quantitative Error Prediction: HEART provides a numerical estimate of human error probability for specific tasks, which can be incorporated into broader risk assessments, safety cases, or design reviews. This quantification supports evidence-based decision-making and prioritization of interventions.

5. Actionable Risk Reduction: By highlighting which EPCs most contribute to error, HEART offers direct guidance on where to focus improvement efforts—whether through engineering redesign, training, procedural changes, or automation. This can lead to reduced error rates, improved safety, fewer incidents, and increased productivity.

6. Supports Accident Investigation and Design: HEART is not only a predictive tool but also valuable in investigating incidents and guiding the design of safer systems and procedures. It helps clarify how and why errors occurred, supporting root cause analysis and preventive action planning.

7. Encourages Safety and Quality Culture and Awareness: Regular use of HEART increases awareness of human error risks and the importance of control measures among staff and management, fostering a proactive culture.

When Is HEART Best Used?

  • Risk Assessment for Critical Tasks: When evaluating tasks where human error could have severe consequences (e.g., operating nuclear control systems, administering medication, critical maintenance), HEART helps quantify and reduce those risks.
  • Design and Review of Procedures: During the design or revision of operational procedures, HEART can identify steps most vulnerable to error and suggest targeted improvements.
  • Incident Investigation: After an failure or near-miss, HEART helps reconstruct the event, identify contributing EPCs, and recommend changes to prevent recurrence.
  • Training and Competence Assessment: HEART can inform training programs by highlighting the conditions and tasks where errors are most likely, allowing for focused skill development and awareness.
  • Resource-Limited or Fast-Paced Environments: Its simplicity and speed make HEART ideal for organizations needing quick, reliable human error assessments without extensive resources or data.

Generic Task Types (GTTs): Establishing Baselines

HEART classifies human activities into nine Generic Task Types (GTT) with predefined nominal human error probabilities (NHEPs) derived from decades of industrial incident data:

GTT CodeTask DescriptionNominal HEP Range
AComplex, novel tasks requiring problem-solving0.55 (0.35–0.97)
BShifting attention between multiple systems0.26 (0.14–0.42)
CHigh-skill tasks under time constraints0.16 (0.12–0.28)
DRule-based diagnostics under stress0.09 (0.06–0.13)
ERoutine procedural tasks0.02 (0.007–0.045)
FRestoring system states0.003 (0.0008–0.007)
GHighly practiced routine operations0.0004 (0.00008–0.009)
HSupervised automated actions0.00002 (0.000006–0.0009)
MMiscellaneous/undefined tasks0.003 (0.008–0.11)

Comprehensive Taxonomy of Error-Producing Conditions (EPCs)

HEART’s 38 Error Producing Conditionss represent contextual amplifiers of error probability, categorized under the 4M Framework (Man, Machine, Media, Management):

EPC CodeDescriptionMax Effect4M Category
1Unfamiliarity with task17×Man
2Time shortage11×Management
3Low signal-to-noise ratio10×Machine
4Override capability of safety featuresMachine
5Spatial/functional incompatibilityMachine
6Model mismatch between mental and system statesMan
7Irreversible actionsMachine
8Channel overload (information density)Media
9Technique unlearningMan
10Inadequate knowledge transfer5.5×Management
11Performance ambiguityMedia
12Misperception of riskMan
13Poor feedback systemsMachine
14Delayed/incomplete feedbackMedia
15Operator inexperienceMan
16Impoverished information qualityMedia
17Inadequate checking proceduresManagement
18Conflicting objectives2.5×Management
19Lack of information diversity2.5×Media
20Educational/training mismatchManagement
21Dangerous incentivesManagement
22Lack of skill practice1.8×Man
23Unreliable instrumentation1.6×Machine
24Need for absolute judgments1.6×Man
25Unclear functional allocation1.6×Management
26No progress tracking1.4×Media
27Physical capability mismatches1.4×Man
28Low semantic meaning of information1.4×Media
29Emotional stress1.3×Man
30Ill-health1.2×Man
31Low workforce morale1.2×Management
32Inconsistent interface design1.15×Machine
33Poor environmental conditions1.1×Media
34Low mental workload1.1×Man
35Circadian rhythm disruption1.06×Man
36External task pacing1.03×Management
37Supernumerary staffing issues1.03×Management
38Age-related capability decline1.02×Man

HEP Calculation Methodology

The HEART equation incorporates both multiplicative and additive effects of EPCs:

Where:

  • NHEP: Nominal Human Error Probability from GTT
  • EPC_i: Maximum effect of i-th EPC
  • APOE_i: Assessed Proportion of Effect (0–1)

HEART Case Study: Operator Error During Biologics Drug Substance Manufacturing

A biotech facility was producing a monoclonal antibody (mAb) drug substance using mammalian cell culture in large-scale bioreactors. The process involved upstream cell culture (expansion and production), followed by downstream purification (protein A chromatography, filtration), and final bulk drug substance filling. The manufacturing process required strict adherence to parameters such as temperature, pH, and feed rates to ensure product quality, safety, and potency.

During a late-night shift, an operator was responsible for initiating a nutrient feed into a 2,000L production bioreactor. The standard operating procedure (SOP) required the feed to be started at 48 hours post-inoculation, with a precise flow rate of 1.5 L/hr for 12 hours. The operator, under time pressure and after a recent shift change, incorrectly programmed the feed rate as 15 L/hr rather than 1.5 L/hr.

Outcome:

  • The rapid addition of nutrients caused a metabolic imbalance, leading to excessive cell growth, increased waste metabolite (lactate/ammonia) accumulation, and a sharp drop in product titer and purity.
  • The batch failed to meet quality specifications for potency and purity, resulting in the loss of an entire production lot.
  • Investigation revealed no system alarms for the high feed rate, and the error was only detected during routine in-process testing several hours later.

HEART Analysis

Task Definition

  • Task: Programming and initiating nutrient feed in a GMP biologics manufacturing bioreactor.
  • Criticality: Direct impact on cell culture health, product yield, and batch quality.

Generic Task Type (GTT)

GTT CodeDescriptionNominal HEP
ERoutine procedural task with checking0.02

Error-Producing Conditions (EPCs) Using the 5M Model

5M CategoryEPC (HEART)Max EffectAPOEExample in Incident
ManInexperience with new feed system (EPC15)0.8Operator recently trained on upgraded control interface
MachinePoor feedback (no alarm for high feed rate, EPC13)0.7System did not alert on out-of-range input
MediaAmbiguous SOP wording (EPC11)0.5SOP listed feed rate as “1.5L/hr” in a table, not text
ManagementTime pressure to meet batch deadlines (EPC2)11×0.6Shift was behind schedule due to earlier equipment delay
MilieuDistraction during shift change (EPC36)1.03×0.9Handover occurred mid-setup, leading to divided attention

Human Error Probability (HEP) Calculation

HEP ≈ 3.5 (350%)
This extremely high error probability highlights a systemic vulnerability, not just an individual lapse.

Root Cause and Contributing Factors

  • Operator: Recently trained, unfamiliar with new interface (Man)
  • System: No feedback or alarm for out-of-spec feed rate (Machine)
  • SOP: Ambiguous presentation of critical parameter (Media)
  • Management: High pressure to recover lost time (Management)
  • Environment: Shift handover mid-task, causing distraction (Milieu)

Corrective Actions

Technical Controls

  • Automated Range Checks: Bioreactor control software now prevents entry of feed rates outside validated ranges and requires supervisor override for exceptions.
  • Visual SOP Enhancements: Critical parameters are now highlighted in both text and tables, and reviewed during operator training.

Human Factors & Training

  • Simulation-Based Training: Operators practice feed setup in a virtual environment simulating distractions and time pressure.
  • Shift Handover Protocol: Critical steps cannot be performed during handover periods; tasks must be paused or completed before/after shift changes.

Management & Environmental Controls

  • Production Scheduling: Buffer time added to schedules to reduce time pressure during critical steps.
  • Alarm System Upgrade: Real-time alerts for any parameter entry outside validated ranges.

Outcomes (6-Month Review)

MetricPre-InterventionPost-Intervention
Feed rate programming errors4/year0/year
Batch failures (due to feed)2/year0/year
Operator confidence (survey)62/10091/100

Lessons Learned

  • Systemic Safeguards: Reliance on operator vigilance alone is insufficient in complex biologics manufacturing; layered controls are essential.
  • Human Factors: Addressing EPCs across the 5M model—Man, Machine, Media, Management, Milieu—dramatically reduces error probability.
  • Continuous Improvement: Regular review of near-misses and operator feedback is crucial for maintaining process robustness in biologics manufacturing.

This case underscores how a HEART-based approach, tailored to biologics drug substance manufacturing, can identify and mitigate multi-factorial risks before they result in costly failures.

Causal Factor

A causal factor is a significant contributor to an incident, event, or problem that, if eliminated or addressed, would have prevented the occurrence or reduced its severity or frequency. Here are the key points to understand about causal factors:

  1. Definition: A causal factor is a major unplanned, unintended contributor to an incident (a negative event or undesirable condition) that, if eliminated, would have either prevented the occurrence of the incident or reduced its severity or frequency.
  2. Distinction from root cause: While a causal factor contributes to an incident, it is not necessarily the primary driver. The root cause, on the other hand, is the fundamental reason for the occurrence of a problem or event. (Pay attention to the deficiencies of the model)
  3. Multiple contributors: An incident may have multiple causal factors, and eliminating one causal factor might not prevent the incident entirely but could reduce its likelihood or impact. Swiss-Cheese Model.
  4. Identification methods: Causal factors can be identified through various techniques, including: Root cause analysis (including such tools as fishbone diagrams (Ishikawa diagrams) or the Why-Why technique), Causal Learning Cycle(CLC) analysis, and Causal factor charting.
  5. Importance in problem-solving: Identifying causal factors is crucial for developing effective preventive measures and improving safety, quality, and efficiency.
  6. Characteristics: Causal factors must be mistakes, errors, or failures that directly lead to an incident or fail to mitigate its consequences. They should not contain other causal factors within them.
  7. Distinction from root causes: It’s important to note that root causes are not causal factors but rather lead to causal factors. Examples of root causes often mistaken for causal factors include inadequate procedures, improper training, or poor work culture.

Human Factors are not always Causal Factors, but can be!

Human factor and human error are related concepts but are not the same. A human error is always a causal factor, and the human factor explains why human errors can happen.

Human Error

Human error refers to an unintentional action or decision that fails to achieve the intended outcome. It encompasses mistakes, slips, lapses, and violations that can lead to accidents or incidents. There are two types:

  • Unintentional Errors include slips (attentional failures) and lapses (memory failures) caused by distractions, interruptions, fatigue, or stress.
  • Intentional Errors are violations in which an individual knowingly deviates from safe practices, procedures, or regulations. They are often categorized into routine, situational, or exceptional violations.

Human Factors

Human factors is a broader field that studies how humans interact with various system elements, including tools, machines, environments, and processes. It aims to optimize human well-being and overall system performance by understanding human capabilities, limitations, behaviors, and characteristics.

  • Physical Ergonomics focuses on human anatomical, anthropometric, physiological, and biomechanical characteristics.
  • Cognitive Ergonomics deals with mental processes such as perception, memory, reasoning, and motor response.
  • Organizational Ergonomics involves optimizing organizational structures, policies, and processes to improve overall system performance and worker well-being.

Relationship Between Human Factors and Human Error

  • Causal Relationship: Human factors delve into the underlying reasons why human errors occur. They consider the conditions and systems that contribute to errors, such as poor design, inadequate training, high workload, and environmental factors.
  • Error Prevention: By addressing human factors, organizations can design systems and processes that minimize the likelihood of human errors. This includes implementing error-proofing solutions, improving ergonomics, and enhancing training and supervision.

Key Differences

  • Focus:
    • Human Error: Focuses on the outcome of an action or decision that fails to achieve the intended result.
    • Human Factors: Focuses on the broader context and conditions that influence human performance and behavior.
  • Approach:
    • Human Error: Often addressed through training, disciplinary actions, and procedural changes.
    • Human Factors: Involves a multidisciplinary approach to design systems, environments, and processes that support optimal human performance and reduce the risk of errors.

Peer Checking

Peer checking is a technique where two individuals work together to prevent errors before and during a specific action or task. Here are the key points about peer checking:

  • It involves a performer (the person doing the task) and a peer checker (someone familiar with the task who observes the performer).
  • The purpose is to prevent errors by the performer by having a second set of eyes verify the correct action is being taken.
  • The performer and peer checker first agree on the intended action and component. Then, the performer performs the action while the peer observes to confirm it was done correctly.
  • It augments self-checking by the performer but does not replace self-checking. Both individuals self-check in parallel.
  • The peer checker provides a fresh perspective that is not trapped in the performer’s task mindset, allowing them to potentially identify hazards or consequences the performer may miss.
  • It is recommended for critical, irreversible steps or error-likely situations where an extra verification can prevent mistakes.
  • Peer checking should be used judiciously and not mandated for all actions, as overuse can make it become a mechanical process that loses effectiveness.
  • It can also be used to evaluate potential fatigue or stress in a co-worker before starting a task.

Personally, I think we overcheck, and the whole process loses effectiveness. A big part of automation and computerized systems like an MES is removing the need for peer checking. But frankly, I’m pretty sure it will never go away.

Peer-Checking is the Check/Witness

Expert Intuition and Risk Management

Saturday Morning Breakfast Cereal source http://smbc-comics.com/comic/horrible

Risk management is a crucial aspect of any organization or project. However, it is often subject to human errors in subjective risk judgments. This is because most risk assessment methods rely on subjective inputs from experts. Without certain precautions, experts can make consistent errors in judgment about uncertainty and risk.

There are methods that can correct the systemic errors that people make, but very few organizations implement them. As a result, there is often an almost universal understatement of risk. We need to keep in mind a few rules about experience and expertise.

  • Experience is a nonrandom, nonscientific sample of events throughout our lifetime.
  • Experience is memory-based and we are very selective regarding what we choose to remember,
  • What we conclude from our experience can be full of logical errors
  • Unless we get reliable feedback on past decisions, there is no reason to believe our experience will tell us much.

No matter how much experience we accumulate, we seem to be very inconsistent in its application.

Experts have unconscious heuristics and biases that impact their judgment, some important ones include:

  • Misconceptions of chance: If you flip a coin six times, which result is more likely (H= heads, T= tails): HHHTTT or HTHTTH? They are both equal, but many people assume that because the first series looks “less random” than the second, it must be less likely. This is an example of representativeness bias. We appear to judge odds based on what we assume to be representative scenarios. Human beings easily confuse patterns and randomness.
  • The conjunction fallacy: We often see specific events as more likely than broader categories of events.
  • Irrational belief in small samples
  • Disregarding variance in small samples. Small samples have more random variance that large samples is considered less than it should be.
  • Insensitivity to prior probabilities: People tend to ignore the past and focus on new information when making subjective estimates.

This is all about overconfidence as an expert, which will consistently underestimate risks.

What are some ways to overcome this? I recommend the following be built into your risk management system.

  • Pretend you are in the future looking back at failure. Start with the assumption that a major disaster did happen and describe how it happened.
  • Look to risks from others. Gather a list of related failures, for example, regulatory agency observations, and think of risks in relation to those.
  • Include Everyone. Your organization has numerous experts on all sorts of specific risks. Make the effort to survey representatives of just about every job level.
  • Do peer reviews. Check assumptions by showing them to peers who are not immersed in the assessment.
  • Implement metrics for performance. The Brier score is a way to evaluate the result of predictions both by how often the team was right and by the probability the estimated for getting a correct answer.

Further Reading

Here are some sources that discuss the topic of human errors and subjective judgments in risk management: