No theory forbids me to say "Ah!" or "Ugh!", but it forbids me the bogus theorization of my "Ah!" and "Ugh!" - the value judgments. - Theodor Julius Geiger (1960)

Systems-Theoretic Accident Model and Processes

🚀 Engineering a Safer World - Prof. Nancy Leveson - Massachusetts Institute of Technology 🚀

In today's fast-paced and complex world, it's crucial to approach safety from a different perspective. Our traditional methods of ensuring safety have limitations that hinder their effectiveness and cost-efficiency.


1️⃣ Why Our Efforts are Often Not Cost-Effective:

Firstly, current efforts towards safety are often superficial, isolated, or misdirected, focusing too much on assuring system safety rather than designing systems to be inherently safe. Added to this, safety measures are often implemented too late in the system development process, limiting their effectiveness. We also tend to use inappropriate techniques that are not suitable for the complex systems being built today. Our efforts primarily concentrate on the technical components of systems, overlooking important factors like human error, new technologies (especially software), conflicting expectations, management, and system evolution.


2️⃣ The Limitations of the Traditional Approach:

The traditional approach to safety views it as a failure problem, aiming to establish barriers between events or prevent individual component failures. As systems become more complex, accidents often arise from the interactions among components rather than individual failures. It's impossible to anticipate and account for all potential interactions, both by designers and operators. If we confuse safety with reliability, we end up neglecting the dynamic and non-linear nature of accidents.

Non-serious events and incidents are often overlooked as learning opportunities, but they hold tremendous value in enhancing safety. Blaming individuals with the label of operator error is an unproductive finding that fails to address the underlying causes. In order to truly improve safety, we must shift our focus from "who" or "what" to "why." Blame serves as the enemy of safety. The seduction of finding a single root cause can lead us astray, creating an illusion of control. This approach typically centers on operator error or technical failures, disregarding systemic and management factors. Consequently, we find ourselves engaged in a sophisticated game of "whack-a-mole," fixing surface-level symptoms while failing to address the flawed processes that gave rise to those symptoms. This perpetual fire-fighting mode perpetuates the cycle of accidents repeating themselves.

To truly grasp the complexities of safety, we must engage in three levels of analysis. First, we must examine the events themselves - the "what" - such as an explosion. Then, we delve into the conditions surrounding the incident, considering the "who" and "how." This includes factors like flawed valve design or an operator failing to notice something. Lastly, and most importantly, we must explore the underlying systemic factors - the "why." This entails evaluating production pressures, cost concerns, flaws in design and reporting processes, and more. By understanding why the safety control structure failed, we can prevent future losses effectively.

Hindsight bias often clouds our judgment after an incident. It becomes easy to pinpoint where individuals went wrong, what they should have done differently, or what crucial information they missed. We cannot fully grasp the perspective of someone without the knowledge of the outcome. Overcoming hindsight bias requires us to assume that nobody comes to work with the intention of performing poorly and that they were acting reasonably given the complexities, dilemmas, trade-offs, and uncertainties they faced. Simply highlighting mistakes or stating what should have been done does not provide the necessary insights into why people acted the way they did.

3️⃣ Software-Related Accidents and Operator Error:

Software-related accidents are frequently caused by flawed requirements, incorrect assumptions, and unhandled controlled system states. Merely focusing on making software reliable will not guarantee its safety under these conditions. Similarly, blaming operator error for incidents and accidents is a limited perspective. Human error is often a symptom, not the cause, of accidents. Understanding the role of operators in complex systems, the changing nature of their work, and the system design in which they operate is crucial. Human error can be mitigated by designing systems that minimize the likelihood of errors and provide the necessary support and tools for operators.


4️⃣ Systems Thinking - STAMP Approach:

To address these limitations and create a safer world, we need to adopt a systems thinking approach. One such approach is the System-Theoretic Accident Model and Processes (STAMP). STAMP treats safety as a dynamic control problem rather than just a reliability problem. It emphasizes enforcing a set of constraints on system behavior to prevent accidents. Accidents are viewed as a result of interactions among system components that violate these constraints. By shifting our focus from preventing failures to enforcing safety constraints, we can make significant progress in enhancing system safety.


5️⃣ Applying STAMP to Safety Engineering:

STAMP provides a comprehensive framework for safety engineering. It considers accidents as complex, dynamic processes arising from interactions among humans, machines, and the environment. By identifying safety constraints and designing an effective control structure, we can eliminate or reduce adverse events. The control structure encompasses physical design, operations, management, social interactions, and culture. Clear expectations, responsibilities, authority, and accountability must be defined at all levels of the safety control structure.


6️⃣ STPA - A New Hazard Analysis Technique:

System Theoretic Process Analysis (STPA) is a powerful hazard analysis technique that complements STAMP. STPA starts from hazards and identifies safety constraints and scenarios leading to their violation. It influences early design.