Era of Human Error
As advanced technologies have been introduced, the complexity of systems exceeded the capacity limits of human operators or users, and many accidents occurred due to human error.
The Three Mile Island, Unit 2 (TMI-2) accident that occurred in 1979 was a typical case in this era. The accident started with a minor malfunction in the secondary loop, but subsequent unfavorable events made the situation worse, fi leading to severe damage of the reactor core. Some of the critical events that caused the accident include operators' human errors. The operators, for example, misjudged that the reactor vessel was full of coolant water, and they tripped manually the Emergency Core Cooling System (ECCS) which had been initiated automatically.
The point where humans interact with human-made equipment is called a human-machine interface. Analysis of the TMI-2 accident revealed that there were improper human-machine interfaces behind the operators' errors. At the beginning, for example, more than 100 alarms were initiated at the same time, and the operators were unable to comprehend what had actually happened in the plant. In addition, the indication of the relief valve position did not reflect the actual valve position. This defect in interface design caused a delay in operators' correctly recognizing the internal state of the reactor vessel.
Individual human factors and prevention of human errors became key issues in this stage , and efforts were made to design working conditions and humanmachine interfaces appropriate for physical and cognitive human characteristics. Suppression of unimportant alarms based on prioritization of alarms is an example of functions that have been adopted in nuclear power plants after the TMI-2 accident. Since consideration of human factors is nowadays the standard requirement in designing socio-technical systems, the probability that human error may cause a serious accident has been greatly reduced.
Era of Socio-Technical Interactions
In the next stage, socio-technical interactions were the main sources of system failures. Many accidents occurred due to inadequate interactions among technologies, humans, management, organizations, and society. The impact of such accidents often goes beyond the boundary of the organization and cause widespread damage to society. An accident of this type is called “organizational accident .”
The accident that occurred at Chernobyl, Unit 4, in 1986 was a typical organizational accident. At the beginning, it was thought that operators' violation of the operation rules for accomplishing a special test at the plant had caused the accident. As investigation by the international community progressed, it was revealed that organizational and social factors characteristic of the Soviet system at the time were the root causes of violation. The operators, for example, were not sufficiently trained in background knowledge of operation rules, technical communication was lacking between different organizations, workers' will to obey the rules was low in comparison with what was needed to accomplish the norm, and so on.
In the same year, the Space Shuttle Challenger disintegrated after launch and killed the entire crew. The direct cause of the accident was failure of O-ring seals of a solid rocket booster due to cold weather. It is said, however, organizational factors of the National Aeronautics and Space Administration (NASA), such as lack of communication and face-saving decision attitudes, were present behind the direct cause.
The notion of safety culture was introduced after these accidents. Safety culture is defi as an assembly of characteristics and attitudes in organizations and individuals which establish that, as an overriding priority, safety issues receive the attention warranted by their signifi Researchers and practitioners made efforts to assess the level of safety culture of a particular organization and then to enhance it. Though remarkable progress has been made, these efforts are still on-going.
Era of Resilience
In this century, we have experienced more shocking events such as the terrorists' attack on the World Trade Center (WTC) in New York and the Great East Japan (Tohoku) Earthquake in Japan. Vulnerability of our socio-technical systems in the face of unanticipated situations was clearly shown in these events. In the conventional approaches of engineering, the design basis is determined beforehand based on some assumptions of severe conditions, and safety design is performed so that the system can fulfill the design basis. An event that exceeds the design basis, however, may happen, and its probability is characterized as residual risks. Since losses are unavoidable in such a case, we have to consider how quickly socio-technical systems can recover from the losses.
The conventional approaches have not considered sufficiently how to manage residual risks that spill out of the design basis of a complex socio-technical system. Having experienced natural disasters, accidents, economic crises, and so on, people are getting skeptical about technological approaches to risk management. Now we need a new framework for the safety of socio-technical systems to manage risks not only within but also beyond the design basis.
From the above background, the concept of “resilience” has lately attracted widespread interest of researchers and practitioners in systems safety [3, 4]. The term means the ability of a socio-technical system to adapt to disturbances from the environment and maintain its normal function. If we want to face up to unanticipated situations like WTC and Tohoku, we need to establish a new academic field, which we can call resilience engineering, to devise resilient socio-technical systems that can quickly recover their functions from damaged conditions.