What We Know from Assessing Astronaut Applicants
Psychological or psychiatric screening for an illness that might jeopardize a space- flight mission is considered an obvious performance criterion, and has been a part of the original selection of astronauts and cosmonauts since the beginning of space agencies (Steimle & Norberg, 2013). Validation and standardization of selection practices were considered in Industrial-Organizational Psychology journals since that time. However, astronaut applicant psychiatric evaluations were perceived as purely medical judgements and conducted primarily by physicians who used varying styles and methods. In 1989, as psychologists became more regularly involved in astronaut selection, a method of providing well-grounded, standardized evaluations for psychiatrically qualifying or disqualifying astronaut applicants was initiated. Standardized selection methods lead to more consistent, higher quality selection decisions because they require all candidates to be evaluated against the same criteria, using the same “measuring stick.” In addition, use of a standardized process helps to prevent perceived or actual discriminatory hiring processes as well as lay the foundation for continuous improvement of the process. During early and short- duration space missions (SDMs) (e.g., Space Shuttle); selection at NASA heavily emphasized the identification of psychopathology. Initially, it was most sensible and practical to focus on leveraging science to “select out” applicants who were likely to do harm in or be harmed by an extreme environment. Also, because SDMs lasted less than 3 weeks, astronauts free of psychopathology could reasonably be expected to handle the presenting stressors and challenges whether they had skills indicating that they would genuinely thrive during space flight or not. NASA’s focus expanded to include long-duration missions with participation in MIR (i.e., the Russian Space Station) and the ISS in the 1990s. The unique demands of long-duration missions (>2 weeks and up to 3 years for future Mars missions) placed a greater emphasis on the qualities and skills that allow astronauts to safely and eagerly adapt to living and working in space for long periods of time (currently, 10-12 months on ISS at most) (Collins, 2003). The additional skills needed for successful adaptability to long- duration missions led ВНР to update the tools and procedures used for psychological screening of astronaut applicants several times since 1998 (Fiedler & Carpenter, 2005; Galarza & Holland, 1999a, 1999b; Landon, Vessey, & Barrett, 2016). In addition to screening for psychological and psychiatric indicators of illness that might jeopardize missions, ВНР also collects information regarding the clinical indicators of suitability for living and working in spaceflight conditions for months at a time.
The modern process involves a battery of appropriate psychometric tests, typically chosen by a group of qualified psychiatrists and clinically focused psychologists, and administered to a small group of qualified applicants. Then, semi-structured interviews (standardized across professionals engaged in the psychological and psychiatric screening of astronaut candidates) are conducted with an even smaller group of highly qualified applicants. Additional assessment activities (e.g., teamwork reaction exercises, work samples) have also been used during the last four selection cycles. Data is compiled and analyzed after each cycle to check the effectiveness and efficiency of the selection process... as much as the accumulated data, operational constraints, and accepted statistical practices of the day allow (Landon et al., 2016). All these assessment methods have evolved somewhat since their initial inclusion to reflect changes in the science of selection and changes in the astronaut job demands and job context (c.f., completion of Space Shuttle flights and the longer-duration missions of the ISS).
Continual evolution of both the assessment process and elements is an absolute must to accommodate new spaceflight mission concepts and preparations. For example, until June of 2008, the ISS program used a one-to-one backup scheme for training ISS crews. It required that a fully dedicated backup crewmember train alongside each prime crewmember. After the launch of the prime crewmember, the backup crewmember was then inserted into a prime training flow and the whole training sequence (i.e., from assignment as a backup to flight as a prime crewmember) typically required 4-5 years for inexperienced crewmembers. This schedule provided opportunities for crewmembers to interact with one another, build relationships, and develop effective teamwork habits together. Unfortunately, that training schedule was too long, complicated, expensive, and exhaustive to be practical. In 2007, the Expedition Training Requirements Integration Panel of the International Training Control Board outlined the single-flow-to-launch (SFTL) concept that, along with other economies, helped reduce the 5 years or more typically needed to train inexperienced crewmembers to 2 years or less. However, these gains in efficiency required astronauts to prepare for training much more extensively (e.g., network with instructors beforehand, build the appropriate level of proficiency in the training language and culture, complete pre-assignment work and review training materials independent of instruction prior to re-assignment or new mission assignment, etc.). Most importantly, SFTL also reduced the opportunity for crewmembers to train together as a team and this limited crews’ ability to establish shared mental models and interpersonal experiences prior to flight. In fact, the modern ISS six-person crewmembers may not even meet one another prior to their mission. Although the training flow for the ISS spans 2.5 years, each astronaut or cosmonaut largely trains alone, traveling primarily between the US and Russia (Steimle & Norberg, 2013), and also to Canada (for Canadarm), Europe (for European module training), and JAXA (for Dextre robot on end of Kibo lab). This extensive travel, demanding learning schedule, and inability to bond as a team create a more acute need to select astronauts who learn quickly, demonstrate great resilience to constant travel, and exhibit exceptional teamwork knowledge and behavior. Thanks to current training efficiencies, rarely do all six ISS crewmembers train together (usually only for a total of 8-12 hours of emergency evacuation simulations), and even more rarely have any of the six lived together prior to launch. This means that it is now more important than ever to select astronauts who live well with strangers and in multi-cultural contexts. Since training provides less opportunity to test and foster effective teamwork or group living behaviors, changes like these have and will continue to raise new questions regarding how to select candidates already well suited to teamwork and cross-cultural group living.
Poor selection decisions for any job can result in significant costs related to errors made by weak performers, the need to find suitable replacements, and getting new incumbents up to speed. Typically, organizations invest in assessing psychological characteristics that cannot be developed through training and are developed over long periods of time or even a lifetime (e.g., personality traits, cognitive ability), and are often measured through tests. One important advantage of using tests is that individual applicants are treated consistently. Using standardized tests or assessments ensures that the same information is gathered on each individual and used in a similar way, and this helps to ensure the consistency and quality of selection decisions (Zedeck, 2011). Many psychological factors have significant face validity for work in extreme environments, so it is relatively easy to argue for the application of tests and assessment in psychological selection for these environments (Bell, 2007; Ones, Dilchert, Viswesvaran, & Judge, 2007). However, standardized tests are associated with several legal and practical concerns wortli reviewing here.
As with any other method of making or informing employment decisions, tests can be legally and ethically scrutinized if there is a belief that unfair discrimination has occurred. Adverse impact exists when the selection rate of a given demographic group (e.g., females vs. males, whites vs. blacks, etc.) is substantially lower than the selection rate of the majority group. Any selection procedure may show score differences that result in exclusionary effects upon a group, but some types (e.g., physical ability tests, cognitive ability tests) are more likely to do so. However, these tests often accurately predict job performance and other outcomes of interest and they can significantly contribute to effective selection decisions. Before using a test, it is important to anticipate whether adverse impact might occur and to consider ways that minimize any exclusionary effects while preserving the ability to make valid inferences based on test scores. If an incidence of adverse impact does occur, it is important to demonstrate that the inferences made based on test scores are appropriate and that the constructs tested are bona fide job qualifications, especially when selecting for jobs as socially revered and desired as the astronaut role. For these jobs, we know that multi-method job analyses are especially critical in helping determine and document bona fide job qualifications and identifying fair and accurate ways to assess performance (i.e., for future use in validation research). The multimethod concept of job analysis (e.g., literature reviews, interviews, surveys, observations, cognitive task analysis, sociometric) is especially important in building sound psychological selection systems for extreme environments. This is because there is no other way to collect enough evidence to make reasonable decisions, given the first two of our three primary challenges: (1) small sample sizes of current and past incumbents, and (2) highly variable contexts for the exact same job titles.
Also, given our third challenge of continuously evolving performance criteria, we know that the only way to ensure effective selection systems is to conduct multi-method job analysis studies regularly. For this reason, competency modeling is another tool from the Industrial-Organizational Psychology tool box that makes good, practical sense. Effective selection processes are based upon the qualities and skills required to meet the organization’s expectation for competent performance (e.g., a competency model). Many organizations use competency frameworks to select individuals (such as IBM, GE, Verizon, Waste Management, Hanover, Shell, 3M, the United States Office of Personnel Management) (Rodriguez, Patel, Bright, Gregory, & Gowing, 2002; Schmidt, 2008a, 2008b). There is both spaceflight- and ground-based job analysis evidence suggesting that teamwork, communication, leadership, and related competencies help predict individual and team performance and safety across many jobs that involve key elements of the astronaut role.
Several efforts have been made to identify factors that are important for selecting individual crewmembers for long-duration spaceflight (Barrett, Holland, & Vessey, 2015; Caldwell, 2005; Galarza & Holland, 1999; Manzey, Schwie, & Fassbender, 1995; McGrath, Arrow, & Berdah. 2000; Nicholas & Fouchee, 1990; Rose, Fogg, Helmreich, & McFadden, 1994; Vinograd, 1974). There have been and still are reoccurring attempts to use content-, construct-, and criterion-related validation approaches to link assessment tools to the competencies needed for short- and long-duration missions and ultimately to astronaut performance criteria (Musson, Sandal, & Helmreich. 2004; Rose et al., 1994; Santy et al., 1993).
Selection research within spaceflight is severely limited by the lack of job performance data available to researchers. This lack of performance data is due, in part, to the fact that there is such a limited number of astronauts actually selected (around 340 US astronauts over the life of the program) and that there is so much evolution in the job (from Mercury to ISS). Quantifying different levels of performance (optimal versus adequate versus inadequate) is unrealistic with such small sample sizes. Even when performance data are available, there is rarely much observable variance in performance (likely because incumbents have been so highly selected and trained by that point) and this makes criterion-related validation untenable. As a result, space agencies have heavily relied upon content validation methods when making changes to selection processes.
These issues are relevant for all space agencies, as they all suffer from a lack of performance data and small sample sizes. For example, the Russians have long collected personality data on cosmonauts (Kanas & Manzey, 2008), but the empirical linking of personality factors to specific performance levels necessary to provide cut-scores or norms for selection has still eluded Russian researchers. One exception to this lack of reporting on empirical selection data was a study published in relation to the European Space Agency’s 2008/2009 selection cycle (Maschke, Oubaid, & Pecena, 2011). Maschke et al. (2011) documented some criterion-related validity evidence on at least one personality test used by the European Space Agency. Considering the critical need for additional methods of ensuring the quality of selection tools for the high-stakes astronaut position, Landon and colleagues (Landon et al., 2016; Landon, Rokholt, Slack, & Pecena, 2017) urge space agencies to make increased quality and quantity of astronaut performance data a priority. This will allow for more rigorous evaluation of current selection tools and their alternatives.
The current dearth of performance data makes conducting criterion-related validation particularly troublesome for space agencies. Messick (1995; 1998) proposed that validity can only be established once a preponderance of evidence has been collected (1) indicating that the test content is relevant to the construct; (2) there is theoretical rationale behind the test scores; (3) the test is scored in a manner that corresponds to the construct’s structure; (4) the generalizability of the test is known; (5) test scores are related to similar constructs and not dissimilar constructs; and (6) the consequences of using test scores are well understood. At this point, many space agencies attempt to collect evidence in all six categories, but the scarcity of present and past job incumbents, unique contexts, and lack of performance data preclude them from establishing theoretical rationale behind specific test scores or adequately determining the consequences of using test scores. We do know that astronauts, and many who work in extreme environments, self-select and are so well screened before their initial psychological testing that they typically demonstrate little variability across most constructs. Range restriction is problematic and only likely to remain so as we reach for the stars in completely new ways and engage in longer-duration missions (that will likely require far more international collaboration to pull off than prior explorations).