The Global Assessment Trends Report (Kantrowitz, 2014) is an annual online survey. The 2014 survey was conducted in early 2014 and completed by 1,406 human resources (HR) professionals from companies headquartered throughout the world. In summarizing the trends, the author concludes:
Our findings indicate heightened interest in technology-based hiring tools and technology- enabled assessment, although their use is often characterised by inconsistent or inappropriate justification or processes, or without demonstrable job relevance. (Kantrowitz, 2014, p. 48)
The respondents indicated they were increasingly turning to social media as a hiring tool, though most HR professionals were unclear about the criticality or relevance of such information for hiring and few had formal processes in place to advise hiring managers on its use. At the time of the survey, interest in administering tests on mobile devices was modest, with some interest coming from candidates. However, the growth of mobile testing is a trend that has seen more recent significant increases in volume.
Historically, test producers and publishers worked on a national basis and developed tests for country-specific markets. Globalization has changed that. Not only are there increasing numbers of global or multinational organizations that use assessments, but there are providers of assessments that also operate on a multinational basis. Bartram (2000), in an early look at the impact of globalization on internet recruitment and selection, put forward the following scenario:
An Italian job applicant is assessed at a test centre in France using an English language test. The test was developed in Australia by an international test developer and publisher, but is running from an ISP located in Germany. The testing is being carried out for a Dutch-based subsidiary of a US multi-national. The position the person is applying for is as a manager in the Dutch company’s Tokyo office. The report on the test results, which are held on the multinational’s Intranet server in the US, are sent to the applicant’s potential line-manager in Japan having first been interpreted by the company’s out-sourced HR consultancy in Belgium. (Bartram, 2000, p. 272)
Bartram listed a number of questions that this scenario raises, including where the legal responsibility lies, who the ‘user’ of the test is, how the various countries’ standards and regulations regarding testing apply, and so on. In particular, one issue that has come to the fore since this was written is the question of what norms should be used in comparing this Italian job applicant with other applicants.
It used to be standard practice to base test norms on a country-wide sampling approach. Aggregating data across countries was thought inappropriate despite the fact that country boundaries are often relatively arbitrary and countries contain a complex amalgam of culturally and linguistically diverse groups. In many countries, this mix is becoming increasingly diverse as cross-border employment continues to expand alongside the growth of multinational companies.
Increasingly, organizations are using assessment in an international context and need to compare the results of people who have completed an assessment using different languages. The development of online testing has made this possible as administration can be centrally controlled and then globally distributed. This testing environment raises the question of whether the results from two candidates applying for the same position who have completed different language versions of the same instrument should be compared using a common (i.e., multilingual) norm or each person’s ‘country-based’ language norms. Bartram (2008a) has set out guidelines for making this decision. Essentially, the answer lies in establishing what level of equivalence exists between scores from the two countries. If evidence suggests that it is reasonable to conclude that a given raw scores on a test represents the same level of the same trait in both countries, then international norms should be used. However, establishing equivalence is not simple. Clearly, there is no point in developing common norms if an instrument does not measure the same characteristic across all groups.
If an instrument is administered in different cultures, it is necessary to check that test scores have the same psychological meaning in those cultures (Van de Vijver & Leung, 1997). An item or test is biased if scores do not have the same meaning across these groups (Poortinga, 1989). Differences in meaning can derive from three sources: in the constructs that are being measured; in the effects of the method being used to measure the constructs; or in issues arising from the content of specific items. If the construct being measured is not universal (i.e., does not have the same meaning across geographic and cultural groups), then the scores obtained from one group may indicate something different from those of other groups. Method biases can arise from different groups’ susceptibilities to bias relating to response formats. For example, individuals from East Asian cultures tend to avoid extreme points on Likert response scales, whereas individuals from Central and Latin America are more likely to use the extremes (He & Van de Vijver, 2013; Hui & Triandis, 1989). Content-related bias is the most often observed source of non-equivalence and can arise from poor translation on the one hand or unnecessarily cultural-specific content in the source material on the other.
Equivalence, or freedom from bias, can be assessed both qualitatively and quantitatively. The most convincing evidence is provided by quantitative analyses. In the past decade or so, consensus has emerged on the need to consider three hierarchical levels of equivalence:
- 1 Construct equivalence: relates to the evidence that the instrument assesses the same underlying constructs in each group.
- 2 Metric or measurement unit equivalence: relates to the evidence that the instrument assesses the same underlying constructs, using the same metric of the response scale.
- 3 Scalar or score equivalence: relates to the evidence that the instrument assesses the same underlying constructs, using response scales that use the same metric and having the origin of the scales and the measurement units that are the same for each group.
Scalar equivalence is the most difficult to establish, but it is necessary if raw scores are to be compared across groups. Whenever we put two or more people together and use them as a reference or norm for making comparisons between people, we are assuming scalar equivalence, that the constructs we are measuring are the same for all the people in the group and that any differences in raw scores reflect comparable differences in levels or amounts of the construct. As discussed earlier, the process of aggregating people with varying demographics into a norm group has typically been carried out within countries rather than across countries without analysing the equivalence of demographic subgroups within the norm group. The introduction of online testing and its easy deployment across national boundaries has spurred interest in equivalence between groups and raised the issue of when it is appropriate to combine people across countries to form international norm groups.
Establishing equivalence involves accumulating evidence. No single study ‘proves’ equivalence at all levels. Techniques include the study of bilinguals, differential item functioning (DIF) analyses and the use of multilevel designs to identity sources of between- group differences. Bartram (2013a, 2013b), for example, has shown how multilevel analysis can be used to examine scale score variance between countries to determine how much can be accounted for by independent country variables. Personality scale scores aggregated to country level are correlated with country measures of culture, quality of life and global competitiveness. To the extent that we can account for country variance in personality in terms of these other country measures, we can support claims for scalar equivalence. In short, if we can show that the difference between two groups is a difference that predicts other independently assessed variables, then the difference is real and not bias attributable to non-equivalence.