Statistical Identification of Fraudulent Interviews in Surveys: Improving Interviewer Controls
Survey data are important for establishing new insights in many disciplines such as sociology, economics, and others. However, various sources of error, summarized in the Total Survey Error (TSE) framework, can undermine survey data, with interviewers being one important source (Groves 2005). Interviewers can deviate from their instructions and, in the most blatant case, falsify or manipulate the data. Although empirical evidence suggests that interviewer falsification is a rare event (Blasius and Friedrichs 2012), even small amounts of undetected falsification can lead to substantial bias in multivariate analyses (Schrapler and Wagner 2005). Therefore, falsification detection strategies such as random re-contacting procedures are crucially important for optimizing data quality. Statistical identification methods, which identify suspicious patterns in the data to reveal potential falsifiers, are used less often. Yet a number of such methods have been developed for detecting a wide variety of falsification types, such as duplicates and complete or partial falsification, in a cost-effective way. This chapter contributes to the literature by providing a broad overview of statistical methods for identifying interviewer falsification and demonstrating promising statistical identification strategies using data from a large-scale refugee survey in Germany that includes confirmed falsifications.
Interviewer Falsification – An Overview
7.2.1 Forms of Falsification
The American Association for Public Opinion Research defines interviewer falsification as "the intentional departure from the designed interviewer guidelines or instructions, unreported by the interviewer which could result in the contamination of data" (AAPOR 2003: 1). There are multiple forms of falsification. The most blatant form, complete falsification, occurs when no interview is conducted and instead the interviewer provides fictitious data (Schreiner, Pennie, and Newbrough 1988). An attenuated form is the partial falsification of interviews where interviewers conduct "short interviews," meaning that some parts of the questionnaire contain real data provided by the respondent while other parts contain fictitious data. Particularly long or difficult parts of the questionnaire are more prone to this type of falsification (Biemer and Stokes 1989). In addition to providing fictitious data, other falsification forms include interviewers deviating from respondent selection rules and interviewing the wrong person (AAPOR 2003; Schreiner, Pennie, and Newbrough. 1988), misclassifying persons or addresses as ineligible cases, deviating from the intended interview mode (Biemer and Stokes 1989; Schreiner, Pennie, and Newbrough), or incorrectly entering answers to manipulate questions that trigger skip patterns and thus shorten the interview (AAPOR 2003; Kosyakova, Shopek, and Eckman 2015; Schnell 2012; Tourangeau, Kreuter, and Eckman 2012)- In addition, duplicate response patterns that are highly unlikely to be attributed to respondents represent a special form of falsification because they can be caused by either the interviewer or other survey staff (Sarracino and Mikucka 2016; Slomczynski, Powalko, and Krauze 2017).
Forms of Falsification
Overall, the proportion of falsified interviews is low and falsifiers usually represent a minority of interviewers. For example, the percentage of detected falsifications (any of the above-mentioned types) in the US Current Population Survey was 0.4% per month (Schreiner, Pennie, and Newbrough, 1988); in the New York City Housing
Frequency of Falsification
Vacancy Survey the share was comparatively high at 6.5% (Schreiner, Pennie, and Newbrough, 1988); and in the German Socio-Economic Panel it was between 0.1% and 2.1% (Schrapler and Wagner 2005). Although studies with high rates of interview falsification appear occasionally (Hyman et al. 1954; Turner et al. 2002), rates between 3% and 5% can be regarded as realistic (Biemer and Stokes 1989). Higher rates have been found in studies employing only a small interviewer staff (Bredl, Storfinger, and Menold 2013). In panel surveys, falsification rates are often lower and partial rather than complete falsification is more likely to occur (Blasius and Friedrichs 2012; Schrapler and Wagner 2005).
Reasons for Falsification
The motivation to falsify data primarily arises from conditions or situations that discourage interviewers from fulfilling their roles adequately. A difficult questionnaire, administrative factors related to the interviewer's employment, or other external factors like the survey location are typical reasons for interviewers becoming discouraged and thus increasing the chance they falsify data (Biemer and Stokes 1989; Crespi 1945; Winker 2016). These interview conditions can be divided into intrinsic and extrinsic factors (Gwartney 2013). Intrinsic factors are partly under the control of the researcher and include the sampling design (e.g. difficult selection rules), the survey instrument (e.g. poorly designed questionnaires, programming errors), the survey institute (e.g. high workloads, poorly communicated standards), or the respondent (e.g. difficult interviewees) (Gwartney 2013). Extrinsic factors include the interview location, the interviewer's personal situation, how interviewers are paid, or whether they know about quality control and monitoring procedures (Gwartney 2013; Koczela et al. 2015). Many factors can be taken into account during the process of designing the survey to minimize interviewer burden and the overall falsification likelihood (Crespi 1945; Biemer and Stokes 1989; Blasius and Friedrichs 2012). Nevertheless, the remaining risk of falsifications calls for quality control procedures.
Effects of Falsification on Data Quality
Since falsifications are systematic deviations (Gwartney 2013), even small proportions of falsified data may introduce bias, resulting in misleading inferences (Schnell 1991, 2012; Schrapler and Wagner 2005). Descriptive statistics, such as means and variances, may be only slightly distorted: the bias in a mean cannot exceed the share of falsified records. Yet larger distortions are evident for multivariate statistics since falsifiers are unlikely to reproduce complex multidimensional relationships between variables (Reuband 1990; Schnell 1991; Schrapler and Wagner 2005).