Challenges for the Evaluation of Labor Market Policy
The evaluation of labor market issues faces the same procedural challenges as other evaluations. But the organization and institutional conditions of the labor market lead to a number of aspects that must be considered when assessing the effects of labor market measures and programs. The aspects to be considered are methodical, but also spatial and political. It is also necessary to consider information- and complexity challenges. The assessment of labor market policy impacts is often done in the conventional approach, i.e., in the form of measuring the number of participants who find employment after the program or measure, or in the form of earning effects for participants. This approach neglects the evaluation of wider policy impacts on the whole labor market. Issues to be addressed in this context are creaming effects, deadweight ef- fects/windfall profits, substitution effects, unintended side-effects, for example
through crowding out  or market distortions by supporting certain industries with subsidized employees, creating an advantage over companies not benefiting from subsidies; and last but not least the consideration of opportunity costs (Schmid 1997). Opportunity costs are often neglected as the determination whether the money spent would have been more effective in a different project, i.e., if a different measure would have had a bigger impact (Hujer ET al. 2000), is often neither politically desirable nor statistically realizable, as the potential impact of other measures that could have been conducted but were decided against cannot be put in numbers.
Another methodical challenge is the analysis of causalities. A typical example in labor market discussion is the analysis of well-being in connection with unemployment rates: does unemployment make unhappy? Or do unhappy people tend to become unemployed because of their unhappiness (Oesch, Lipps 2011)? The same goes for the impact of measures. To offer reliable results to policy makers it is necessary to determine whether a program or measure was responsible for the results measured. This is made difficult by the fact that there is no monocausal explanation for unemployment; and usually no monocausal explanation for the transition from unemployment into employment (Weinberg 1999). The elimination of counterfactuality, i.e., the fact that one individual cannot be participant and non-participant in a measure at the same time, can usually be accomplished by using a control or comparison group (Blaschke, Plath 2000). The organization of programs and measures as experiments implies the random choice of participants and non-participants out of a group of potential participants. Experiments are the simplest option for an evaluation with a control group, because the random choice of participants and control group prevent systematic differences between the two groups. The difficulty with experiments is, though, that they are politically not feasible, because the random assignment of participation in measures can hardly be communicated to voters (Eichhorst, Zimmermann 2007). In addition, the results of singular or isolated experiments are not necessarily transferable to other regions or situations. Depending on the size of the experiment, the specific situations in different regions, different times (e.g., different stages of the economic cycle) or different groups of participants, the results are not conferrable.
An alternative to experiments is the creation of artificial control groups by using matching-methods. Choosing a control group that is as similar to the participants as possible is a more feasible option, but does have the disadvantage that the artificial control group corresponds with the participants only in some observable characteristics. This may lead to the oversight of potentially relevant explanatory characteristics, and thus to non-reliable results. The evaluation of both methods, experiments and matching, are often also complicated by a small number of cases. This makes it harder to draw reliable conclusions as to the effectiveness of a measure (Weinkopf 2002). If neither matching nor an experimental design can be applied, then the efficacy of the measure in terms of differences in the outcome between participation and nonparticipation cannot be evaluated (Eichhorst, Zimmermann 2007). All methods of labor market evaluations also face a different problem: much time was and is spent on finding the perfect estimators and methods of measurement; but too little time is spent on considering the quality of the data used as a base for the estimators and measures. This neglects the fact that even an almost perfect econometric model cannot compensate inadequate and unreliable data (Smith 2000). It is therefore necessary to define the data necessary to measure impacts, effectiveness and efficacy and to collect this data in the necessary quality.
A second aspect of labor market policy that complicates the evaluation of the measures is the number of stakeholders and their specific interests. As financiers as well as project workers and “clients”, i.e., the intended beneficiaries of the measures, but also supervisors and scientists have an explicit stake in the success of a program, their information and impressions must be considered in an evaluation. But as every stakeholders’ interest in a program differs from that of the other stakeholders (a financier may be interested in an optimal rate of transmission into regular employment or low costs, instructors may be interested in being considered successful to ensure their employment be securing future students, etc.), their own interests and the possible bias through their priorities must be included in the evaluation. The next aspect to be considered is the complexity not only of measures, but also of the evaluator groups. In the first case, if the measures are composed as network of measures, an evaluation is often complicated by the sheer size of the network. In the case of EQUAL, for example, a community initiative that supported innovative, transnational projects that were aimed at reducing discrimination and disadvantages in the labor market (European Commission EQUAL 2008), 129 partnerships were included, and each partnership had an evaluator. In addition, there was an evaluation to oversee the single evaluators. The third assessment level in this project network was an evaluation on the european level that included the results of the national evaluators. On every level, evaluators did not only evaluate the evaluators on the lower levels, but also conducted their own evaluation of the projects (Heister 2008). This is without doubt an extreme example, but it is not rare that evaluators from different institutions have to work together to conduct a very complex evaluation, such as in the case of the Hartz-Evaluations in Germany, where different research institutions worked together on a joint report, and the outcomes do not only depend on their expertise, but also on their cooperation. Evaluations where a number of evaluators work together are usually fulfilling what can be called the joint venture between science and praxis for the evaluation of labor market policy. Often, the methodical specialty of one evaluator and the experience of a second one complement one another and increase the reliability of the results. This is often necessary in the case of labor market policy, as the system is complex; and measure and target groups are very heterogeneous (Brinkmann, Wiebner 2002).
The third challenge that evaluators of labor market measures face are time aspects. Many programs are developed and funded for only a certain period, and their continuation depends on the results of an evaluation. But results of measures are often only visible sometime after the completion (Bouis ET AL. 2012); and long-term impacts cannot be assessed until a few years at least after the conclusion of a measure. As the allocation of funds of programs of the European Social Fund (ESF) programs, for example, where funding periods usually last six years, is partly based on the results of programs in the preceding funding period, evaluations must often be conducted before the program is completed or before at least medium-term conclusions can be drawn. Summa- tive evaluations are therefore often too late to influence the decisions and are rather academic research than evaluations with impact on decisions (Heister 2008). A similar schedule occurred with the evaluation of the Hartz reforms in Germany. The substantial labor market reform dubbed after one of its drafters was to be thoroughly evaluated; therefore an evaluation was commissioned in November 2002 and was expected to be completed four years later. In 2005, there was to be an interim report (Heyer 2006). This was a very extensive evaluation, but the time frame shows that it was impossible to draw reliable long-term conclusions from the results of the evaluation. The Four Laws for Modern Services in the Labor Market, as the four parts of the Hartz reform are called were passed in December 2002 (First and Second Law) and December 2003 (Third and Fourth Law). The First and Second Law came into effect in January 2003 (BGBl. I 2002). The Third Law came into effect in January 2004, and the Fourth Law in January 2005 (BGBl. I 2003a, 2003b). The laws include an obligation to scientifically evaluate all measures of the activating labor market policy. In compliance with this obligation, a comprehensive data base was developed that also takes into account the heterogeneity of the participating groups (Caliendo, Steiner 2005). In short, the laws came into effect in 2003, 2004 and 2005 with an interim report due in 2005 and the final report in 2006. Considering that measures do not have their full impact right away, and that there are usually teething problems in the implementation and that data collection and the writing of a substantial report take their time, an evaluation with such time pressure faces almost insurmountable challenges. The Hartz- evaluations were also a combined evaluation project. Due to the insight that prerequisites for large evaluation projects with considerable time pressure are
- (1) institutions with the relevant know-how,
- (2) the readiness for cooperation between the institutions, as the commission for such large projects cannot be fulfilled by one institution only,
- (3) the willingness to be part of a systematic, administrative part of the research process, and
- (4) the availability of a sufficient and as actual as possible amount of data and datasets, the evaluation project was assigned to almost 20 research institutes, where about 100 scientists worked on the report (Heyer 2006).
The time frames are often politically motivated. It has been discussed in subsection
3.3.4 that politicians have their own interests, and their foremost interest is often their reelection. If the next election is close, then either positive results are needed for the incumbents, or negative information is hoped for by the opposition. In both cases, information has to be made available as quickly as possible, in many cases preventing a long-term evaluation. In addition, the results of evaluations are not always used by politicians in their decisions, especially if they are not the clients. Scientific research that was not prompted by policy-makers is often not directly useful or usable for politics. In addition, evidence research needs independence, time and qualified scientists. As qualified scientists are usually identified by the number of journal articles they publish, gaining a reputation as qualified scientists needs time as well. Also, data is not always available in the quality and quantity necessary for a thorough analysis. In addition, in some cases it is doubtful whether it is possible at all to detect clear causalities. Those are all aspects that complicate political consulting, and they are reasons for a development in German labor market policy that ZIMMERMANN (2014) considers a shift from evidence-based policy to a politics-oriented approach. Scientific evidence means proving something by using statistically sound results. Policy-oriented evidence making, i.e., research that was ordered by policy-makers, in itself is not untrustworthy or biased; it is necessary though to carefully study possible influence by the sponsor(s) and the compliance of the research with the rules of good scientific work.
-  Deadweight effects or windfall profits occur, e.g., when subsidized persons that would also havebeen hired without the subsidies, are hired.
-  Substitution effects are the effects of participants being hired instead of non-participants, but thetotal number of employees has not changed, thus simply causing redistribution, not an increasein employment.
-  Crowding out can occur for example by using taxes that companies may have put to better use.
-  EQUAL’s main topics were increasing employability, encouraging inclusive entrepreneurship,facilitating adaptability, promoting gender equality, integrating asylum seekers (EUROPEAN Commission eQual 2008).