Data and methodology
We carry out empirical analysis based on two distinct methods and data sets, which, however, complement each other by offering different angles to answering the key questions of this chapter.
Statistical analysis of the LFS data set
We use the 2013-2018 Lithuanian LFS data to explore returnees' socio-demographic characteristics and employment patterns in comparison to those of non-migrants. As done in other research (e.g. Martin & Radu, 2012), we define return migrants as those bom in Lithuania who resided abroad one year prior to the survey. Our sample includes 162 returnees over the age of 15 as well as 77,614 non-migrants.9 The dataset includes important socio-demographic information, including age, gender, marital status, education, employment status, and self-employment. This information helps us compare returnees with other residents in Lithuania and assess to what extent their employment patterns differ.
Using LFS has notable limitations. The LFS lacks a question on whether the respondent has ever resided abroad, so we are limited to naming our analysis only on recent returnees rather than those with any migration experience. Recent returnees might still be in the process of looking for work given that they returned within the last year, and so the analysis might underestimate the employment rates of all returnees. Furthermore, the sample of returnees is small, limiting our ability to mil more complex models and angles to analyse the data.
We nevertheless proceed with the analysis of the LFS in order to check whether the characteristics of returnees have changed since earlier research. We also explore the limitations of using the LFS with the aim to understand the potential of other tools, such as text mining, to address some of those shortcomings.
In order to assess how non-migrants and returnees differ in the labour market, we run a logistic regression with a variable set to 1 if the respondent is working and 0 otherwise. Note that zero in this case represents both the inactive and the unemployed. The main explanatory variable in the second regression is the returnee variable, allowing us to assess whether returnees are more or less likely to be working compared to non-migrants. We also control for socio-demographic variables mentioned earlier.
We run four more logistic regressions to explore how working patterns differ among returnees and non-migrants. In these, we explore whether having a higher degree affects returnees’ employment. We then look at the differences of using a public employment office among the two groups and the likelihood of being a student as well as self-employment patterns.
Finally, to compare not only returnees' and migrants' likelihood of working, but also how they do once they find jobs, we compare their salaries using an ordinary least-squares regression (OLS). The dependent variable is the net salary, which indicates monthly wage measured after taxes and coded in 15 categories, from 0
to 900+ euros. In addition to the socio-demographics outlined, we also control for occupation, which is coded in nine categories based on ILO categorisation (ILO, 2012), where 1 indicates managers and 9 stands for elementary occupations.
Text mining of media articles on return migration
The key limitation of the LFS-based data is that it captures only the recent returnees, who lived in a foreign country a year ago. In order to reach a more diverse group of returnees, researchers in CEE countries have used surveys (relying mostly on quota-based convenience samples) and/or interviews. In this chapter we apply yet another method, which assumes that texts about the remrnees in the major news portals are a good proxy for, firstly, understanding the status of the returnees in the labour market and society and, secondly, exploring to what extent the state or municipal institutions are associated with success or otherwise of the returnees. The texts of interest include factual reports, interviews with returnees or their family members, comments by journalists, statements by public institutions, reflections by employers, etc.
We have collected news media articles from the most popular Lithuanian web portals (delfi.lt, 15min.lt, vz.lt, lrt.lt, alfa.lt, lrytas.lt, balsas.lt.). Data collection was carried out in two steps: first, we scraped all the articles from the portals that either had a tag related to emigrants or showed up when we posted queries with the keyword •emigrant' to the portal search engine. The result of this first step was a corpus of web portal texts related to emigrants. In the second step, we narrowed our search to focus on returnees only. We did so by keeping only the articles that contained the roots forms for the verb ‘renmT (in Lithuanian: 'grjz*') and a noun ‘emigrant’ (in Lithuanian ‘emigrant*’). The first step of the data collection yielded more than 3000 articles, and 1017 articles were left after the filtering in the second step.
Afterwards, the selected texts were pre-processed for analysis. All the preprocessing procedures had the same core steps: lemmatising, lowercasing, removing numbers, punctuation, pronouns, and stop-words.
We then proceeded to perform quantitative and qualitative analysis of the collected texts. More specifically, we applied three methods: sentiment-coloured word-clouds, topic-modelling, and word-embeddings using the word2vec algorithm. Word clouds are a technique to visualise word frequencies in the text/cor- pus. A word-cloud algorithm produces a picture with the most common words in a text/corpus, where the font size of each word represents its frequency in text, with more common words being larger. For this method we performed additional pre-processing, keeping only those parts of speech which carry a sentiment value: verbs, adjectives, and adverbs. We then compiled dictionaries of positive and negative words that occur in our corpus and used them to colour the word-cloud figures.
Topic modelling is a technique that allows us to summarise a corpus of documents and extract the most commonly occurring topics (collections of words). For this method, we also did additional pre-processing steps by including bi-grams (commonly occurring two-word phrases) in the analysis. For the analysis, we implemented a classical LDA (Blei, Ng, & Jordan, 2003) topic modelling algorithm.
Finally, word2vec is a word clustering algorithm, which embeds words in a lowdimensional space. It does so by analysing text using a sliding window approach and trying to predict each word in a window using other words as input features. As a result, words that are used in a corpus in a similar way (interchangeably) are clustered closely together. The algorithm is most often used to detect the most similar and most different words in a text/corpus (Goldberg & Levy, 2014). All the data collection and analysis were done using Python programming language.