Statistical methods - Bringing social and linguistic together

As statistical methods are constantly evolving, and there are many statistical guidebooks available for linguists, we will not delve into the finer details of statistical analysis here; rather, this section will introduce some considerations about the statistical method or methods that should be applied depending on the type of data a reader might be interested in analyzing. The three types of data discussed here include dichotomous or binary data, count data, and grouped data.

Binary data

The response or dependent variable for binary data has only two possible values that can be coded as 'Г and ‘O’. For example, whether the past tense marking of a verb is present or absent in a past context. For binary data, a suitable statistical tool is the use of logistic regressions.

34 Toolkit for unifying social and linguistic aspects Table 3.2 Organization of binary data in an Excel spreadsheet










Lexical aspect



















In this study Rbrul is used to implement logistic regressions on the presence or absence of morphological marking in Colloquial Singapore English. Rbrul works in an R enviromnent and combines the strengths of Goldvarb and R, thereby allowing users to not only connect to the wider community of quantitative linguists who use SPSS and R, but also allows users to incorporate random effects into their statistical modeling. An example of a random effect is the 'individual speaker’ as some individuals might favor or disfavor a particular linguistic form, over and above what the social and linguistic predictors in the statistical model would predict. Therefore, incorporating random effects in the statistical model helps to solve the problem of overestimating the significance of predictors in a model (Johnson 2009).

There are several advantages to using Rbrul for logistic regressions. First, the interface for Rbrul is user-friendly and no knowledge of coding is required to use Rbrul. Second, Rbrul automatically selects the correct regression type depending on the type of response variable and whether random effects are incorporated into the statistical model. Third, Rbrul is able to analyze unbalanced data. Unbalanced data refers to having an unequal number of observations in certain group combinations. For instance, a study may have more observations of plural marking from female participants than male participants. Lastly, performing logistic regressions using Rbrul, as is the case in using R, not only allows the inclusion of many social or linguistic predictor variables that can either be continuous or discrete data, it also allows the incorporation of random effects.

As Rbrul works in an R environment, you would first need to download R and RStudio before you can use Rbrul. Rbrul can be downloaded at the following link: Moreover, guides to using Rbrul can also be found at this website. As Rbrul is easy to use, the only thing that probably needs mention is the way in which the data has to be organized before it can be uploaded and analyzed by Rbrul. The binary data must be organized in a tabular form using Excel spreadsheets (see Table 3.2) and then saved as a. csv (comma-separated values) file.

Count data

The dependent variable for count data must be either zero or a number that is discrete and positive. An example of count data would be the number of tokens of colloquial got a participant produced in a single sociolinguistic interview session.

Poisson regressions can be used to analyze count data like the number of tokens of a linguistic feature a participant produces in a sociolinguistic interview. There are several advantages to using Poisson regressions for count data. First, like Rbrul. Poisson regressions can also handle unbalanced data. Second, performing Poisson regressions allow the inclusion of a variety of social or linguistic predictor variables that can either be continuous or discrete data. Lastly, Poisson regressions allow the incorporation of random effects.

In order to use Poisson regressions in R, the 'brms’ package has to be installed first. Additionally, just like Rbrul, the count data must be organized in a tabular form like in Table 3.2 and saved as a. csv file. Poisson regressions can be applied in R using the following code:

(2) MyData <- read.csvffile = “C:/Users/ABC/Desktop/Filename.csv”, header= TRUE, sep=“,”)

Example (2) is the code that instructs R to read the. csv file and create a data frame named ‘MyData’ based on the information in the file. R code or functions can then be applied to the data frame for statistical analysis.

(3) library(brms)

Example (3) is the code that instructs R to load the ‘brms’ package. This package contains R code that allows R to run Poisson regressions.

  • (4a) Ml brm(Tokens~Ethnicity + Dominance + (l|Speaker),data = MyData, family = ‘Poisson’)
  • (4b) Ml brm(Tokens ~ Ethnicity + Attitude * Dominance, data = MyData, family = ‘Poisson’)

Examples (4a) and (4b) are code that allow the user to analyze count data using Poisson regressions. The code in Example (4a) shows an additive model named ‘МГ that has two predictor variables, ‘ethnicity’ and ‘English language dominance’. It also has ‘speaker’ as a random effect. On the other hand, the code in (4b) shows an additive model also named ‘МГ that has an interaction term indicated by ‘*’. An interaction term means that the interaction between ‘attimde toward English' and ‘English language dominance’ will be included in the statistical model.

  • (5a) summary(Ml, waic = TRUE)
  • (5b) plot(marginal_effects(Ml, probs = c(0.05, 0.95)))

Examples (5a) and (5b) are code that enable the user to analyze the results of Poisson regressions implemented by code similar to those in (4a) and (4b). The code in Example (5a) gives the user a summary of the fitted model. This summary includes a list of the parameter estimates, the standard errors, and the 95% confidence intervals of the different predictors in the statistical model. The code in Example (5b) gives the user a graphical representation of the 95% confidence intervals of each predictor in the model.

Grouped data

The last type of data a researcher might be interested in analyzing is grouped data. 2x2 Chi-squared tests are useful for researchers that want to find out if two groups differ quantitatively in their use of cextain linguistic features. For example, whether the frequency of discourse particle lor differs between male and female speakers.

However, there are several limitations to 2x2 Chi-squared tests. First, they can only be used to analyze count data that is divided into different categories. Second, the statistical analysis is only limited to the categories that are examined, as there is no way to include additional social or linguistic predictor variables in the analysis. Third, there is also no way to include random effects into the statistical analysis.

  • 2x2 Chi-squared tests can be computed in R with the following code:
    • (6) table matrix(c(8, 2, 292, 122), byrow = TRUE, 2, 2)

Example (6) is the code that instructs R to create a two by two table named ‘table’ that has the values of 8 and 2 in the first row, and the values of 292 and 122 in the second row.

  • (7a) chisq.test(table)
  • (7b) chisq.test(table, correct = FALSE)

Examples (7a) and (7b) are code that allow the user to analyze grouped data like the table in (5) with the use of 2x2 Chi-squared tests. The code in (7a) allows the user to apply a Pearson’s 2x2 Chi-squared test with Yates’ continuity correction to the data, whereas the code in (7b) allows the user to apply a Pearson’s 2x2 Chi- squared test without Yates’ continuity correction to the data.


1 The final output of the speaker is also dependent on other factors like recency. For example, the interlocutor using the word shirt in a question like which shirt do you like ? in a previous conversational turn.


Gass, Susan M., and Jennifer Belmey (eds.). 2018. Salience in second-language acquisition. New York, NY: Routledge.

Grosjean, Francois. 2010. Multilingual: Life and reality. Cambridge, MA: Harvard University Press.

Hartsuiker, Robert J., Martin J. Pickering, and Eline Veltkamp. 2004. Is syntax separate or shared between languages? Cross-linguistic syntactic priming in Spanish-English multi- linguals. Psychological Science 15. 409-414.

Housen, Alex, and Simoens Hannelore. 2016. Introduction: Cognitive perspectives on difficulty and complexity in L2 acquisition. Studies in Second Language Acquisition 38(2). 163-175.

Johnson, Daniel Ezra. 2009. Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3(1). 359-383.

Lasagabaster, David, and Angel Huguet. 2007. Multilingualism in European bilingual contexts: Language use and attitudes. Clevedon; England: Multilingual Matters.

Montrul, Silvina. (2016). Dominance and proficiency in second language acquisition and bilingualism. In Carmen Silva-Corvalan and Jeanine Traffers-Dallers (eds.), Measuring dominance in bilingualism, 15-35. Cambridge: Cambridge University Press.

Schilling, Natalie. 2013. Sociolinguisticfieldwork. Cambridge; New York, NY: Cambridge University Press.

Silva-Corvalan, Carmen, and Jeanine Treffers-Daller (eds.). 2016. Language dominance inbilinguals: Issues in measurement and operationalization. Cambridge: Cambridge University Press.

Traugott, Elizabeth Closs, and Graeme Trousdale. 2013. Constructionalization arid constructional changes. Oxford: Oxford University Press.

Travis, Catherine E., Evan, Kidd, and Torres Cacoullos Rena. 2017. Cross-language priming: A view from multilingual speech. Bilingualism: Language and Cognition 20(2). 283-298.

Wasserscheidt, Philipp. 2015. Constructions do not cross languages: On cross-linguistic generalizations of constructions. Constructions and Frames 6(2). 305-337.

Weinreich, Uriel. 1953. Languages in contact, findings and problems. New York, NY: Linguistic Circle of New York.

< Prev   CONTENTS   Source   Next >