Analyzing phonetic data with generalized additive mixed models

Yu-Ying Chuang, Janice Fon, Ioannis Papakyritsis, and Harald Baayen

Introduction: A clinical example

To investigate the realizations of lexical stress by Greek speakers with dysarthria, Papakyritsis and Muller (2014) recorded and measured the productions of Greek words by five native speakers of Greek. Three speakers were diagnosed with dysarthria, the remaining two were controls. Each speaker completed a word repetition task, reading a total of 27 minimal pairs. These minimal pairs consisted of bisyllabic words that differed only in stress position, e.g. /'milo/ “apple" and /mi'lo/ “talk.” To quantify stress realizations, vowel duration and intensity of the two syllables were measured for each word. It was hypothesized that speakers with dysarthria have difficulty producing stress contrasts, resulting in smaller differences between stressed and unstressed syllables compared to control speakers.

In the dataset cl in', stress position is specified in the column stress, with “SI" and “SF” indicating initial and final stress, respectively. In the column group, dysarthric and control participants are coded with “D" and “C.” Vowel durations of the first and second syllables are provided in durl and dur2. Hence, for words with initial stress (i.e. SI), durl indicates the vowel duration of the stressed syllable, whereas for words with final stress (i.e. SF), it is dur2 that refers to the vowel duration of the stressed syllable. With regards to intensity, inti and int2 are the mean intensities of the first and second vowels. Since we are interested in the degree of contrast between stressed and unstressed syllables, following Papakyritsis and Muller (2014), we calculated duration and intensity ratios. That is, for a given word, we divided the vowel dur- ation/intensity of the stressed syllable by the vowel duration/intensity of the unstressed syllable2. This gives us the response variables durR and intR, which are also available in the dataset.

presents the boxplots of duration and intensity ratios for control and dysarthric subjects when producing words with initial and final stresses. Larger ratios stand for greater

Figure 10.1 presents the boxplots of duration and intensity ratios for control and dysarthric subjects when producing words with initial and final stresses. Larger ratios stand for greater

Boxplots of duration (left) and intensity (right) ratios for the four stress by group conditions

Figure 10.1 Boxplots of duration (left) and intensity (right) ratios for the four stress by group conditions.

contrasts between stressed and unstressed syllables. For duration (left panel), control participants show a larger stress contrast than dysarthric participants, more obviously so in stress-initial than stress-final words. The same holds for intensity (right panel), where between- group differences are also more pronounced in stress-initial words than stress-final words.

To evaluate the quantitative patterns in the data statistically, we fitted linear mixed models to durR (log-transformed) and intR separately, using the lme4 package in R. For both models, we included group and stress and their interaction as fixed-effect factors, as well as by-subject random intercepts. Model summaries are presented in Tables 10.1 and 10.2. For mathematical reasons, the lme4 package does not provide p values, but absolute values of the t statistics larger than 2 are good indicators of significance.

For duration ratios (Table 10.1), only the difference in stress position is supported. On the other hand, intensity ratios (Table 10.2) differ not only between stress positions, but also between participant groups, with dysarthric participants having a smaller intensity contrast for stress than control participants. Furthermore, the interaction effect indicates that for the

Tabel 10.1 Model summary for the effects of participant group and stress position, w ith duration ratio as the response variable


Std. Error

t value

















Tube! 10.2 Model summary for the effects of participant group and stress position, with intensity ratio as the response variable


Std. Error

t value

















dysarthric participants, the effect of stress position is reduced compared to that for the control participants, in line with the pattern shown in the right panel of Figure 10.1.

Although these analyses already captured the major trends in the data, there are further details that require attention. For example, while duration and intensity ratios clearly differ in the means between stress positions and participant groups, there are also differences in the variances. For duration ratios, stress-final words tend to have larger variances than stress- initial words, and intensity ratios appear to vary more for dysarthric participants than control participants. Canonical analysis of variance or the linear mixed model as implemented in lme4 cannot take such differences in variances into account. As will become clearer later, modeling the variance along with the mean is possible with generalized additive models, henceforth GAMs, which are the topic of this chapter.

GAMs can also help us refine the current analyses if we would have further information available. For instance, it would be useful to know the recording order of the words for each participant. This is because subjects’ responses across the trials of an experiment are usually not independent of each other, which violates one of the core assumptions of the standard Gaussian regression model. This issue of inter-trial dependency can however be easily addressed with GAMs. Moreover, GAMs can model time-varying response variables. Take intensity for example. In the current dataset, we considered the mean intensity of the vowels. However, intensity actually changes over time, and instead of calculating the mean, we would have gained precision if we could have modeled how intensity changes over time. Given that dysarthric participants clearly show different mean intensities compared to control participants, it seems likely that there would also be qualitative differences between the intensity curves over time for the two participant groups. As these curves are usually nonlinear, analyses on the curves cannot be done with methods such as canonical anovas or linear models, which operate on the assumption of linearity.

As an example of a response variable that varies nonlinearly over time, consider the Mandarin pitch contour in Figure 10.2. This pitch contour realizes both lexical tones and an incredulous question intonation. The black solid line in the figure was obtained with the help of a GAM.

Pitch contour (grey dots) of the Mandarin question la shi Li- Mei ma° “Is she LiMei?”. Dotted lines indicate syllable boundaries and the black solid line is the smooth curve obtained from GAM

Figure 10.2 Pitch contour (grey dots) of the Mandarin question la1 shi4 Li- Meis ma° “Is she LiMei?”. Dotted lines indicate syllable boundaries and the black solid line is the smooth curve obtained from GAM.

predicting pitch as a function of time. Accordingly, the dataset mand consists of the response variable pitch and the corresponding time points, time.

pitch time

  • 1 329.2815 0.16
  • 2 329.7480 0.17
  • 3 334.2271 0.18
  • 4 340.9061 0.19

In R, to fit a GAM model, we use the gam function from the mgcv package (Wood, 2017). For the formula, we first specify the response variable, which is followed by the ~ symbol (which can be read as “modeled as a function of’), and that in turn is followed by the predictors. By applying the function s () to the predictor, we request the GAM to model pitch as a nonlinear smooth function of time. The simple model that we fit for this example is as follows.

mand.gam = gam(pitch ~ s(time), data = mand)

What the s () does is to build up wiggly curves (known as splines) from a set of simple wiggly curves called basis functions. By carefully assigning appropriate weights to basis functions, a wiggly trend in the data can be approximated as a weighted sum of basis functions.

The remainder of this chapter provides a tutorial on how to analyze phonetic data with GAMs. We present analyses for three example datasets, all of which address aspects of the phonetics of Taiwan Mandarin. Although these are not clinical data, the issues that arise in the analyses are similar to those that one might encounter when analyzing clinical data. For each analysis, we discuss the relevance of the methods that we illustrate for the analysis of clinical data.

The first dataset that we discuss contains response time data from an auditory priming experiment. With this dataset, we illustrate how to take into account subject-specific inter-trial dependencies, which can also be found in clinical experiments. For the second dataset, we look into the realizations of Mandarin high-level tones by different speaker groups in Taiwan. Like intensity, pitch contours vary' with time. We will show how to work with numerical covariates, and how to take into account potential interactions with factorial predictors. The third dataset addresses the ongoing merging of two sets of sibilants in Taiwan Mandarin. As will be seen, GAMs can model response variables that vary' not only with one single-dimensional predictor (e.g. time), but also with predictors of two or more dimensions (e.g. geographical coordinates of longitude and latitude, locations of sensors on the tongue in electromagnetic articulography, locations of electrodes in ERP studies, etc.).

Along with the analyses, we will introduce the basic concepts of GAMs. As the purpose of this tutorial is to illustrate how to use GAMs, we therefore focus primarily on application. It is also important to note that the approach that we take for the analyses presented here is exploratory, rather than confirmatory. One reason for this is that this allows us to introduce increasingly more intricate aspects of a GAM model step by step. The other reason is that the complexity of the data is such that it is difficult to predict in advance how exactly predictors interact and jointly shape a regression line or regression surface. Due to multiple testing inherent in exploratory data analyses, we will accept an effect as potentially replicable when its associate p value is less than 0.0001.

Example 1: Inter-trial dependencies

Individual variability is usually a major source of variance for behavioral data. This is true not only for non-clinical participants, but also for clinical ones. In the course of an experiment, subject responses are often differentially affected by factors such as familiarity, fatigue, or attention drift. Thus, in addition to effects induced by experimental manipulation, in the data there are also response patterns that are tied to individual subjects. These patterns are often nonlinear in time. By-subject random intercepts and slopes in the linear mixed model (LMM). therefore, are limited in this regard. The LMM can allow the intercept and slope of regression lines for individual subjects to vary, but it cannot handle the case in which functional relations are not linear but instead wiggly.

Sometimes the effects of central predictors of interest are also nonlinear. For example, lexical properties of the stimuli such as word frequency tend to have nonlinear effects on response times in lexicality decision tasks (Baayen et al., 2016). In clinical settings, it is also likely that some numeric predictors turn out to affect response variables in a nonlinear way. In longitudinal or cross-sectional studies where one traces the development or progress of clinical participants, age of participants, or the length of treatment received, can potentially have nonlinear effects on their performances. With GAMs, it is possible to discover whether nonlinear patterns are present in the data.

In the following example, we will be looking at response time data collected from an auditory priming experiment, conducted by Chuang (2017). The central question of this study is whether Taiwan Mandarin speakers recognize words with standard and variant pronunciations differently. Specifically, in Taiwan Mandarin, retroflex sounds (e.g. /g/) undergo partial or complete deretroflexion, merging into their dental counterparts (e.g. /s/). Thus, while the standard pronunciation of the word book is Igu], the variant pronunciation of this word is [su|. The stimuli of this experiment were 30 retroflex-initial bisyllabic words which were produced either with retroflex or with deretroflexed pronunciations, coded as “standard” and “variant” respectively in the column targetType of the dataset ret. There were three priming conditions: the same word with retroflex pronunciation, the same word with deretroflexed pronunciation and a completely different word without any retroflex or dental sounds. In the primeType column, these conditions are referred to as “standard,” “variant,” and “control.” To avoid the effect of speaker priming, primes and targets were produced by two different speakers. The voice column of ret specifies the speaker of the target word, “A” and “B.” If the speaker of the target word is A, it also means that the speaker of the prime word is В and vice versa. Participants were instructed to perform auditory lexical decision on target words, and their response times (in millisecond) were recorded. In the ret dataset, only correct responses are included.

subject primeType targetType voice RT trial word targetP

  • 1 SI standard variant A 1062 1 shibai 0.3597897
  • 2 SI standard standard A 834 2 shanliang 0.4546521
  • 3 SI control variant A 1472 10 zhekou 0.3102893

We first excluded datapoints with RTs that are shorter than 300 ms and longer than 2500 ms, and log-transformed the raw RTs to ensure that model residuals better approximate normality. We used treatment coding, with “standard” as reference level for both targetType and primeType. For the baseline model, we fitted a GAM with targetType, primeType and voice as fixed-effect predictors. To take subject or item variability into account, we also included by-subject and by-word random intercepts. In GAMs, this is done by adding bs = "re" in the smooth term s (), where "re" stands for random effect. For this to work, it is essential that the pertinent random effect column in the dataset is encoded as a factor in R.

ret$subject = as.factor(ret$subject) ret$word = as.factor(ret$word)

ret.gamO = gam(logRT ~ targetType + primeType + voice +

s(subject, bs = "re") + s(word, bs = "re"), data = ret)

A model summary is obtained as follows: summary(ret.gamO)

The summary of model ret. gamO contains two parts.

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 6.822828 0.021024 324.527 <2e-16

targetTypevariant 0.075274 0.005840 12.888 <2e-16

primeTypecontrol 0.130702 0.007142 18.300 <2e-16

primeTypevariant -0.012028 0.007131 -1.687 0.0918

voiceB 0.077066 0.005830 13.219 <2e-16

Approximate significance of smooth terms: edf Ref.df F p-value

s(subject) 45.94 47 42.990 < 2e-16

s(word) 22.60 29 3.609 2.22e-14

R-sq.(adj) = 0.516 Deviance explained = 52.9%

GCV = 0.023946 Scale est. = 0.023305 n = 2746

The first part of the summary reports parametric coefficients, just as in the summary of a linear regression model. According to this part of the summary, there is good evidence that RTs are longer if target words are produced with variant pronunciation (targetTypevariant). In addition, RTs are also longer when presented with a control prime (primeTypecontrol), because little priming can be induced by a totally different word. RTs however become shorter when prime words are produced with variant pronunciation (primeTypevariant), but the difference is not well supported. Finally, responses are slower for target words produced by speaker В (voiceB). This is due to the overall slower speaking rate of speaker B.

The second part of the summary reports statistics for the smooth terms. For this model, it is useful to have by-subject and by-word random intercepts, given that the p values for both effects are very small. The GCV score (0.0239) is an indication of the goodness of model fit. The lower the score, the better the model fit.

As a next step, we improve the model by taking further details of subject variability into account. Figure 10.3 displays RTs as a function of trial number in the experiment, for each subject separately. We can see that in addition to differences in intercepts, how RTs develop in the course of the experiment also differs substantially across subjects. S48, for example, becomes faster as the experiment proceeds. By contrast, S19 has the tendency to become slower. Staying for now with the LMM. we include by-subject random slopes for trial. To avoid that trial dominates prediction accuracy because of its larger scale as compared to other predictors, we first scaled it and created a new numeric variable trialScaled. The second smooth term in the formula of this model requests the by-subject random slopes for trial.

ret$trialScaled = as.numeric(scale(ret$trial, center = TRUE, scale = TRUE))

ret.gamOa = gam(logRT ~ targetType + primeType + voice +

s(subject, bs = "re") + s(subject, trialScaled, bs = "re") + s(word, bs = "re"), data = ret)

The GCV score of this model (0.0221) is lower than that of the previous model, indicating improved model fit. However, Figure 10.3 also shows that RT development is notably nonlinear. For instance, S39 exhibits a convex pattern, whereas a concave pattern is found for S6. We therefore need factor smooths if we want our model to be more precise. A factor smooth fits a wiggly curve for each individual subject, and provides the nonlinear equivalent to the combination of random intercepts and random slopes in the linear mixed model. In model

RTs across trials for each subject

Figure 10.3 RTs across trials for each subject.

By-subject factor smooths from ret. gaml

Figure 10.4 By-subject factor smooths from ret. gaml.

ret. garni, a factor smooth was requested by specifying bs = "fs". The directive m = 1 enables shrinkage for the basis function for the linear tilted line. Thus, if there is no evidence of different linear slopes for the individual subjects, the coefficients of the linear basis function can be shrunk down all the way to zero. In this case, subject variability only involves intercepts, and the factor smooth has become functionally equivalent to a random effect with by-subject random intercepts.

ret.garni = gam(logRT ~ targetType + primeType + voice +

s(trialScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = ret)

Figure 10.4 presents the fitted smooths for the subjects, all of which are clearly nonlinear across trials. The lower GCV score (0.0215) of this model indicates that the inclusion of the factor smooth has improved model fit.

Next we asked whether there are interactions among the factorial predictors. Of specific interest is whether RTs to variant target words are disparately influenced by different primes and different voices. We therefore built our next model including an interaction between targetType and primeType, and in addition an interaction between targetType and voice.

ret.gam2 = gam(logRT ~ targetType * primeType +

targetType * voice +

s(trialScaled, subject, bs = "fs", m = 1) + s(word, bs="re"), data = ret)

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 6.838974 0.020115 339.990 < 2e-16

targetTypevariant 0.054803 0.010727 5.109 3.49e-07

primeTypecontrol 0.127271 0.009180 13.864 < 2e-16

primeTypevariant -0.030184 0.009121 -3.309 0.000948

voiceB 0.061979 0.008382 7.394 1.92e-13


primeTypecontrol 0.005855 0.013186 0.444 0.657071


primeTypevariant 0.037840 0.013173 2.873 0.004104

targetTypevariant:voiceB 0.015086 0.010781 1.399 0.161859

Model summary suggests that there is some hint of evidence that an upward adjustment of the intercept is needed when both the target and prime words are presented with variant pronunciations. The interaction between targetType and voice, on the other hand, does not seem to be necessary. Given that the GCV score (0.02149) remains practically the same as that of the model without interactions (ret .garni), we should stay wary about the interpretation of these interaction effects.

In the next model we investigated the effect of a numeric predictor, targetP. This measure attempts to estimate the probability that a retroflex pronunciation activates the intended semantic field. When retroflexes are deretroflexed (e.g. /su/—>[su|), it is inevitable that more lexical candidates will emerge, because now in addition to words beginning with /gu/, words beginning with /su/ will be compatible with the acoustic input as well. Thus, the probability of obtaining the correct semantics is usually smaller for variant pronunciations than for standard pronunciations.

To explore whether the effect of targetP might be nonlinear, we included it in a smooth term. Since the interaction between targetType and voice was not significant, we left it out in the model shown here.

ret.gam3 = gam(logRT ~ targetType * primeType + voice +

s(targetP) +

s(trialScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = ret)

Approximate significance of smooth terms:

edf Ref.df F p-value

s(targetP) 5.078 5.816 2.745 0.0183

s(trialScaled,subject) 197.611 431.000 6.797 < 2e-16

s(word) 21.736 29.000 3.581 2.53e-16

According to model summary, the effect of targetP is nonlinear, because the edf, denoting effective degrees of freedom, is larger than one. Before explaining edf, let’s first visualize the effect. This can be done by using the plot function, as shown below. The directive select = 1 requests the first effect in the summary table for smooth terms to be plotted. To zoom in on

The effect of targetP from ret. gam3

Figure 10.5 The effect of targetP from ret. gam3.

the effect, we set the range of the у-axis between -0.1 and 0.1. In addition, we added shading to the area of the confidence interval, and finally a horizontal line at zero.

plot(ret.gam3, select = 1, ylim = c(-0.1, 0.1), shade = TRUE) abline(h=0)

As can be seen in Figure 10.5, the effect is indeed nonlinear, with a tendency of RTs to become shorter as the probability of obtaining the intended meaning increases. Note that the plot function visualizes the partial effect of targetP, i.e. the population intercept and the effects of other predictors are not included. This is why the curve is centered around the line у = 0.

Returning to the discussion of edf, we have mentioned previously that wiggly curves are modeled by assigning different weights to a set of basis functions. Usually approximating a more wiggly curve requires a larger number of basis functions. Unrestricted use of basis functions, however, may lead to overfitting. In fact, in such cases, models become unnecessarily complex. Under the assumption that simple models are to be preferred over complex ones, GAMs incorporate a mechanism that penalizes the weights of basis functions. Penalization may take the weight of a basis function completely to zero. The edf is the sum of the proportions of the unpenalized weights that are retained in the penalized weights. Since a larger edf typically implies more basis functions are used, its magnitude is a rough indicator for the degree of wiggliness of the smooth.

Given that the support for the effect of targetP is only minimal (p = 0.018), we need to be cautious and refrain ourselves from jumping into conclusions. We therefore went on to further examine whether targetP interacts with other factors. In the next model, we added the directive

by = voice to the smooth term of s (targetP). This requests two separate smooths, one for each voice.

ret.gam4 = gam(logRT ~ targetType * primeType + voice +

s(targetP, by = voice) + s(trialScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = ret)

edf Ref.df F p-value

s(targetP):voiceA 4.327 5.100 1.404 0.22028

s(targetP):voiceB 7.545 8.313 3.010 0.00212

The model summary suggests that, apparently, the effect of targetP is only present for voice B, but not for voice A. This explains why the effect of targetP is not well supported when both voices are considered together in model ret .gam3. The visualization of the two effects is presented in Figure 10.6. In the left panel, the confidence interval always includes zero, indicating that there is no significant effect anywhere. By contrast, in the right panel, there is only a small portion of the curve where the confidence interval includes zero, suggesting that the effect is better supported.

Although now the downward trend for voice В is clearer, the curve is still uninterpretably wiggly. In fact, the high degree of wiggliness is likely to be a technical artifact. The values of targetP are sparsely distributed, as can be seen from the rugs on the x-axis. Given that there are only 30 target words, and that each word comes with two pronunciations (standard v.v. variant), there are at most 60 unique values for targetP. When there are many datapoints with the same targetP value, as is the case in the present experiment in which 48 subjects responded to each target token, the GAM is presented with substantial evidence about where the mean should be for each stimulus. As a consequence, it does its best to bring the predictions as close as possible to the observed values. This then leads to overfitting. To remedy this, we brought down the

The effects of targetP for voice A (left) and voice В (right) from ret. gam4

Figure 10.6 The effects of targetP for voice A (left) and voice В (right) from ret. gam4.

number of basis functions used to fit the curves. The default number of basis functions is ten. In the following model, we set this number to four by specifying к = 4.

ret.gam5 = gam(logRT ~ targetType * primeType + voice +

s(targetP, by = voice, к = 4) + s(trialScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = ret)

edf Ref.df F p-value

s(targetP):voiceA 1.00 1 0.031 0.86110

s(targetP):voiceB 1.00 1 8.658 0.00328

Now the effects of targetP for both voices are practically linear, as shown in Figure 10.7. The linearity of the effects can also be read off their edfs, both of which are equal to one. Because the first two basis functions of the spline are a horizontal line and a tilted line (the counterparts of the intercept and slope of a standard linear regression model), an edf value close to or equal to one means that in addition to the intercept (the first basis function, which is already represented in the parametric part of the summary), only one additional basis function is needed, which is the basis function for the tilted line. In other words, the GAM has detected, when restrained from overfitting, that the effect is linear rather than nonlinear.

In summary, the analysis of this dataset illustrated how to use factor smooths to model the nonlinear random effect for individual subjects. In addition, we introduced how to request smooths for different levels within a given factorial factor. We also noted that we should be cautious about smooths that are excessively wiggly, as they might be artifacts resulting from sparsely represented numerical predictors in repeated measurement experimental designs. This issue can be addressed by adjusting the number of basis functions.

The effects of targetP for voice A (left) and voice В (right) from ret. gam5

Figure 10.7 The effects of targetP for voice A (left) and voice В (right) from ret. gam5.

Example 2: Time-varying response variables

In phonetics, a lot of features that we measure change over time. These include pitch, intensity and formants. Tongue positions traced by electromagnetic articulography, and signals recorded by EEG electrodes are also dynamic in nature. As discussed previously in the analysis of the introductory clinical example, when it comes to these measurements, it has been a common practice to pre-process the data, e.g. by calculating the mean, range, or maximum/minimum values, and then use these aggregated data as input for statistical analyses. By doing so, we however run the risk of missing out on potentially interesting aspects of the data. For example, participants with dysarthria might have intensity profiles for stressed and unstressed syllables that are very different from those of control speakers, which might be characterized by less abrupt changes in intensity. Fine-grained details such as these cannot be brought to light by the traditional ways of data analysis.

The dataset of the next example consists of the FO measurements of the high-level tone, or Tone 1 (henceforth Tl) of Mandarin. Speakers were sampled from two regions in Taiwan, Taipei and Taichung, which are two metropolitan cities located in northern and central Taiwan, respectively. Of interest is whether Tl realizations differ between the two speaker groups. A clinical parallel to the present between-group comparison is a comparison between a control group of healthy speakers and a group of clinical participants. With the present dataset, we will illustrate how to model wiggly curves for different speaker groups (the levels of a factorial predictor), as well as how to ascertain whether, and if so, where, the contours of two groups are significantly different. In other words, we can use GAMs to model the curves of any response variable separately for control and clinical groups. We can also set the model up in such a way that we can compare two curves directly and find out where in the curves the two speaker groups differ significantly. Finally, we will discuss how to take the variances of different speaker groups, in addition to the means, into account in GAMs. From the perspective of clinical phonetics, it is conceivable that speakers with an impairment, due to lack of control, produce data with greater variability as compared to a control group. Failure to take such differences in the variance into account may introduce inflated p values. GAMs provide a flexible way for dealing with unequal variances.

For the current dataset, Fon (2007) recorded 25 native speakers of Taiwan Mandarin. Thirteen speakers (six males, seven females) were from Taipei, and 12 (six males, six females) were from Taichung. Stimuli were 24 bisyllabic words, and the target Tl syllables occur in either the first (PI) or the second (P2) position of the bisyllabic words. The adjacent preceding or following syllable carried one of the four lexical tones (Tl, T2, ТЗ, T4). In total, the manipulation of position and adjacent tone gave rise to eight different tonal contexts (specified in the column context). By way of example, the word bing'gan1 “cookie" has the target syllable gan1 in P2, and is preceded by a T3 syllable, and hence the context coding is “P2.T3." To capture the pitch contour of Tl, the FO values at ten equally-spaced time points of each target syllable were measured.

presents Tl realizations in the eight different contexts by the four speaker groups (broken down by sex and location). In general, females have higher pitch values than

Figure 10.8 presents Tl realizations in the eight different contexts by the four speaker groups (broken down by sex and location). In general, females have higher pitch values than

TI contours of different contexts for individual male (blue) and female (red) subjects

Figure 10.8 TI contours of different contexts for individual male (blue) and female (red) subjects.

males, as expected. Interestingly, Tl is not really a “level” tone as canonically described. Instead, it is featured by a final rise, as is clearly visible especially for females and for word- final positions (P2).

Before fitting models, we centered time . Time is a numeric variable ranging from one to ten in normalized time. Centering was implemented because the model intercept, the y-coordinate where the regression line hits the у-axis, is not interpretable, since pitch at time 0 does not make sense. We centered time so that the intercept now will represent the mean pitch in the middle of the target syllables. We fitted our first model with sex, location and context as fixed-effect factors, along with a smooth for the variable of scaled time (timeScaled). In addition, we also requested a by-subject factor smooth for timeScaled (i.e. a wiggly line for each subject) and random intercepts for word. Also note that here we used bam instead of gam. This is because when working with a large dataset, bam works more efficiently at minimum cost of accuracy. In addition, with the directive discrete set to TRUE, covariates are binned in a mathematically principled way to enable faster estimation of the model’s coefficients.

tone$timeScaled = as.numeric(scale(tone$time, center =

TRUE, scale = FALSE))

tone.gamO = bam(pitch ~ sex + location + context +

s(timeScaled) +

s(timeScaled, subject, bs = "fs", m = 1) + s(word, bs="re"), data = tone, discrete = TRUE)

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 123.505 6.337 19.489 < 2e-16

sexF 98.017 7.156 13.698 < 2e-16

locationTAICHUNG -8.739 7.156 -1.221 0.2220

contextP2.Tl -1.196 1.163 -1.028 0.3040

contextPl.T2 6.371 1.163 5.476 4.52e-08

contextP2.T2 2.295 1.163 1.973 0.0486

contextPl.T3 7.601 1.163 6.533 6.99e-ll

contextP2.T3 -1.943 1.163 -1.670 0.0950

contextPl.T4 2.108 1.163 1.812 0.0701

contextP2.T4 -9.670 1.163 -8.312 < 2e-16

Approximate significance of smooth terms:

edf Ref.df F p-value

s(timeScaled) 5.677 6.567 27.66 < 2e-16

s(timeScaled,subject) 119.356 225.000 150.77 1

s(word) 13.698 16.000 5.95 3.89e-16

R-sq.(adj) = 0.973 Deviance explained = 97.4%

fREML = 21572 Scale est. = 73.039 n = 6000

According to model summary, pitches have to be adjusted upward for females, unsurprisingly. While the effect of location is not supported, context shows up with several well-supported contrasts. With respect to the smooth terms, the overall contour of T1, as shown in Figure 10.9, indeed has a dipping curvature. The by-subject factor smooth, on the other hand, is not significant, indicating that inter-subject variability is not substantial.

There is, however, one serious problem for this model. When we apply the autocorrelation function to the residuals of the model, we obtain the plot presented in the left panel of Figure 10.10.


It suggests that there is still structure remaining in the residuals, violating the crucial modeling assumption that the residuals are independent of each other. That is, residuals at time I are correlated with residuals at preceding time steps t_lag. At shorter lags, autocorrelations are strongest. This autocorrelation is inevitable, since given physical constraints, the vibration of vocal folds at time t cannot be completely independent of that at time t—1. One strategy to deal

Partial effect of timeScaled from tone . gamO

Figure 10.9 Partial effect of timeScaled from tone . gamO.

with autocorrelated residuals is to incorporate an AR(1) process in the model. This correction process assumes that the error at time t is the sum of a proportion p of the preceding error at time t—1 and Gaussian noise. Adding an AR(1) component to our model requires that we add a further variable to the dataframe. The required variable specifies the beginning of each production (i.e. time 1) with the logical value TRUE, and all the others with the value FALSE. We now fit a new model with the AR( 1) process incorporated, setting rho to 0.8, the approximate value at lag 1 in the left panel of Figure 10.10.

Autocorrelation functions fitted to models tone.gamO (left) and tone.gaml (right). Lag indicates the number of time steps preceding a given time t

Figure 10.10 Autocorrelation functions fitted to models tone.gamO (left) and tone.gaml (right). Lag indicates the number of time steps preceding a given time t.

tone$AR.start = FALSE

tone$AR.start[tone$time==l] = TRUE

tone.gaml = bam(pitch ~ sex + location + context +

s(timeScaled) +

s(timeScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = tone, discrete = TRUE,

AR.start = tone$AR.start, rho = 0.8)

To inspect the residuals, this time we used the resid_gam function from the itsadug package (van Rij et al., 2017). This function discounts the part of residuals that has been taken care of by the AR( 1) process.

As shown in the right panel of Figure 10.10, autocorrelation in the errors has been successfully eliminated.

acf (resid_gam (tone . garni) )

To further examine the effect of context, we fitted a new model and requested a separate smooth for each context. In order to compare the model fit of this model with that of the previous model, we used the method of maximum likelihood estimation with the directive method = "ML". It is important that we cannot use the method of (fast) restricted maximum likelihood estimation, i.e. (f)REML, if the models to be compared have different fixed-effect structures.

tone.gam2 = bam(pitch ~ sex + location + context +

s(timeScaled, by = context) + s(timeScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"),

data = tone, AR.start = tone$AR.start, rho = 0.8, method = "ML")

After also refitting the previous model with maximum likelihood estimation, we can now compare the model fits.

compareML(tone.garni, tone.gam2)

Chi-square test of ML scores

Model Score Edf Difference Df p.value Sig.

  • 1 tone.gaml 16925.54 15
  • 2 tone.gam2 16578.79 29 346.752 14.000 < 2e-16 ***

Model fit has improved substantially by allowing each combination of position and lexical tone to have its own smooth (as indicated by the lower ML score). Figure 10.11 clarifies that T1 is realized very differently when in PI compared to when in P2. Furthermore, tonal contours are also clearly depending on the tones of preceding and following syllables.

Partial effects of pitch contours for different contexts from tone . gam2

Figure 10.11 Partial effects of pitch contours for different contexts from tone . gam2

Although T1 looks very different for several of the eight contexts, it can however not be straightforwardly inferred from the model where pairs of contours are significantly different. For example, in the contexts of “P2.T2” and “P2.T3” (lower mid panels, Figure 10.11), the T1 contour in the “P2.T3” context appears to start lower than that in "P2.T2,” suggesting a clear influence from preceding tones. However, we do not know whether this difference is statistically supported. To address this issue, we can build a difference curve.

By way of example, to zoom in on this particular contrast, we first subsetted the data to include only the pertinent datapoints. We then transformed the variable context into a numeric variable, with the reference level “P2.T2” coded as 0 and the contrast level “P2.T3” coded as 1. A new model was fitted, in which we included not only a smooth for the general effect of timeScaled, but also a smooth for the difference between the reference and the contrast levels. What this second smooth term does is to provide a corrective curve for the datapoints that fall under the contrast level. In this way, we obtain a difference curve for the two levels.

toneP2 = droplevels(tone[tone$context=="P2.T2"

|tone$context=="P2.T3", ])

toneP2$contextNum = as.numeric(toneP2$context)-1 toneP2.gam = bam(pitch ~ sex + location +

s(timeScaled) +

s(timeScaled, by = contextNum) + s (timeScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = toneP2, discrete = TRUE,

AR.start = toneP2$AR.start, rho = 0.8)

The difference curve for the present example is shown in the left panel of Figure 10.12. T1 contours are always lower when following T3 than when following T2: the entire curve lies under the x-axis. To better understand the difference curve, we plotted the fitted values for the two contexts in the right panel of Figure 10.12. The difference is greater at the beginning and gradually attenuates toward the end, which also indicates that the difference decreases

Difference curve between “P2.T2’' and "P2.T3” contexts (left) and the predicted pitch values for the two contexts (right) obtained from toneP2 . gam

Figure 10.12 Difference curve between “P2.T2’' and "P2.T3” contexts (left) and the predicted pitch values for the two contexts (right) obtained from toneP2 . gam.

Difference curve between "P1.T2" and “P1.T3” contexts (left) and the predicted pitch values for the two contexts (right) obtained from model tonePl. gam

Figure 10.13 Difference curve between "P1.T2" and “P1.T3” contexts (left) and the predicted pitch values for the two contexts (right) obtained from model tonePl. gam.

over time. Importantly, the confidence interval of the difference curve never contains zero, suggesting that even at the endpoint in time, the small remaining difference is still significant.

With the same method, we can also examine whether the contours of “P1.T2" and “P1.T3” are significantly different. Detailed code is provided in the supplementary material. The difference curve (left panel. Figure 10.13) is a straight line above zero, since “P1.T3” has higher predicted pitch values than “P1.T2’’ for all time points (right). Notably the confidence interval of the difference curve always contains zero, a clear indicator that there is no real difference between these two contexts.

In addition to the effect of context, we were interested in variation in different subject groups. Thus, we further asked whether the effect of context interacts with sex or location. The next model now has two additional interactions: one between sex and context, and the other between location and context.

tone.gam3 = bam(pitch ~ context * (sex + location) +

s(timeScaled, by = context) + s(timeScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"),

data = tone, AR.start = tone$AR.start, rho = 0.8, method = "ML")

Using compareML (), we observed a substantial improvement of model fit. Including by-sex or by-location smooths for each context, however, did not significantly improve model fit. This clarifies that in terms of tonal shape, we do not have sufficient statistical evidence supporting cross-dialect or cross-gender variation.

So far tone . gam3 is our best model. But now we have to subject tone . gam3 to model criticism, to clarify whether this model is actually appropriate for our data. We first checked whether residuals of this model follow a normal distribution using a quantile-quantile plot.

qq.gam(gam.tone 3)

Quantile-quantile plots for the residuals of models tone . gam3 and tone . gam3a

Figure 10.14 Quantile-quantile plots for the residuals of models tone . gam3 and tone . gam3a.

This plot is presented in the left panel of Figure 10.14. Ideally, the residuals should fall on the black solid line, an indication that they follow a normal distribution. However, we see that the distribution of the residuals clearly has two heavy tails, indicative of a t distribution. We therefore modeled the data with a scaled-t distribution, which transforms the residuals back to normality. In the model specification we thus added family = "scat", which gave us tone . дашЗа:

tone.gam3a = bam(pitch ~ context * (sex + location) +

s(timeScaled, by = context) + s(timeScaled, subject, bs = "fs", m = 1) + s(word, bs = "re"), data = tone, AR.start = tone$AR.start, rho = 0.8, method = "ML", family = "scat")

The quantile-quantile plot for the residuals of tone . gam3a is presented in the right panel of Figure 10.14. It is clear that now the distribution of errors is much closer to a normal distribution.

A further way to carry out model criticism is to run the function gam. check (), which provides further useful information about residuals. For example, Figure 10.15 presents one of the plots output by gam. check (). It is clear that the residuals cluster into two groups. Given that sex is the most powerful linear predictor, the two clusters could roughly be considered as belonging to males and females. This scatterplot tells us that residuals spread out more for females (cluster on the right) than for males (cluster on the left). In other words, the model is still problematic, as it violates the assumption of homogeneity of variance for residuals. Indeed, as also shown in Figure 10.8, Taipei males in particular, show much less variability than Taichung males and females in general.

GAMs can directly model the variance, in addition to the mean. This can be achieved by using the Gaussian location-scale model, with the specification of family = "gaulss". Since we are now modeling not only the mean but also the variance, we need to specify two formulae, one for each. Combined into a list, the first formula is similar to the one of model tone. датЗ. The second formula is for the variance, which includes an interaction between sex and location. We also removed the random effect of by-subject factor smooth in the first formula, because subject variability coincides with the effect of sex x location on variances to a substantial extent.

One of the output plots given by gam. check () for tone . gam3a

Figure 10.15 One of the output plots given by gam. check () for tone . gam3a.

tone.gam4 = gam(list(pitch ~ context * (sex + location) +

s(timeScaled, by = context) + s(word, bs = "re"),

~ sex * location), data = tone, family = "gaulss")

Parametric coefficients:

Estimate Std. Error z value Pr(>|z|) (Intercept) 123.68768 0.78478 157.608 < 2e-16

contextP2.Tl 1.80746 1.10982 1.629 0.103396

(Intercept).1 2.27445 0.01878 121.100 < 2e-16

sexF.l 0.59782 0.02565 23.308 < 2e-16

locationTAICHUNG.1 0.81179 0.02657 30.557 < 2e-16

sexF:locationTAICHUNG.1 -0.62281 0.03687 -16.891 < 2e-16

The summary of the parametric predictors now contains two parts. The first part presents the results for modeling the mean, which is similar to the results of model tone . gam3.The second part is the results for modeling the variance. All three predictors reach significance, confirming our observation that the variance indeed differs across speaker groups. The downside of using the Gaussian location-scale model, however, is that autocorrelation is no longer under control. Here we have to wait for further development of GAMM theory and implementation.

In summary, the analyses of pitch contours for this dataset illustrated how GAMs can be used to model time series of acoustic measurements. In particular, we addressed the issue of autocorrelation. We also introduced difference curves, which are useful to track the difference between the curves of two contrast levels. Finally, we showed how model criticism can lead to the selection of a different “family” to fine-tune the model better.

Example 3: Interacting covariates

While some response variables vary with time, as shown in the previous tone example, others vary across location. Unlike time, which is represented by one numeric predictor, location has to be represented by at least two numeric predictors, e.g. x and у coordinates. Thus, instead of wiggly curves, now we have to deal with wiggly three-dimensional surfaces. A straightforward application of this in phonetics is the study of dialectal variation. Speakers of the same language but from different locations often pronounce words differently. To investigate how pronunciation variations are distributed geographically, we can use GAMs to predict pronunciation by fitting a nonlinear surface projected from geographical coordinates, i.e. longitude and latitude.

In clinical phonetics, more and more attention is directed to the issue of language variation and its potential impact on clinical assessment and treatment (see Chapters 7 and 8 for detailed discussion). The technique introduced here provides a tool that can be used not only to study dialectal variation, but also to study the consequences of dialectal variation for the speech of geographically diversified control and clinical speaker groups. Other applications within clinical phonetics are the positions of sensors on the tongue in electromagnetic articulography. For instance, when sensor positions on the midsaggital plane are considered as a function of time, then tongue height can be modeled with a regression surface predicted from sensor location and time, while contrasting one or more clinical speaker groups with a control group (for an example of such an analysis, see Tomaschek et al., 2018).

As mentioned previously, retroflex sibilants are merging with dental sibilants in Taiwan Mandarin. However, the degree of merging is heterogeneous across Taiwan. In some areas, the retroflex-dental distinction still persists, whereas in other areas, sibilants are merged to a greater or lesser extent. In order to investigate how sibilant realizations differ geographically, a total of 323 participants from 117 different locations of Taiwan were recruited (Chuang et al., 2019). As shown in Figure 10.16, a large number of the participants are from the north, comprising the urban and suburban areas of Taipei, the capital city. From central to southern Taiwan, we only had participants from the western side of the country (between the two big cities of Taichung and Kaohsiung). This is because the eastern part is mountainous and sparsely populated. For each geographical location, we obtained its longitude and latitude coordinates.

Stimuli were 15 retroflex-initial and 15 dental-initial words. The production of these words was elicited by means of a picture-naming task. For each sibilant, we measured its centroid frequency, which was calculated based on a 30-ms spectral slice extracted from the middle of the frication part of the sibilant. Centroid frequency is generally considered a good index for sibilants’ place of articulation: the more anterior the articulation, the higher the centroid frequency. Thus, dental articulation results in higher mean centroid frequency values compared to retroflex articulation. The pertinent information can be found in the dataset mer.

Subject Word gender vowel sibilant centfreq longitude latitude minflu

  • 1 SI W1 m unrounded D 6310.199 121.5717 25.03062 6
  • 2 S2 W1 m unrounded D 5245.757 121.4871 25.06282 1
  • 3 S3 W1 m unrounded D 7276.224 121.5317 24.93039 5
The geographical distribution of population in Taiwan, and the participants in this study

Figure 10.16 The geographical distribution of population in Taiwan, and the participants in this study.

We first fitted a model with three factorial predictors as fixed effects. In addition to sibilant, we also included gender and vowel, the latter of which specifies whether the sibilant is followed by a rounded or an unrounded vowel. Males’ sibilants in general have lower centroid frequencies because vocal tract length is usually longer for males than for females. Similarly, when sibilants are followed by rounded vowels, the vocal tract is lengthened due to anticipatory coarticulation of lip protrusion. This leads to the lowering of centroid frequencies as well.

To inspect the effect of geography, since a geographic location is determined simultaneously by two predictors, longitude and latitude, we need tensor product smooths. Just as splines fit a wiggly curve for the effect of a single predictor, tensor product smooths fit a wiggly surface for the joint effect of two or more predictors. The pertinent function for tensor product smooths in GAMs is te (). For the present dataset, a wiggly surface is projected from longitude and latitude. In addition, we added by = sibilant in the specification of the tensor product smooth, to request two wiggly surfaces, one for each sibilant category. This model includes by-subject and by-word random intercepts as well.

mer.gamO = bam(centfreq ~ gender + vowel + sibilant +

te(longitude, latitude, by = sibilant) + s(Subject, bs = "re") + s(Word, bs = "re"), data = mer, discrete = TRUE)

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6252.8 143.9 43.446 < 2e-16

genderf 317.9 59.1 5.379 7.67e-08

vowelunrounded 759.3 112.2 6.769 1.37e-ll

sibilantR -1185.5 583.9 -2.030 0.0424

Approximate significance of smooth terms:

edf Ref.df F p-value


sibilantD 3.001 3.001 1.446 0.227


sibilantR 15.058 16.259 5.593 1.47e-12

s(Subject) 295.054 323.000 23.399 0.317

s(Word) 28.322 29.000 56.137 < 2e-16

R-sq.(adj) = 0.543 Deviance explained = 55.8%

fREML = 86336 Scale est. = 6.2616e+05 n = 10613

The table of parametric coefficients shows that centroid frequencies are higher for females and for sibilants followed by unrounded vowels, as expected. The effect of sibilant, on the other hand, is not very well supported, presumably because of merging. With regards to the effect of geography, there is also evidence supporting that retroflex (but not dental) realizations differ across regions. Before visualizing this effect, we first check the concurvity of the model. Concurvity is a concept similar to collinearity in linear regression. It occurs when predictors are highly correlated with one another. When there is high correlation among the predictors, coefficient estimates become inaccurate, and it becomes unclear what the unique contribution of a predictor to the model fit is. The concurvity of the model can be obtained with the function concurvity ():


para te(longitude,latitude):sibilantD worst 1 1

observed 1 1

estimate 1 1

te(longitude,latitude):sibilantR s(Subject) s(Word) worst 1 1.0000 1.0000

observed 1 0.0700 0.0114

estimate 1 0.0824 0.1065

Concurvity values are bounded between zero and one, with larger values indicating high concurvity. The fact that the estimates of the two tensor product smooths are both one suggests a serious concurvity problem of this model. The problem results from the high correlation between the effect of geographical location and the by-subject random effect. Given that for half of the locations we have only one participant, the effect of geographical location is inevitably closely tied to the random effect of subject. To remedy this, we left out the by-subject random intercept, an effect which also turned out to be not statistically supported in model mer. gamO.

To visualize the regression surfaces, we use the vis . gam function. This function produces a contour plot of the fitted values. The output contour plots for dentals and retroflexes, overlaid with the map of Taiwan, are presented in Figure 10.17. More coding details are colder in the supplementary material. Warmer colors represent higher centroid frequencies, whereas colder colors represent lower centroid frequencies. We can immediately see that, as expected, dentals (left) have higher centroid frequencies than retroflexes (right). In addition, geographical differentiation is more prominent for retroflexes than dentals as it is the realization for retroflex sounds that varies substantially across regions. Specifically, people in central Taiwan (the area

The effect of geography on dental (left) and retroflex (right) sibilants from mer. gamO

Figure 10.17 The effect of geography on dental (left) and retroflex (right) sibilants from mer. gamO.

Yu-Ying Chuang et al.

near Taichung) have very retroflexed productions, and there are a few places (the yellow areas) where people have almost dental-like, or deretroflexed productions, indicating a high degree of merging.

Figure 10.17 shows that centroid frequencies vary geographically in ways that differ between dental and retroflex sibilants. This leads to the question of where across the country the dental and retroflex realizations are most similar (i.e. high degree of merging). We can investigate this by setting up a model with a difference surface, which will be predicted by the geographical coordinates (longitude and latitude). Similar to the difference curve described earlier for pitch contours, we first transformed sibilant into a numeric variable. Next in the formula we specified two tensor products. The first one is for the reference sibilant category (dental), while the second one is for the difference surface (retroflex - dental).

mer$sibNum = as.numeric(ifelse(mer$sibilant=="D", 0, 1)) mer.garni = bam(centfreq ~ gender + vowel +

te(longitude, latitude) + te(longitude, latitude, by = sibNum) + s(Word, bs = "re"), data = mer, method = "ML")

Figure 10.18 presents the partial effects of the two tensor product smooths. The left panel shows the contour plot of the reference level (i.e. dental). The numbers on the contour lines are the predicted partial effect for centroid frequency. Since partial effects exclude intercepts and the effects of other predictors, we therefore observe the “pure” effect of geography. For example, Taichung is located around the contour line of zero. To the east of Taichung, the number gradually decreases (from -500 to -1,000). This indicates that the centroid frequency

Contour plots of the partial effects of the reference dental sibilants

Figure 10.18 Contour plots of the partial effects of the reference dental sibilants (left) and the difference surface (right) obtained from mer .garni. One SE (standard error) confidence regions are indicated by green dotted and red dashed lines for the upper and lower bounds of the interval respectively.

predicted for dental sibilants decreases to the east. The contour plot in the right panel presents the magnitude of the difference in centroid frequencies between the two sibilants (retroflexes - dentals). Given that retroflexes have lower centroid frequencies than dentals, more negative values thus indicate a larger difference and less merging. Notably, speakers from central Taiwan (to the east of Taichung) keep the production of the two sibilants most distinct (the predicted difference in centroid frequencies can be as large as 3,000 Hz). As this is the location of the former capital of Taiwan, speakers from that region appear to preserve retroflexion to a greater extent. In addition, we found that the degree of merging for the three major cities in Taiwan is fairly similar (all located around the -1.000 contour lines). This could be a consequence of urbanization. Since the population of these cities is all composed of speakers from different regions across the country, extensive interactions among different dialectal or ethnic groups could have led to a new variety that is specific to urban areas. However, we will need more data to verify this hypothesis.

One frequently asked question regarding sibilant merging in Taiwan Mandarin is to what extent the degree of merging is subject to the influence of Min, another major substrate language spoken in Taiwan. Like Mandarin, Min also has the three dental sibilants. However, unlike Mandarin, it lacks all of the three retroflex counterparts. Merging, essentially deretroflexion, is thus often regarded as a negative transfer from Min. The question of interest is whether Min fluency has an effect on sibilant merging. Specifically, we ask whether highly fluent Min speakers deretroflex more when speaking Mandarin, and also whether knowledge of Min interacts with the geographic effect.

In the mer dataset, the column minf lu provides our participants’ self-reported ratings of Min speaking fluency. The ratings range from one (low fluency) to seven (high fluency). In the next model, we included a tensor product for minf lu and the two geographical coordinates, and further requested separate wiggly surfaces for the two sibilant categories. We fitted the model with the maximum likelihood method to enable model comparisons.

mer.gam2 = bam(centfreq ~ gender + vowel + sibilant +

te(longitude, latitude, minflu, by = sibilant) + s(Word, bs = "re"), data = mer, method = "ML")

Compared to mer.gamO (refitted with method = "ML"), mer.gam2 clearly has

better model fit. To visualize the interaction of longitude and latitude by Min fluency, we plotted separate maps for different levels of Min fluency. Given that most of our participants have Min proficiency ratings between three and six, we therefore focused on these four fluency levels.

Figure 10.19 presents the effect of the four-way interaction, with the predicted surface for dentals in the upper row, and the predicted surface for retroflexes in the bottom row. Columns are tied to Min fluency, increasing from three (left) to six (right). At all Min fluency levels, the distinction between the two sibilants is still retained. Dental realizations are comparatively stable, whereas retroflex realizations vary across Min fluency levels to a much greater extent. Generally speaking, speakers from the center of Taiwan have more retroflexed pronunciation, but typically for speakers with mid-to-high Min fluency. On the other hand, for speakers from northern Taiwan (areas with latitude above 24.5), the realization of retroflexes becomes more dental-like with increasing Min fluency, consistent with the hypothesis of negative Min transfer. Interestingly, our analysis clarifies that this influence is region-specific.

The effect of geography on dentals (upper panels) and retroflexes (lower panels), conditioned on Min fluency levels

Figure 10.19 The effect of geography on dentals (upper panels) and retroflexes (lower panels), conditioned on Min fluency levels.

In summary, we have shown how to analyze the geographical distribution of sociophonetic variation with GAMs. Instead of using categorical variables that distinguish different geographical locations, we showed that directly working with geographical coordinates provides us with novel and powerful ways of studying variation. Specifically, tensor product smooths are available for modeling nonlinear interactions between two or more predictors, enabling dialectal variation to be studied in conjunction with other phonetic or social variables in considerable detail.


The goal of this chapter has been to introduce GAMs to the clinical phonetician. GAMs, which are rapidly gaining popularity in phonetics in general due to their versatility, are a promising tool also for the analysis of data from clinical phonetics.

As we have only provided a basic introduction to the main concepts of GAMs in this chapter, we refer the interested reader to Wood (2017) for a detailed exposition of the mathematics underlying GAMs and discussion of a wide range of examples of application to empirical data. A general, non-technical introduction to the mathematical concepts underlying GAMs is provided by Baayen et al. (2017) and Wieling (2018).

Introductions to GAMs focusing specifically on application to phonetics are Wieling et al. (2016) and Baayen and Linke (2020). Wieling et al. (2016) used GAMs to analyze tongue movement data obtained with electromagnetic articulography. investigating the difference in articulatory trajectories between two Dutch dialects. Baayen and Linke (2020) analyzed the distribution and occurrence of pronunciation variants, extracted from the Buckeye corpus, with GAMs. A GAM analysis on eye-tracking data collected for the perception of Cantonese tones can be found in Nixon et al. (2016).

Quantile GAMs (QGAMs, Fasiolo et al., 2020) provide an extension of the GAM framework that makes it possible to study how the quantiles of the distribution of a response variable depend on a set of predictors. As QGAMs are distribution-free, they are especially useful for datasets for which models that build on the assumption that the residual errors should be independently and identically distributed turn out to be inappropriate. For this reason, Tomaschek et al., (2018) turned to QGAMs for modeling tongue movements registered with electromagnetic articulography.


  • 1 The datasets and the supplementary material for this chapter are available at
  • 2 For one dysarthric participant (D2), the duration and intensity of the first and second syllables, instead of vowels, were measured, because the vowels could not be reliably identified. We do not expect this difference to skew the results, since instead of using raw duration and intensity data, we calculated the ratios and used them as our response variables. Moreover, statistical results are qualitatively similar irrespective of whether D2 is included or excluded. We therefore retained the data of D2 in the analyses.


Baayen. R. H. and Linke, M. (2020). An introduction to the generalized additive model. In Paquot. M. and Gries, S. T.. editors, A practical handbook of corpus linguistics. Springer. Berlin.

Baayen. R. H.. Milin. P., and Ramscar, M. (2016). Frequency in lexical processing. Apltasiology, 30(11), 1174-1220.

Baayen, R. H.. Vasishth, S.. Kliegl, R., and Bates, D. (2017). The cave of shadows: Addressing the human factor with generalized additive mixed models. Journal of Memory and Language, 94, 206-234.

Chuang, Y.-Y. (2017). The effect of phonetic variation on word recognition in Taiwan Mandarin. PhD thesis, National Taiwan University, Taipei.

Chuang, Y.-Y.. Sun. C.-С., Fon, J.. and Baayen. R. H. (2019). Geographical variation of the merging between dental and retroflex sibilants in Taiwan Mandarin. In Proceedings of the /9* International Congress of Phonetic Sciences, pages 472-476, Melbourne. Australasian Speech Science and Technology Association.

Fasiolo, M.. Wood, S. N.. Zaffran. M.. Nedellec. R.. and Goude. Y. (2020). Fast calibrated additive quantile regression. Journal of the American Statistical Association, 1-11.

Fon, J. (2007). The effect of region and genre on pitch range of Tone 1 in Taiwan Mandarin. Technical Report NSC95-2411-H-002-046-. National Science Council, Taipei, Taiwan.

Nixon, J. S., van Rij, J., Мок, P.. Baayen. R. H., and Chen. Y. (2016). The temporal dynamics of perceptual uncertainty: Eye movement evidence from Cantonese segment and tone perception. Journal of Memory and Language, 90, 103-125.

Papakyritsis, 1. and Muller. N. (2014). Perceptual and acoustic analysis of lexical stress in Greek speakers with dysarthria. Clinical linguistics & phonetics, 28(7-8), 555-572.

Tomaschek, F.. Tucker. В. V., Fasiolo, M., and Baayen, R. H. (2018). Practice makes perfect: The consequences of lexical proficiency for articulation. Linguistics Vanguard, 4, 1-13.

van Rij. J.. Wieling. M.. Baayen. R. H.. and van Rijn. H. (2017). itsadug: Interpreting time series and autocorrelated data using GAMMs. R package version 2.3.

Wieling. M. (2018). Analyzing dynamic phonetic data using generalized additive mixed modeling: a tutorial focusing on articulatory differences between LI and L2 speakers of English. Journal of Phonetics, 70, 86-116.

Wieling, M., Tomaschek, F, Arnold. D., Tiede, M.. Broker, F.. Thiele, S., Wood. S. N.. and Baayen, R. H. (2016). Investigating dialectal differences using articulography. Journal of Phonetics, 59, 122-143.

Wood, S. (2017). Generalized Additive Models: An Introduction with R (2nd Ed.). CRC Press. Boca Raton. FL.


< Prev   CONTENTS   Source   Next >