Assimilation, the sound pattern whereby one sound acquires a phonetic property from a neighbouring sound (e.g., [nb] ~ [mb]), is one of the most widespread phonological phenomena in the world’s languages. Intuitively, assimilation can be accounted for as a straightforward ease-of-articulation effect where the production of two properties (e.g., [coronal]-[labial]) is simplified to one (e.g., [labial]), under the assumption that production of heterorganic clusters is articulatorily harder than the production of homorganic clusters (for discussion see e.g., Kohler, 1991; and Winters, 2003 regarding nasal-stop clusters). The explanatory power of a phonetically-grounded account of assimilation increases when we further take into account the observation that the perceptual cues to the consonants that undergo assimilation are generally weaker than are those of the consonants that trigger it (Kohler, 1991; Hura, Lindblom, & Diehl, 1992; Jun, 1996, 2004, 2011; Chang, Plauche, & Ohala, 2001; Steriade, 2001; Kawahara & Garvey, 2014; see also Ohala, 1990; Kawasaki-Fukumori, 1992; Rysling, 2017) and are hence more likely to undergo processes of reduction.
The observation that production and perceptual factors play a role in predicting assimilation is consistent with the view that sound patterns are shaped by communicative biases towards accurate message transmission and low resource cost (e.g., Lindblom, 1990; Hall, Hume, Jaeger, & Wedel, 2016). From this perspective, reduction of an inherently weakly perceptible segment in an articulatorily difficult context can be viewed as maintaining low resource cost. That is, there is little communicative gain in investing resources in the clear production of the segment since it contributes little to the probability that the word will be successfully recognized (henceforth, inferrability).
In principle, communicative biases could manifest in direct or indirect ways. The direct route involves strategic—although possibly implicit and automatic—behaviour as suggested just above: Speakers modulate phonetic and phonological content to adjust the amount of information contained in the speech signal to improve the likelihood of communicative success (Lindblom, 1990; Aylett & Turk, 2004; Scarborough, 2004; Jaeger, 2013; Buz, Tanenhaus, & Jaeger, 2016). The simplest prediction of this framework is an inverse relationship between phonetic content and word predictability. This relationship arises because, for highly contextually-predictable words, the predictability of the word itself licenses a degree of phonetic reduction on the part of the talker. That is, the talker is free to reduce such words with acceptable risk of communication failure. On the other hand, less predictable words have no such contextual cues to assist their recognition, and they therefore are produced with relative phonetic clarity. The indirect route involves filtering as part of the perception-production loop, as outlined by Pierrehumbert (2001, 2002). Under this view, unrecognizable words are not stored in a listener’s exemplar clouds, and are thus not replicated in future communications. Crucially, the recognizability of a word is a function of both its phonetic clarity and its contextual predictability. Along these lines, contextually predictable but phonetically unclear word exemplars will be more likely to ‘survive’ and be used in many communications. Unpredictable and unclear exemplars will not survive, with the end result being a trade-off between phonetic clarity and contextual predictability (see also Wedel, 2006). For example, speakers are unlikely to reduce a coda /n/ in a less-predictable word because they have stored relatively few exemplars of this word without a clearly-produced /n/.
Regardless of the directness of these biases, several studies have found empirical support for the basic generalization that phonetic content and word predictability stand in an inverse relationship (e.g., for English, Jurafsky, Bell, Gregory, & Raymond, 2001; Bell, Brenier, Gregory, Girand, & Jurafsky, 2009; Aylett & Turk, 2004, 2006; for review, see Ernestus, 2014; Clopper & Turnbull, 2018; Jaeger & Buz, 2017). For example, reduction of English /t, d/ in word-medial onset position is more likely when the word that it occurs in has high predictability given the following word (Raymond, Dautricourt, & Hume, 2006). Similarly, the addition of phonetic material through variable vowel epenthesis in Dutch was found to be more likely in less predictable words (Tily & Kuperman, 2012).
Two predictions arise from a communicative framework such as message-oriented phonology (Hall et al., 2016).1 The first prediction focuses on the relative contextual predictability of a word and the view that fewer acoustic cues are needed to identify a word with high predictability. Under the assumption that phonological patterns of assimilation have their roots in coarticulatory variability (Beddor, 2009) and that coarticulation can2 involve a reduction in phonetic cues of the canonical target form, we would expect place assimilation of a segmental target to be more likely when the word it occurs in is contextually more predictable. For example, the phrase ten bucks is more likely to be produced as tem bucks when the word ten is relatively predictable from context. While the reduction in cues to the phonetic form may lead to diminished inferrability of the word ten, the word’s high contextual predictability supports effective recognition: While ten is a probable word to precede bucks, temp or other alternative word candidates that would have non-zero probability under the assimilated signal are not probable before bucks (for further discussion, see Flemming, 2010 and Hall et al., 2016; for evidence that contextual inferrability ameliorates the effects of phonetic reduction; see Ernestus, Baayen, & Schreuder, 2002).3
Crucially, however, assimilation does not only reduce redundancy in the target word; it also extends a cue from the assimilation trigger to earlier in the signal, thereby increasing redundancy in the trigger word. That is, assimilation enhances the cues to the trigger (Scarborough, 2013). Indeed, listeners seem to be able to take advantage of such coarticulatory cues. Otake, Yoneyama, Cutler, and van der Lugt (1996) investigated the role of nasal place of articulation on the processing of place in a following stop consonant by Japanese subjects. In a phoneme monitoring task, they found that Japanese listeners made use of place cues in a nasal consonant to anticipate the identity of a following homorganic stop (see also Beddor et al., 2013).
The second prediction of a communicative approach is thus that assimilation is more likely when the trigger word is less contextually predictable. This prediction, to our knowledge, has not been previously proposed in the literature. In the case of nasal place assimilation, assimilation causes the place feature of the trigger to be both on the trigger and on the target thereby increasing inferrability for the trigger word. Assimilation can therefore be viewed as enhancing cues to the trigger, as well as potentially reducing cues to the target. As an enhancement process, assimilation is therefore predicted to be more likely to occur with contextually unpredictable trigger words (to boost their likelihood of error-free transmission). This predicts the phrase ten bucks to be more likely produced as tem bucks when the word bucks is relatively unpredictable from context.
We note that our predictions deviate from the most widely referenced usage-based accounts. Many usage-based accounts see routinization and practice effects, rather than effective communication, as the driving factor behind language change (e.g., Bresnan & Spencer, 2013; Bybee, 2001, 2006; Diessel, 2007; Munson, 2001; Tomaschek, Tucker, Fasiolo, & Baayen, 2018; see also Jaeger & Hoole, 2011). These accounts predict that frequency is the main factor to influence assimilation (see also certain exemplar-based accounts, e.g., Pierrehumbert, 2001, 2002). That is, if a word is very common, it will be phonetically reduced—high frequency leads to high assimilation rates. However, the communicative approach, based on general assumptions about the role of probabilistic inference during communication (see footnote 1), predicts that the main factor influencing assimilation is the inferrability of a message—and thus its conditional probability in context. The probability of the assimilation trigger is as relevant as that of the assimilation target, and high trigger probability in fact will lead to low assimilation rates. This prediction thus stands in opposition to that of a routinization-based account, which would generally predict more assimilation for more probable words. The present study directly contrasts the predictions of these two types of accounts.
The predictions of the communicative account can be framed in terms of two distinct levels of representation: segments and words.4 That is, do these processes respond to uncertainty calculated at the level of individual words, or of individual segments? The enhancement of both kinds of units mitigate uncertainty about the intended meaning of an utterance, which is central to message-oriented phonology (Hall et al., 2016). Some theoretical treatments privilege the word level (e.g., Hall et al., 2016), others the segment level (e.g., Cohen Priva, 2012; Scarborough, 2013), and many others are agnostic (e.g., Aylett & Turk, 2004). While determining the level of representation at which these processes apply is an important step in this research program, the present study does not adjudicate between these possibilities. As both are consistent with our main hypothesis regarding a trade-off relationship between the target and trigger words, we present analyses in terms of both word-level predictability (i.e., conditional lexical probability in context) and segment-level predictability (i.e., conditional probability of a segment given the preceding segments).5
We test the predictions of both routinization and communicative approaches against a corpus of conversational speech. We first examine nasal place assimilation at the level of phonetic categories, using the close phonetic transcriptions provided with the corpus. On the understanding that English place assimilation is a phonetically gradient process (Ellis & Hardcastle, 2002), we also examine assimilation using a continuous acoustic measure based on formant transitions. Specifically, we investigate assimilation of word-final coronal nasals in the citation phonological form (henceforth /n/) which precede word onsets with labial or velar plosives /p, b, k, g/. Coronal nasals are thus the target of assimilation for the present study. Labial and velar plosives are the triggers of assimilation. Before a word-initial labial trigger, a coronal nasal /n/ may be assimilated to [m]. Before a word-initial velar trigger, a coronal nasal /n/ may be assimilated to [ŋ].
Figure 1 summarizes the predictions of place assimilation across a sequence of two words, e.g., ten bucks. Assimilation of a word-final coronal nasal to a following word-initial labial or velar stop is predicted to occur at a greater rate when the contextual predictability of the target ten is high. Further, assimilation is also predicted to occur at a greater rate when the predictability of the trigger bucks is low. Note that if place assimilation were driven by repetition or collocational strength (e.g., Bybee, 2006; Mowrey & Pagliuca 1987, 1995), greater predictability of both the target and the trigger should be associated with more assimilation. If, however, place assimilation is driven by a trade-off in the inferrability of the words in the phrase, we would expect the pattern shown in Figure 1. The communicative approach thus makes predictions that differ from those of other usage-based approaches to phonology.
2.1 Corpus data
We extracted target-trigger word sequences which license nasal place assimilation across a word boundary (i.e., bigrams such as ten bucks) from the Buckeye Corpus of spontaneous speech (Pitt, Dilley, Johnson, Kiesling, Raymond, Hume, & Fosler-Lussier, 2007). The Buckeye Corpus includes audio recordings of one-on-one interviews with forty native English speakers from Columbus, Ohio, conducted under the initial pretense that they were participating in a study investigating how people express their opinions about everyday topics. The interviews are approximately one hour each, and comprise roughly 300,000 words. The audio recordings are annotated with time-aligned close phonetic transcriptions that have been hand-corrected at the segment level, as well as canonical transcriptions for each word based on its dictionary transcription. These transcriptions have been used in a number of previous studies on other aspects of pronunciation variation (e.g., Raymond et al., 2006; Dilley & Pitt, 2007; Cole, Mo, & Baek, 2010; Cole, Mo, & Hasegawa-Johnson, 2010; Gahl, Yao, & Johnson, 2012; Fourtassi, Dunbar, & Dupoux, 2014; Seyfarth, 2014; Cohen Priva, 2017b; Turnbull, 2018).
For our analysis, we first identified potential target-trigger sequences meeting the following three basic criteria:
- The first word (the target) had a dictionary transcription ending in a coronal nasal coda /n/ which was preceded by a vowel.
- The second word (the trigger) had a dictionary transcription beginning with an oral stop with a non-coronal place of articulation (i.e., /p, b, k, g/).
- Neither word involved a speech error or restart, or was a filled pause, and neither word had an obvious transcription error, defined as (a) a close phonetic transcription that was missing or did not match the timestamped segments or (b) a negative duration.
This yielded 1334 tokens from the entire Buckeye corpus. Next, we excluded tokens for the following reasons (in this order). Table 1 summarizes our data and exclusions.
- No vowel in target’s phonetic transcription: The close phonetic transcription of the target word did not include a nucleus vowel (139 excluded tokens, 10% of the total). The majority of these exclusions (101 tokens) were tokens transcribed as having a syllabic [n̩] nucleus in the final syllable, especially the word can (51 tokens) transcribed as [kn̩].
- No nasal in target’s phonetic transcription: The close phonetic transcription of the target word did not include a final nasal coda (18 excluded tokens, 1.3% of the total). For these purposes, a target word was included if the final nasal was transcribed as [n], [m], or [ŋ]—for example, ten transcribed as [tɛm] was allowed. A target word was excluded if the coda nasal was missing or transcribed as any other segment. For example, in transcribed as [ɪp] (in the bigram in particular) was excluded. Most of these were speech errors or transcription errors.
- No stop in trigger’s phonetic transcription: The close phonetic transcription of the trigger word did not include an initial oral stop with the same place of articulation as the initial stop in the dictionary transcription (40 excluded tokens, 3.0% of the total). These exclusions mostly fell into two categories. First, there were 12 tokens which involved progressive nasal manner assimilation: For example, sun bear transcribed as [sʌn mbɛɹ]. While we would expect this kind of assimilation to also participate in the type of trade-off predicted by communicative accounts—for example, we would predict progressive manner assimilation if the predictability of the target is low and the predictability of the trigger is high—there were not enough such cases for them to be meaningfully involved in our analysis. Second, there were 26 tokens in which the trigger word was listed as because but the speaker produced cause, which was transcribed with an initial [k].
|Exclusion reason||Tokens after exclusion|
|(All Buckeye sequences with word-final /Vn/ followed by /p, b, k, g/ onset with no speech errors, pauses, or alignment errors)||1334|
|No vowel in target’s phonetic transcription||1195|
|No nasal in target’s phonetic transcription||1177|
|No stop in trigger’s phonetic transcription||1137|
|Short target vowel (excluded for partial assimilation only)||777|
|Diphthongal target vowel (excluded for partial assimilation only)||671|
In addition, because our measure of partial assimilation is based on the target vowel’s formant trajectory (cf. Dilley & Pitt, 2007), we excluded some additional tokens from the analysis of partial assimilation only:
- 4. Short target vowel: The target vowel was shorter than 50 milliseconds (360 excluded tokens, 27% of the total). Formant measurements are unreliable for short measurements, and short vowels are unlikely to have meaningful formant transitions. Fifty milliseconds is the cut-off used by the FAVE vowel measurement software (Rosenfelder, Fruehwald, Evanini, Seyfarth, Gorman, Prichard, Yuan, 2014). A large number of these excluded tokens involved the function words can, in, even, one, when, been, on, then, which are often pronounced with a very short duration and a short or unmeasurable formant trajectory.
- 5. Diphthongal target vowel: Either the dictionary transcription or the close phonetic transcription of the target vowel was listed as a diphthong (/aɪ/, /aʊ/, /ɔɪ/, /eɪ/, /oʊ/, including nasalized diphthongs) (106 excluded tokens, 7.9% of the total). Diphthongization inherently affects formant trajectories, but the realization of phonetic diphthongization is not fully predictable in American English (Fox & Jacewicz, 2009; Lehiste & Peterson, 1961).
After excluding these tokens, there were 1137 target-trigger sequences used in the analysis of categorical assimilation, and 671 target-trigger sequences used in the analysis of partial assimilation (additional analyses confirmed that our results do not change if the analysis of categorical assimilation is also limited to the smaller data set of 671 cases).
2.2 Measuring assimilation
2.2.1 Categorical assimilation
For our analysis, we used two metrics of nasal place assimilation. The first metric was a dichotomous measure of assimilation based on the close phonetic transcriptions provided by the Buckeye Corpus annotators. If an /n/ coda was transcribed as [m] before /p, b/-initial trigger words, or as [ŋ] before /k, g/-initial trigger words, it was considered to be assimilated; otherwise it was considered to be canonical. For example, if the word ten in the sequence ten bucks was transcribed as [tɛm], the /n/ was considered to be assimilated. If the nasal was transcribed as [tɛn], it was considered to be canonical.
2.2.2 Partial assimilation
Although the segmental transcriptions in the Buckeye Corpus are hand-corrected, a wide variety of factors might influence the transcribers’ subjective auditory and visual impressions of nasal assimilation, including their expectations and previous experience with the particular target-trigger sequence. Therefore, our second metric was an acoustic, continuous measure of assimilation based on the vowel formant trajectories immediately preceding the target /n/ codas. The change in the second formant frequency (F2) preceding a consonant is influenced by that consonant’s place of articulation. Thus, it can be used as a gradient approximation of nasal consonant place.
We follow previous work in using F2 to identify labial or velar nasal place assimilation (Pitt & Johnson, 2003; Dilley & Pitt, 2007). If a vowel is followed by a labial consonant, F2 typically falls into the consonant onset. If a vowel is followed by a velar consonant, F2 typically rises into the consonant onset. While this pattern generally holds across vowel types, we note that (i) the magnitude of the rise or fall differs between front and back vowels; and (ii) the actual acoustics of F2 trajectories for particular vowel-stop sequences are somewhat more complicated (see e.g., Fowler, 1994), and can differ not only among vowels but also among talkers.
To manage these differences, we construct a standardized metric which corrects both for vowel context as well as talker-specific pronunciation differences. This metric has higher values when F2 at the vowel offset is unusually high or low for a particular talker and phonological context—thus suggesting place assimilation—and lower values when F2 at the vowel offset is more typical for that particular talker and phonological context.
184.108.40.206 Measuring F2
We estimated raw F2 trajectories in Hz using linear predictive coding (LPC) with the Burg algorithm, using the standard preprocessing and analysis window settings in Praat. However, it is well known that automatic formant extraction using this method can result in gross errors due to misidentification of the formant tracks (e.g., see Weenink, 2015). To avoid such errors, we used the extraction technique described in Evanini, Isard, & Liberman, 2009 (see also Evanini, 2009; Rosenfelder et al., 2014; Labov, Rosenfelder, & Fruehwald, 2013).
In this procedure, seven different sets of formant measurements are estimated for each vowel token, using seven different LPC orders (6 to 12). For each vowel token, the best LPC order is selected from the seven candidates by calculating the Mahalanobis distance between the estimated formant values at the vowel midpoint (the first two formant frequencies and bandwidths) and the average measurements for the vowel type. The average measurements for each vowel type are taken from the Atlas of North American English (Labov, Ash, and Boberg, 2005, as provided in Rosenfelder et al., 2014). The LPC order that results in the smallest Mahalanobis distance is selected, which indicates that the formant estimates which resulted from that LPC order were most typical for the vowel type.
This method avoids gross measurement errors at the midpoint of each vowel (see discussion in Labov, Rosenfelder, & Fruehwald, 2013). However, to quantify nasal place assimilation, we require measurements at the vowel offset (i.e., the consonant onset), and gross formant tracking errors may occur also between the midpoint and vowel offset, even if the midpoint is measured correctly. For this reason, we began with the midpoint measurements, and then for each consecutive window from the midpoint to the vowel offset, we adjusted the LPC order if it lowered the Mahalanobis distance between the formant measurements in the current and following window. This ensures a relatively continuous trajectory, and helps avoid gross formant tracking errors.
220.127.116.11 Correcting offset F2 measurements for talker and context
We used the technique above to estimate F2 at the vowel offset for two datasets. The first dataset comprises the 671 target-trigger bigram tokens described above. In order to determine whether each of these measurements are typical (and thus likely precede a canonical /n/) or atypical (and thus likely precede an assimilated /n/) for each talker and vowel type, we collected a second dataset comprising /Vn/ rimes which precede vowels or /h/ in the corpus (using exclusion criteria 1–2 above). We expect that the /n/ codas in this second dataset should be realized as fully canonical, and therefore the F2 values at the vowel offsets in this dataset represent the distribution that would be expected for unassimilated /n/ given the preceding vowel and talker.
Next, we used the F2 measurements from these two datasets to calculate an F2 value for the observed target-trigger token, standardized within-talker and within-vowel.6 For example, in the sequence ten bucks, we took the observed F2 value at the vowel offset, and compared it to the mean F2 value at the vowel offset for all tokens of /ɛn/ produced by that talker, including both assimilatory (as in ten bucks) and non-assimilatory contexts (/Vn#V/, such as in then again). If the observed F2 in ten bucks is one standard deviation above that mean, it receives a value of 1. If the observed F2 in ten bucks is one-half standard deviation below that mean, it receives a value of –0.5.
This score is negative when the observed F2 at the vowel offset is below the F2 at the vowel offset that would be expected given the talker and vowel type. The score is positive when the observed F2 is above the expected F2. Figure 2 shows the assimilation scores for the target-trigger tokens for each vowel type, divided into three categories depending on whether the corpus annotators transcribed each token as canonical [n], assimilated [m], or assimilated [ŋ]. As expected, the assimilation scores tend to be lower when the coda /n/ was transcribed as labial [m], and higher when it was pronounced as velar [ŋ].
Here we are interested in modeling the degree of partial assimilation. For velar assimilation contexts, such as in Columbus, these assimilation scores serve as the intended metric of partial assimilation: They are higher in categorically-assimilated productions, and closer to zero in more canonical productions. Large negative values of this metric are thus unexpected for velar assimilation contexts, which is in fact what we found (range –2.12 to 3.84, quartiles at –0.14, 0.69, 1.32 for velar assimilation contexts).
For labial assimilation contexts, such as ten bucks, these assimilation scores have the opposite sign: Since F2 tends to be lower preceding a labial consonant, the standardized value is lower in assimilated productions, and closer to zero in typical, canonical productions. Thus, we inverted the assimilation scores in labial contexts, so that higher values of the metric are always associated with assimilated productions, regardless of whether the trigger was labial or velar. These adjusted scores (i.e., the transformation of the standardized F2 values) served as our metric of partial assimilation.
18.104.22.168 Summary of partial assimilation metric
The assimilation scores capture assimilation well. Figure 3 shows the assimilation scores for the target-trigger tokens for each vowel type, divided into two categories depending on whether the corpus annotators transcribed each token as being assimilated. As expected, tokens that were transcribed as phonetically assimilated have higher assimilation scores (mean = 0.81) than those transcribed as canonical [n] (mean = 0.34), and this is also true within all vowel types in the data. Validating our measure of partial assimilation, the partial assimilation scores were significantly higher for tokens transcribed in the corpus as assimilated (t(199.1) = –4.81, p < .0001).
At the same time, Figure 3 also shows that there is considerable variation in the degree of F2 assimilation within tokens that received the same phonetic transcription, including considerable overlap in F2 values between phonetic categories. This confirms our motivation for the analysis of partial assimilation: Phonetic transcriptions are only part of the story. A stronger, and arguably more appropriate, test of the hypothesis that assimilation is modulated by the predictability of target and trigger words thus should analyze patterns of partial assimilation.
Next, we describe how we obtained estimates of word predictability, within-word predictability, and word frequency from spoken language databases.
2.2.3 Estimating word predictability
Message-oriented phonology predicts that there should be more nasal place assimilation when the target word is predictable and the trigger word is unpredictable (see Figure 1). For each target-trigger bigram token, we calculated two contextual probabilities: the bigram probability of the target word given the trigger, e.g., the backward probability p(ten|bucks) in ten bucks, and the bigram probability of the trigger word given the target, e.g., the forward probability p(bucks|ten) in ten bucks. The first measure captures how inferrable or predictable the target word is given the trigger. Our hypothesis is that if the target word is highly predictable, the target should be more likely to undergo assimilation, since there is relatively less need for the target to be produced reliably with a coronal nasal. The second measure captures how inferrable or predictable the trigger word is given the target. Our hypothesis is that if the trigger word is highly predictable, the target should be less likely to undergo assimilation, since there is relatively less need to enhance the perceptibility of the trigger onset’s place of articulation.
Following standard practice, we can estimate the bigram predictability as experienced by a typical speaker in our data set. There are two related and competing challenges that any such estimation needs to keep in mind. The first challenge is that the distribution of words differs based on genre, register, mode, audiences, etc. Since we are analyzing the effect of bigram predictability on assimilation in (a type of) conversational speech, it is thus preferable to obtain our bigram estimates from similar, if not the same, data. In fact, from this perspective, using only the Buckeye Corpus to estimate bigram predictability would seem best. However, this would conflict with a second challenge: The quality of estimates increases with the amount of the data that they are based on. From this perspective, it is thus desirable to use larger corpora to estimate bigram predictability.
We address these two challenges by obtaining bigram probability estimates from both the Buckeye Corpus and a much larger conversational speech corpus (Fisher Part 2, Cieri, Graff, Kimball, Miller, & Walker, 2005). The Fisher Part 2 corpus contains a similar linguistic genre as the Buckeye Corpus (informal one-on-one conversations) but contains over 30 times as many words as Buckeye (about 11,000,000 versus about 300,000 words). The 1137 target-trigger bigram tokens used for the main analyses were held out from the Buckeye Corpus, and not used to estimate bigram probabilities.7
The bigram probability estimates for each corpus were obtained using SRILM (Stolcke, 2002; Stolcke, Zheng, Wang, & Abrash, 2011) with the modified Kneser-Ney smoothing procedure (Chen & Goodman, 1998), which uses three discounting parameters per model. Kneser-Ney smoothing provides a reliable way to estimate probabilities for unseen words.
The two estimates for each bigram were then combined by taking a weighted average. The weight was chosen to optimize perplexity on 10% of the Buckeye Corpus, which was held out from the estimation only for the purposes of optimizing this weight parameter. This step ensures that the resulting bigrams are good estimates of words’ predictability in the Buckeye Corpus—i.e., the broader linguistic context (topic distribution, register, etc.) for which we analyze assimilation.
The procedure (estimating smoothed Buckeye and Fisher language models, and combining them using an optimized weight parameter) was repeated twice: once using the two corpora as-is, to estimate forward trigger bigram probabilities, and once using the two corpora in reverse, to estimate backward target bigram probabilities. For the forward probabilities, the optimal weights were 0.514 for Buckeye data and 0.486 for Fisher data; for the backward probabilities, the optimal weights were 0.511 for Buckeye and 0.489 for Fisher. In other words, our estimates are based in roughly equal parts on the Buckeye and Fisher data (reflecting the trade-off between the two pressures described above).
We note that this estimation procedure is independent of the analyses of assimilation presented below: The smoothing and interpolation parameters were fit so as to optimize word prediction in held-out Buckeye data (so as to best approximate word predictability as experienced by a typical speaker in the Buckeye Corpus), not to maximize the effect of predictability on assimilation.
2.2.4 Estimating within-word predictability
Communicative accounts also predict effects of within-word segment predictability on assimilation. We thus obtained estimates of within-word predictability, defined as the probability of the target or trigger segment (i.e., word-final /n/ or word-initial /p, b, k, g/) conditioned on the preceding segments within the word (following van Son & Pols, 2003; Cohen Priva, 2008). For example, in a word like between, the final target /n/ is entirely predictable given the preceding string /bɪtwi/, because every English word that begins with the string /#bɪtwi/ is followed by /n/. For a word like run, however, the final target /n/ is much less likely, due to the existence of other words like rough, rucksack, rudder, and so on. These probability estimates were calculated based on the frequencies of strings of segments within the word tokens in the Buckeye Corpus mixed with the Fisher Corpus (see Section 2.2.5).
The within-word probability of word-initial trigger segments /p, b, k, g/ tends to be lower than for word-final target segments, because the trigger segment is conditioned only on a word boundary. In particular, the probability of a word-initial /p/ was calculated to be 5.2%, a word-initial /b/ was 4.7%, a word-initial /k/ was 9.6%, and a word-initial /g/ was 3.0%. The probability of word-initial /k/ is higher compared to the other word-initial segments because of the high number of relatively frequent words that begin with /k/ (can, can’t, come, etc.), and vice-versa for /g/, with /p, b/ in between. Thus, segment-based communicative accounts predict that /k/ is less likely to trigger assimilation—since it is relatively more predictable as an onset, and thus contributes less to the identifiability of the trigger word—whereas /g/ is relatively more likely to do so.
2.2.5 Estimating word frequency
Estimates of unigram probability (word frequency) were obtained from a weighted mixture of the Buckeye and Fisher corpora, following the same methods for the word predictability estimates described above. Since Kneser-Ney smoothing is not applicable to unigrams, smoothed estimates for unseen words (those with zero tokens in the training data) were generated by re-dividing the total probability mass for all word types with one token in the training data equally among all types with zero or one tokens. The optimal weights were 0.967 for Buckeye and 0.033 for Fisher. Compared to the word predictability estimates obtained above, which had about equal weights for Buckeye and Fisher, our frequency estimates rely almost exclusively on the Buckeye Corpus. This is not surprising. The average number of tokens that go into a bigram predictability estimate is much smaller than the average number of tokens that go into a (unigram) frequency estimate. This means that, for word frequency estimation, the second of the two challenges described above is ameliorated. Essentially, the number of words in the Buckeye Corpus is sufficient to robustly estimate word frequency, but not quite sufficient to robustly estimate contextual predictability, such that the predictability estimation benefits from the additional data in Fisher despite the fact that Fisher has a somewhat different bigram distribution than Buckeye. We note that this is a general asymmetry between unigram frequency and bigram predictability estimates, and not specific to our study. We also note that this might create a bias against word predictability, in favor of word frequency, as the former estimates are based on less domain-specific (Buckeye) data (for relevant discussion, see also Cohen Priva & Jaeger, 2018).
2.2.6 Correlations among predictor variables
Following common practices, all frequency and predictability estimates were log-transformed prior to analysis. The predictor variables discussed here tend to exhibit some degree of collinearity, which can make it difficult to tease apart the individual contribution of any one variable. Figure 4 shows correlation matrices for all six variables. The left panel depicts the data used in our analysis of categorical assimilation; the right panel depicts only the subset of data used in our analysis of partial assimilation. For both datasets, the strongest correlations were obtained between the measures of target word frequency and target word predictability and between trigger word frequency and trigger word predictability. These relationships are not surprising, given the definition of these variables. The other correlations are relatively low and follow expected patterns.
We conducted two sets of analyses using multilevel (mixed-effects) regression. The first analysis set used logistic multilevel regression to predict transcribed categorical assimilation (assimilated or non-assimilated) as a dichotomous outcome for each target-trigger sequence. The second set of analyses used linear multilevel regressions to predict the acoustic measure of assimilation, as described in Section 2.2.2. Each set of regression analyses included three sets of parameters, as described in Section 2.2.3: (A) target and trigger word frequency (p(ten) and p(bucks)); (B) target and trigger word predictability (conditional probabilities p(ten|bucks) and p(bucks|ten)); and (C) within-word predictability (conditional segment probabilities p(/n/|/#tɛ/) and p(/b/|#)).
The logistic regressions included by-speaker and by-target intercepts. The linear regressions included by-target intercepts, but not by-speaker intercepts, since the acoustic measure of assimilation was already standardized within-speaker.
Models were constructed with all seven possible combinations of the three sets of parameters: that is, one model with all three sets (frequency, lexical predictability, and within-word predictability); three models with just a single set; and three models with two sets of parameters (e.g., within-word and lexical predictability). This step allowed us to directly compare the relative explanatory power of the various predictors and to find the most parsimonious model.
We used Bayesian inference to estimate the full probability distributions for the effect sizes of these factors on nasal place assimilation. The posterior distribution of the regression parameters was estimated by Markov Chain Monte Carlo, implemented in Stan using the rstanarm R package for Bayesian regression modeling (R Core Team, 2016; Stan Development Team, 2016a, 2016b). We estimated the posterior density for each effect using a symmetrical, weakly-informative prior (the default in rstanarm version 2.17.4: a normal distribution with µ = 0 and σ = 2.5, scaled by the standard deviation of each predictor variable). By doing so, we are able to assess our relative confidence in a positive versus a negative effect for each of the crucial predictors without bias. Our predictions were that the target probability should have more density associated with a positive effect, while the trigger probability should have more density associated with a negative effect. That is, if ten is contextually predictable, it is more likely to be reduced via assimilation, while if bucks is contextually predictable, it is less likely to be enhanced via assimilation.
Model fit was evaluated by leave-one-out cross validation, which can be used to derive an expected logarithm-transformed pointwise predicted density (ELPD), an approximate measure of how accurately the model predicts the data. The ELPD is conceptually similar to the Akaike information criterion in that it can be used to compare models with different sets of predictors, and is the recommended practice for Bayesian model comparisons (Vehtari, Gelman, & Gabry, 2017). An ELPD closer to zero indicates a better model fit when considering model complexity.
3.1 Categorical assimilation
Table 2 presents a summary of the results of the models for the analysis of transcribed nasal place assimilation. Each column shows one of the seven models calculated. Model terms are abbreviated as follows: WordFreq is target and trigger word frequency; WordPred is target and trigger word predictability; and SegPred is target and trigger within-word predictability (or ‘segmental predictability’). The model preferred by the ELPD is shown in bold, but we note that all models perform very similarly. We thus show the outcomes of all models in Figure 5. This is also in the spirit of Bayesian analyses in that it visualizes the uncertainty in the effects across different models.
|WordFreq + WordPred + SegPred||WordFreq + WordPred||WordPred + SegPred||WordFreq + SegPred||WordFreq||WordPred||SegPred|
|Target frequency (log-transformed)||Posterior density > 0||0.635||0.760||0.905||0.854|
|Trigger frequency (log-transformed)||Posterior density < 0||0.810||0.804||0.896||0.733|
|Target predictability (log-transformed)||Posterior density > 0||0.830||0.576||0.998||0.966|
|Trigger predictability (log-transformed)||Posterior density < 0||0.316||0.209||0.676||0.488|
|Target within-word predictability (log-transformed)||Posterior density > 0||0.983||0.989||0.979||0.950|
|Trigger within-word predictability (log-transformed)||Posterior density < 0||1.000||0.999||0.999||0.998|
The rows of Table 2 depict values relating to each of the six parameters: target word frequency, trigger word frequency, target word predictability, trigger word predictability, target within-word predictability, and trigger within-word predictability. For each parameter, there are three values of interest. The first is the proportion of the posterior density in the predicted direction. Recall that our hypothesis predicts that the posterior density should be positive for the target word parameters and negative for the trigger word parameters. This proportion value therefore depicts the extent to which the results are consistent with our hypothesis: 1.0 would mean that the entire distribution is in the predicted direction, constituting the strongest possible evidence for our hypothesis; 0.5 that the distribution is exactly split around zero; and 0.0 that the entire distribution is in the opposite direction predicted, constituting the strongest possible evidence against our hypothesis.
The other two values of interest give the range of the 95% highest posterior density interval (HPDI). Broadly speaking, we can interpret this interval as having a 95% chance of including the true value of the parameter, under the assumptions made in the analysis. This interval does not have the same interpretation as a frequentist confidence interval (Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2016). Finally, the last row of Table 2 shows the ELPD value for each of the models.
For the target, we find strong support for the predicted effects of word and within-word predictability. The majority of the posterior densities for these parameters are positive and most of the 95% HPDI values do not overlap with zero. This result is consistent with the predictions of communicative accounts: If p(ten|bucks) (or p(n|#tɛ)) is high, then ten is contextually predictable, and thus more easily inferrable. Because of this, the phonetic cues to the coda /n/ are less important for robust communication, and nasal place assimilation (which reduces cues) is more likely.
For the trigger, however, we only observed strong evidence in the hypothesized direction for the within-word predictability parameter. That is, if p(b|#) is high, then the /b/ of bucks is contextually predictable and thus there is less motivation for assimilation to occur. For the word predictability p(bucks|ten), the results do not provide strong evidence in favor of either a negative or a positive effect.
We note that the evidence for effects of frequency is diminished in the models with parameters for word predictability. This is the always case for target frequency (see “WordFreq+WordPred” vs. “WordFreq” and “WordFreq+WordPred+SegPred” vs. “WordFreq+SegPred”), and sometimes for trigger frequency (“WordFreq+WordPred+SegPred” vs. “WordFreq+SegPred”). Likewise, the evidence for the effects of word predictability are diminished in all models which include parameters for frequency (models “WordFreq+WordPred” vs. “WordFreq+WordPred+SegPred” and model “WordPred” vs. “WordFreq+WordPred”). This is expected given the correlations between word frequency and predictability (see Figure 4), and highlights the need to control for both types of variables when assessing one of them (see Cohen Priva & Jaeger, 2018). Overall our analysis of categorical assimilation presents more evidence for effects of word predictability than for effects of word frequency. The evidence for effects of within-word predictability is strong in all models.
In summary, the analysis of categorical assimilation finds strong evidence for three of the four predictability effects hypothesized by communicative accounts, and no conflicting evidence. And, regardless of what combination of factors is considered, we find that the likelihood of assimilation seems to be governed not only by the inferrability of the target word, but also by that of the trigger word.
3.2 Partial assimilation
Table 3 and Figure 6 present a summary of the results of the models for the analysis of partial nasal assimilation. This table and figure are organized in the same way as Table 2 and Figure 5. The results partly replicate and complement the results of the analysis of categorical assimilation. We find support for effects of both frequency and predictability for both targets and triggers. Unlike for the analysis of categorical assimilation, we do not find support for within-word predictability.
|WordFreq + WordPred + SegPred||WordFreq + WordPred||WordPred + SegPred||WordFreq + SegPred||WordFreq||WordPred||SegPred|
|Target frequency (log-transformed)||Posterior density > 0||0.548||0.516||0.865||0.858|
|Trigger frequency (log-transformed)||Posterior density < 0||0.496||0.475||0.883||0.885|
|Target predictability (log-transformed)||Posterior density > 0||0.699||0.726||0.89||0.888|
|Trigger predictability (log-transformed)||Posterior density < 0||0.755||0.788||0.948||0.955|
|Target within-word predictability (log-transformed)||Posterior density > 0||0.743||0.756||0.747||0.712|
|Trigger within-word predictability (log-transformed)||Posterior density < 0||0.374||0.403||0.354||0.247|
As with the categorical analysis, effects of word frequency and word predictability again were diminished when included in the same model, as expected given their correlation. The ELPD suggests that there is more evidence for an effect of word predictability than frequency, although the ELPD differences among the models are modest.
The analysis of partial assimilation thus provides strong support for two of the predictability effects hypothesized by communicative accounts, and little to no conflicting evidence.
4. Discussion and conclusion
Our analyses of word-final /n/ in assimilation-licensing contexts in the Buckeye Corpus suggest a strong role of predictability in mediating nasal place assimilation. Our categorical analysis of the corpus’ phonetic transcriptions demonstrate that annotators were more likely to transcribe assimilation when the predictability of the assimilation target (e.g., ten) given the trigger (e.g., bucks) was high. Our acoustic analysis of the same tokens additionally finds that the degree of place assimilation apparent in F2 was greater when the predictability of the trigger given the target was low. Our analyses also demonstrate that higher within-word target predictability led to more transcribed assimilation, and that higher within-word trigger predictability led to less transcribed assimilation (we return to the asymmetry in the results between categorical and partial assimilation below).
Our results also speak to the relative plausibility of different usage-based explanations of assimilation. Model comparison suggested that the models with word predictability, rather than frequency, were preferable. This result implies that the observed frequency effects are likely to be predictability effects. For reasons described in the method section, frequency estimates tend to be more robust than predictability estimates, creating a bias against detecting predictability effects. The fact that we nevertheless find word predictability to be a better predictor of assimilation than word frequency is thus quite informative. At first blush, this result may seem to be at odds with those of Ernestus, Lahey, Verhees, and Baayen (2006), who observed effects of frequency in Dutch voicing coarticulation. However, Ernestus and colleagues investigated effects of collocation frequency, which is more closely related to the type of bigram predictability measure we used here. We also note that Ernestus and colleagues—like the bulk of studies on reduction and other pronunciation variation—did not include controls for predictability in their model (for exceptions, see e.g., Ernestus, Hanique, & Verboom, 2015; Gahl & Garnsey, 2004; Gahl et al., 2012; Scarborough, 2010; Seyfarth, 2014). This omission is potentially critical, as recent computational work suggests: In the absence of predictability controls, effects of predictability are very likely to lead to spurious frequency effects (Cohen Priva & Jaeger, 2018). Our result thus echo Cohen Priva and Jaeger’s call for further studies that assess the effects of frequency and predictability simultaneously.
Crucially, our results are inconsistent with any account that attributes reduction or enhancement solely to routinization or repetition. Under those hypotheses, we would expect more assimilation when the target as well as the trigger are frequent. However, inasmuch as we found evidence for frequency effects, more frequent trigger words led to less assimilation (mirroring the results for word predictability), contrary to accounts that focus on routinization or repetition as the source of reduction.
These results are, however, consistent with a communicative approach where the phonetic form of messages (e.g., words) can be shaped by biases towards accurate message transmission and low resource cost (e.g., Buz et al., 2016; Lindblom, 1990; Hall et al., 2016). All else being equal, cues to a given word may be either reduced or enhanced depending on their potential contribution to successfully conveying the word. Along these lines, reducing the coronal place cue in words that are predictable from the following word promotes cost-effective communication; there would be little communicative gain in investing resources in the clear production of the nasal given that the word it occurs in is predictable. Conversely, increasing redundancy in the trigger word is advantageous when the word has low contextual predictability. Similar explanations can account for a large range of reduction/enhancement phenomena during language production (for a recent overview, see Jaeger & Buz, 2017). This includes implicit decisions during not only phonetic and phonological encoding, but also during morphological (e.g., don’t vs. do not, Frank & Jaeger, 2008; optional case-morphology, Kurumada & Jaeger, 2015), lexical (Mahowald, Fedorenko, Piantadosi, & Gibson, 2013) or syntactic encoding (e.g., optional argument or function word omission, Jaeger, 2010; Resnik, 1996; Wasow, Levy, Melnick, Zhu, & Juzek, 2015). Though many questions remain, a new generation of experimental and historical studies has now also begun to investigate in more depth the link between such implicit decisions during language processing and communicative effects on the phonology, lexicon, and grammar over historical time (e.g., Fedzechkina, Newport, & Jaeger, 2017; Kanwal, Smith, Culbertson, & Kirby, 2017; Sóskuthy & Hay, 2017; Wedel, Nelson, & Sharp, 2018).
One possible mechanism for the observed results relies on the relative prosodic weights of the target word and the trigger word. It is known that prosodic boundaries between adjacent words can block variable assimilation processes (Kuzla, Cho, & Ernestus, 2007); and that contextual predictability plays a role in the relative prosodic prominence of individual words, independently of information-structural considerations like focus (Brenier, Nenkova, Kothari, Whitton, Beaver, & Jurafsky, 2006; Turnbull, 2017). However, this explanation is difficult to reconcile with our specific pattern of results, where we observed high predictability targets having more assimilation but high predictability triggers having less assimilation. Any proposed prosodic mechanism would have to strengthen the prosodic boundary preceding a predictable word (to account for the trigger word pattern), and have to weaken the prosodic boundary following a predictable word (to account for the target word pattern). While this mechanism is not outside of the realm of possibility, it lacks the clear motivation and intuitive appeal of a communicative approach couched in terms of enhancing and reducing cues to word identity.
We found evidence for effects of both word and within-word predictability. The effect of the former was comparable for both categorical and partial assimilation, except that we did not find evidence that trigger word predictability affected categorical assimilation. The effect of within-word predictability differed between the two analyses to a greater extent. For categorical assimilation, within-word predictability of both the target and the trigger was strongly associated with effects in the hypothesized directions. For partial assimilation, there was no evidence for an effect of within-word predictability in any direction. These findings could indicate different mechanisms for full assimilation versus partial assimilation, although further work is needed to assess this hypothesis.
Future work is also required to tease apart possible explanations for the effects of within-word predictability. Communicative accounts that link phonological processes to a bias for robust message transmission, and thus inferrability, can be formulated with regard to different levels of representation. This includes the transmission of phonological segments (Cohen Priva, 2015) or the transmission of meaning-bearing units (e.g., words, Hall et al., 2016), or both (for discussion, see Hall et al., 2016). While the present study would seem to provide evidence for both formulations, the effects of within-word predictability can also be interpreted with reference to word (rather than segment) inferrability. In particular, the predictability of the last segment of the target word also describes the target word’s predictability given the segmental context.
Another area for future work concerns the sensitivity of assimilation to its potential effects whether or not applying assimilation to the target word will make it homophonous with another word. For example, assimilation of ten bucks to [tɛm bʌks] is unambiguous, as /tɛm/ is not an English word; but in a sequence like a quick run picks you up, nasal assimilation may not necessarily enhance the likelihood of communicative success due to the presence of the word rum in the lexicon (Gaskell & Marslen-Wilson, 2001). Recent work suggests that minimal pairs may indeed play a role in determining articulation (Buz & Jaeger, 2016; Buz et al., 2016; Wedel et al., 2018). While the present data do not realistically allow for such an investigation (only 11% of all word sequence types have a /m/ or /ŋ/-final competitor for the target word, implying low statistical power), we believe that this prediction holds a great deal of promise.
More broadly, these results expand the set of factors shown to influence patterns of assimilation. While previous research has shown the importance of articulatory and perceptual factors, the probabilistic nature of assimilation has received comparatively little attention (but see Coetzee, 2016; Coleman, Renwick, & Temple, 2016; Dilley & Pitt, 2007; Steele, Colantoni, & Kochetov, in press). Indeed, the amount of data is rather limited, especially given that the effects we are assessing are predicted to be relatively subtle (see Hall, Hume, Jaeger, & Wedel, 2018 among others). The finding that the contextual predictability of target and trigger words play a role in predicting place assimilation thus adds to the growing body of evidence showing that sound patterns are shaped by statistics of the lexicon (for review see Cohen Priva, 2017a, inter alia).
In this paper we explored the extent to which the contextual predictability of words conditions nasal place assimilation. While previous work has shown that reductive sound processes can be associated with high predictability, here we argued that assimilation—which enhances a target at the expense of a trigger—is conditioned by both high predictability of the target and low predictability of the trigger. This fits within a communicative, message-oriented approach to phonology: Assimilation shifts redundancy from the target to the trigger, and thus can be used to manage the relative inferrability of target and trigger.