This paper examines two plausible mechanisms supporting sound category adaptation: directional shifts towards the novel pronunciation or a general category relaxation of criteria. Focusing on asymmetries in adaptation to the voicing patterns of English coronal fricatives, we suggest that typology or synchronic experience affect adaptation. A corpus study of coronal fricative substitution patterns confirmed that North American English listeners are more likely to be exposed to devoiced /z/ than voiced /s/. Across two perceptual adaptation experiments, listeners in test conditions heard naturally produced devoiced /z/ or voiced /s/ in critical items within sentences, while control listeners were exposed to identical sentences with canonical pronunciations. Perceptual adaptation was tested via a lexical decision test, with devoiced /z/ or voiced /s/, as well as a novel alveopalatalized pronunciation, to determine whether adaptation was targeted in the direction of the exposed variant or reflected a more general relaxation. Results indicate there was directional and word-specific adaptation for /z/-devoicing with no evidence for generalization. Conversely, there was evidence of /s/-voicing generalizing and eliciting general category relaxation. These results underscore the role of perceptual experiences, and support an evaluation stage in perceptual learning, where listeners assess whether to update a representation.
Due to the enormous amount of linguistic and social information simultaneously available in the speech stream, speech perception is a challenging task. When considering the variability listeners encounter within and across talkers, the act of understanding spoken language becomes an even more complex enterprise. Both within- and between-talker variation can arise due to differences in physiology (e.g.,
How do listeners adapt to and accommodate variable pronunciations? Listeners’ rapid adaptation abilities have been studied under the umbrellas of perceptual learning or phonemic recalibration—behaviours that allow for listeners’ adjustments to phonemic categories. In the lab, adaptive behaviours have been extensively tested using an experimental paradigm termed lexically-guided perceptual learning, which generally functions as follows: Listeners are exposed to novel or phonetically ambiguous pronunciations of a particular sound in lexical contexts that provide the linguistic scaffolding to guide the interpretation of the intended word. Norris, McQueen, and Cutler (
There seem to be some limitations to perceptual learning, and further identifying and articulating (some of) those limitations is a goal of the current study. Adaptation seems to be limited, for example, if the phonetic ambiguity of the crucial phoneme is extreme, as is the case of heavily non-native-accented speech, where listeners also do not show adaptation (
Taking advantage of the natural variation in stop realization within and across languages, scholars have identified some additional limits to the generalization of what is learned. For example, native English listeners naïve to Dutch-accented English not only show adaptation to the devoiced final-stops in Dutch-accented English, but can generalize voiced stop devoicing to initial position as well, at least when not presented with counterevidence to initial devoicing patterns (
Adaptation has also been found for vowels (
The specificity of the category—for example whether what is learned is tied to a phoneme-level, an allophonic-level, or an even more surface-level of representation—appears to be sensitive to the exposure conditions (
It is important to broach the question of how degree of deviation might limit or constrain the ease with which listeners adapt—or even whether they adapt at all. There are limitations on the acoustic similarity of what kind of substitution is acceptable while still allowing for perceptual adaptation (e.g.,
Lastly, we can also ask whether all speech sounds are updated equivalently in perceptual learning paradigms. The answer to this question appears to be no. Zhang and Samuel (
This overview of the theoretical and empirical literature on adaptation leads to the objectives for the current study, which are multifaceted. We seek to (i) directly test whether asymmetries in sound patterns affect perceptual adjustments, and (ii) assess whether the adaptation mechanism is targeted in the direction of the exposed pronunciation or reflects a general relaxing of criteria or both. We investigate a potential asymmetry in perceptual learning by testing the learnability of changes in the voiced and voiceless English coronal fricatives: /z/ and /s/. Specifically, we assess whether listeners perform differently when exposed to naturally-produced devoiced /z/ (perceived as [s]) and voiced /s/ (perceived as [z]). This question has roots in questions about learnability in phonology (e.g., the presence of channel biases in phonology; e.g.,
We use a lexical decision task as a test of perceptual learning, and quantify differences in word endorsement rates across experimental and control groups for items with a voicing change (i.e., /z/-devoicing or /s/-voicing). This paradigm allows us to quantify both learning specificity and the generalization of the learned (or not) voicing change in novel words not presented during the exposure phase. As opposed to the two-alternative forced choice categorization task often used in phoneme recalibration studies, our use of a lexical decision paradigm to quantify learning has the distinct advantage of allowing us to test whether phonemic adjustments are directional, showing increased word endorsement only in the direction of the exposed change, or whether the adjustments involve general relaxation of category boundaries. To assess whether exposure to devoiced /z/ or voiced /s/ results in directional adaptation or general relaxation, we tested word endorsement rates of both [s] and [ʒ] pronunciations in canonical /z/ words. Endorsement of [ʒ] pronunciations in addition to [s] pronunciations would be evidence in support of a general category relaxation mechanism. Likewise, [ʃ] pronunciations in canonical /s/ words are used to test category relaxation in response to exposure to /s/-voicing. Both the voicing change shifts and the place of articulation shifts are pronunciation changes of one feature edit distance from the canonical pronunciation, and as a result are close to the canonical productions. Following our corpus study in Section 2, we provide much more finely honed predictions of listener behaviour.
English coronal fricatives offer a nice test case because they are clearly asymmetrical in their cross-linguistic distribution and within-English voicing patterns. English coronal fricatives also have attested alveopalatalized variants in casual speech, though these are rare (at least in the Buckeye Corpus [
While there is at least one dialect of English that engages in word-initial fricative voicing (West Country English;
Smith (
Thus, not only is there typological evidence that voiced fricatives are more likely to devoice than the reverse, but at least in lab speech, English /z/ is more likely to devoice than /s/ is to voice. There is no literature, to our knowledge, documenting the probability or magnitude of /s/-voicing. This lack of evidence, however, does not mean /s/-voicing does not exist. Thus, to assess /z/-devoicing and /s/-voicing on a categorical level and to quantify listeners’ previous experience with sibilant fricatives and voicing patterns, we consider spontaneous speech, which exhibits large amounts of reduction and pronunciation variation (
We estimate listeners’ previous experience with voiced /s/ and devoiced /z/ by comparing expected citation forms to pronounced forms of voiced and voiceless fricatives in the Buckeye Corpus (
As the goal was to assess the overall probability of categorical voicing changes for all /s/ and /z/, regardless of position in the word (or participation in morphophonological processes), all items with /s/ or /z/ in any position in the expected citation form transcriptions were identified in the Buckeye Corpus. This resulted in 38,934 instances of citation /s/ and 23,076 instances of citation /z/.
The surface behaviour of the coronal fricatives was assessed from the phonetic transcripts. An automated procedure separated items into matches or items that required human decisions. Items where the citation form (e.g., ‘dictionary’ pronunciation) and the phonetically-transcribed pronounced surface form showed perfect alignment (e.g., there were no changes between citation and pronounced forms) were automatically tagged as matches. There were 17,448 perfect matches for /s/ items (44.8%) and 8,831 perfect matches for /z/ items (38.2%). Fricatives were also automatically paired and identified as substitutions in citation and surface forms if the two strings matched in terms of both the number of characters and the identity of all characters
The remaining items were manually matched by annotators in a custom-written command-line program using the following process: The citation transcription was presented with the item’s pronounced transcription. The fricative of interest was visually flagged by a symbol (>) in the citation pronunciation and the characters in the pronounced word were numbered. Using the transcriptions alone (without audio), annotators identified the number in the citation form string associated with the fricative of interest or marked the fricative as deleted from the string.
The counts for labels that were automatically and manually applied are shown for /s/ and /z/ words in
Counts of substitutions in IPA and Arpabet for underlying /s/ in the Buckeye Corpus from automatic and manual tagging.
Citation forms with /s/ | ||||
---|---|---|---|---|
IPA | Arpabet | Automaticallytagged | Manuallytagged | Total no. of observations |
a | ah | 1 | 3 | 4 |
ɑ | aw | 0 | 1 | 1 |
ʧ | ch | 12 | 32 | 44 |
d | d | 0 | 1 | 1 |
ð | dh | 2 | 2 | 4 |
ɛ | eh | 1 | 6 | 7 |
n̩ | en | 0 | 2 | 2 |
f | f | 0 | 1 | 1 |
h | hh | 3 | 1 | 4 |
ɪ | ih | 0 | 4 | 4 |
i | iy | 0 | 1 | 1 |
ʤ | jh | 0 | 1 | 1 |
k | k | 1 | 1 | 2 |
m | m | 0 | 2 | 2 |
n | n | 0 | 3 | 3 |
p | p | 0 | 1 | 1 |
ɹ | r | 0 | 3 | 3 |
s | s | 17448 | 20060 | 37508 |
ʃ | sh | 180 | 469 | 649 |
t | t | 6 | 20 | 26 |
θ | th | 7 | 11 | 18 |
tʔ | tq | 0 | 1 | 1 |
ɐ | uh | 0 | 1 | 1 |
deletion | xx | 0 | 167 | 167 |
z | z | 143 | 320 | 463 |
ʒ | zh | 0 | 16 | 16 |
TOTALS | 17804 | 21130 | 38934 | |
Counts of substitutions in IPA and Arpabet for underlying /z/ in the Buckeye Corpus from automatic and manual tagging.
Citation forms with /z/ | ||||
---|---|---|---|---|
IPA | Arpabet | Automaticallytagged | Manuallytagged | Total no. of observations |
æ | ae | 0 | 1 | 1 |
a | ah | 1 | 10 | 11 |
a͡ɪ | ay | 0 | 3 | 3 |
ʧ | ch | 2 | 1 | 3 |
d | d | 0 | 11 | 11 |
ð | dh | 3 | 0 | 3 |
ɾ | dx | 2 | 8 | 10 |
ɛ | eh | 0 | 2 | 2 |
n̩ | en | 0 | 4 | 4 |
ɹ̩ | er | 0 | 4 | 4 |
f | f | 0 | 1 | 1 |
h | hh | 1 | 0 | 1 |
ɪ | ih | 2 | 12 | 14 |
i | iy | 0 | 8 | 8 |
ʤ | jh | 2 | 2 | 4 |
k | k | 0 | 3 | 3 |
l | l | 0 | 5 | 5 |
n | n | 0 | 21 | 21 |
o͡ʊ | ow | 0 | 2 | 2 |
ɹ | r | 0 | 5 | 5 |
s | s | 2152 | 2118 | 4270 |
ʃ | sh | 56 | 86 | 142 |
t | t | 0 | 3 | 3 |
θ | th | 2 | 3 | 5 |
ɐ | uh | 1 | 2 | 3 |
v | v | 2 | 3 | 5 |
deletion | xx | 0 | 284 | 284 |
z | z | 8831 | 9042 | 17873 |
ʒ | zh | 156 | 227 | 383 |
TOTALS | 11213 | 11871 | 23084 | |
The voicing-matched alveopalatal English fricatives are equally likely to surface: [ʃ] and [ʒ] both occur 1.7% of the time for /s/ and /z/, respectively. This equivalence is useful in our adjudication between directional adaptation and general category relaxation, as both substitutions are attested, but equivalently unlikely.
The results of this corpus study confirm that, at least for this dialect of North American English, listeners are more likely to be exposed to /z/-devoicing than /s/-voicing in spontaneous speech.
We predicted that English listeners would be willing to identify words with /z/-devoicing as words
Conversely, we expected listeners to exhibit a different response for /s/, as fully voiced /s/ is incredibly rare. While we anticipated that exposure to voiced /s/ pronunciations would increase the identification of these items as words, we expected this pattern to be qualitatively different from the word endorsement rates for the more frequent pattern of /z/-devoicing, as this reflects listeners’ experiences with the more probable /z/-devoicing than /s/-voicing; the baseline word endorsement rates for /z/-devoicing are predicted to be much higher than those for the /s/-voicing. Because /s/-voicing is such a rare change in listeners’ experiences, the talker may essentially be labelled as atypical and the adaptation mechanism may be different, reflecting the listener’s lower degree of certainty that the produced [z] indeed maps onto the /s/ category. Under this general relaxation response scenario, listeners may show increased acceptability of [ʃ] pronunciations as well as to novel words with a [z] pronunciation. Given the rarity of [ʒ] for /z/ and [ʃ] for /s/ in spontaneous speech, we anticipated these items as having lower word endorsement rates than items with the voicing change for listeners in the control groups, who were given no reason to adjust their phoneme boundaries or criteria in the exposure phase. The hypotheses described above are summarized in
Summary of mechanisms, rationale, and predictions for experimental groups across Experiments 2 and 3.
Experimental condition of Experiment 2: /z/-devoicing | Experimental condition of Experiment 3: /s/-voicing | |
---|---|---|
Hypothesized mechanism | Directional adaptation | General relaxation |
Talker behaviour is … | Expected | Unexpected |
Baseline word endorsement behaviour | Very high rates for devoiced /z/ | Higher than control listeners, but lower than /z/-devoicing |
Listener behaviour for novel words (tests generalization) | Yes, listeners form representation of talker as ‘devoicer’ | Yes, as voicing change falls within general relaxation |
Listener behaviour for place change (tests general relaxation) | No, exposure to devoiced /z/ reinforces prior knowledge | Yes, as place change also falls within general relaxation |
Experiments 2 and 3 target different critical sounds—/z/ and /s/—but share a procedure for creating the stimuli. Section 3 outlines this process and the process of winnowing to a final stimuli list. Relevant acoustic characteristics of the stimuli are also described in this section. Sections 4 and 5 report on the experiments that used these stimuli.
Target lexical items with non-initial /s/ or /z/ were identified. These items contained only a single sibilant fricative. To reduce ambiguity in the lexical frame, these items were confirmed to not be able to form words if the target fricative was replaced by another fricative (see the list of stimuli in the Appendix). Further, all targets had two to four syllables, and were embedded within moderately predictable sentences containing no sibilant fricatives other than the critical /s/ or /z/. Filler sentences (n = 100) were composed with no sibilant fricatives.
An adult female monolingual English speaker produced the sentence and single word materials. These auditory stimuli were recorded using a head-mounted microphone with a SoundDevices USB PreAMP in a sound-attenuated cubicle at a 44.1kHz sampling rate with 16 bit depth. Recorded materials were trimmed of extraneous silence and RMS-amplitude normalized to 70 dB, ensuring no clipped samples. Three versions of each critical /s/ and /z/ sentence were recorded, corresponding to the canonical pronunciation (control), (de)voicing, and alveopalatalized item types. To address the potential for unintended mispronunciations and inconsistencies, the speaker produced several instances of each critical sentence and word. The final recordings were selected based on perceived clarity and consistency by the third author.
Critical items from the sentences were excised and presented to two trained linguists blind to the purpose of the experiment along with all of the filler single word items. Each linguist independently phonetically transcribed each item, which were presented in a unique random order without word-level labels. This was done to confirm that the critical words with the (de)voicing and alveopalatal pronunciations were categorically perceived as intended. Any items where the transcribers disagreed or where the transcription did not match the intended voicing or place were eliminated from the pool of potential materials.
The items selected based on transcription accuracy were also confirmed to be appropriate via acoustic analysis. The onsets and offsets of aperiodic energy associated with frication were used to identify the fricative. Using the identified interval, three measurements were made: (i) the
Summary statistics (means, with standard deviation in parentheses) for acoustic measures related to voicing. The stimuli types accompanied by an asterisk were not used as stimuli for the experiment, but were recorded to allow comparison of shifted pronunciations to naturally produced tokens.
Underlying fricative in word | Produced Fricative | Stimuli type | Percent Unvoiced | Fricative Duration (ms) | Ratio Duration |
---|---|---|---|---|---|
s | s | sentence | 89.79 (3.42) | 115.38 (16.93) | 0.19 (0.04) |
s | s | words* | 88.48 (5.02) | 120.86 (15.27) | 0.18 (0.03) |
z | s | sentence | 90.33 (7.06) | 116.78 (8.57) | 0.20 (0.04) |
z | s | words | 89.51 (5.04) | 123.25 (11.68) | 0.19 (0.04) |
s | ʃ | words | 89.70 (4.38) | 133.85 (15.67) | 0.20 (0.03) |
z | z | sentence | 75.54 (13.53) | 80.13 (7.36) | 0.14 (0.03) |
z | z | words* | 42.63 (33.61) | 83.39 (11.245) | 0.13 (0.03) |
s | z | sentence | 62.76 (22.35) | 82.91 (12.24) | 0.13 (0.03) |
s | z | words | 34.63 (32.30) | 89.24 (14.35) | 0.13 (0.03) |
z | ʒ | words | 30.11 (32.26) | 87.55 (17.24) | 0.13 (0.02) |
As these data show, items produced as [s] in word and sentence contexts are more likely to be unvoiced, have longer raw fricative durations, and longer ratio durations than items that were produced with [z], regardless of whether the canonical form contained /s/ or /z/. To quantify whether exposure sentence stimuli were well-matched, and to corroborate the transcription described in the previous section, the following comparisons were made.
To assess the acoustic equivalence of target words in the exposure stimuli, underlying target /s/ words produced with [s] from the control condition of Experiment 3 (/s/-voicing, n = 36) were compared with devoiced /z/ words from the experimental condition of Experiment 2 (/z/-devoicing, n = 36). Similarly, underlying target /z/ words produced as [z] in the control condition of the /z/-devoicing experiment (n = 36) were compared with voiced /s/ words (n = 36) from the experimental condition of Experiment 3 (/s/-voicing). A series of ANOVAs were run separately for [s] and [z] items (but differing in underlying representation), using percentage unvoiced, fricative duration, and ratio duration as dependent measures and the underlying fricative as the single independent variable. There were no significant effects, suggesting that [s] and [z] were produced consistently in sentence contexts regardless of the voicing of the fricative in the canonical pronunciation of the word.
To assess the acoustic equivalence of the single word test items, underlying /z/ items produced as [s] (n = 36) and underlying /s/ items produced as [z] (n = 36) were compared to the /s/ and /z/ items produced in their canonical form, respectively; again, these canonical single word items are not used as stimuli, but were recorded to make these comparisons. The fricatives from single word environments were assessed separately for [s] and [z] pronunciations through a series of ANOVAs with percentage unvoiced, fricative duration, and ratio duration as dependent measures with the underlying fricative as an independent variable. There was no significant effect of underlying fricative in any of the three ANOVAs. Together, these results suggest that the pronunciation of the /z/ as [s] and /s/ as [z] in sentence and word contexts well-matched canonical pronunciations of /z/ and /s/ by the same speaker.
The /s/ as [ʃ] and /z/ as [ʒ] test items were not compared to /ʃ/ and /ʒ/ as produced by this speaker in natural contexts because appropriate comparison items were not recorded. Nonetheless, [ʃ] and [ʒ] are well-matched to [s] and [z], respectively, in terms of acoustic presentation of voicing (i.e., measures of percent unvoiced, fricative duration, and ratio duration), which indicates these productions were similarly well-matched.
In Experiment 2 listeners were presented with devoiced /z/ using a sentence exposure task. Perceptual adaptation was assessed by listeners’ endorsements of items with devoiced /z/ as words in a lexical decision test following the exposure phase. Generalization was tested through the presentation of novel devoiced /z/ words, which comprised items not presented in the exposure phase. To determine whether adaptation was targeted in the direction of /z/-devoicing or reflected a more general relaxation of /z/ criteria, items containing [ʒ] in place of [s] were included in the second half of the test block—of these, half were heard during the exposure phase with [s], and half were novel /z/ words. The performance of listeners presented with the devoiced /z/ items in exposure was compared to those in a control group who heard the same items with canonical /z/ pronunciations during exposure. All participants completed the same lexical decision test.
We report on materials and procedures first, followed by participants, as this order allows us to contextualize participant outlier removal in a more transparent manner.
The final stimuli for the exposure phase for Experiment 2 consisted of 56 semantically coherent filler sentences, randomly sampled from the 100 possible filler sentences, and two versions of the 14 semantically predictable critical sentences, in which pronunciation patterns and intonation of the sentence was consistent, but the voicing of the /z/ in the target word in the sentence varied according to condition. The control version comprised sentence-final critical words produced in their canonical form (e.g.,
Participants completed the task up to four at a time in sound-attenuated cubicles outfitted with AKG headphones, a desktop PC, and a PST serial response box. All auditory stimuli were presented at a comfortable listening level (approximately 65dB SPL) over the headphones. The experiment was controlled by E-Prime 2.0 software (
As noted previously, the study comprised two parts: an exposure phrase with auditorily presented sentences and an auditory lexical decision test phase. Half of the participants were assigned to a control condition where /z/ words were produced with the canonical voiced [z] pronunciation, and half were assigned to the experimental /z/-devoicing condition where all /z/ exposure items were devoiced and pronounced as [s]. The lexical decision test phase was identical for the two groups of listeners.
In the exposure phase, 70 sentences (14 critical, 56 filler) were presented in a pseudo-random order such that no two critical trials were adjacent. The number of filler sentences presented before the first critical sentence varied randomly from six to eight across experimental conditions. Participants were informed to listen carefully to the sentences and falsely instructed that there would be comprehension questions following the sentence list, but otherwise not required to do anything while listening to the presented stimuli. Each sentence was separated by a 2000 ms pause. Participants were allowed a self-timed break after the exposure phase.
The lexical decision test phase began after the participants’ self-administered break. Participants were presented with a single item over headphones and were asked to classify the item as a ‘word’ or ‘not a word’ using the button box provided. The buttons “1” or “5” for ‘word’ and ‘not a word’ were counterbalanced across participants. These response options were visually presented—numerically and orthographically—on a computer monitor and participants were given up to 1500 ms to respond. All items (n = 140) were pseudo-randomized across participants as described below. There were 42 nonwords and 70 filler words, none of which contained sibilant fricatives. In the first half of the test block, listeners were presented with 14 critical items where the underlying /z/ was pronounced as a [s]. Half of these test items had occurred in the exposure sentences and half were novel words. These items were fully randomized within the block. The second half of the test block contained 14 critical items that tested for general relaxation of the /z/ category, in which the fricative was pronounced as [ʒ]. Half of these items were lexical items that had occurred in the exposure sentences (with the exposure pronunciation as [s] or [z] depending on the condition) and half were novel in the context of the experiment. After completing the exposure and test phases, listeners completed a language background questionnaire. Participants who inquired about the (lack of) comprehension questions for the sentences were informed that those instructions were included to ensure they attended to the sentences.
A total of 135 adults from the Metro Vancouver community participated in this study. Participants’ data were removed prior to the analysis if they did not report English as one their native languages (n = 55), reported a speech/hearing impairment (n = 1), or were below 90% accuracy on filler items (n = 5; following the exclusionary criteria of
Participant responses were removed if the reaction time was below 200 ms, or more than three standard deviations above the grand mean—this resulted in the removal of 0.2% of the data.
The experimental data were analyzed using Bayesian multilevel logistic regression models implemented with the
Inference in Bayesian models is based on the
In all models, the dependent variable was Word Endorsement, where 1 corresponds to participants’ response of ‘word,’ and 0 to ‘not a word.’ Each of the models had population-level (fixed) effects for Item Type, Condition, and their interaction. Both were weighted effect coded categorical variables, in which levels are compared against the weighted mean. For example, the effect for a specific level of Item Type would indicate that the level differs from the weighted mean across all levels. Weighted effect coding accounts for unbalanced data (i.e., more filler items than critical items) and facilitates the interpretation of both main effects and interactions (
Models also shared general specifications. Priors for the intercept and all population-level effects were
This model assesses whether or not participants learned /z/-devoicing, and excludes critical test items that were not previously heard during exposure. In the model, Item Type was weighted effect coded with four levels (
Mean word endorsement rates for filler words, nonwords, and all previously heard critical items across the two conditions used in the analysis of learning /z/-devoicing.
The results of the Bayesian multilevel regression model of word endorsement for the Control and Experimental listener groups’ adaptation to /z/-devoicing are described here and depicted in
Population-level parameters for the learning /z/-devoicing model. Thin lines represent 95% CrI and thick lines represent 50% CrI. The posterior mean estimate for each parameter is indicated by the vertical tick mark.
The model intercept indicates that there is a slight and consistent bias to endorse items as words (β = 2.09, CrI = [1.69, 2.53], Pr(β > 0) = 1). There is evidence that participants in the Control condition have a slightly higher word endorsement rate overall (β = 0.30, CrI = [0.02, 0.59], Pr(β > 0) = 0.98). Filler word items were very likely to be endorsed as words (β = 3.96, CrI = [3.59, 4.37], Pr(β > 0) = 1), and the Condition × Item Type Filler interaction indicates that Control participants were slightly more likely to endorse filler words as words (β = 0.24, CrI = [0.04, 0.48], Pr(β > 0) = 0.99), providing evidence that Experimental condition listeners may have globally adjusted their criteria for word endorsement, becoming more conservative and calling fewer filler items words—we offer possible explanations for this in the interim discussion. Overall, the population-level parameter for Item Type [s] provides evidence that previously heard [s] items are more likely to be endorsed as words (β = 1.39, CrI = [0.36, 2.40], Pr(β > 0) = 0.99). The interaction of Condition and Item Type [s] provides strong evidence that Control listeners were somewhat less likely to identify [s] pronunciations as words (β = –0.76, CrI = [–1.27, –0.28], Pr(β < 0) = 1). This is clear evidence in support of an adjustment to [s] pronunciations in /z/ words for the Experimental participants. While critical items presented as [ʒ] were less likely to be endorsed as words for both experimental and control listener groups (β = –2.68, CrI = [–3.90, –1.55], Pr(β < 0) = 1), the evidence that conditions differed in how they responded to [ʒ] items was weak (β = –0.44, CrI = [–1.34, 0.48], Pr(β < 0) = 0.84). Together with the low estimate for Item Type [ʒ], this indicates that, overall, listeners in both conditions were very unlikely to endorse these items as words, though note the wide range of variability in
A second model addressed the question of whether participants generalized their learning of /z/-devoicing to novel words, that is, words not heard during the exposure phase. For this model, Filler Words and Nonwords were excluded, and all critical items included (both Heard and Novel). All aspects of the model structure were identical to that of the previous section, with the following exceptions. Item Type was weighted effect coded with four levels (
Mean word endorsement rates for previously Heard and Novel critical Item Types (both [s] and [ʒ] pronunciations) across the two conditions used in the analysis of generalizing /z/-devoicing.
The model intercept indicates that there is an overall bias towards endorsing items as words (β = 1.04, CrI = [0.47, 1.63], Pr(β > 0) = 1). There is strong evidence that novel words pronounced with [s] are endorsed as words (β = 1.16, CrI = [0.59, 1.76], Pr(β > 0) = 1), but little to no evidence that this interacts with Condition (β = 0.13, CrI = [–0.28, 0.56], Pr(β < 0) = 0.73). That is, listeners in the experimental condition did not generalize their learning of /z/-devoicing to novel /z/ words pronounced with [s]. Overall, listeners were less likely to endorse Heard (β = –1.74, CrI = [–2.35, –1.14], Pr(β < 0) = 1) and Novel (β = –2.23, CrI = [–2.82, –1.68], Pr(β < 0) = 1) items where /z/ was pronounced as [ʒ] as words. The evidence that this interacted with Condition is weak for Novel [ʒ] items (β = 0.20, CrI = [–0.18, 0.59], Pr(β < 0) = 0.85) and non-existent for Heard [ʒ] items (β = 0.06, CrI = [–0.42, 0.56], Pr(β > 0) = 0.61). These results are depicted in
Population-level parameters for the generalizing /z/-devoicing model. Thin lines represent 95% CrI and thick lines represent 50% CrI. The posterior mean estimate for each parameter is indicated by the vertical tick mark.
Regardless of condition assignment, listeners were very likely to identify /z/ words with devoiced [s] pronunciations as words. Listeners in the experimental condition who were exposed to devoiced /z/ in training were more likely to endorse heard /z/ words with devoicing as words, indicating that they had adjusted their thresholds for acceptable or identifiable realizations of /z/ in a directional manner for the items they were exposed to. However, there was no evidence for generalization for listeners in the experimental group: While listeners in the experimental condition were more likely to identify previously heard devoiced /z/ words as words, there was no evidence that novel /z/ words pronounced with [s] were more likely to be identified as words. This outcome highlights the potentially word-specific nature of perceptual adaptation.
In the generalization model, there was weak evidence that control listeners were
Note that there was moderate evidence that listeners exposed to devoiced-/z/ pronunciations in the exposure phase were less accurate on the categorization of filler words. This is a small (well below a doubling of odds) but consistent effect, and may be due to an apparent difference in the word-nonword balance across conditions. Experimental participants have learned to perceive target items (which are part of the nonword count in the stimuli distribution) as words; they identify fewer filler words as words. An alternative possibility is that by virtue of learning that a talker makes certain pronunciation changes, listeners may anticipate additional pronunciation changes which render licit words as nonwords. Both of these interpretations are speculative and warrant future consideration, and stem from a small effect.
In summary, as expected, both listener groups were likely to call devoiced /z/ items words, as they are all exposed to such pronunciations in their natural input. Listeners who were exclusively presented with devoiced /z/ in the exposure phase were even more likely to identify these items as words. Their adaptations were limited to the specific items they were exposed to and in the direction of variation to which they were exposed. That is, /z/-devoicing was not extended to novel /z/ items that were not presented in the exposure sentences and alveopalatalized [ʒ] pronunciations of /z/ were not more likely to be identified as words. This lack of generalization was unexpected, and we discuss this further in the general discussion.
Experiment 2 established that listeners adjust their lexical decision thresholds to a talker’s /z/-devoicing pattern for items they were exposed to, albeit in a somewhat narrow manner. In Experiment 3 we tested whether listeners learned an /s/-voicing pattern, which is both typologically marked and very likely outside of listeners’ regular perceptual experience in English, as demonstrated by Experiment 1. The materials and procedures for Experiment 3 are equivalent to those for Experiment 2, with the exception that critical items were underlyingly /s/ words, and listeners were exposed to /s/-voicing instead of /z/-devoicing. To assess whether any adjustments were due to a directional adaptation or a general relaxation mechanism, items with [ʃ] were included in the second half of the test block.
Like in Experiment 2, the stimuli for the exposure phase in Experiment 3 comprised 56 semantically coherent filler sentences, randomly sampled from a pool of 100 filler sentences, and two versions of the 14 semantically predictable critical sentences. The control conditions had sentence-final critical words produced in their canonical form (e.g.,
Like for the /z/ items, lexical frequency (log frequency per million) of the final stimuli list were estimated using the SUBTLEX-us corpus (
The procedure for Experiment 3 was identical to that of Experiment 2.
There were 135 adult participants from the Metro Vancouver community in Experiment 3. Using the same criteria as in Section 4.1.3., participants were excluded if they were not self-reported native speakers of English (n = 34), had a speech or hearing impairment or did not answer the question (n = 7), or scored below 90% on filler words (n = 7). There were 87 participants retained in the analysis (Control: n = 46, Experimental: n = 41). Participants varied in gender (64 female, 16 male, 1 fluid, 1 non-binary, 4 did not report), and of the participants who reported their age (n = 84), the majority were undergraduate student-aged (M = 21.08, Median = 20,
The analysis for Experiment 3 was nearly identical to that of Experiment 2. Less than 0.2% of the data was removed due to reaction time filtering. The models for Experiment 3 have the same formula, weak priors, and specifications as described in Section 4.2. For reference, the formula was:
This model assesses whether or not participants learned /s/-voicing, and excludes critical test items that were not heard during exposure. In the model, Item Type was weighted effect coded with four levels (
Mean word endorsement rates for filler words, nonwords, and all previously heard critical items across the two conditions used in the analysis of learning /s/-voicing.
In the model of learning /s/-voicing, the intercept indicates that there is an overall bias towards endorsing items as words (β = 1.79, CrI = [1.41, 2.20], Pr(β > 0) = 1), though there was little to no evidence of a meaningful difference across conditions (β = 0.09, CrI = [–0.12, 0.31], Pr(β > 0) = 0.80). There is strong evidence that filler words were consistently endorsed as words (β = 4.13, CrI = [3.77, 4.52], Pr(β > 0) = 1), and that this interacted with Condition. Control participants were slightly more likely to endorse filler words as words (β = 0.32, CrI = [0.16, 0.49], Pr(β > 0) = 1), suggesting that listeners in the Experimental group may have globally adjusted their criteria for word endorsement, becoming more conservative with respect to what constitutes a word, as in Experiment 2. There is weak evidence that Heard [z] items were less likely to be endorsed as words (β = –0.64, CrI = [–1.76, 0.47], Pr(β < 0) = 0.87), and furthermore, the Condition × Item Type [z] parameter indicates that Control participants were less likely to endorse Heard [z] items as words (β = –0.57, CrI = [–1.27, 0.10], Pr(β < 0) = 0.95). This provides some evidence that listeners in the Experimental group adjusted their /s/ criteria to accommodate [z] pronunciations. There is strong evidence that [ʃ] items were less likely to be endorsed as words overall (β = –3.38, CrI = [–4.76, –2.06], Pr(β < 0) = 1). Additionally, Control participants were less likely to identify [ʃ] items as words (β = –0.51, CrI = [–1.43, 0.40], Pr(β < 0) = 0.87), though given the probability of the effect’s direction, this should be interpreted as weak evidence that Experimental participants generally relaxed their word endorsement thresholds. The population-level parameters for the model of learning /s/-voicing are depicted in
Population-level parameters for the learning /s/-voicing model. Thin lines represent 95% CrI and thick lines represent 50% CrI. The posterior mean estimate for each parameter is indicated by the vertical tick mark.
This model addresses the question of whether participants generalized their learning of /s/-voicing to novel words—those not heard during exposure. As in Section 4.2.2, Filler Words and Nonwords were excluded, while all Heard and Novel critical items were retained. The same model structure was used. Item Type was weighted effect coded with four levels (
Mean word endorsement rates for previously Heard and Novel Item Types (both [z] and [ʃ] pronunciations) across the two conditions used in the analysis of generalizing /s/-voicing.
In the model of generalizing /s/-voicing, there is a slight overall bias to call items nonwords (β = –0.87, CrI = [–1.75, –0.04], Pr(β < 0) = 0.98), unlike each of the previous three models. Overall, there was some moderate evidence that novel /s/ items pronounced with [z] are more likely to be endorsed as words (β = 0.32, CrI = [–0.19, 0.84], Pr(β > 0) = 0.90), and strong evidence that [ʃ] pronunciations for /s/ items are less likely to be endorsed as words, whether Heard (β = –0.95, CrI = [–1.55, –0.41], Pr(β < 0) = 1) or Novel (β = –1.16, CrI = [–1.80, –0.56], Pr(β < 0) = 1). There is moderate evidence that Control participants are less likely to endorse items as words across the board (β = –0.57, CrI = [–1.30, 0.17], Pr(β < 0) = 0.94), suggesting that Experimental participants may have generally relaxed their criteria for /s/ which manifests here as greater acceptance of [ʃ] pronunciations. There was no evidence that Condition interacted with Item Type, as each interaction term substantially overlaps with zero, possibly because the pattern is well-captured by the main effect of Condition. The population-level parameters described here are depicted in
Population-level parameters for the generalizing /s/-voicing model. Thin lines represent 95% CrI and thick lines represent 50% CrI. The posterior mean estimate for each parameter is indicated by the vertical tick mark.
Listeners were assigned to an experimental condition where /s/ words were produced with [z]—a voiced fricative at the same place of articulation—or to a control condition where a typical [s] was heard in the same sentence set. At test, all listeners were presented with /s/ items from exposure containing /s/-voicing, novel /s/ words with voicing, heard /s/ words pronounced with [ʃ], and novel /s/ words with [ʃ]. Bayesian multilevel logistic regression models present very weak evidence that listeners exposed to voiced /s/ adapted their /s/ category to specifically accommodate these pronunciations. That is, there is little to no evidence of a directional adjustment mechanism in play in response to exposure to /s/-voicing. However, there is weak-to-moderate evidence that, at test, listeners exposed to the voiced /s/ in exposure were more likely to call any non-canonical pronunciation of an /s/ word a word than listeners in the control condition, suggesting that any change in /s/ category structure was the result of a more general relaxation of /s/ criteria, as opposed to any directional adjustments towards voiced /s/.
Our goal was to examine asymmetries in adaptation to non-canonical pronunciations, focusing on the voicing patterns of coronal fricatives, given the typological and English-specific tendencies for these fricatives to devoice as opposed to voice. In Experiment 1, we first confirmed that North American English speakers in the Buckeye Corpus produce substantially more (categorical) /z/-devoicing than /s/-voicing in spontaneous speech. Such a confirmation reinforces an expected asymmetry that is reflected in the behavioural data of Experiments 2 and 3: We expected listeners to adjust their word endorsement behaviours in different ways to /z/-devoicing and /s/-voicing. The devoicing of /z/ is not only typologically frequent and phonetically natural, but also comparatively frequent in spontaneous speech in North American English. Categorical voicing of /s/, on the other hand, is typologically rare, phonetically unnatural, and very rare in spontaneous North American English. In Experiments 2 and 3, we presented listeners with naturally-produced words containing coronal fricatives that had been produced with either devoicing (/z/ → [s]) or voicing (/s/ → [z])—compared to their canonical citation forms—in sentences. Groups of listeners in control conditions were presented with the same sentences and words with their typical fricative realizations (/z/→ [z]; /s/ → [s]). We tested whether listeners in the Experimental conditions adjusted their fricative categories, identifying more items with a change in coronal fricative voicing as words in a lexical decision test. To assess whether adjustments are targeted in the direction of the exposed variant or whether such adjustments are the results of general category relaxation, the test block also presented listeners with an additional change in place of articulation. In the latter part of the test phase, listeners were presented with items where an expected /z/ was replaced by [ʒ] and /s/ by [ʃ].
The Bayesian statistical analysis provides nuance in evaluating the evidence for directional adaptation and general relaxation mechanisms in perceptual learning. While listeners in both the control and experimental conditions of Experiment 2 (/z/-devoicing) showed strong evidence of identifying devoiced /z/ items as words, those in the experimental condition were more likely to identify these items as words, demonstrating directional adaptation. These same listeners, however, did not generalize their adjustments to novel devoiced /z/ items, underscoring just how lexically-specific the adjustments were, though both listener groups did identify these items as words at high rates, which is expected given it is a pronunciation pattern listeners have substantial experience with. Listeners in the /z/-devoicing control condition (Experiment 2) were more likely than those in the experimental condition to accept [ʒ] pronunciations as words, suggesting that exposure to novel pronunciations at test, after having been presented with canonical /z/ pronunciations in exposure, may have triggered a general category relaxation mechanism at that point. It is worth reiterating that the evidence for this behaviour in the control condition was weak, and did not extend to previously heard /z/ words pronounced as [ʒ] at test.
The behaviours in Experiment 3 (/s/-voicing) were quite different. While the evidence was weak and weak-to-moderate, respectively in the learning and generalization models, experimental condition listeners who were exposed to voiced /s/ were more likely to identify [z]
Summary of mechanisms, rationale, and predictions and results for
Experimental condition of Experiment 2: /z/-devoicing | Experimental condition of Experiment 3: /s/-voicing | |
---|---|---|
Hypothesized mechanism | Directional adaptation | General relaxation |
Talker behaviour is … | Expected | Unexpected |
Baseline word endorsement behaviour | Very high rates for devoiced /z/ |
Higher than control listeners, but lower than /z/-devoicing |
Listener behaviour for novel words (tests generalization) | Yes, as voicing change falls within general relaxation |
|
Listener behaviour for place change (tests general relaxation) | No, exposure to devoiced /z/ reinforces prior knowledge |
Yes, as place change also falls within general relaxation |
The goal of these experiments was to take advantage of English phonology and fricative typology as a way to establish that there are predictable asymmetries in lexically-guided perceptual learning in speech, compared to asymmetries which have simply been fortuitously stumbled upon (e.g.,
The first reason supporting the learnability of devoiced /z/ stems from the voiced fricative’s typological rarity. Voiced fricatives are typologically less common than voiceless fricatives across the languages of the world. Producing a voiced fricative involves an aerodynamic configuration that maintains the subglottal and supraglottal pressure differential for voicing whilst generating high enough pressure behind the constriction to create turbulent airflow. Voiced fricatives often devoice or become approximants in languages (i.e., debuccalization). Thus, the argument can be made that voiceless fricatives are more natural than voiced fricatives, and that listeners may be more likely to learn this more natural pattern. Indeed, in the learning of phonological patterns, there is some evidence that learners appear to be biased towards learning what is phonetically natural (though disentangling naturalness is a challenge;
The observation of English listeners being more experienced with /z/-devoicing than /s/-voicing is our second reason for expecting /z/-devoicing to be learned more readily than /s/-voicing. It is well established that /z/ devoices in both lab speech (
Perceptual learning is a retuning of sublexical representations, which means that the nature of those representations is of crucial importance for these theories. The two leading models—TRACE (
Recent work by Tzeng, Theodore, and Nygaard (
A mechanism that may account for the current results exists more readily in theories about sound change, which could be viewed as
A post-perceptual evaluative stage similar to those proposed by sound change theorists may account for the directional adaptation to devoiced /z/, which did not generalize, compared to the very different category relaxation behaviour, which included generalization to novel items, for voiced /s/. Such a mechanism is congruent with our results, and the prospect of integrating theoretical mechanisms for category flexibility and stability in both synchronic and diachronic scenarios is appealing. When listening to devoiced /z/ items in sentences, listeners likely had no trouble recognizing the intended word and it was likely not considered an extreme outlier, as listeners regularly experience devoiced /z/. Thus, these devoiced /z/ items were filed away and used to update the /z/ distribution for particular lexical items
This kind of post-perceptual evaluation is becoming viewed as necessary to account for diachronic sound change—and offers an elegant account of our results—but how does it fit with the models and mechanisms discussed above for lexically-guided perceptual learning? A stage that involves weighting of evidence and goodness-of-fit evaluation seems to align well within a Bayesian model like Merge B (that has a more explicit decision stage built into the model framework), though an evaluation stage is not incompatible with an interactive model of spoken language processing (e.g., TRACE). Building such a stage into our models allows us to account for asymmetries in adaptation that ostensibly relate to the magnitude of experienced category deviation. While we remain agnostic between the two opposing proposals, the degree to which lexical activation feeds back or merges with a phonemic category to allow a sufficient signal-to-phoneme mapping update may be determined by the certainty or confidence in the lexical assessment and deviant sound’s category goodness (
The total lack of generalization to novel devoiced-/z/ words warrants further discussion. We suggest that this is due to listeners learning a context-specific mapping associated with a particular lexical item. Previous studies highlighting a lack of generalization argue that the learned acoustic-auditory features are highly context dependent (
We demonstrate that there are asymmetries in what listeners learn and how they adjust in lexically-guided perceptual learning tasks. Listeners adapted to word-specific patterns in /z/-devoicing, illustrating a directional adaptation mechanism. Those exposed to voiced /s/ showed some evidence of engaging in more generic /s/ category relaxation, subtly and variably increasing their probability of endorsing any novel /s/ pronunciation. These results suggest that perceptual learning of speech is moderated by listener knowledge or experience. The different response elicited by these two patterns may be because /z/-devoicing is phonetically natural in the languages of the world and/or because it taps into pre-existing pronunciation variants and is, therefore, readily recognized and accepted as a pronunciation that merits the updating of a word’s category-associated distribution. We are, of course, unable to say which of these phonetic factors is ultimately the motivating factor for the lack of learnability of voiced /s/ and the learnability of devoiced /z/, but we clearly establish an important asymmetry in how listeners update phonetic representations. We posit that an evaluation stage, which has been invoked in the sound change literature, accounts for these results. Altogether, these results suggest that adaptation to novel pronunciation may leverage distinct mechanisms based on the nature of the phonetic variability presented.
The additional file for this article can be found as follows:
Critical sentences used in Experiment 2 and Experiment 3 and statistical model summaries. DOI:
The Buckeye corpus uses Arpabet transcriptions and those are used in our reporting from this corpus.
While we acknowledge that the population sample of the Buckeye Corpus (Midland American English in Columbus, Ohio) is not identical to that of the listener population sample (Canadian English in Vancouver, British Columbia), the fricative voicing patterns are robust to the point that we assume the basic pattern holds for speakers of Canadian English as well given that, to our knowledge, no previous literature suggests major differences in fricative voicing patterns in North American English.
Maximal here refers to the fact that the pseudowords differed from any real word with respect to multiple phones.
There was no intention of balancing or controlling for the presence or absence of filler words in the filler sentences. The overlap varied by participant, as each participant’s list was randomly sampled from the viable pool of items, and the overlap simply existed as a function of our lack of creativity composing semantically coherent sentences that lacked the critical sounds of interest.
For additional information on weighted effect coding, see
Full model summaries are provided in the Appendix.
Special thanks to Carolyn Norton and Zoe Lawler who played pivotal roles in earlier iterations of this project. Thanks to Martin Oberg for programming assistance, Masako Kato, Sophie Bishop, Stephanie Chung, and Cassandra Savage for their contribution to the coding of the corpus data, and Tristan Bhimaraj for being the speaker. This work has benefited from discussion with many members of the Speech in Context Lab, past and present, especially Brianne Senior, Kathleen Hall, and Michael McAuliffe. Thanks to Charlotte Vaughn and audiences at Acoustics Week in Canada, September 21–23, 2016, for comments on earlier versions of this work. All errors are our own. This work has been supported by an award from Canada’s Natural Sciences and Engineering Research Council (MB).
As permitted by the participant consent forms, the data and materials from Experiments 2 and 3 are available by request made to the researchers.
The authors have no competing interests to declare.