Individual variation is key to understanding phenomena in phonetic variation and change, including the production-perception link. To test the generalizability of this relationship, this study compares community- and individual-level variation across three long-standing consonant mergers in Hong Kong Cantonese speakers: [n]→[l], [ŋ̩]→[m̩], and [ŋ]↔Ø. Concurrently, we document these understudied mergers in a community that has undergone rapid social change in recent decades. Younger (college-aged) and older (middle-aged) Hong Kongers completed a reading production task followed by a forced-choice lexical identification perception task. Group-level results suggest mismatching production and perception: While the community overall distinguished merger pairs in production, younger listeners are more perceptually categorical than older listeners. However, aggregate results obscure the fact that individuals vary substantially in the extent of merging in both perception and production, including many who exhibit complete merger, and that individual-level production-perception correlations were found for [n]→[l] and [ŋ̩]→[m̩], though not [ŋ]↔Ø. Results are discussed in the context of previous research. We find that (i) these mergers have diverged from predicted trajectories of completion, and (ii) overall, prior findings on the production-perception link are generalizable to these consonant mergers.
While variability within speech communities has long been acknowledged, phonetic variation and change has historically been studied at the level of the community or macro-social demographic groups. Recently, a more concentrated focus has been placed on the characteristics of individual speaker-listeners and their role in innovating and driving change (for overviews, see
The example of focus here is how perception and production systems relate in the context of sound change. As this relationship holds implications for mechanisms behind the initiation and propagation of sound change, a more complete picture is crucial for understanding both the spread of sound change from individuals to entire communities and the progression of community-level change over time. Though many have probed this question, consistent results have been hard to come by both in regards to whether production-perception systems are linked in individuals (e.g.,
In line with this perspective, the present study examines—at community and individual levels—production, perception, and production-perception alignment in an understudied case of phonetic variation and change. We conduct an apparent-time investigation of three consonant mergers in Hong Kong Cantonese with distinct profiles and trajectories, namely [n]→[l], [ŋ̩]→[m̩], and [ŋ]↔Ø. While documenting the recent state of these mergers, we test hypotheses of the production-perception relationship, seeking to extend previous findings to a novel set of sound changes of a type—consonant mergers—that has thus far lacked study (with the notable exception of
Theories of sound change have long assumed that an individual’s perception and production systems must be connected in some way for an individual to perceive a change within their speech community and implement it in their own production repertoire. This is a necessary condition if we posit that changes propagate from individual to individual, though the extent to which this is true (or, under which circumstances) had not been directly explored until recently (see
Although there is a strong theoretical appetite for perceptual and productive repertoires to map neatly onto one another, only some investigations have found individual-level evidence for this connection (e.g.,
One common suggestion put forth to account for inconsistent findings is that the experimental tasks and measures used to assess a link between production and perception were not necessarily comparable, such that production and perception tasks may have in fact been measuring different constructs (see
From a theoretical perspective, some posit that the context of variation, such as the stability of variation patterns in the community, has consequences for production-perception alignment. Beddor et al. (
Importantly, many recent studies of sound change do find evidence of individual-level links between domains, despite community-level mismatch (
Beyond the simple existence of a production-perception connection, a co-occurring question pertains to the
Kuang and Cui (
Pinget et al. (
Factors relating to the source of sound change may separately influence the manifestation of production-perception misalignment. Voeten (
Finally, the type of change may matter in the coordination of production and perception. Most prior studies investigated cue shifting (e.g.,
In the current study, we highlight mergers as an interesting case of change, unique because they represent a total loss of phonological contrast rather than simply a shifting of cues that maintain contrast. This comparatively drastic phonological change could lead to differences in behaviour surrounding production and perception. To the best of our knowledge, only one study of production and perception has been published on segmental mergers (
Hong Kong Cantonese (HKC) provides an interesting opportunity to study mergers, as a large set of consonantal (e.g.,
Another layer of complexity is that linguistic ideologies (e.g., the public media campaigns and school curriculum changes spearheaded by Professor Richard Ho Man-Wui, a prominent public scholar and professor of Chinese literature) have led to stigmatization of the innovative variants. The use of mergers is termed 懶音
In Hong Kong, these external factors could have plausibly come together to stall or reverse the progression of the mergers, leading to stable variation or preservation of the conservative form in formal registers (e.g.,
Against this background, the present investigation focuses on three consonant mergers in HKC: onset [n]→[l], syllabic [ŋ̩]→[m̩], and onset [ŋ]↔Ø.
Examples of Cantonese words involved in each merger, and the direction of change.
[n]→[l] | 藍 | |
‘blue’ | |
男 | |
|
‘male’ | |
[ŋ̩]→[m̩] | 唔 | |
‘not’ | |
五 | |
|
‘five’ | |
[ŋ]↔Ø | 嘔 | |
|
‘vomit’ |
牛 | |
|
‘cow’ | |
The [n]→[l] onset merger represents a prototypical case of contrast loss. These syllable-initial phonemes, described by Zee (
Since the 1940s or earlier (
Because the innovative [l] has become so prevalent, Chen (
It has further been specifically noted that [n]→[l] varies by speech register such that [n] is associated with formality (
The [ŋ̩]→[m̩] merger involves two historical phonemes, the syllabic velar and bilabial nasal consonants, that were largely non-contrastive. [ŋ̩] historically occurred in relatively few lexical items, limited to the three low tones. Of these, Bauer (
Though this merger was first documented in the 1980s (
Some early sources demonstrate general or public awareness of the [ŋ̩]→[m̩] merger: Newspapers mentioned homophony of the negative morpheme
Unlike the former two mergers, the [ŋ]↔Ø merger—involving the syllable-initial velar nasal and its null-initial (Ø) counterpart (also referred to as zero-initial, phonetically either a vowel or glottal stop onset;
Since the start of the 20th century, both historical [ŋ] and Ø have shown evidence of merging towards the other (note some exceptions of [ŋ] onset occurring with mid to high tones;
A further complexity arises when we compare the results of To et al.’s (
One potential explanation is that stigmatization of the null-initial as ‘lazy’ led to age-based patterning across the community, such that adult populations tend to produce the more ‘proper’ and ‘prestigious’ [ŋ] variant for both historical classes. Indeed, according to Chen (
The present study investigates the production and perception of the [n]→[l], [ŋ̩]→[m̩], and [ŋ]↔Ø mergers across two generations in Hong Kong, examining both group-level and individual-level patterns. By studying phonetic variation and change in this context, we hope to contribute new insights to unresolved questions in the sound change literature, including those on the phonetic basis of sound change. Our research objectives are two-fold.
The first goal is to test how production-perception patterns generalize across sound changes with varying attributes. In doing so, we seek to clarify the factors that modulate the existence and nature of the production-perception link, for which evidence has been inconsistent. Characteristics that may be relevant include source of change (internal versus contact-induced), stage of change (e.g., early- versus late-stage), and type of change (shifting versus merging). The present study explores the comparison between three internally-driven consonant mergers within the same speaker-listeners using a consistent methodology. While [n]→[l] and [ŋ̩]→[m̩] appear to similarly be late-stage changes (compared to the mid-to-late stage [ŋ]→Ø) according to recent report, these mergers are further differentiated by a host of other factors. We interpret these case studies in the context of previous results that have focused on cue-shifting and boundary-shifting vowel changes, with particular attention to the results of Pinget et al. (
The second objective is to add to documentation on the progression of these mergers specifically in the post-handover era of Hong Kong, revisiting several open threads in the HKC literature. Hong Kongers who grew up in the years since 1997 have had vastly different formative experiences from preceding generations, including increased exposure to Mandarin and reinforcement of ‘proper pronunciation’ ideologies that brand the mergers as ‘lazy’ (
As part of a larger project, data was collected from Cantonese speakers in both Hong Kong and Vancouver, Canada but only the Hong Kong data are discussed in this paper. This study received approval from the Human Subjects Ethics Sub-committee of the Hong Kong Polytechnic University (HSEARS20160829002), and participants gave written consent before taking part in the experiment. Data collection took place between September 2016 and February 2017. To examine the merger trajectories in apparent time, participants were recruited across two age groups: older middle-aged individuals and younger college-aged individuals.
Fifty-one participants were recruited from the Hong Kong Polytechnic University. The older generation consisted of 23 speakers (M = 54.09 years,
Summary demographic information, including mean age (standard deviation), median Cantonese and English age of acquisition (AoA), and mean Cantonese-English language dominance score (standard deviation) as calculated by the BLP.
Older | F | 12 | 55.25 (4.90) | 0 | 6 | –81 (39) |
M | 11 | 52.82 (5.46) | 0 | 4 | –95 (39) | |
Younger | F | 13 | 18.77 (1.74) | 0 | 3 | –91 (25) |
M | 15 | 20.00 (2.59) | 0 | 3 | –84 (26) | |
Mean ratings (on a scale from 0–6) of self-reported speaking and comprehension abilities for Cantonese, English, and Mandarin per demographic group.
Older | F | 12 | 5.50 | 5.50 | 3.92 | 4.25 | 2.25 | 2.75 |
M | 11 | 5.82 | 5.91 | 3.82 | 4.27 | 3.18 | 3.73 | |
Younger | F | 13 | 5.46 | 5.38 | 3.77 | 3.69 | 3.62 | 4.08 |
M | 15 | 5.07 | 5.13 | 3.43 | 3.73 | 2.93 | 3.87 | |
Production stimuli consisted of 64 Cantonese words presented visually in Chinese orthography
Perception stimuli consisted of three 13-step continua generated between Cantonese real-word minimal pairs, one per merger as listed in
Minimal word pairs for each synthesized continuum.
[n]→[l] | [lou] - [nou] | 老-腦 | ‘old’- ‘brain’ | |
[ŋ̩]→[m̩] | [m̩] - [ŋ̩] | 唔-吳 | ‘no, not’- ‘Ng (surname)’ | |
[ŋ]↔Ø | [aːk ̚] - [ŋaːk ̚] | 握-呃 | ‘shake [hands]’ - ‘deceive’ | |
These six natural productions were then used as endpoints to create three word-pair continua using
Waveforms (top) and spectrograms (bottom) of the endpoints and midpoint of the final 13-step
Participants completed the Bilingual Language Profile (BLP;
An exit interview was administered to examine metalinguistic awareness about the mergers in the participant’s own speech and experiences. The interview was conducted in English or Mandarin, supplemented by Cantonese when necessary to ensure comprehension.
Participants were seated alone at a computer in a sound-attenuated booth where they were presented with the production task followed by the perception task. Instructions were provided visually in English on the screen. Upon completion, the experimenter returned to the room to answer any questions while participants filled out the language questionnaire, then conducted the exit interview. To reduce the chance that participants would change their speech behaviour due to realizing the purpose of the experiment (i.e., our interest in the target sounds), the production task, which included fillers, was ordered prior to the perception task, which involved only the target sounds.
In the self-paced production task, Chinese characters and the English translation were visually presented using E-Prime 2.0 software (
Perception stimuli were auditorily presented in a two-alternative forced choice lexical identification task. To reduce influences of explicit knowledge about ‘proper’ pronunciation, participants were instructed to respond as quickly as possible and not overthink their response. Using E-Prime 2.0, the synthesized Cantonese words were played over AKG K77 Perception headphones at a comfortable listening level, accompanied by a visual display of the appropriate minimal pair words in Cantonese orthography and the English translation. For each trial, participants heard the audio stimulus once and saw two word choices labelled with ‘1’ or ‘5’ (on the left or right side of the screen). They were asked to press the button on the button box (i.e., ‘1’ or ‘5’) that corresponded to the word they heard. If no response was registered within three seconds, the next trial began automatically. Each token was repeated three times throughout the experiment, randomly presented over three blocks (117 trials in total).
To assess production, two phonetically-trained Cantonese speakers who grew up in Hong Kong coded the onset of each item, or in the case of the syllabic nasals, the sole segment comprising the word. Items were blocked by talker, randomized, and presented to the coders blind, without knowledge of the intended lexical items. Coders categorized the onset from a closed set, presented orthographically as six options: ‘l,’ ‘m,’ ‘n,’ ‘ng,’ ‘a vowel,’ or ‘other.’ The 204 items identified by both transcribers as ‘other’ were coded by the first author. These items either did not include a usable production (e.g., recording errors where the word was cut off or missing) or included a mispronounced production (e.g.,
The coders agreed on 4933 trials. Inter-rater reliability for the two coders was calculated in R (
Inter-rater reliability measures for the auditory coding of critical items from the perspective of the historical phoneme.
[l] | 1067 | 0.82 | <0.001 | 95.9% |
[n] | 1327 | 0.85 | <0.001 | 90.4% |
[m̩] | 801 | 0.48 | <0.001 | 92.5% |
[ŋ̩] | 801 | 0.46 | <0.001 | 79.3% |
[ŋ] | 798 | 0.60 | <0.001 | 84.5% |
Ø | 1068 | 0.59 | <0.001 | 81.0% |
Data for the [n]→[l] merger (n = 1158) were analyzed in a logistic mixed effects regression model with the probability of [l] productions ([l] = 1, [n] = 0) as the dependent measure. All categorical variables were sum coded (Historical Pattern: /l/ = 1, /n/ = –1; Talker Age: Older = 1, Younger = –1; Talker Gender: Male = 1, Female = –1) and entered as possible main effects and interactions. There were random intercepts for Subject and Item, with Historical Pattern as a by-subject random slope.
The model output is summarized in
Model output for [n] and [l] merger in production. Significant factors (
Age Group | 0.2192 | 0.4088 | 0.536 | 0.59182 |
Gender | 0.2537 | 0.4084 | 0.621 | 0.5344 |
Historical Pattern : Age Group | –0.297 | 0.2121 | –1.4 | 0.16156 |
Historical Pattern : Gender | 0.1822 | 0.2112 | 0.863 | 0.38839 |
Age Group : Gender | –0.0898 | 0.4077 | –0.22 | 0.82567 |
Historical Pattern : Age Group : Gender | 0.1713 | 0.2109 | 0.812 | 0.41668 |
Group means and individual data points for proportion [l] productions for historically /l/- and /n/-onset words, faceted by age and gender. The lines connecting data points connect values for an individual. The shading of the individual points is to allow the visualization of individual overlap.
To better visualize the individual-level data, we also present each individual’s mean proportion of the innovative variant for the two historical categories as a scatter plot in
Proportion [l] for historically /n/ items (y-axis) and historically /l/ items (x-axis) for each individual. Women are plotted with circles and men with triangles. Older speakers are in red and younger speakers in blue. The values have been jittered to minimize overlap.
Data for the [ŋ̩]→[m̩] merger (n = 872) were analyzed with a logistic mixed effects regression model in an identical fashion to those for the [n]→[l] merger, with the variant-specific variables coded analogously (i.e., dependent variable as the probability of [m̩] productions: [m̩] = 1, [ŋ̩] = 0; Historical Pattern: /m̩/ = 1, /ŋ̩/ = –1).
The model output is summarized in
Model output for [ŋ̩] and [m̩] merger in production. Significant factors (
Historical Pattern : Age Group | –0.2848 | 0.21884 | –1.301 | 0.19312 |
Historical Pattern : Gender | 0.05902 | 0.21788 | 0.271 | 0.78649 |
Age Group : Gender | –0.03475 | 0.35812 | –0.097 | 0.92271 |
Historical Pattern : Age Group : Gender | –0.04729 | 0.21276 | –0.222 | 0.8241 |
Group means and individual data points for proportion [m̩] productions for historically syllabic /m̩/ and /ŋ̩/ words, faceted by age and gender. The lines connecting data points connect values for an individual. The shading of the individual points is to allow the visualization of individual overlap.
Proportion [m̩] for historically /ŋ̩/ items (y-axis) and historically /m̩/ items (x-axis) for each individual. Women are plotted with circles and men with triangles. Older speakers are in red and younger speakers in blue. The values have been jittered to minimize overlap.
Data for the [ŋ]↔Ø merger (n = 1023) were analyzed similarly to the two previous mergers with a logistic mixed effects regression model. The variant-specific variables were coded with null-initial assumed as the innovative variant (i.e., dependent variable as the probability of vowel-initial productions: Ø = 1, [ŋ] = 0; Historical Pattern: Ø = 1, [ŋ] = –1).
The model output is summarized in
Model output for vowel initial and initial [ŋ] merger in production. Significant factors (
Age Group | 0.43279 | 0.59628 | 0.726 | 0.468 |
Gender | 0.60362 | 0.58385 | 1.034 | 0.3012 |
Historical Pattern : Age Group | 0.10844 | 0.16359 | 0.663 | 0.5074 |
Historical Pattern : Gender | –0.09481 | 0.16497 | –0.575 | 0.5655 |
Age Group : Gender | 0.62266 | 0.58525 | 1.064 | 0.2874 |
Historical Pattern : Age Group : Gender | –0.22533 | 0.18756 | –1.201 | 0.2296 |
Group means and individual data points for proportion vowel-initial productions for historically [ŋ]- and null-initial words, faceted by age and gender. The lines connecting data points connect values for an individual. The shading of the individual points is to allow the visualization of individual overlap.
These results can be seen more clearly in the individual scatterplot (
Proportion vowel-initial productions for historically [ŋ]-initial items (y-axis) and historically null-initial items (x-axis) for each individual. Women are plotted with circles and men with triangles. Older speakers are in red and younger speakers in blue. The values have been jittered to minimize overlap.
Group-level production results indicate that all three Hong Kong Cantonese merger pairs maintain some form of community-wide contrast, as indicated by a statistically significant effect of historical category. However, underlying these community averages was a wide variety of merging patterns at the individual level. For [n]→[l], speakers ranged from maximally merged to maximally contrastive, while for [ŋ̩]→[m̩] and [ŋ]↔Ø, the majority of speakers were either fully merged or demonstrated an intermediate contrast. There was also a subset of individuals who were ‘hypercorrective’ such that they produced mainly conservative variants for both historical categories (e.g., [n] for both /n/ and /l/). Although degree of contrastiveness varied substantially across individuals, in none of the three mergers is it predicted by demographic factors like age or gender (i.e., no significant interactions were found), which suggests that these mergers should not be characterized as changes-in-progress.
Null responses were removed, accounting for just over 1.5% of the data. The remaining data (n = 3265) were fit to a logistic mixed effects regression model predicting the likelihood of a historical /l/ word response (/l/ = 1, /n/ = 0). Continuum step was centered and scaled, and Talker Age and Gender were sum coded (Age: Older = 1, Younger = –1; Talker Gender: Male = 1, Female = –1). Subject was a random effect with Step as a by-subject random slope.
The model output is reported in
Model output for [n]→[l] merger in perception. Significant factors (
Gender | 0.10879 | 0.11218 | 0.97 | 0.332137 |
Step : Gender | 0.1023 | 0.22859 | 0.448 | 0.654494 |
Age Group : Gender | 0.05584 | 0.11223 | 0.498 | 0.618813 |
Step : Age Group : Gender | –0.1741 | 0.22868 | –0.761 | 0.446471 |
Proportion /l/ word responses for the [n]→[l] merger by continuum step. Step 1 of the continuum is a canonical [l] pronunciation and Step 13 is a canonical [n] pronunciation. The dashed blue line and triangles represent the data of the younger listeners, while older listeners are shown in solid red lines and circles. The error bars show standard error.
To explore the individual differences in the discreteness of these lexical items, individual values were quantified using the by-subject contrast coefficient slope (CCS), following the methods provided by Casillas (
Histograms illustrating the distributions of category crispness values for the three continua.
Histograms illustrating the distribution of individual category crispness scores for the [n]→[l] merger continuum; only crispness scores between –0.6 to 0.3 are presented to show the variation amongst individuals excluding the two extreme outliers. Density plots are underlaid to represent the distribution per demographic group. Women (top) and men (bottom) are plotted separately while older speakers are colored red and younger speakers colored blue.
To analyze the [ŋ̩]→[m̩] merger, null responses were removed, accounting for just over 2% of the data. The remaining data (n = 3248) were fit to a logistic mixed effects regression model predicting the likelihood of a historical /m/ word response (/m̩/ = 1, /ŋ̩/ =0). All other model specifications were the same as [n]→[l].
The model output is reported in
Model output for the [ŋ̩]→[m̩] merger in perception. Significant factors (
Age Group | –0.05142 | 0.13962 | –0.368 | 0.712672 |
Gender | 0.0954 | 0.13971 | 0.683 | 0.494698 |
Step : Age Group | 0.15525 | 0.10419 | 1.49 | 0.136216 |
Step : Gender | 0.13829 | 0.10425 | 1.327 | 0.184662 |
Proportion /m̩/ word responses for the [ŋ̩]→[m̩] merger by continuum step with different panels for male and female listeners. Step 1 of the continuum is a canonical [m̩] pronunciation and Step 13 is a canonical [ŋ̩] pronunciation. The dashed blue line and triangles represent the data of the younger listeners, while older listeners are shown in solid red lines and circles. The error bars show standard error.
Individual differences in recognition performance for [ŋ̩]→[m̩] were also analyzed in terms of category crispness. The middle panel in
Histograms illustrating the distribution of individual category crispness scores for the [ŋ̩]→[m̩] merger continuum. Density plots are underlaid to represent the distribution per demographic group. Women (top) and men (bottom) are plotted separately while older listeners are colored red and younger listeners colored blue.
Lastly, the responses to the [ŋ]↔Ø continua were analyzed in a similar manner. Null responses were removed, accounting for just over 1% of the data. The remaining data (n = 3277) were fit to a logistic mixed effects regression model predicting the likelihood of a historical vowel-initial word response (Ø = 1, /ŋ/ =0). All other model specifications were the same as above.
The model output is reported in
Model output for the [ŋ]↔Ø-initial model in perception. Significant factors (
Age Group | –0.06238 | 0.0971 | –0.642 | 0.52062 |
Gender | –0.04629 | 0.09708 | –0.477 | 0.63344 |
Step : Age Group | –0.25023 | 0.13259 | –1.887 | 0.05912 |
Step : Gender | –0.16481 | 0.13245 | –1.244 | 0.21341 |
Age Group : Gender | 0.07286 | 0.09704 | 0.751 | 0.45271 |
Step : Age Group : Gender | 0.12243 | 0.13234 | 0.925 | 0.3549 |
Proportion vowel-initial word responses for the [ŋ]↔Ø-initial merger by continuum step. Step 1 of the continuum is a canonical vowel-initial pronunciation and Step 13 is a canonical [ŋ] onset pronunciation. The dashed blue line and triangles represent the data of the younger listeners, while older listeners are shown in solid red lines and circles. The error bars show standard error.
Individual differences for [ŋ]↔Ø were also quantified in terms of category crispness. While the mean (Mean = 0.029) and median (Median = 0.009) both are positive, the range of values (–0.22, 0.31) spans 0 to a greater degree than those for the previous two continua. Individuals’ data for these continua are shown in the right panel of
Histograms illustrating the distribution of individual category crispness scores for the [ŋ]↔Ø-initial merger continuum. Density plots are underlaid to represent the distribution per demographic group. Women (top) and men (bottom) are plotted separately while older listeners are colored red and younger listeners colored blue.
Group-level perceptual results indicate that each merger is characterized by different patterns, but unlike production, age appears to be a relevant factor. Simplifying slightly, younger listeners were generally more categorical: significantly so for [n]→[l], a trending effect for [ŋ]↔Ø, and a case where younger women alone appear more categorical for [ŋ̩]→[m̩]. In other words, younger listeners were often less merged in recognition than older participants. Individual category crispness values further demonstrate that individuals varied from no differentiation between lexical items (present for all mergers) to fully categorical (for /n/ and /l/). Across mergers, the /n/ and /l/ categories were the most discrete, followed by /m̩/ and /ŋ̩/. The [ŋ] and null-initial lexical items elicited the least discrete patterns, and the +/– signs of the crispness values further indicate that listeners categorize these items in opposing directions.
To understand the relationship between perception and production at the individual level for the three mergers under investigation, we quantified the degree of mergedness for perception and production. Mergers in perception were quantified using the absolute value of the by-subject crispness values described above. Higher values indicate more discrete perceptual categories, while lower values are taken to indicate a merger of perceptual categories.
Mergers in production were quantified as the absolute value of the difference between proportions of the more novel pronunciation (i.e., [l], [m̩], null) for historical /l/, /m̩/, and null-onsets words and proportions of the more novel pronunciation for the historical /n/, /ŋ̩/, and [ŋ]-onset words. This quantification means that individuals who produce, for example, a full merger 100% of the time (e.g., 100% [l] for historical /l/ and /n/ words) will have an equivalent merger score as those who exclusively show hypercorrection (e.g., 0% [l] for historical /n/ and /l/ words; thus 100% [n] for both lexical classes) and those who exhibit a variable mix of pronunciations for both lexical sets (e.g., 50% [l] for historical /n/ and /l/ words). Crucially, however, participants of these types are demonstrating a lack of a reliable difference in pronunciation variants for the lexical sets, which we take as an indication of a merger of these categories at the lexical level.
Importantly, for the production-perception analysis, we removed the two perceptual crispness outliers (see
Degree of merger in perception (y-axis) versus production (x-axis) for the three mergers. Merger in perception is the absolute value of category crispness and the merger in production is quantified as the absolute value of the difference in the proportions of the novel pronunciation for the two lexical sets for each merger. The range of the x-axis is from 1 (fully contrastive) to 0 (fully merged) to reflect the time course of change. The solid lines represent fitted lines to the data while the dashed lines represent a hypothetical fitted line if production and perception were perfectly correlated.
These measures of merger in perception and production are moderately correlated for the [n]→[l] merger [t(47) = 4.10,
To assess the direction of misalignment within an individual, we calculate the difference between scaled production and perception values per merger (DiffPP, following
DiffPP scores, calculated as the difference of production and perception measures (y-axis) versus production (x-axis, reversed) for the three mergers. DiffPP scores above zero represent individuals whose production is more contrastive (less merged) than perception while scores below zero represent individuals whose perception is more contrastive (less merged) than production.
The left panel in
The middle panel in
Lastly, the third panel of
Production-perception analyses, which correlate an individual’s degree of merger in production to their degree of merger in perception, reveal a mixed bag of results. We find moderate production-perception correlations for [n]→[l] and [ŋ̩]→[m̩] despite different degrees of overall mergedness, but no correlation for [ŋ]↔Ø. At the same time, while all mergers reveal a similar overall trend of misalignment such that if not aligned, production remains more contrastive than perception, [n]→[l] and [ŋ]↔Ø show an additional pattern where individuals with merged production show contrast in perception despite little to none in production. Finally, age-based trends suggest that younger individuals are less misaligned than older individuals in the direction of production contrast but more misaligned in the direction of perception contrast.
The two aims of the study were to (1) test generalizations of the production-perception link and (2) describe apparent-time patterning of the merger variants in production and perception. We first discuss the descriptive data relative to previous documentation of these mergers (Sections 4.1 and 4.2). As this study was not designed to provide a definitive or comprehensive overview of the completion status of the mergers, we discuss various possible interpretations consistent with the data but do not diagnose the situation further. Then, we discuss the production-perception results situated in the sound change literature (Sections 4.3 and 4.4).
Summary of results for
Summary of results for
Summary of results for
In To et al. (
The [ŋ̩]→[m̩] merger was likewise reported by To et al. (
For the [ŋ]↔Ø merger, children in To et al. (
Several factors could plausibly lead to the discrepancies between current results and prior predictions about the trajectory of these consonant mergers. We consider here four possible sources: methodological differences, style-shifting, age-grading, and community-wide change.
In terms of methodology, we used several words per merger pair as opposed to the single word used in To et al. (
Another methodological difference between our study and the previous offers the possibility that speakers may have been style-shifting to a formal register in the current data given our lab-based reading task, but would have produced more innovative variants in a more casual setting without the influence of orthography (e.g., the picture naming task in
A third possible scenario is that of age-grading
Lastly, production norms may have changed due to reversal of the phonological merger in younger generations. Given that older men appear to be merging [ŋ̩]→[m̩] at a similar rate as previously reported, the current results could be consistent with an incipient reversal. For [l]→[n] and [ŋ]↔Ø, however, younger speakers did not make a larger contrast relative to the older generation. Perhaps these age-undifferentiated results reflect a community-wide change of language ideologies due to increase in awareness of the ‘proper’ social meaning of [n] and [ŋ] since the advent of the ‘proper pronunciation’ campaigns in the late 2000s. Notably these events occurred after To et al. (
If changes in production were indeed driven by an ideological shift, the age-related results for [ŋ̩]→[m̩] could be seen as representing an effect of register that is particularly pronounced in younger speakers; that is, the social association of [ŋ̩] with ‘properness’ may not exist as strongly for older speakers (or, alternatively, older speakers may be less invested in abiding to newer social norms). Moreover, women were more likely to use high rates of [ŋ̩] for both categories, including historical /m̩/. Given that women have been suggested to (a) be more aware of social stigma and (b) use more prestige variants (e.g.,
It is clear that, to fully understand the patterns found in these data, speech style and attitudes must be accounted for. Thus, limitations of the current study include a lack of consideration for multiple registers in the production task and attitudes about ‘lazy pronunciation.’ Accounting for variable styles would have allowed for more conclusive interpretation of descriptive results, especially in comparison to previous literature. At the same time, while we were limited in our methodological choices due to constraints of the stimuli, the formal style elicited in our task does not invalidate our findings, including those of the production-perception analyses. Speakers have a repertoire at their disposal, and we tapped into a particular style of production, which may well have been influenced by our methodological choices; regardless, we need to account for the linguistic knowledge presented to us. Moreover, our choice of an isolated, context-free, single word production task accompanied by orthography is well matched with the perception task, which is characterized by all those features as well. This well-matched formality allows for interpretable (mis)alignment patterns across production and perception.
With respect to ‘lazy pronunciation,’ although the exit interviews confirm that all participants knew what ‘lazy pronunciation’ was, we did not assess awareness or attitudes for each particular merger. Attitudes can vary substantially, potentially influencing patterns of both production and perception. For example, although many participants characterized the mergers as ‘incorrect’ and ‘lazy,’ one young man expressed more neutral or positive sentiments, describing the pronunciations as ‘more convenient’ (see also
Unlike production, a general theme in perception is that younger participants, particularly women, responded to historical lexical classes more categorically than older participants, though the degree of categoricity was objectively low for mergers involving [ŋ].
Another possibility is that, rather than simply demonstrating perceptual flexibility, younger speaker-listeners are in fact beginning to reverse the mergers in their recognition of lexical items, prior to showing clear signs of reversal in production. That is, a phonological change across generations may truly be occurring, where perception leads production in the opposition direction to the historical change. As suggested earlier, this explanation could align with qualitative age-based trends found in [ŋ̩]→[m̩] production. Only future investigations on the continuing trajectory of these variants can shed light on these open questions.
An alternative explanation does not rely on listeners’ decontextualized phonological representations but involves social representations construed from the model speaker’s voice. That is, differences may have arisen due to younger listeners perceiving the model speaker as a peer while older listeners perceived the same speaker as younger (
Moving on to the production-perception link, our individual-level production-perception results reveal a link between production and perception for two consonantal mergers, corroborating recent positive findings from the sound change literature. In particular, it allows us to more confidently generalize the findings from Pinget et al. (
On the contrary, no production-perception correlation was found for the [ŋ]↔Ø merger. Since the other two merger pairs demonstrated a correlation, the type of change (i.e., merger rather than cue shifting, for example) cannot be the constraining factor, contributing evidence against the hypothesis that production-perception relationships vary by type of change. Still, it is possible that the specific nature of this change is relevant. That is, the other two mergers are ‘standard’ contrast-loss mergers, unlike [ŋ]↔Ø which was originally allophonic and exhibited bidirectional merging. The nature of this change (including the changes in direction) may have contributed to the apparently unlinked representations. Given this, it may be beneficial for future research to consider more fine-grained variability among types of change and how it may influence the existence of a production-perception link. We additionally note that these findings could have resulted from the choice of perceptual stimuli and task, which involved lexical identification between items in a purported minimal pair that did not follow historical tone allophony patterns (i.e., involved an exception). With the current data, it is not possible to tease apart whether this null effect is due to the nature of this particular merger or aspects of the experiment. Future work would benefit from examining the [ŋ]↔Ø merger further in perception.
Despite correlation differences, similar patterns of misalignment were uncovered across all three merger pairs. In general, individuals making a contrast in production showed comparatively less perceptual contrast, indicating that perceptual merger is more advanced than production merger. On the other hand, many individuals with nearly or fully merged production for [n]→[l] and [ŋ]↔Ø—if not ‘realigned’ through full merger in production and perception—in fact show evidence of categoricity in perception; this suggests that at these late stages of production merger, perception is less advanced, or lagging. Given the data at hand, these results fall in line with previous reports of production-perception misalignment with regards to stage of change (
In addition, younger participants on the whole tend towards less misalignment in the direction of production contrast and/or more in the direction of perceptual contrast. This confirms that community-level perceptual and production results, where younger individuals showed more contrast in perception, can be sourced to the individual level. A final observation is noteworthy. Unlike the other two mergers, few individuals showed misalignment in the direction of perceptual contrast for the [ŋ̩]→[m̩] merger, even for those merged in production. Why does this difference arise? While speculative, it seems conceivable that [ŋ̩]→[m̩] operates on a reduced perceptual scale due to the lack of acoustic-perceptual salience for nasals (
This study investigates production and perception of the [n]→[l], [ŋ̩]→[m̩], and [ŋ]↔Ø mergers in Hong Kong Cantonese, on one hand documenting the present-day status of these long-standing sound changes and on the other hand extending hypotheses about the production-perception relationship in the context of sound change. Production results demonstrate substantial amounts of individual variability but a lack of clear age- or gender-related variation, indicating that the picture for each merger is less straightforward than continued or completed merger. Perceptual results find that younger speaker-listeners are less merged than the older generation, potentially due to adaptation to high phonetic variability. While this project did not collect the data necessary to fully diagnose the current stage of the mergers, the results do suggest that [n]→[l] and [ŋ]↔Ø no longer appear to be changes-in-progress, though incipient reversal is a possibility for [ŋ̩]→[m̩]; instead, usage of the formal variants appear inflated compared to previous records, suggesting that ‘proper pronunciation’ ideologies are at play. Another possibility is that the mergers have stabilized into context-conditioned style-shifting, a scenario that Labov (
A birds-eye view of our results on the production-perception link reveals that, on the whole, it parallels the literature: Some level of linkage evidently exists between production and perception systems, though it is not always strong and not always found. We show here that, in the same population using the same methodology, a production-perception correlation can be found moderately for two mergers yet not at all for another. Our finding of production-perception coupling for two additional consonant mergers specifically bolsters the conclusion that convergent results in the literature are generalizable beyond certain types of sound change. However, while we aimed to use comparable measures for production and perception, we also acknowledge that our experimental choices may still have impacted results. Ultimately, we do not solve the problem of methodological inconsistency, but we urge future research to take these issues into careful consideration.
Interestingly, when a correlation was detected, they were of very similar magnitude, even though these mergers are characterized by rather different profiles (e.g., functional load), trajectories, and perceptual salience. This direct comparison of magnitudes indicates that there is a modicum of consistency about the production-perception relationship across cases; notably, such a comparison is difficult to make across studies due to the variability in context, methodology, participants, and more (see
That production and perception systems constrain but do not determine each other is sensible as we expect flexibility in perception beyond variability exhibited in production (e.g., we understand and accept a much larger range of speech than we tend to produce; see
One route forward is to consider moving beyond a search for evidence on the existence of a production-perception link to a more nuanced examination of when and to what degree we are able to detect a relationship. In this vein, we encourage researchers to apply a controlled comparative approach with a larger range of case studies in order to paint a broader typology of sound change and the extent of coupling between production-perception systems. By focusing attention on theorizing why there would not be a link for any particular case, what mechanisms underlie such connections, and what regulates the strength of relationship across controlled groups, variables, or tasks, we can move closer to understanding the persistently mixed results of production-perception studies and the underlying cognitive processes.
The additional file for this article can be found as follows:
Word list for production task. DOI:
In a study on the frequencies of all Cantonese word initial sounds, Ng and Kwok (
While including orthography in our tasks departs from methodology in previous research (e.g., picture naming) and likely elicits particular styles of speech that are more formal than spontaneous speech, because target word selection was constrained by the limited options for the [ŋ̩]→[m̩] and [ŋ-]↔Ø mergers (especially with the requirement of minimal pairs for the perception task), the available target words were not easily picturable. In order to maintain comparability across domains, we chose to include orthography for both production (reading) and perception (word selection) tasks; thus, production and perception tasks were well-matched methodologically. For more discussion, see Section 4.1.1.
Interviews were not conducted fully in Cantonese mainly due to limitations of available personnel, but an added advantage is that we minimize the potential for an interviewers’ pronunciation of Cantonese (e.g., if they used merged variants) to have influenced participants' response to minimal pair comparisons and ‘lazy pronunciation’-related questions. In addition, we note that because participants only interacted with the experimenter outside the main task and that the post-task interview was conducted only for additional context, there is no reason to expect this choice to impact the study results in any meaningful way.
The model syntax used was: glmer(ProportionL ~ HistoricalPattern * AgeGroup * Gender + (1+HistoricalPattern|Subject) + (1|Item), family=”binomial”).
Model syntax: glmer(PropNovel ~ Step_centered * AgeGroup * Gender + (1+Step_centered|Subject), family = “binomial”)
As a comparison point to facilitate interpretation, Casillas and Simonet (
Because the distance was too great, attempts to apply transformations for skewed data were not successful in creating a reasonable distribution; thus, we ultimately chose to remove the outliers.
Due to the removal of the two outliers prior to scaling and visualizations, the maximum scaled value in Figures 12 and 13 represents the next highest CCS value, which was 0.632.
To be conservative, the correlation provided in text includes the two extreme values. Removing those two data points from the correlation does not affect the direction or interpretation of the results [t(45) = 4.54,
It’s worth noting that regardless of its participation in a merger, the auditory-acoustic contrast between syllabic nasals compared to /n/ and /l/ is less robust.
While this may not be a foolproof approach due to the less perceptually robust nature of nasal place of articulation contrasts, the use of the upper boundary of [n]→[l] crispness values serves as an estimate, given that ‘maximum’ perceptual crispness values are unknown for this particular contrast.
A direct comparison could not be made for the historical null-initial word, as the word
See Wagner (
Merger reversal across contexts and context-based style-shifting cannot be distinguished in the current data, as we do not compare contexts. As well, they may not be mutually exclusive: More prestige-related style-shifting could co-occur with and/or precede incipient reversals in production outside of formal contexts, both motivated by increased awareness and attitudes towards the social meanings of the merger variants. As we propose these outcomes to have the same source, we discuss them together.
We acknowledge the possibility that an age-based effect of hearing sensitivity (or cognitive processing abilities) may have influenced perceptual results to some extent and note that we did not specifically control for this confounding factor. At the same time, because younger men often patterned with older individuals (most noticeably for [ŋ̩]→[m̩]), we take the position that the age trend cannot simply be reduced to a difference in auditory-perceptual abilities.
We thank Chang Liu for conducting data collection at the Hong Kong Polytechnic University. Thank you to all the members of the UBC Speech in Context Lab, particularly Zoe Lam, Kristy Chang, Ivan Fong, Sophie Bishop, Cassandra Savage, Shannon Briggs, and Rachel Soo, for help with stimuli creation, auditory coding, data analysis, and discussion of the project at various stages. Finally, we appreciate all the audience feedback that we received at the 4th and 5th Workshops on Sound Change, the NorthWest Phonetics & Phonology Conference, and the UBC Language Sciences Undergraduate Research Conference.
LSPC is supported in part by funding from the Social Sciences and Humanities Research Council of Canada (SSHRC). This research was partially funded by the UBC Language Sciences Initiative and a SSHRC grant awarded to MB, as well as a research grant (P0001897) from the Hong Kong Polytechnic University awarded to YY.
The authors have no competing interests to declare.