Unmerging the sibilant merger among speakers of Taiwan Mandarin

This study presents empirical evidence from read versus interactive speech to shed light on the nature of the alveolar-retroflex sibilant merger by young speakers of Taiwan Mandarin (TM). TM speakers often merge the two sibilants through deretroflexion of the retroflex category. The results of the reading task showed that the variation is on a full continuum, from a complete merger to clear contrasts, and the merger is more prevalent among male speakers, demonstrating the impact of the social stigma associated with the merger. However, the results of the interactive task demonstrated that speakers who merged the contrast produced the retroflex sounds as distinct from their alveolar counterparts, revealing hidden structures in the mental lexicon. The mismatch between the abstract phonological knowledge and actual implementation in production suggests that the exposure to phonological systems of other speakers, especially those who make clear distinctions, has led to the incorporation of discrete categories into the phonological knowledge of the merged speakers. These findings suggest that large individual variation in the early stages of sound change may provide evidence for possible categories in a given language for language learners; however, their implementation may be further modulated by social as well as other phonetic factors.


Unmerging of phonemic mergers
According to Garde's principle (Garde, 1962;Labov, 1994), a merger, once completed, cannot be reversed by linguistic means as traces of phonetic differences have been eliminated and learners have no means of accessing the earlier forms during language acquisition. Yet a number of sociolinguistic studies of vowel mergers have generally concluded that phonemic mergers are far from straightforward and cannot be defined in a simplistic and homogeneous fashion (Clark, Watson, & Maguire, 2013;Harris, 1985;Labov, 1994). Specifically, the vast majority of the studies of contemporary phonological systems involves cases of near-mergers or suspended contrasts, a pattern that involves a separation between production and perception (Labov, 1994, pp. 349-70). For example, the vowels in sauce versus source in r-less NYC speech were thought to be identical or indistinguishable in perception. Labov, Yaeger, and Steiner's (1972) instrumental study of the data revealed, however, that the vowel in source was consistently produced as more retracted and higher as though the /r/ had been retained as in different varieties of American English.
Near-merger patterns offer a unique opportunity to examine the nature of phonological representations and their connection with actual phonetic implementation in production. Most notably, small but consistent acoustic differences in production strongly suggest that speakers have some knowledge of the non-overlapping phonological categories stored in their mental lexicon. In a cleverly designed study, Hay, Drager, and Thomas (2013) compared real and nonce words containing the vowels undergoing a merger across the sets /ɛl/ versus /ael/ (ellen-allan) in New Zealand English (NZE) and /ɑ/ versus /ɔ/ (lot-thought) in American English. The results of a reading task showed that the speakers who merged the contrasts did so to a greater degree for real words than for nonce words. Framed in an exemplar-based model (Johnson, 1997;Pierrehumbert, 2001), the results were taken to indicate that, in the mental lexicon of the merged speakers, the pairs of vowels are represented as belonging to separate phonemes, which were realized as distinct in the production of the nonce words. In contrast, real words are stored as whole forms with acoustic details in the word-level representations, and the implementation of those phonetically rich abstract forms leads to considerable overlap between the two categories.
Some degree of separation, if not full distinction, in the mental representation suggests that the (nearly) merged categories could potentially be unmerged in certain experimental conditions. Hay, Drager, and Warren (2009) explored this idea in a series of word list reading tasks in which NZE speakers were exposed to two different experimenters, one from New Zealand and the other from the US. NZE speakers often merge the vowel contrasts in the near-square lexical set by approximating the diphthong /eə/ toward /iə/. When interacting with the US experimenter, however, they unmerged the two vowels to a greater degree, accommodating the speech of the experimenter who retained the vowel distinction. In another study, Babel, McAuliffe, and Haber (2013) showed that NZE speakers may also merge or unmerge this contrast based on their perception of their interlocutor. Specifically, NZE speakers merge the contrast less when exposed to an Australian model talker with clear vowel distinctions who was described as holding positive views of New Zealand. When that same talker was described as having a negative view of New Zealand, however, the speakers did not imitate the vowel distinction made by the model talker.
Taken together, the results of these studies suggest that social factors may also mediate the unmerging of merged categories.
How, then, can learners acquire discrete categories when they cannot perceive the differences? To begin to answer this question, we must consider the environment in which nearmergers emerge. In a given speech community, speech varies greatly among individuals who are exposed to different phonological systems, some of which may make a clear distinction between categories that they themselves merge. Among the sub-patterns of near-mergers (Harris, 1985;Labov, 1994;Clark et al., 2013), 1 'merger by approximation' refers to a case in which one of the categories is brought closer to the phonetic space of another category, eliminating the phonetic distinctions between the categories. Clark et al. (2013) argued that this type of merger involves approximation of one category to another which is likely to involve a gradual change, so the merger is likely to be incomplete and variable throughout the course of the linguistic change.
Even though some speakers eliminate the contrasts entirely in their own speech, they are still exposed to clear distinctions, which may be incorporated into their phonological knowledge. The impact of community-level variation is clearly demonstrated in the disassociation between perception and production patterns by the speakers who merge the contrast. That is, speakers may merge the phonemic contrasts in their own speech, but they are not necessarily incapable of distinguishing the relevant categories in perception. For instance, Hay, Warren, and Drager (2006) showed that NZE speakers of the near-square vowel merger were fairly accurate at identifying intended targets when forced to choose between minimal pair words (e.g., beer versus bear) given the stimuli produced by speakers who made the contrast. Hay et al. (2013) extended this to the perception of nonce words as well; they observed that NZE speakers were able to identify the /ɛl/ versus /ael/ contrasts in nonce word pairs (e.g., lellit versus lallit) nearly as well as those in real word pairs (e.g., Ellen versus Allan). Along these lines, Wade (2017) found that speakers in Youngstown, OH, who merged the pool, pull, pole word sets were adequately sensitive to the secondary cues to the tense-lax vowel distinctions, namely vowel duration. Although they may have collapsed the tense-lax vowel contrasts along the spectral 1 Other types identified in the literature include 'merger by transfer' which refers to a case where individual lexical items are mapped onto another item in the lexicon and 'merger by expansion' which refers to a sudden collapse in phonetic space between two or more categories. Each has argued to have characteristic features and are likely to go through different developmental paths. This paper focuses on 'merger by approximation,' as the sibilant merger in TM falls under this sub-pattern. dimension, they relied more on vowel duration cues than speakers from Burlington, VT, who maintained the contrast when identifying the tense vowels, particularly those in the pool lexical set. The weak link between perception and production has generally been interpreted to mean that speakers who merge a contrast still encounter distinct forms from other speakers in their speech community, and this experience helps them maintain separate exemplar clouds of the two categories (Hay et al., 2013;Hay et al., 2006;Nycz, 2013).
Building upon the previous research, the present study aims to contribute another piece of evidence for variable realizations of the phonemic merger conditioned by different experimental tasks. While earlier works have focused mainly on vowel mergers (e.g. Labov, 1994) or tonal mergers (Fung & Lee, 2019;Mok, Zuo, & Wong, 2013;Yu, 2007), the present study investigates the sibilant merger in TM. The merging of sibilants in TM is known to be variable, and its conditioning social factors have been extensively documented in traditional descriptive studies.
In the first experiment, we use a reliable instrumental method to first establish the extent of merging among individual TM speakers and thereby assess how far the phonetic variation has progressed among the younger members of the speech community. In Experiment 2, we then examine whether the speakers who merged the contrast in Experiment 1 would be able to unmerge the category, using a novel application of the methodology that has been used to elicit contrastive hyperarticulation in speech production. In the following, the current status of the sibilant merger in TM is introduced in Section 1.2, and the details of the methodology and its core findings are reviewed in Section 1.3.

Variation in sibilant merging in Taiwan Mandarin
Standard Mandarin makes a phonemic distinction between alveolars /s ts ts h / and retroflexes /ʂ tʂ tʂ h /. In the Taiwanese variety of Mandarin, however, speakers often merge the two categories primarily through deretroflexion of the retroflex category (Ing, 1984;Kubler, 1985). Some scholars have pointed out that the variation is not categorical but exists on a continuum (Chung, 2006;Lin, 2007). Chung (2006, p. 200), for example, argued that Taiwan Mandarin (TM) speakers' retroflex production demonstrates considerable variation ranging "from highly retracted,  Chiu, Wei, Noguchi, and Yamane (2019), for example, investigated the production of TM sibilants by seven speakers using ultrasound imaging and identified three different groups: overlap, non-overlap, and context-dependent overlap. The tongue curves of the two overlap speakers were indistinguishable from the tongue tip to root for alveolar and retroflex sibilants. For the two non-overlap speakers, the tongue curves significantly diverged at the tongue tip with the alveolar sibilants exhibiting a significantly more concave tongue body gesture than their retroflex counterparts. For the remaining three speakers, sibilant merging was conditioned by vowel context. The articulatory differences were faithfully reflected in the acoustic properties of the frication noise operationalized by the Center of Gravity (CoG) values. Notably, even in Chiu et al.'s (2019) small sample of speakers, the varying degrees of sibilant merging was captured.
It is worth noting that even the TM speakers who maintain the contrast between sibilants diverge from the reported norms of retroflex articulation in Mainland Mandarin. focus. The differences were primarily driven by the higher spectral means of the TM retroflexes, indicating weaker retroflexion in TM. Anecdotally, the strong retroflexion in Mainland Mandarin is hardly ever observed in the speech of TM speakers. Chung (2006) argued that intermediate retroflexion has emerged as the socially neutral and acceptable norm for the retroflex category in Taiwan. The differences in sibilant production are well-known among the speakers from both regions and are considered one of the most salient dialect-distinguishing features (Chang, 2017).
The conditioning factors of the variation in TM sibilants have been subject to extensive research among Taiwanese scholars who have proposed an interaction between linguistic and social factors. The merger has long been thought to arise from language contact with Taiwanese Southern Min (TSM, hereafter, also commonly referred to as 'Taiwanese' or 'Hokkien') which lacks the retroflex category altogether (Ing, 1984;Kubler, 1985), a feature typical of Southern varieties of Chinese (e.g., Cantonese, Shanghainese, and Hakka, Chen, 1999). Though not the official language of Taiwan, TSM is a major substrate language which is spoken by 70% of the population and of which most TM speakers have at least some passive knowledge (Huang, 1993;Sandel, 2003). Given the asymmetrical sibilant inventories of the two languages in contact, the predominant pattern of the sibilant merger in TM is considered deretroflexion, the approximation of the retroflex sibilants toward the alveolar area (Kubler, 1985;Lin, 1988;Wei, 1984).
However, the direct connection between TSM and the sibilant merger seems to have weakened in recent years. In a large-scale phonetic study, Chuang, Sun, Fon, and Baayen (2019) tested 331 young ethnically-Min TM speakers. Their results showed that speakers living in Southern Taiwan were more likely to speak TSM fluently than those living in Northern Taiwan, consistent with the linguistic landscape of the island (Ang, 2010). However, TSM proficiency did not predict the degree of sibilant merging; fluent speakers of TSM did not necessarily merge the two categories more frequently. Rather, speakers who lived in rural areas were more likely to merge the sibilants than those residing in urban areas, indicating that the merger no longer originates from direct language contact with the local substratum languages and is becoming an independent feature of TM. Building upon this line of research, the present study aims to establish the patterns of the sibilant merger in terms of linguistic and social factors and to investigate the extent to which the merger has spread among the speech community in Taiwan. In particular, if TM speakers who are not in contact with TSM are found to consistently merge sibilants, one could argue that this once-substrate feature has begun to emerge as a general feature of TM.
Sibilant merging in TM is somewhat stigmatized due to its association with the substratum language, leading to linguistic stratification. Since the retreat of the Nationalist party to Taiwan in 1949, Standard Chinese was forcefully imposed on the local population, and the social prestige of TM has been deeply solidified among the speech community (Feifel, 1994;Huang, 1993;Sandel, 2003). Sociolinguistic interviews with young college students revealed that female speakers who care for refinement ('qizhi (氣質)' in Mandarin) and higher social prestige tend to avoid using TSM (Su, 2008). According to one of the male interviewees, "Women really do not speak Taiwanese as much. Maybe they find it lacking in qizhi" (Su, 2008, p. 345), reflecting the deep-seated stereotype of TSM and its negative cultural connotation. Notably, a conventional belief among TM speakers is that TM females produce retroflexes better than male speakers, which has been confirmed by researchers (Jeng, 2006;Tse, 1998). Deretroflexion, as one of the representative features of TSM, is thus found much less frequently in female speech than in male speech.
The goal of the current study is to frame the TM sibilant merger in terms of the relevant linguistic and social factors. Unlike phonemic mergers that are generally neutral in terms of social standing, the sibilant merger in TM is highly sensitive to social factors as discussed above. Merging is thus expected to be highly variable among individuals. Moreover, while some speakers may not contrast the sibilants themselves, the distinct categories produced by other speakers may have been incorporated into their phonological knowledge. It is therefore possible that the merged categories could be unmerged by such speakers in a particular experimental setting. To that end, we capitalize on an experimental paradigm that is known to elicit 'contrastive hyperarticulation,' an enhancement of phonetic cues to phonemic contrasts due to the existence of lexical competitors, as reviewed below. We investigate the conditions in which the sibilant contrast is fully realized, reflecting the underlying representations stored in the speakers' mental lexicon.

Contrastive hyperarticulation in speech production
Minimal pair competitors, a special form of phonological neighbor, are known to trigger a significant enhancement of phonetic cues associated with the relevant phonological contrasts. Wedel, Nelson, and Sharp (2018), for example, used The Buckeye Corpus of Conversational Speech (Pitt, Johnson, Hume, Kiesling, & Raymond, 2005) to investigate the ways in which phonetic cues are hyperarticulated as a function of the existence of lexical competitors. The results showed that English voiceless stops with lexical competitors tended to be produced with longer VOT (e.g., pat/bat) than those lacking competitors (e.g., pant/*bant). In contrast, English voiced stops with lexical competitors were produced with shorter VOT (e.g., bat/pat) compared to those without (e.g., badge/*padge). The opposite direction of the VOT changes of voiceless and voiced stops resulted in the enhancement of the distance between the two contrasting categories. Similar effects were found for vowel contrasts: Lax vowels tended to be centralized while tense vowels tended to be peripheralized in the vowel space (e.g., [ɪ] in ship and [i] in sheep).
In their seminal work using a cooperative interactive paradigm, Baese-Berk and Goldrick (2009) examined the implementation of the stop voicing contrast in English. 2 In this study, a speaker (participant) and a listener (experimenter) sat face-to-face, each with a computer screen in front of them (see Figure 4). The speaker produced a target word among three candidates out loud for the listener who would then indicate the target by clicking it on his/her screen. Crucially, each target word appeared in three different conditions: Context, No Context, and No Competitor. In the Context condition, a target word appeared along with two other words, one of which was its lexical competitor and the other a filler (e.g., cod [target], god [competitor], yell [filler]). In the No Context condition, the same target word appeared with two fillers (e.g., cod, lamp, yell). The No Competitor condition contained a target lacking a legal lexical competitor along with two fillers (e.g., cop, lamp, yell). The results showed elongated VOT for the voiceless stops in both the Context and No Context conditions compared with those in the No Competitor condition. Furthermore, the VOT was enhanced to a greater degree when the targets were presented overtly with their competitors in the Context condition.
Mandarin sibilants have also been examined in an interactive task with a partner, though with some methodological differences. Chang and Shih (2015) utilized prosodic focus to elicit the contrast enhancement between Mandarin alveolar and retroflex sibilants in a map task.
Notably, they pre-screened their participants and included only those who made clear contrasts, excluding speakers who merged the contrast. Their stimuli included a pair of non-word location names, with the target containing one of the two sibilants (e.g., 扎狗海岸 '/tʂakoʊ/ beach') and 2 In the same study, they obtained similar results in a single-word reading task without the simultaneous presentation of minimal pair competitors (Experiment 1). The interaction with an interlocutor is, therefore, not essential for contrastive hyperarticulation, based on which the authors argued for a production-internal mechanism as the source of the observed effect. However, the present study is not concerned with whether contrastive hyperarticulation arises from a production-or perception-oriented mechanism, or whether the effect stems from competition with minimal pair competitors or broad lexical neighbors (Fricke, Baese-Berk, & Goldrick, 2016;Kirov & Wilson, 2012). Rather, the paradigm is adopted here to address the novel question of whether speakers who merge a category have stored distinct categories in their mental lexicon. the corresponding control item without any sibilants (e.g., 莽狗海岸 '/maŋkoʊ/ beach'). The experimenter would ask a question about a location on the map using a control item, and the participant-speaker would correct the direction with the target containing the sibilant. Their production would thus be under contrastive focus, and the sibilant place contrast was predicted to be enhanced. However, neither the TM speakers nor the Mainland Mandarin speakers enhanced the contrast in this condition, which led the authors to conclude that the coronal sibilants were not subject to cue-enhancing hyperarticulation. This result may suggest that, unlike lexical competition, prosodic focus alone may not be sufficient to drive significant contrast enhancement for the sibilants. The stimuli used in this study were geographic nonce words and their lexical competitors were not legitimate Mandarin words (e.g., *tsakoʊ). Hence, it remains to be determined whether contrast enhancement could be obtained via lexical competition for the Mandarin sibilants.
To that end, the present study capitalizes on the experimental paradigm modeled after Baese-Berk and Goldrick (2009), which has been shown to elicit an enhancement of phonemic contrasts in the presence of lexical competitors. The employment of Chinese characters as stimuli is expected to enable phonological effects to be isolated while minimizing a direct orthographic interference/facilitation. Unlike previous interactive studies in which the Roman alphabet was used to present the stimuli, the study at hand uses logographic and phonologically opaque Chinese characters, which are expected to provide stronger evidence for contrastive hyperarticulation. Clear contrasts, if any, cannot be attributed to the visual cues available in the alphabetic encodings of the contrasting sounds.
Additionally, the current study differed from previous studies on contrastive hyperarticulation in one crucial regard. Previously reported cases of contrastive hyperarticulation have focused primarily on phonological contrasts that are robustly represented by the speakers of the language (e.g., stop voicing and vowel tenseness). In these cases, the presence of minimal pair competitors, explicit or implicit, consistently leads to the hyperarticulation of the phonetic cues that enhance the contrast. TM sibilant contrasts, however, are subject to large speaker variation. Some speakers merge the contrast, while others retain it. Still, others may fall between these two extremes.
How, then, would speakers with different degrees of merging cope with an experimental task designed to elicit contrastive hyperarticulation using lexical competitors?
Sociolinguistic studies, in fact, have shown that eliciting contrasts via minimal pairs may not necessarily improve the contrast made by speakers who typically merge it (Johnson & Nycz, 2015;Labov, Karan, & Miller, 1991;Labov et al., 1972;Nycz, 2013). For example, Nycz (2013) used a variety of tasks, including naturalistic conversation, word lists, and minimal pair reading, to test Canadian speakers of the cot-caught merger who had been exposed to the NYC dialect, which makes a robust distinction. Surprisingly, the speakers merged the vowels in the minimal pair context but carried small but consistent vowel distinctions in their conversational speech.
A similar trend was observed in the speech of individuals moving in the opposite direction, from split to merged: Advanced forms (i.e., merged contrasts) were observed more frequently in conversation than in the minimal pair context (Johnson & Nycz, 2015). The authors conjectured that different tasks may reveal the multi-faceted nature of a speaker's linguistic knowledge.
While minimal pair contexts encourage speakers to express the phonetic norms of their original dialect variety, conversation reveals their adaptation toward the characteristics of the sounds predominant in their new speech community.
Building upon these previous works, the present study examines how the variable sibilant merger in TM is realized across different experimental tasks, which would help us understand the interplay between abstract representations, phonetic implementation, and the conditioning social factors. In Experiment 1, we first establish the extent and the range of the merger between individuals, thereby assessing the extent to which this pattern has spread among young TM speakers. Experiment 2 examines whether the speakers who merged the contrast in Experiment 1 could be induced to make clear contrasts using an interactive task designed to elicit contrastive hyperarticulation.

Experiment 1: Sibilant production in read speech
The first experiment was designed to characterize the sibilant production of individual TM speakers. As a merger-in-progress, it is expected that the speakers' levels of mergedness would vary.

Participants
Sixty native TM speakers (32 female, 28 male; aged 20-29) participated in the production study. The participants were divided into two subgroups with respect to their TSM proficiency: 31 TSM-fluent versus 29 TSM-weak speakers. Prior to data collection, participants were pre-screened for their proficiency in TSM as well as TM, Hakka, and English. In a language background questionnaire, participants rated their confidence of listening and speaking for each language on a 7-point Likert scale ('1' being not at all confident, '7' being highly confident). Those who rated their TSM proficiency as greater than 5 (TSM-fluent, hereafter) or lower than 3 (TSM-weak, hereafter) were invited to participate in the study. In addition to language proficiency, the questionnaire gathered details about birthplace, location of residency, family language background, and daily language use. As shown in Table 1, most TSM-fluent speakers were born and raised in Southern or Central Taiwan, and TSM-weak speakers were predominantly from Northern Taiwan, reflecting the general linguistic landscape in Taiwan (Ang, 1997;Ang, 2010).
Notably, all speakers identified themselves as most fluent in TM. For TSM-fluent speakers, their fluency in TSM was ranked lower than that in TM, the difference between which reached statistical significance (listening: t(44) = 2.102, p = .041; speaking: t(41) = 3.564, p = .001).
While TSM-weak speakers indicated low fluency in TSM, the mean TSM listening of 2.07 indicates that they had at least some passive knowledge of this language, consistent with descriptions in the literature (Huang, 1993;Sandel, 2003). Both groups self-rated their level of English proficiency as intermediate; there was no significant difference between groups (listening: t(54) = .125, p = .901; speaking: t(53) = .865, p = .391). It is worth noting that TSM-fluent speakers expressed higher confidence in TSM than in English, while TSM-weak speakers ranked their confidence in TSM lower than English.
Participants' TSM fluency was verified during their lab visit through a short conversation in TSM with the experimenters who were fluent in TSM. None of the participants reported any hearing or speech disorders. Participants received small monetary compensation for their time.

Stimuli
The stimuli consisted of four disyllabic words containing the sibilants /s ʂ ts h tʂ h / in the wordinitial position followed by the vowel /a/ carried by Tone 1 (X 55 ). 3 The non-target second syllable of the stimuli was fixed with a labial initial followed by the vowel /a/. Stimuli items varied in their morphosyntactic compositions; for example, some were nouns (e.g., 沙發 'couch'), while others were phrasal (e.g., 撒滿 sprinkle-full 'fully sprinkled'). Since frication noise is sensitive to the neighboring phonological environment, especially lip rounding, the phonological conditions were balanced in the selection of the stimuli words. None of the target items had minimal pair competitors distinguished by the initial sibilants (e.g., 沙發 /ʂa.fa/, */sa.fa/). In addition to the target items, 32 filler items were included. The experimental stimuli for Experiment 1 are listed in Table 2.

Procedure
Participants were recorded individually in a sound-attenuated booth in the Experimental Phonology lab at National Yang Ming Chiao Tung University in Taiwan. They were asked to read aloud each stimulus item on a computer screen in a frame sentence (/wo 21(4) ʂuə 55 _____ tʂə 51 kə 0 tsi 51 / "I say __ this word"). Because target words were carried by a frame sentence, they were under narrow focus, and participants' production was clear and formal. Participants were familiarized 3 Some of the words initially chosen as target items were treated as fillers in the analysis stage and were not analyzed further. For one, a pair of words with the unaspirated affricates (/tsa 55 pa 214 / 紮吧 'tuck in!' versus /tʂa 55 man 214 / 扎滿 'injection') was problematic: specifically, large speaker variation was found for /tsa 55 / 紮 'tuck.' Unlike the dictionary transcription, many DISTINCT speakers produced it as /tʂa 55 /. The spectral characteristics of this item patterned together with other retroflexes /ʂa tʂ h a/, indicating that this word was represented in the mental lexicon as having a retroflex. Second, the stimuli list also included corresponding sibilants followed by the vowel /u/. However, the rounded vowel sometimes leads to a bimodal spectral distribution (see Figure 4 in Lee-Kim, 2011, especially for the retroflex sibilants, which is not ideal for the spectral moment analysis (Forrest, Weismer, Milenkovic, & Dougall, 1988  with the stimuli items prior to the recordings, and no prosodic disfluency or abnormalities were observed during the experiment. Participants clicked a computer keyboard to proceed to the next trial at their own pace. A randomized reading list was repeated five times (36*5 = 180 trials). All stimuli were presented in Chinese characters, which is logographic and essentially non-alphabetical. The recording was made using AKG C520L condenser microphone with Zoom H4n digital recorder at a sampling rate of 44,100 Hz.
The recordings were first annotated for the frication noise in Praat (Boersma & Weenink, 2020). Since the sibilants differed in the manner of articulation (i.e., aspirated affricates, fricatives), the boundaries of the noise signal had to be carefully labelled. For the aspirated affricates /ts h tʂ h /, the burst and aspiration were segmented out, leaving only the frication noise for acoustic analysis. The frication noise was marked from the onset of high-frequency noise to the onset of aspiration or that of the following vowel. The separation between frication and aspiration was not always well defined; in some cases, the noise interval was fully fricated without aspiration, and in other cases the aerodynamic change was gradual making it difficult to place a clear-cut boundary. While generally following the criteria implemented in Chang and Shih (2015), we ensured that the noise portion with the highest energy concentration of the frication was located in the middle of the two boundaries for all cases. The middle interval of the frication noise was used for the acoustic analysis.
The segmented sibilants were then saved as individual sound files and submitted to a multitaper spectral analysis implemented in Matlab (Blacklock, 2004;Blacklock & Shadle, 2003). This particular spectral analysis ensures reliable and accurate spectral estimates and has been adopted in previous works (e.g., Koenig Among the four spectral moments (Forrest et al., 1988) drawn from the multitaper spectral analysis, the first moment (M1) representing the mean of the spectral energy distribution was used to summarize the noise property ('spectral mean,' hereafter). Following Chang and Shih (2015), we further computed 'spectral distance' for each speaker by subtracting the spectral mean of the retroflex sibilants from that of the alveolar sibilants (∆M = M(alveolar) -M(retroflex)).
A spectral distance of zero would indicate the two categories had been merged, whereas a large spectral distance would indicate a speaker made a clear distinction between the two categories. This finding confirms previous impressionistic descriptions about TM sibilants (Chung, 2006;Lin, 2007)-sibilant merging by TM speakers is on a continuum, ranging from fully merged to completely distinct categories, making it difficult to draw a boundary between speakers.

Results
The pattern can be more precisely summarized as such that TM speakers differ in terms of the degree of merging. The results also confirm the previous impressionistic and instrumental studies which found that TM sibilant contrasts are acoustically less salient than those found in Mainland Mandarin. The maximum spectral distance was around 4 kHz for the data based on 60 young TM speakers in this study, while those based on Mainland Mandarin have been reported to be greater than 4 kHz (Chang & Shih, 2015;Lee-Kim, 2011). In contrast, the spectra of Speaker ZXK clearly show a bimodal distribution with the alveolar fricatives having energy concentration at higher frequencies and the retroflex fricatives at much lower frequencies. It was often the case that the retroflex spectra demonstrated sharp peaks at the frequency region 3-5kHz for those who made a clear distinction between the two sibilants.
The spectral means of the frication noise were analyzed further using two statistical methods.
First, we performed a Bayes factor test using the Bayesian First Aid package (Bååth, 2014) in     maintained statistically significant differences between the two categories, no matter how small the differences were. 5 The dotted vertical lines in Figures 1 and 2 represent the dividing line between merged and distinct speakers. Generally, the spectral distance for merged speakers was less than 1000 Hz. 5 The categorical distinction between MERGED and DISTINCT speakers was obtained through statistical analyses.
Among the DISTINCT speakers, in particular, there was a large variation ranging from speakers barely implementing spectral differences of 1,000 Hz to speakers implementing spectral differences of 4,000 Hz. The speakers who maintained small but reliable spectral differences between the two categories were likely to be the so-called 'near-mergers' (Labov, 1994). The spectral difference of 1,000 Hz in the high-frequency region is ostensibly small, e.g., 6,000 Hz (29.7 in ERB) and 7,000 Hz (30.8 in ERB), and is highly likely to be imperceptible, but, nonetheless, those speakers maintained a consistent and reliable spectral distance in their production of the two categories. Although a perception study is warranted to determine if these were cases of near-merger or full contrasts, it is beyond the scope of the present study. Here we simply aim to establish a conservative but reasonable boundary between the two groups. These results align well with the figures based on Bayesian tests showing that merged speakers were not limited to TSM-fluent bilinguals; in particular, male merged speakers were equally represented across the two groups (6 TSM-fluent, 6 TSM-weak). Apparently, sibilant merging is no longer a consequence of direct contact with TSM. TSM-weak speakers could barely speak the language (TSM-speaking score: 1.59/7, Table 1) and seldom heard it (TSM-listening score: 2.07/7) in their everyday lives. Despite past studies positing the association between TSM and the merger, the results of the present study add to the mounting evidence being compiled for the autonomy of the sibilant merger from TSM usage (Chuang et al., 2019).
However, the gender-dependent asymmetry was shown to hold true as described in the literature (Jeng, 2006;Tse, 1998) such that male speakers maintained smaller spectral distances between the two categories than female speakers as reflected in the results of the regression analysis (Table 3). This is not entirely surprising given that men have been shown to acquire socially stigmatized forms more readily than women (Eckert, 1989;Labov, 1990  more frequently subject to the evaluation of their qizhi (refinement) (Su, 2008). As such, if a representative feature from TSM were to spread to non-native speakers, it would be adopted first by male speakers who are presumably under less pressure to avoid stigmatized features in their speech. The results are thus indicative of the complex interplay between gender and social prestige.
A closer look at the results, in fact, reveals another gender-dependent difference in the absolute spectral means of the sibilants. Some female speakers with clear contrasts, in particular, demonstrated spectral means as high as 10 kHz for the alveolar category, which indicates a more anterior location of articulation, namely dentalization. This might have to do with the stigmatization of strong retroflexion, which is a well-known characteristic of Mainland Mandarin spoken in China (Chang, 2017;Chang & Shih, 2015 (2012) showed that middle-aged TSM-dominant female speakers tend to produce more anterior alveolar sibilants to compensate for the lack of the retroflexes in their speech. In both cases, dentalization seems to be favored by female speakers as a means to enhance the phonological contrast while appearing sufficiently 'refined.'

Experiment 2: Sibilant production in an interactive task
Having established the spectral characteristics of individual TM speakers' sibilant production, we investigated whether the merged categories could be unmerged in a particular experimental condition. To that end, an interactive task was administrated to the merged and distinct speakers who fell on the extreme ends of the continuum. The latter group was included as a reference which was expected to show contrastive hyperarticulation in the presence of lexical competitors as shown in previous studies.

Participants
Twenty TM speakers (M(age) = 23.5) who had participated in the first experiment were invited to participate in the interactive task after as short as a few days or as long as 1.5 years of completing the reading task. The participants for this task were chosen from the two extreme ends of the continuum established in Experiment 1. Again, speakers who completely merged the sibilants are referred to as merged speakers, and those who maintained a sufficiently large spectral distance between the sibilants are referred to as distinct speakers. There was no definitive objective criterion constituting 'sufficient' spectral distance; however, the spectral distance greater than approximately 2 kHz was deemed a reasonably large value for TM speakers, given the range of the spectral distance from 0 to around 4 kHz. 6 Crucially, the establishment of the two groups was motivated to assess similarities and differences in the way the sibilants are produced by speakers with varying degrees merging during the interactive task. Table 4 summarizes the itemized numbers of the participants according to the known factors influencing the sibilant merger. TSMfluency was balanced between the two groups, while the gender factor was slightly skewed as there were inherently more male merged speakers.

Stimuli
Eighteen minimal pairs contrasting in the initial sibilants (山腳 /ʂan 55 tɕiaʊ 214 / 'hillside' versus 三角 /san 55 tɕiaʊ 214 / 'triangle') were compiled for Experiment 2. The items were balanced for manner of articulation: six pairs each of fricative sibilants, unaspirated affricates, and aspirated affricates. The sibilants were followed by either unrounded homorganic approximants (e.g., [sɹ̪ ʂɽ], represented as /si ʂi/ below) (Lee-Kim, 2014) or the rhymes /an/ or /aŋ/. The retroflex items were the target words that the participants produced during the interactive task, and the corresponding alveolar items served as lexical competitors. The word frequencies were obtained through the Academia Sinica Balanced Corpus of Modern Chinese (http://asbc.iis.sinica.edu. tw/) and are summarized in Appendix B. A t-test run in R v.4.0.3 (R Development Core Team, 2020) confirmed null differences in the word frequency between the retroflex-targets (M = 3.15) and the alveolar-competitors (M = 3.00) (t(33) = 0.4234, p = 0.6747). 6 While there seems to be no single, uncontroversial method for further dividing the DISTINCT speakers, Chang and Shih (2015) provided some insight for the present case. In their study comparing TM speakers with Mainland Mandarin speakers on sibilant production, perceptual judgments were first employed to screen out some TM speakers who did not convey clear distinctions between the two sibilants. Two out of ten speakers were excluded from the experiment for not contrasting the retroflexes more than 60% of the time. The remaining eight participants implemented a spectral distance of approximately 2 kHz (Figure 2 in Chang & Shih, 2015). Although this value was not established through a well-controlled perception study, it seems to be a reasonable boundary for marking relative perceptibility. Further, based on our own perceptual impressions, we confirmed that those classified as DISTINCT speakers made clear category distinctions. It is hoped that future studies explore this topic with a focus on perceptual consequences of spectral distance.

MERGED DISTINCT Total
TSM-fluent 6 (2 F/4 M) 5 (4 F/1 M) 11 TSM-weak 4 (1 F/3 M) 5 (3 F/2 M) 9 Total 10 (3 F/7 M) 10 (7 F/3 M) 20 were accompanied by two filler items but no lexical competitor. The eighteen minimal pairs were randomly divided into two halves balanced for manner of articulation. Two sets of the stimuli list were constructed based on this. For one set, one half was presented along with their competitors in the Context condition, and the other half was presented without their competitors in the No Context condition. For another set, the experimental conditions were counterbalanced across the two halves. One of the two sets was randomly assigned to the participants for the experiment.
In addition, another nine words beginning with retroflex sibilants were included as target words. Unlike the target words with minimal pair competitors in the Context condition, the potential competitors of these words were not existing Mandarin words (e.g., /ʂan 214 tuə 214 / 閃躲 'blink' versus */san 214 tuə 214 /). The target words of this type were presented in the No Competitor condition where they appeared along with two filler items and were presented to all participants.
The structure of the experimental design and relevant examples are presented in Table 5, and a full list of the target stimuli is presented in Appendix A.
In addition to the target items, eighteen control items beginning with the alveolar sibilants were added to the stimuli list, balanced by manner of articulation, i.e., fricative sibilants, unaspirated affricates, and aspirated affricates. The sibilants were followed by either the unrounded

Procedure
During the experiment, the participant and the experimenter sat face-to-face at a table, each with a separate laptop computer. The experimenter (listener) told the participant (speaker) that they would be playing a language game together. 7 Upon seeing three words on the screen, the participant would read the highlighted word out loud (Figure 4). Both parties would see the exact same three words, but the target word would not be highlighted on the experimenter's screen. The participant's task was to read out the target words highlighted for the experimenter who would click on the word on her screen. Instructions for the procedure were given to the participants with some examples, and five practice trials were completed prior to starting the experiment. The participants wore an AKG C520L head-mounted microphone connected to a Zoom H4n recorder, and their production was recorded throughout the experiments. The experiment was carried out in the Production and Perception lab at National Yang Ming Chiao Tung University. It took approximately twenty minutes for participants to complete each block, and they were given a five-minute break between blocks. the target words appeared in the first, middle, and last position on the screen randomly. All trials were randomized within each block. The stimuli were presented in Chinese characters to the participants using Microsoft PowerPoint. A total of 7,000 ms was allotted for each trial.
A fixation cross appeared on the screen for 1,000 ms, followed by all three words presented horizontally for 2,000 ms; a purple square box appeared around one of the three words on the participants' screen (Figure 4). Participants were given 4,000 ms to name the target. They were told that the response time would be limited so they needed to produce the target as quickly as possible. As soon as the participants produced the words, the experimenter clicked on the corresponding word on her screen. The experimenter's performance was not recorded, but the interactive nature of the task was expected to encourage the participants to produce the target words in a careful manner.
After removing 19 poorly-recorded tokens, a total of 2,816 tokens (99.3%) was collected for analysis. The sound files were first labeled automatically using Montreal Forced Aligner (McAuliffe, Socolof, Mihuc, Wagner, & Sonderegger, 2017), and the boundaries were corrected manually. As in Experiment 1, the first spectral moment (M1) of the frication noise was computed using the multitaper spectral analysis in Matlab (Blacklock, 2004;Blacklock & Shadle, 2003).
To examine the effect of phonological neighbors on the production of the target sibilants, a mixed-effects regression model was fitted using the lmer function in the lme4 package (Bates, Maechler, Bolker, & Walker, 2015)  Group(merged) and condition(NoCont) were set as the baseline for analyses to directly compare these conditions with the other conditions for the target items. Apart from the main effects, a two-way interaction between condition and group was also included in the model to examine whether the two groups performed differently for the experimental conditions. In addition, the model included subject as a random intercept.

Figure 4:
The experimental setting for the interactive task. Figure 5 plots the distribution of spectral means (Hz) of the sibilants as a function of different conditions in Experiment 2 for the two groups, merged (left) versus distinct (right). The results of the read speech for those participants are summarized below as well for comparison. Notably, the merged group made clear distinctions between the categories in the interactive setting, even though they had merged the two categories completely in the reading task. Table 6 summarizes the results of the regression model fit. For the merged speakers, the results revealed that the spectral mean of the retroflex sibilants in the baseline No Context condition (M = 5,534 Hz) was significantly lower than that of the control alveolar sibilants (M = 6,996 Hz) (p =.0001). The differences between the two categories are substantial (∆M = 1,461 Hz) especially given that they had completely merged them in the reading task, suggesting that these merged speakers indeed have clear mental representations of the two discrete categories which can be realized with fully distinct articulation in a particular experimental setting. Although the spectral means in this particular condition appear to be slightly higher than those produced by distinct speakers in the same condition (M = 4,947 Hz), the difference was marginal, as shown in the main effect of group(distinct) (p = .0681). This result, once again, indicates that while merged speakers may habitually deretroflex the retroflex category, the merger does not arise from articulatory limitations. That is, merged speakers can produce the retroflexes when necessary and to a similar degree as the distinct speakers.

Results
The spectral means of the retroflexes produced in the No Context (M = 5,534 Hz) and No Competitor conditions (M = 5,649 Hz) were comparable, and the small difference between them did not reach statistical significance (p = .1684). However, the spectral means of the retroflexes  Context condition (p < .0005), demonstrating the impact of the minimal pair competitors during online processing. The absolute mean difference between the two conditions was ostensibly small, i.e., 286 Hz, which is unlikely to be audible; however, the strong effect of this variable in the mixed-effects modeling suggests that speakers made small but consistent differences in response to the presence of the lexical competitors during sibilant production. A three-way distinction among the experimental conditions was not obtained in the present study, unlike the findings in Baese-Berk and Goldrick (2009), an issue that will be addressed in the discussion below.
In addition to the main effects, the model also revealed a significant two-way interaction As the effect of explicit phonological neighbors was ostensibly small, we also examined individual data to ensure the context effects were independently present for each participant.

Unmerging of the sibilant merger via contrastive hyperarticulation
The results of the interactive task revealed that the merged speakers increased the spectral distance considerably between the two sibilant categories. In fact, they were able to reverse the deretroflexion and produce reasonably good retroflexes, although the degree of retroflexion was slightly smaller than that of the distinct speakers. This finding provides evidence in favor of discrete representations of the two categories in the mental lexicon of the speakers who merged the sibilants. These speakers, however, seldom produced retroflexes in their speech, as shown  in their performance in Experiment 1. Recall that the reading task was designed to ensure clear speech, especially given the particular structure of the frame sentence, in which the target words were produced under narrow focus. Yet these speakers completely merged the sibilants in this formal register. When the experimental task encouraged interaction with another speaker, the merged speakers, however, revealed that the distinct categories were indeed stored in their mental lexicon.
The results of this study are, on one hand, in line with previous studies of contrastive hyperarticulation (Baese-Berk & Goldrick, 2009;Wedel et al., 2018). Both merged and distinct speakers responded to the experimental condition as predicted-all the TM speakers enhanced the spectral distance in the Context condition significantly more than in the No Context condition. That is, retroflexion was stronger when the target retroflexes were produced in the presence of the alveolar counterparts, compared to when no such competitors were presented.
This result adds to the growing body of literature concerning the cognitive mechanisms of speech production (Baese-Berk & Goldrick, 2009;Fricke et al., 2016;Kirov & Wilson, 2012). Based on VOT enhancement in the presence of lexical competitors, Baese-Berk and Goldrick (2009) argued that word-specific phonetic variation is driven primarily by online processing in which a target is triggered by the activation of an 'explicit' competitor (the Context condition). The results of the present study contradict an alternative account which postulates that speakers' production  Subj. TFH systems are permanently restructured to hyperarticulate words in dense neighborhoods. Such a model would predict no differences between the Context and No Context conditions, which was again not the case in the current study. Additionally, this particular experiment, with its use of logographic Chinese characters, highlights that the general competitor effects are not driven by visual cues available in alphabetic orthographies (e.g., pat versus bat), rather its origin is rooted in more abstract phonological knowledge.
On the other hand, the results of the present study were not entirely consistent with previous studies. Baese-Berk and Goldrick (2009)  This discrepancy may be attributed to the overall enhancement of the phonetic space specific to the sibilant place contrasts, i.e., spectral means (Hz), during the interactive task. In particular, the retroflexes in the No Competitor condition were fully retroflexed, as well as other conditions, the source of which, apparently, was not the lexical competition. Rather, it is likely that the social interaction with the experimenter may have encouraged speakers to avoid the deretroflexed forms, irrespective of the lexical properties of the words. Recall that deretroflexion in TM is fairly stigmatized and used to be associated with lower levels of education and economic status (Chung, 2006;Hsiau, 1997;Su, 2008). This was partially reflected in the results of the first experiment; namely, female speakers were found to merge the sibilants less than the male speakers. Given that the magnitude of lexically-driven hyperarticulation was ostensibly small, the socially-driven enhancement of phonetic space seems to have neutralized some of the word-specific properties. In previous studies of English stop VOT, no stigma is associated with voicing/devoicing, and therefore the three-way distinction in VOT between the three conditions could be driven solely by the lexical status of the words. With the introduction of social pressures to complicate matters, the effects of lexical factors may only manifest to a limited degree.
Lastly, we address the seemingly conflicting results of our work and those of previous sociophonetic studies (Johnson & Nycz, 2015;Nycz, 2013). As discussed earlier, native speakers of Canadian English merged the vowels in the minimal pair tasks but unmerged them in spontaneous conversation to approximate the phonetic characteristics of the dialect spoken in their new speech community. Conversely, the TM speakers in our study unmerged the sibilants in the task utilizing minimal pairs. There are, however, many fundamental differences between the previous studies and the current one. For one, in terms of the nature of the task, our study utilized minimal pair competitors in an interactive linguistic setting. This methodology was motivated to see if the speakers who merged the contrast would alter their production for the sake of the interlocutor. This differs from the above-mentioned studies in which the minimal pair task was meant to induce the speakers' knowledge of the 'correct' phonetic norms by explicitly asking whether they were aware of the phonetic differences between the pairs of the words. In the present case, lexical disambiguation implicitly capitalized on the speaker's desire to facilitate communication, without the intention to tap into the speaker's phonetic norms.
The present case also differs from the English vowel mergers in that TM speakers are astutely aware of the phonetic characteristics of the contrasting sounds and the lingering social stigmatization attached to the deretroflexed forms. In Nycz's (2013) judgment task, the speakers' knowledge of the vowel contrast was marginal, at best, and the merged forms did not carry a strong stigma. In an experimental task designed to specifically induce the explicit knowledge of the contrasts, TM speakers are likely to give the 'proper' pronunciations consistent with the prescriptive grammar explicitly taught in school, trying their best to unmerge the two categories.
This critically differs from the unmerging of the low back vowels by Canadian English speakers residing in NYC, which served to express their accommodation of the phonetic characteristics of the sounds in their new speech community. Bearing in mind the similarities and differences between these studies, the interactive task presented here could be utilized to inform the structure of abstract linguistic knowledge and could be a useful addition to existing tools for sociolinguistic research.

Representations and implementation of the sibilants in the merger-inprogress
The results of the interactive task provided evidence in favor of the clear mental representations of the contrastive categories being stored in the mental lexicon of the merged speakers. What is the source of the apparent mismatch between the phonological knowledge of the merged speakers and the actual implementations of the sound categories in speech production? Why did these speakers merge the sounds if they were represented separately in their minds? More fundamentally, where does this linguistic knowledge, i.e., sound contrasts, originate? Will the variable merger progress into a more stable sound change throughout the community? Experiment 1 provided some diagnostic means regarding the status of this particular merger.
Taiwanese scholars have maintained the deeply-rooted belief that the sibilant merger arose from language contact with local substratum languages that do not have the retroflex category (Ing, 1984;Kubler, 1985). However, the present study has shown that sibilant merging has become, by and large, independent of TSM fluency as TSM-weak TM speakers merged the category as frequently as TSM-fluent bilingual speakers. This indicates that the merger is widespread among the younger generation in Taiwan, and weak retroflexion is becoming a characteristic phonetic feature of TM, departing from Mainland Mandarin. This echoes recent sociolinguistic studies that have argued that the stigma associated with TSM is dramatically declining among young TM speakers and a supra-ethnic and cross-linguistic Taiwanese identity is being formed (Hsiau, 1997;Huang, 2019;Tse, 2000). In this dynamic socio-historical context, the social meaning associated with deretroflexion may be changing-no longer is it necessarily a derogatory feature of a substratum language; it is instead becoming a feature of a unique phonetic variant that enables speakers to express a positive orientation toward the Taiwanese language. A future study is warranted to directly address the connection between social attitudes and linguistic performance of TM speakers.
Yet, despite its recent rise, TSM is a non-standard language that cannot claim as much prestige as TM in Taiwan (Sandel, 2003;Tse, 2000). The lingering stigma or negative social pressure seems to have deterred female speakers from incorporating the once-substrate feature into their speech. Formal school education and conservative cultural practices in Taiwan still impose the importance of the standard norms. While sound change led by men is less common and often limited to relatively isolated patterns (Labov, 1990;Labov, 1994), with the mixed signals co-existing in the society, it is not surprising that female speakers lag behind for this particular change.
The mixed social connotations attached to the merger seem to have brought about a fully variable and gradient merger pattern as demonstrated in the results of the reading task. An immediate consequence of this full continuum is that speakers of the linguistic community are exposed to drastically different grammars; speakers who merge sibilants encounter unmerged patterns, and speakers who contrast the category encounter merged patterns. In particular, the exposure to different sound systems may have prevented the complete merging of the categories in the mental representations, which may have been amplified by some lingering stigma (Clark et al., 2013). As proposed by the exemplar-based theories of speech perception and production, phonological representations may not be fully discrete and categorical (Johnson, 1997;Pierrehumbert, 2001). Rather, they form clouds of exemplars, which is constantly updated as speakers encounter various forms of the same category from other speakers. With many distinct speakers in the speech community, the exemplar cloud of the retroflex category would expand with the addition of somewhat outlying fully retroflexed sounds, which could increase the dispersion between the two categories. Of course, the reverse process may happen to the speakers who contrast the category, creating a fuzzier boundary between the categories. Based on the results of the present study, however, it is not clear whether those who merge sibilants have less clear category boundaries than those who do not. While it is a likely scenario, we believe it could be more adequately addressed by future comprehensive empirical research covering production, perception, and psycholinguistic processing.
Apparently, the sibilant merger in contemporary TM is variable and could be best characterized as a merger-in-progress. The outlook-whether this pattern will develop into a full-grown sound change-is unclear. At the onset of sound change, considerable phonetic variation is observed among the individuals in a speech community. Sociolinguists have argued that social factors, as well as grammar-internal factors, modulate whether certain variations eventually develop into systematic sound changes. Labov (1963Labov ( , 1990 famously identified social motivations in sound change such that a particular phonetic variable may gain some social meaning and trigger imitation by other speakers. Sound changes result from the regularization of the advanced forms that spread among the speech community. In light of this, we should consider another merger-in-progress of an entirely different sound category in TM. TSM lacks certain nucleus and nasal coda sequences, e.g., /iŋ/ and /əŋ/, which are often replaced with /in/ and /ən/ by TSM-dominant speakers (Kubler, 1985;Tse, 1998).
Just like the stigma associated with detroflexion, these TSM features were once stigmatized, conditioned by age, gender, and socioeconomic status (Kubler, 1985;Tse, 1992). However, the stigmatization of the nasal merger has declined dramatically over the years, and the nasal merger has emerged, along with other nasal merger patterns, as a common feature of TM, as verified by instrumental studies (Chiu & Lu, 2020;Fon, Hung, Huang, & Hsu, 2011;Hsu & Tse, 2007).
Compared to this relatively mature sound change, the sibilant merger seems to still be in its early stages of development. Anecdotally, TM speakers often say that they cannot recover the correct Zhuyin symbols of /n/ and /ŋ/ when typing electronic documents and mostly guess one of the two, which is rarely the case for the /s/ and /ʂ/ contrast. The apparent asymmetry in the sound change with similar historical origins seems to lie in the relative saliency of the sounds involved. Compared to final nasal places, the sibilant place contrasts in the initial position are more salient acoustically and perceptually, which would have slowed down the progression of the merger among the speakers in the community. If the stigma continues to diminish and this merger gains some positive social meaning, e.g., Taiwanese identity, it would be feasible that the sibilant merger would continue to develop into a mature sound change.

Conclusion
Variation, as an essential aspect of speech sounds, provides a window into the linguistic architecture connecting abstract mental representations stored in speakers' mental lexicon and the way they are implemented in production. The formal reading task showed that the TM sibilant merger exists on a full continuum, from a complete merger to clear contrasts, and is more prevalent among male speakers, demonstrating the impact of the social stigma associated with the merger. Moreover, though rooted in historical contact with TSM, the sibilant merger is becoming independent of TSM. Regardless of whether speakers merged the sibilants or maintained the contrast in the reading task, the TM speakers all made a clear distinction between the two categories in the interactive task, indicating that they had non-overlapping discrete representations of the contrasting sounds. The apparent dichotomy between what they have stored in the mental lexicon and what they implement in production suggests the role of variation at the onset of sound change. Speakers are exposed to the phonological systems of others, especially those retaining the contrasts, which, along with the social stigma, may have prevented the complete merging of the categories in their mental representations. This case study provides some insight into the apparent paradox of near-mergers, that speakers cannot perceive certain distinctions even though they maintain small but consistent differences in their production. Variation framed in social and linguistic dynamics may provide substantial evidence for possible categories in a given language for language learners; however, their implementation may be further modulated by social factors as well as grammar-internal or phonetic factors.