Juncture prosody across languages: Similar production but dissimilar perception

How do speakers of languages with different intonation systems produce and perceive prosodic junctures in sentences with identical structural ambiguity? Native speakers of English and of Mandarin produced potentially ambiguous sentences with a prosodic juncture either earlier in the utterance (e.g., “He gave her # dog biscuits,” “他给她 # 狗饼干”), or later (e.g., “He gave her dog # biscuits,” “他给她狗 # 饼干”). These production data showed that prosodic disambiguation is realized very similarly in the two languages, despite some differences in the degree to which individual juncture cues (e.g., pausing) were favoured. In perception experiments with a new disambiguation task, requiring speeded responses to select the correct meaning for structurally ambiguous sentences, language differences in disambiguation response time appeared: Mandarin speakers correctly disambiguated sentences with earlier juncture faster than those with later juncture, while English speakers showed the reverse. Mandarin speakers also showed higher levels of accuracy in disambiguation compared to English speakers, indicating languagespecific differences in the extent to which prosodic cues are used. However, Mandarin, but not English, speakers showed a decrease in accuracy when pausing cues were removed. Thus even with high similarity in both structural ambiguity and production cues, prosodic juncture perception across languages can differ.


Introduction
In any language, a vast repository of words and an infinite range of sentences are based on just a handful of phonemes and syntactic rules. Spoken language, furthermore, is never produced in discrete chunks. Instead, it often contains ambiguity; words can appear embedded within other words, and sentences can carry more than one distinct meaning (consider "He gave her son glasses" versus "He gave her sunglasses"). Yet in everyday conversations, we all produce and understand most ambiguous utterances without much effort. How do we as talkers signal our meaning, and how do we as listeners deduce it? The present study addresses these questions by comparing how speakers of English and of Mandarin Chinese use prosodic cues to resolve syntactic ambiguity.
The use of prosody to signal phrasal junctures has been argued to be a universal feature of all languages (Bolinger, 1978). Formal language theory also suggests that prosody is itself a hierarchical structure that is organized in a similar way across languages (Beckman & Pierrehumbert, 1986). Different levels of prosodic constituents can govern the prominence relations and intonational, rhythmic, and pausing patterns in the speech signal (e.g., Beckman, 1996;Ladd, 1986;Liberman & Prince, 1977;Selkirk, 2003), and from birth, language learners can attend to the prosodic cues that correspond to these levels to detect relevant boundaries (Johnson, 2016). In this respect, prosodic cues to juncture can be seen as a skeletal foundation for integrating different aspects of speech during the early stages of sentence processing (Frazier, Carlson, & Clifton, 2006).
The production of prosodic juncture has been widely researched over the past decades, with remarkable similarity appearing across an impressive number of differing languages, in both tonal and temporal domains (see Table 1 for a non-exhaustive sample). However, it is still an empirical question whether the cross-language similarities observed in production are also relevant for perception. Certainly, overall juncture cues and the way prosodic structure is organized are highly similar, even across typologically distinct languages (e.g., English and Japanese: Liberman & Pierrehumbert, 1984), but how exactly these cues are realized in phonetic effects can vary due to differences in phonological structure. For example, domaininitial articulations of voiceless aspirated stops in English, German, and Korean are more likely to be produced with longer Voice Onset Time (VOT) (Cho & Jun, 2000;Kuzla & Ernestus, 2011;Pierrehumbert & Talkin, 1992), while voiced stops in Dutch undergo VOT shortening to enhance prevoicing (Cho & McQueen, 2005). Similarly, postboundary nasals receive greater linguopalatal contact and reduced nasal airflow in French and slower lip movements and reduced nasal energy in English (Byrd & Saltzman, 1998;Cho & Keating, 2009;Fougeron & Keating, 1996), but only durational lengthening in Tamil (Byrd, Narayanan, Kaun, & Saltzman, 1997). Thus an important challenge is to examine how universal and language-specific factors interact. By adopting a crosslanguage approach, the present study will examine the extent to which strategies in juncture processing are shared across languages.

F0 Cues
Language Reference(s) • Preboundary F 0 lowering • Boundary tones • Postboundary F 0 reset (see Kohler, Peters, & Scheffers, 2017 for results from the Kiel corpus). This has implications for perception, and both ERP and behavioural data show that German listeners can only detect prosodic boundaries when pitch cues and preboundary lengthening co-occur (Holzgrefe-Lang et al., 2016). German listeners show a brain signature associated with boundary detection (a so-called Closure Positive Shift) even when pause duration is made uninformative, suggesting that pausing is not a crucial cue (e.g., Steinhauer, Alter, & Friederici, 1999;Männel & Friederici, 2009;Männel, Schipke, & Friederici, 2013). In addition, there is a developmental trend whereby German-learning infants lose their sensitivity to pausing cues after eight months of age (for a similar case in English, see Seidl & Cristià, 2008).
In Mandarin, in contrast, pausing is a more frequent cue to phrase boundaries (97.2%) than preboundary lengthening (less marked; Wang, Xu, & Zhang, 2019) or boundary-related pitch rises and falls (less predictable due to the presence of contour tones; Yu & Tao, 2005).
Mandarin listeners are correspondingly better at detecting prosodic boundaries in sentences that only contain pausing cues, compared to sentences with only preboundary lengthening and postboundary F 0 reset (Yang, Shen, Li, & Yang, 2014). Whether only pausing or both pausing and other boundary-related cues are present does not affect Mandarin listeners' boundary detection, suggesting that pausing is the most reliable cue in Mandarin (e.g., for a similar case in Dutch and Swedish, see Sanderman & Collier, 1997;Horne, Strangert, & Heldner, 1995). Therefore, even when all juncture cues exist across a language pair (so, boundary-related pausing, pitch, and lengthening cues can all be found in German and Mandarin), listeners have developed processing preferences for different cues.
Another line of evidence for language differences comes from studies that have used sentences with ambiguous complex noun phrases and relative clauses (e.g., "Someone shot the servant of the actress who was on the balcony"), where the relative clause (RC) could be construed as modifying the NP headed by either the first noun (i.e., servant) or the second (i.e., actress).
Across languages, listeners adopt different attachment bias due to variation in default prosodic phrasing (Fodor, 1998). High attachment of the RC to the NP1 is favoured in languages where speakers tend to produce a weak boundary between NP1 and NP2 and a strong boundary before the RC (e.g., French, Spanish: Cuetos & Mitchell, 1988;Zagar, Pynte, & Rativeau, 1997). Low attachment is favoured in languages where speakers tend to place a boundary after the NP1 (e.g., English, Mandarin: Kuang, 2010;Jun, 2003). Again, these findings suggest that listeners can differ due to variation in heard input. Languages vary not only in the degree to which different juncture cues are used, but also in the location of these cues.
Interestingly, however, listeners' language-specific attachment preferences can be modulated (Fernández, 2007;Teira & Igoa, 2007) or even reversed (Fromont, Soto-Faraco, & Biau, 2017) if the location of the prosodic boundary in the speech stimuli was manipulated to favour a different interpretation. Moreover, foreign language learners can adopt native-like parsing strategies in their L2 even when these are different from their native language (L1); English learners of French, for instance, have been shown to use the appropriate French strategy (i.e., high attachment) to disambiguate RC attachment ambiguities even after learning the language for just a few semesters (Dekydtspotter, Donaldson, Edmonds, Liljestrand, & Petrush, 2008). Similarly, English learners of German and German learners of English can both produce and attend to the prosodic cues in their L2 (O'Brien, Jackson, & Gardner, 2014). There is thus certainly flexibility in the processing system for prosodic juncture.

The present study: General overview
The present study addresses the processing of prosodic juncture in a systematic manner. Although, as is clear from the above brief review, data on prosodic juncture processing has been gathered from The two sentences differ in the direct object, and as a consequence, differ in juncture location.
In (a), the juncture (#) is realized earlier on in the utterance, giving a sentence with a feminine personal pronoun as the indirect object and a compound noun as the direct object. In (b), the same (segmentally identical) sentence is produced with a later boundary, after "baby," so that in this case "her" is functioning as a possessive determiner. This ambiguity can occur in English because "her" can be either a possessive or an indirect object. It can also occur in Mandarin because speakers ignore the alienable versus inalienable distinction in everyday speech where the possessive particle -de can be omitted (Haiman 1983(Haiman , 1985Hsu, 2009). In fact, according to a large database of informal written and spoken Mandarin, almost half (45%) of associative noun phrases in Mandarin are produced without the particle (Chappell & Thompson, 1992).
The present study comprehensively examines juncture processing in these two languages, in both production and perception. For production, we address the following questions: 1. Do English and Mandarin speakers use the same prosodic cues to signal juncture in these near-identical structures?
2. To the extent that they do, are there differences in the degree to which specific juncture cues are deployed?
For perception, we ask: 1. Do English and Mandarin listeners differ in their perceptual processing of juncture in these same structures?
If juncture processing were to be universal across languages, then both English and Mandarin speakers would presumably use prosodic cues in the same way to process the intended meaning of the ambiguous utterances. However, since prior literature has reported some cross-language differences, our study may reveal differences even in this case where the syntactic structure is closely similar. China and had been living in Australia for an average of two years and 10 months (range: 2 months -9 years). We excluded additional data from one English speaker who had some disfluency in oral reading and three Mandarin speakers who grew up in Chinese-speaking communities outside of Mainland China (e.g., Taiwan). All participants were naïve to the specific purpose of the experiment.

Reading Passages
Our materials were three pairs of short reading passages written in English and Simplified Chinese (see Table 2). Each passage pair contained the same target ambiguous sentence as the last sentence in the passage. The target sentences were manipulated to have different meaning by virtue of the different storylines provided by the preceding sentences. In one version, the context would elicit production of the target ambiguous sentence with an Early Juncture, where the boundary occurred earlier in the sentence (e.g., "He gave her # dog biscuits"). In another  version, the same target sentence was manipulated to elicit production of Late Juncture, where the boundary occurred later in the sentence (e.g., "He gave her dog # biscuits").
The English and Chinese reading passages were highly comparable in three important ways.
First, the English and Chinese ambiguous sentences, as well as the storylines, were identical in meaning, except for one minor deviation in translation in the second passage where the ambiguous sentence in English was "he saw her duck under the chair" and the sentence in Chinese was "他 看见她猫在坐凳子底下" "he saw her cat/hide under the chair" (n.b., 猫 can mean either "cat" or "hide"). Second, both the English and Chinese sentences involved the same ambiguity. The Early Juncture sentences involved a feminine personal pronoun (i.e., her/她) before the juncture, followed by a postboundary compound noun or verb and preposition (e.g., dog biscuit/狗饼干; duck under/猫在), while in the Late Juncture sentences, the compound noun or verb became a simple noun (e.g., dog/狗; duck/猫) and the personal pronoun became a possessive determiner.
Third, we selected target sentences involving pre-or postboundary consonant onsets that were, as far as possible, highly comparable in their manner of articulation (e.g., /dɔg # bɪskəts/ versus /kou # pinkan/; /baeɪbɪ # mɪlk/ versus /jiŋɚ # naifən/). Before each session, all participants spent a few minutes reading through each of the passages by themselves to prepare. To ensure successful elicitation, the experimenter asked participants to pay careful attention to how they chose to speak in each passage, and encouraged them to speak in a way that would "really flesh out the meaning of the entire passage." Participants were also told that the study aimed to examine how speakers produce speech in everyday contexts, and they were told to try to be "as normal as possible." However, the experimenter did not give any explicit instructions to produce the relevant juncture cues in the target ambiguous sentences.

Recording procedures
Furthermore, the passages were written in plain text without any markers (such as hashtags) between phrases that could signal the designated boundaries.
After reading each passage, participants were asked a series of follow-up questions to test their comprehension of the passage (see Table 3). This was done to confirm that they understood the ambiguous sentences. If participants did not know the answers or answered incorrectly, they were encouraged to read the passage by themselves again, and were the given another chance to produce the passage. In such cases, only data from the latest recordings were included in our final analyses. Every participant produced all the reading passages. No participant had to reread a passage more than twice.
For pausing, we measured the duration of each potential juncture pause, i.e., the one that would indicate the early juncture in Early Juncture sentences (P1), and the one that would indicate the late juncture in Late Juncture sentences (P2). This was done for all sentences, so both Early Juncture and Late Juncture sentences had two measures; for the sentence "He gave her dog biscuits," for example, we measured durations between "her" and "dog" and between "dog" and "biscuits." We then compared the duration at each designated juncture across the two juncture versions. If the spectrogram showed no visible pause at any designated juncture, then a rating of zero was given.
For boundary lengthening, we compared the pre-and postboundary vowel duration of the words preceding and following the two designated junctures. There were three measures of vowel duration per sentence: of the word before the designated early juncture boundary (V1), of the word before the designated late juncture boundary (V2), and of the word after the designated late juncture boundary (V3).

General overview
The original dataset for all of the experiments in the present study is available on this open access site: https://upenn.box.com/s/n5r5ww7t47dqnywakm580axvujh03amk. Acoustic results for each prosodic cue in Experiment 1 were analyzed. For each prosodic cue, we examined For both versions, we measured the pause duration of the juncture locations that would indicate the designated early juncture (P1) and the designated late juncture (P2). Pre-and postboundary vowel durations (V1, V2, and V3) were also measured. As revealed in the annotations, V1 indicates the preboundary vowel duration before the designated early juncture, while V2 indicates the preboundary vowel duration before the late juncture. V3 is the postboundary vowel duration after the designated late juncture. F 0 measures (mean, minimum, maximum, and range) were calculated from the three pre-and postboundary vowels. Acoustic measures of domain-initial segmental strengthening (i.e., VOT, nasal, or fricative duration) were measured wherever a postboundary word began with a consonant word onset.
whether both languages showed similar patterns of production difference between the Early and Late Juncture sentences. For F 0 , a small proportion of the utterances (7.25% of the English data and 2.47% of the Mandarin data) had to be excluded due to octave errors arising from creaky voice production. Using the lme4 package in R (Bates, Mächler, Bolker, & Walker, 2015, version 1.1-7), Linear Mixed Effects regression (LMER) models were constructed with the maximal random effects justified by the data (Barr, Levy, Scheepers, & Tily, 2013). Prior to the analyses, data skewness of each prosodic cue was first examined based on visual inspections of quartilequartile plots. We observed that the pause duration after the designated early juncture (i.e., P1) formed a skewed distribution. Skewed distributions were also revealed in the preboundary vowel durations of the designated early junctures (i.e., V1), as well as in the mean F 0 data for the pre-and postboundary vowels of the designated late junctures. These data were therefore transformed prior to the analyses.
It is important to note that for both our production and perception experiments, we chose not to apply log or square root transformations because some of the raw data contained zero or negative values; the zero values in the production data came from instances where participants did not produce a pause, and negative values can be found in subsequent perception experiments where participants correctly disambiguated the sentence before the sentence offset, which occurred in less than 10% of the total correct responses. A common practice is to add a constant value to the datapoints before transforming the data, so that all datapoints become non-zero positive values. However, we refrain from adopting this practice; albeit common, this practice is arbitrary and problematic because it can inflate both Type I and Type II errors as a function of the added constant value (see Feng et al., 2014 for evidence from simulation studies). All of the skewed raw data from our study were therefore transformed using the Yeo-Johnson transformation procedure (Yeo & Johnson, 2000). This was done using the yeojohnson function from the recently updated bestNormalize package in R (version 1.6.1, June, 2020; Peterson & Cavanagh, 2019).
Importantly, the Yeo-Johnson procedure is an extension of the Box-Cox inverse transformation procedure in that it handles both positive as well as zero and negative values. Compared to other methods of data transformation (e.g., log transformation), transformations based on the Box-Cox procedure have been argued to be better suited for psycholinguistic data (Lo & Andrews, 2015), and provide a better approximation to normal-distribution and homoscedasticity assumption for linear models (Balota, Aschenbrenner, & Yap, 2013). For the readers' convenience, all of the means, standard deviations, fixed effects estimates (β), and standard errors reported in the main text and figures will be raw values (e.g., in milliseconds).
For each juncture cue, we used as the starting point a baseline model that included by-participant and by-item random intercepts as well as by-participant and by-item random slopes for the effect of juncture version. Juncture version, language, and language by juncture version interaction were added as fixed effects predictors in a step-wise fashion and these models were compared with the baseline model. Model fit was determined using chi-squared tests of model log-likelihood based on the p-values of the chi-squared tests and/or differences in the Akaike Information Criterion (AIC), with the latter being more useful in cases where the complexity of the model cannot be justified by the additional variance explained (see Shaw et al., 2018). Predictors that did not yield significant improvement in the model comparisons were dropped before additional predictors were added. Leave-one-out comparisons were used to ensure that each predictor yielded a significant gain in log likelihood with all other predictors in the model. Planned comparisons following significant interaction effects were carried out using the emmeans package with Tukey-adjusted p-values (Lenth, 2020). All fixed effects were coded with mean-centred contrast codes.

Prosodic cues to juncture
Early Pause Duration (P1). We first analyzed the pause duration of the designated early juncture In addition, there was a significant interaction between language and juncture version (χ 2 (1) = 52.98, p < .001; β = -43.24, SE = 6.91, t = −8.93). Follow-up planned comparisons for the significant interaction were again conducted using emmeans with Tukey p-value adjustments (Lenth, 2020), and they revealed crosslanguage differences in the degree to which pause duration production was affected by the different juncture versions (see Figure 2). In English, speakers Summary. Both English and Mandarin speakers produced significantly longer pauses at the relevant junctures in both early and late juncture contexts. In early juncture contexts, the pause duration was even longer in Mandarin than in English. In neither language did speakers produce longer preboundary vowels in early juncture contexts, nor did they produce significantly longer postboundary vowels (as measured in the vowels after the designated juncture) in late juncture contexts. Both English and Mandarin speakers produced preboundary duration cues, but the increase in preboundary duration was longer in Mandarin. As for F 0 , juncture version played no role at all, though the Mandarin speakers overall produced higher F 0 than the English speakers. For segmental modification, only a longer preboundary VOT in Mandarin occurred, in the unpredicted direction, and hence presumably a chance effect.

Discussion
Our production data suggest that English and Mandarin speakers were alike in their use of duration to mark juncture. Particularly in the late juncture sentences, longer pauses were produced at the designated juncture. Similarly, both groups of speakers produced longer preboundary vowels before the designated juncture in Late Juncture sentences. However, there were language-specific differences in the degree to which different prosodic features were produced across the different juncture versions. Thus the difference in pause duration at the juncture position in Early Juncture sentences was greater in Mandarin, and Mandarin speakers also produced a significantly greater increase in preboundary duration in Late Juncture sentences. We therefore conclude that while English and Mandarin speakers are similar in how they produce duration cues (i.e., pausing, preboundary lengthening), there can still be cross-language differences in where they produce them; in both cases of durational differences, Mandarin speakers were more likely to mark longer pauses in early juncture contexts and longer preboundary vowels in late juncture contexts.
Note also that neither language group produced all the boundary-related cues we measured.
For example, neither group produced postboundary or F 0 cues. At the same time, we also observed a case where juncture cues mismatched the prosodic context: Mandarin speakers produced pre-junctural VOT lengthening with late rather than, as might have been expected, with early juncture.
Note that speakers' juncture cue choices could have been influenced by our task, which involved reading passages where the storyline already provided the referential context necessary for effective disambiguation. Reading tasks may be less likely to elicit juncture production, particularly if speakers are unaware of the ambiguity (e.g., Allbritton, McKoon, & Ratcliff, 1996). It is worth noting that all the above studies and proposals have concerned native speakers of English or other Germanic languages (e.g., Dutch, German). Here, we compared languages with very different prosodic systems, but cases where we could adopt a structured approach involving identical storylines and sentences with identical syntactic ambiguity and very similar boundaryrelated segments. Contrary to previous findings from reading tasks (e.g., Albritton et al., 1996), and even without explicit instructions but with contexts that effectively made the use of prosody redundant, we found that speakers did produce prosodic cues to juncture. Importantly, the choice of cue types was similar across English and Mandarin. The speaker groups varied in the degree to which each type was engaged, however. Thus there appear to be language differences in production preferences across the different early versus late juncture versions.
We now move on to explore the cross-language patterns to be found in perception. The following two perception experiments again exploit the similar ambiguous structure of the Early and Late Juncture parses across English and Mandarin. These experiments involve a novel disambiguation task, in which participants from each language group listen to the ambiguous sentences without contextual cues, and press a button to choose the correct interpretation; both their response time and their accuracy are measured. With this method we can ascertain whether the cross-language symmetries and asymmetries we have observed in production are also reflected in listening behaviour.
There are two possibilities. On one hand, the cross-language perception results across early and late juncture versions may mimic the language-specific differences in our production data, particularly with respect to duration. As already mentioned, Mandarin speakers were more likely to mark stronger duration cues compared to English. Such cross-language duration differences may reflect processing difficulties across languages; speakers may be more likely to mark prosodic cues to disambiguate a sentence when they deem the sentence difficult to understand (see Kraljic & Brennan, 2005). For this reason, Mandarin speakers' disambiguation perception may be affected when certain cues are rendered uninformative.
On the other hand, perception strategies may be separate from production. Unlike production, where languages may vary in the degree to which different cues are used, listeners in both English and Mandarin may still use whatever cues are available in the signal. For example, in prosodic entrainment, where listeners entrain to prosodic contours to rapidly locate an upcoming prosodic focus, we know that listeners across different languages do not use any one single cue to predict the prosodic forms of upcoming words (e.g., Cutler, 1987;Cutler & Darwin, 1981;Ip & Cutler, 2020). Likewise, the realization of a sentence's prosodic structure may be a blend of different prosodic cues (e.g., duration, F 0 ) that all listeners may exploit (Cutler & Isard, 1980), and listeners might accordingly exploit whatever cues are available for disambiguation. This could result in no significant relationship between listening behavior and individual disambiguation cues.
We explore these possibilities in the following two perception experiments. In Experiment 2, our first perception experiment, we ask whether English and Mandarin listeners show differences in disambiguation when all prosodic cues are present. We will also analyze if there is a relationship between individual cues and listeners' disambiguation response time and accuracy rates. In our second perception experiment, we ask whether listeners' disambiguation response time and accuracy differ across languages when a primary disambiguation cue (pause duration) is rendered uninformative.

Materials
All the materials used in the present research can be accessed from the following URL link: https://upenn.box.com/s/u72whjp3buhtvwhv9b6adhmdhme7vf71. Twenty-two syntactically ambiguous experimental sentences in English and Mandarin were constructed (see Appendices A and B), each having two different interpretations resulting from different juncture placement. For each language, the sentences were recorded in their two versions by a female native speaker at a natural fast-normal rate. As in the production experiment, the two juncture versions differed in the timing and location of the boundary (i.e., Early Juncture versus Late Juncture). In Early Juncture versions, a boundary occurred earlier in the utterance (e.g., "Larry accidentally gave her # rat poison"; "刘波不小心给她 #老鼠药吃") while in segmentally identical Late Juncture versions, a boundary occurred later in the utterance (e.g., "Larry accidentally gave her rat # poison"; "刘波不小心给她老鼠 # 药吃"). For each experimental sentence, the speaker also produced a pair of interpretation sentences that corresponded to the intended meaning of the Early and Late Juncture versions (e.g., "Larry gave rat poison to Hannah" versus "Larry gave rat poison to Hannah's pet rat Rohan"; "刘波把老鼠药给珍妮" versus "刘波把老鼠药给珍妮的宠物鼠").
Unlike the production experiment on natural usage, where speakers were not explicitly told to disambiguate the sentences, the perception experiment aims to examine whether listeners across languages can use informative juncture cues to disambiguate their understanding of sentences.
To manipulate the Early and Late Juncture versions, the English and Mandarin speakers who recorded the stimuli for the perception study were made aware of the ambiguity and were asked to produce each version of the experimental sentences in a way that would match its corresponding interpretation sentence. But as in the production experiment, nonetheless, they were not given any explicit instruction on how they should accomplish the disambiguation. In both languages, the Early and Late Juncture versions for each stimulus sentence pair were segmentally identical, and the two language sets were highly comparable in terms of their syntactic ambiguity (see Appendix C for interlinear morpheme-by-morpheme glosses of the Mandarin sentences).
In each language, 12 additional filler sentences and their corresponding pair of interpretation sentences were also recorded. These filler sentences involved other types of ambiguity that were either easier than the experimental sentences (e.g., homonyms) or more difficult (e.g., sentences with relative clause attachment ambiguity). There were two counterbalanced experimental conditions, each containing one juncture version of each of the 22 experimental sentences, plus the additional 12 filler sentences. The experimental and filler items were pseudo-randomized (such that the counterbalanced orders contained no more than two consecutive instances of an experimental sentence).

Procedures
The disambiguation task was administered using E-Prime software (Schneider, Eschman, & Zuccolotto, 2002) on a laptop computer and a Chronos® USB response device for button pressing. All instructions were given in the form of a pre-recorded voiceover script made by the same speaker who produced the stimuli. Written instructions were also displayed on the screen as the voiceover instructions were being played (see Appendices D and E). Because of the greater distinction between formal/written and spoken language in Mandarin, we added an extra line in our Mandarin instructions to explicitly inform participants that they would hear everyday normal sentences, (and thus should not expect sentences in a formal/written style). All participants were given three practice trials and feedback before starting the actual experiment.
Note that we did not give instructions on how to disambiguate the sentences.
From the start of each trial, participants saw on their screen two interpretation sentences that corresponded to the left and right buttons in front of them. Participants heard the test sentences, and were asked to "pay careful attention to the meaning of each sentence" and to choose for each sentence its intended meaning by pressing the button that matched the correct interpretation sentence. Up to five seconds were available for pressing the button before the next trial commenced; the interpretation sentences remained on the screen for the full five seconds after the offset of the sentence. Nevertheless, participants were instructed to choose the correct button "as soon as they understood the sentence" and were told that they could press the button at any time during the trial while the sentence was being played. They were further informed that they would be tested on both their accuracy and on their speed of comprehension. Whether the correct button was the left or right button was counterbalanced across participants.
We recorded participants' response times and number of correct responses. Participants who made errors of disambiguation on one-third or more of the experimental sentences were excluded from the analysis. An absence of button press was also considered an 'incorrect response,' because a failure to press the button, even during the five seconds after the sentence was finished, was interpreted as indicating that the participant was still trying to process the meaning of the ambiguous sentence. No participant failed to respond on more than two occasions during the experimental trials.
At the end, all participants completed a recognition test where they were presented with a list of 22 sentences and were asked to judge whether each of these sentences were from the experiment (see Appendices F and G). Half of these test sentences were indeed from the experiment. All participants scored above 14 out of 22 (64%) on the recognition test (In English, M = 88.64%, SD = 9.14%, range: 64-100%; In Mandarin, M = 90.68%, SD = 8.17%, range: 73-100%). English and Mandarin listeners did not differ significantly in their recognition scores.

General Overview
More than 90% of participants' correct responses in both languages were made by pressing the button after the offset of the sentences. Therefore, we measured response time (RT) as the duration between experimental sentence offset and participant button presses. Only data for correct disambiguations were included in our analyses.
The main aims of our statistical analyses were to investigate (1) whether RT differed across juncture version (i.e., Early versus Late Juncture) and (2) whether the pattern of this RT difference varied across languages and experimental trials. To address (2), we performed statistical tests on the RT data to examine whether there was an interaction between language groups and juncture version, and to address (1), we performed separate analyses for the English and Mandarin datasets.
We also performed acoustic analyses of all of the experimental sentences to examine whether there were differences in duration cues across the Early and Late Juncture sentences. We measured the following prosodic disambiguation cues in Praat (Boersma & Weenink, 2018): (1) pausing and (2) pre-and postboundary vowel lengthening. As in the previous production study, we measured the pause duration and boundary lengthening of both the designated early and late juncture locations in all sentences. All analyses were conducted using mixed effects models.

Response Time
As in the production experiment, we constructed LMER models to obtain the best fitting model predicting listeners' RT. Visual displays of the quartile-quartile plots revealed that the raw RT data formed a skewed distribution, so we transformed the data using the Yeo-Johnson procedure.
Leave-one-out model comparisons were conducted to ensure that each predictor yielded a significant gain in log likelihood with all other predictors in the model. Predictors were added in a step-wise fashion and all fixed effects (i.e., juncture version, language) were coded with mean- As in Experiment 1, we followed up the significant interaction using emmeans with Tukey adjustments for p-values (Lenth, 2020). Juncture version influenced RT in both English and Mandarin, but there was an inverse interaction between language and juncture version (see

Accuracy
On average, Mandarin-speaking participants made 3.3 incorrect disambiguation responses (SD = 1.82) throughout the 22 experimental sentences, whereas the English-speaking participants in contrast averaged 5.6 incorrect disambiguations (SD = 2.1) across the 22 experimental sentences (see Table 4). We used the glmer function from the lme4 package (Bates et al., 2015, version 1.1-7) to build Generalized Linear Mixed-Effects regression (GLMER) models to examine whether there were accuracy differences across languages and juncture versions. GLMER models were used because they enabled us to assess the influence of language background and juncture version on accuracy as a categorical dependent measure (i.e., a binary distribution) while also accounting for individual patterns across participants and sentence items (Bolker et al., 2008). The accuracy data were coded as either "1" for correct disambiguation or "2" for incorrect responses. Specifically, we were interested in whether sentences with early versus late junctures had an effect on listeners' ability to correctly disambiguate the sentences, and whether this varied across the languages. As in the RT analyses, by-participant and by-item random slopes for the effect of juncture version, as well as by-participant and by-item random intercepts, were added as random effects, and juncture version, language, and language by juncture version interaction were entered as fixed effects.   To complement these findings, we also observed the error rates of the English and Mandarin participants who were excluded on the basis of their incorrect responses. In total, we excluded seven English listeners and two Mandarin listeners who failed to correctly disambiguate at least

Stimuli: Acoustic analyses
We also conducted acoustic analyses of the stimuli sentences in Praat (

Discussion
In line with the cross-language asymmetry observed in English and Mandarin speakers' duration production, our perception experiment revealed a similar cross-language difference in RT pattern across the different juncture versions. In English, listeners were significantly faster at disambiguating Late Juncture sentences than Early Juncture sentences. In contrast, Mandarin listeners were faster at disambiguating Early Juncture sentences. The English and Mandarin listeners also differed in interpretation accuracy, with more errors made by English listeners. The results therefore indicate (1) language differences in listeners' sentence processing as a function of different juncture context and (2) language differences in the extent to which listeners use prosody at all to correctly disambiguate an ambiguous sentence.
The perception results could be interpreted in light of the production differences in duration. In our production experiment, we have observed that Mandarin speakers tend to produce sentences with longer preboundary lengthening and pauses. From our analyses, only Mandarin listeners were able to use preboundary duration cues to resolve the ambiguous sentence; the preboundary vowel duration of the designated early juncture, and also to a weaker extent its pause duration, showed a significant effect on improving Mandarin listeners' accuracy rates. In English, although response time was related to the duration of the early juncture pause, there was no relation between listeners' disambiguation accuracy and any of the individual disambiguation cues. These findings thus seem to suggest that listeners do not necessarily exploit all available cue(s) for disambiguation; cues are weighted differently across languages and listeners across languages vary in the cues they rely on to correctly disambiguate a sentence.
In light of this we might expect that Mandarin listeners would pay particular attention to boundary-related duration cues. In a second perception experiment, therefore, we test whether native English and Mandarin speakers would show the same RT pattern and accuracy scores when pause duration was rendered uninformative. If the Mandarin listeners assign more weight to pausing than English listeners, then their accuracy and RT performance would be affected to a greater degree by the removal of the pausing cue. Given that pre-and postboundary lengthening cues were still preserved, a lack of change in disambiguation performance would indicate that Mandarin listeners could attend to boundary-related lengthening to disambiguate the sentences.
Likewise, the English listeners' disambiguation performance would be unaffected if they do not rely on pause duration as a cue to prosodic juncture. for an average of 5 years and 2 months (SD = 7.32, range: 41 days to 24 years and 9 months).

Experiment 3: Perception (with pause duration removed)
We excluded additional data from four English listeners and we also excluded one Mandarin listener who failed to correctly disambiguate at least 64% of the experimental sentences.
All participants were university students at the time of the experiment and reported no hearing or reading impairment.

Materials and procedures
The procedures were identical to the first perception experiment, except that the pauses were for the Mandarin group. These recognition scores did not statistically differ from those in the first perception experiment.

Response time
As in the first perception experiment, the raw RT data were transformed using Yeo-Johnson procedure and LMER models were constructed to examine the role of language, juncture version, and language by juncture version interaction. Again, we used a baseline model with random intercepts of participants and items and random slopes of participants and items for the effect of juncture version. In our model comparisons, there was no significant main effect of language

Accuracy
As in the first perception experiment, we constructed GLMER models to examine listeners' accuracy scores. This time, we found that adding language as a predictor did not significantly

Discussion
Mandarin listeners' disambiguation accuracy was significantly lower when the pausing cue was rendered uninformative. The English listeners, however, showed no significant increase in errors.
Thus removal of pausing cues affected the Mandarin listeners' performance accuracy, but had little effect on the English listeners. However, the RT results were non-significant, presumably as a result of the (unavoidably) lower number of participants. Nevertheless, it is noteworthy that the pattern of RT difference between the two juncture versions remained unchanged.

General discussion
The present experiments provide new findings on how native speakers of two phonologically very different languages can differ in their use of prosody in juncture processing. Even when utterances involve the very same structural ambiguity, and even when users of these two languages choose generally the same cues to signal a particular reading, the precise deployment of the disambiguating prosody may vary in several ways. In production, therefore, speakers can differ in the degree to which they enhance the various juncture features. In perception, likewise, listeners' disambiguation accuracy and RT patterns can vary across languages for each prosodic effect.
Consistent with these latter findings, our English and Mandarin speakers produced juncture cues even though the referential context we provided had rendered the use of prosody in principle unnecessary. Language-specific differences appeared in the degree to which speakers would optionally deploy the different cues to mark juncture: The temporal cues (e.g., pause duration and boundary-related vowel lengthening) were, overall, used to a greater extent by the Mandarin Why were there language differences in RT across the Early and Late Juncture contexts? One reason may well be the frequency of these ambiguous structures across languages, in conjunction with findings from work using structural priming; from the latter it has been long known that multiple auditory presentations of sentences with a particular syntactic structure can facilitate processing of subsequent sentences with the same structure (e.g., Carey, Mehler, & Bever, 1970;Mehler & Carey, 1967). Recall that as noted in the introduction, the Early Juncture structure is more frequent in Mandarin than the Late Juncture. Interpretation of the latter in Mandarin is only possible because speakers can omit the genitive particle -de. However, whether -de is omitted or not depends on a number of factors. Based on a large database of informal written and spoken Mandarin, Chappel and Thompson (1992) identified a number of reasons for the omission. First, Chappel and Thompson showed that -de omitted sentences are almost as frequent (at 45%) as -de included ones (55%), and inalienable possessions (e.g., body parts) are not always associated with -de omitted sentences. Whether speakers choose to omit or include -de depends on the conceptual closeness between the possessor and possessee in a given situation (e.g., economic motivation; see also, Haiman, 1983). The degree to which the particle is used also occurs along a continuum with respect to the inherent semantics of the subject and referent. Likewise, there are also pragmatic factors, including the information structure of a conversation and whether the object attached to the optional particle is topicalized (see also Hsu, 2009). For example, -de is more likely to be omitted in the case of given referents; in everyday conversations, once an association between possessor and possessee is established, there is no need to signal it again through the use of -de.
At the same time, -de constructions are syntactically heavy in that they have various functions beyond indicating possession, so processing sentences that involves such a particle (or even the lack of it) may incur extra processing costs. The Late Juncture sentence structure in Mandarin is thus less straightforward than the Early Juncture structure. In our perception experiments, we removed contextual bias by providing listeners with the two possible interpretations before they heard the test sentence; nonetheless, listeners might still be better at accessing a given version of an ambiguous sentence if the use of its structure for a given interpretation is easier to process.
Why were there language differences in listeners' sensitivity to different juncture cues?
Again, we suggest that the frequency and strength of a given cue in production is likely to have influenced whether listeners would use it in perception. English and Mandarin speakers differed in their production preferences; our production data showed Mandarin speakers producing greater increases in pause duration than English speakers. Thus again, as a result of their native language experience, Mandarin listeners would be more used to attending to pausing as a juncture cue than English listeners would. This experience is the most likely source of the better disambiguation accuracy in Mandarin. The results of our second perception experiment (Experiment 3), showing that disambiguation accuracy in Mandarin, but not in English, was significantly degraded when pausing cues were absent, support such an interpretation.
Note that our findings resemble previous data of Yang et al. (2014) in which Mandarin listeners showed better Intonational Phrase boundary detection when only pausing was preserved, compared to conditions where preboundary lengthening or F 0 cues were present. Yang and colleagues focused on a more conscious form of boundary detection by adopting a judgement task where listeners had to respond "Yes" or "No" when asked if they heard a boundary. We have extended their findings by showing that Mandarin listeners relied more on pausing under conditions where prosody was the only source of disambiguation information.
Language-specific preference for a given prosodic cue to boundary placement is not the whole story, however; the details of a cue's realization are also part of the native strategy. There is extensive evidence that even when the same cues (e.g., VOT, domain-initial strengthening) are used across languages, the realization may vary (e.g., Byrd et al., 1997;Cho & McQueen, 2005;Kuzla & Ernestus, 2011;Pierrehumbert & Talkin, 1992). In English, both our perceptual findings and existing ERP data (e.g., Aasland & Baum, 2003) indicate that listeners are less reliant on pausing than on other cues. Interestingly, in language development, English-learning infants undergo a developmental change in cue weighting, from attending to all prosodic boundary cues (i.e., pause, pitch, and vowel duration) at three months, to only pitch cues at six months of age (Seidl, 2007;Seidl & Cristià, 2008; see Männel et al., 2013 for similar findings in German).
It is always possible that languages might use potential cues to juncture less if such cues compete with other functions of the same suprasegmental dimension, such as making lexical distinctions. In this respect, English and Mandarin differ. Mandarin has only 29 phonemes (seven vowels, 22 consonants), while General Australian English has 43 (24 consonants and 19 vowels: pp. xv and xvii respectively in Cox, 2012). At least 12 of the 22 Mandarin consonants involve phonemic distinction based on duration (potentially in combination with aspiration, as for VOT); this number is double that for English. Mandarin also has lexical tones, differing in duration, F 0, and amplitude. The tones alone may diminish the likelihood of suprasegmental cues being useful for non-lexical purposes (for a similar suggestion, see Pierrehumbert, 1999). Probably the strongest asymmetry between English and Mandarin is however lexical ambiguity, which, though common in all languages, is particularly rampant in Mandarin due to the small phoneme inventory and severe restrictions on syllable structure (just as an example: /ʂu1/ with high level tone can represent at least 40 words!). Disambiguation is thus more a feature of Mandarin speech processing than of English, and actually inserting a pause between words or phrases is a way of disambiguating sequences without altering either F 0 or segmental durations. Consistent with this, our native Mandarin listeners showed higher rates of disambiguation accuracy in their native language compared to the English participants. In the disambiguation task used in our perception experiments, only prosody could disambiguate the heard sentences. The fact that there were more interpretation errors in English than Mandarin indicates that English listeners may be less likely overall to rely on prosodic juncture cues for disambiguation.

Conclusion
Our findings demonstrate that identical structural ambiguity does not entail identical processing.
Cues chosen in production can be similar in type but nevertheless different in degree, and perceptual weighting of cues can also differ. All humans may use prosody to segment speech streams into meaningful units, but even when the prosodic cues and the structural options are the same, the ease and the degree to which speakers and listeners use those cues in disambiguation will still show cross-language variability.