Prosody, clause typing, and wh-in-situ: Evidence from Mandarin

This paper examines the use of prosody for marking upcoming linguistic material in speech production and for anticipating them in speech perception. More specifically, it examines whether in the absence of any overt morphosyntactic cues in the beginning of an utterance, speakers use prosodic means to mark the clause type (declarative or wh -question) and whether listeners use these prosodic cues to anticipate the clause type. We report the results of a production and an audio gating experiment. The results of the production experiment show that speakers of Mandarin differentiate declaratives from wh-questions right from the onset of the clause by means of duration, F0, and intensity. The results of the audio gating experiment demonstrate that prosody is used by listeners to anticipate the clause type which is intended by the speaker.


Introduction
In everyday activities like doing sports, dancing, or playing music, people seem to be able to predict events or actions of others and respond accordingly.For instance, most of the time, a good goalkeeper is able to foresee the path of the ball and place him/herself in such a way as to keep the ball away from the vicinity of the goal line.The mechanism of anticipation in language is not much different, except that instead of predicting the path of the ball, listeners or readers use a variety of syntactic, morphosyntactic, discoursesemantic, and prosodic cues to predict upcoming linguistic material before it is read or heard (see among others Clark, 2013;Friston, 2010).For instance, in an eye-tracking study that used the Visual World Paradigm, Altmann and Kamide (1999) showed that listeners can predict the complement of a verb on the basis of its selectional properties, while Staub and Clifton (2006) showed in an eye-tracking reading study that readers can predict the upcoming of an 'or-clause' if they have read an 'either-clause' before.Similarly, Kamide, Scheepers, and Altmann (2003) offer evidence for anticipation effects based on case marking, Altmann and Kamide (2007) for anticipation of tense, and Arai and Keller (2013) show that verb subcategorization properties result in anticipation of the complement of the verb.In this discussion, we focus on cases of anticipation that are triggered by prosody rather than by morphosyntactic or lexical cues.A large body of literature shows that prosody can be used for anticipation of properties of the upcoming linguistic material in various ways.Ito and Speer (2008), for instance, demonstrate that contrastive accents are used by listeners to anticipate the reference of noun phrases and Brown, Salverda, Dilley, and Tanenhaus (2011) show that lexical segmentation can be anticipated by prosodic properties of the preceding context.In this paper, we focus on the question of whether prosody can be used in anticipating the clause type of a sentence (declarative or wh-question) in the absence of any morphosyntactic cues.
In the literature, a number of production and perception studies focus on the prosodic differences between ordinary declaratives, which are used for making a statement, and declarative questions,1 which have exactly the same word order but which are used as questions.In English, declarative questions normally have a rising intonation, while an ordinary declarative is characterized by a falling intonation (see for instance Gunlogson, 2002).Recently, it has been shown that ordinary declaratives and declarative questions differ not only with respect to their boundary tone and their nuclear pitch accent, but also with respect to the pre-nuclear area (see for instance Haan, van Heuven, Pacilly, & van Bezooijen, 1997 for Dutch; Petrone & D'Imperio, 2008 for Italian; Petrone & Niebuhr, 2014 for German; Baltazani, Kainada, Lengeris, & Nikolaidis, 2015 for Greek;and Cooper, 2015 for Welsh).Moreover, perception studies confirm that differences between declarative questions and ordinary declaratives can be perceived before hearing the boundary tone of the tune, which in many languages is considered to be the main prosodic cue for a declarative question (see for instance van Heuven & Haan, 2002 for Dutch; Petrone & Niebuhr, 2014 for Northern Standard German;and Face, 2007 for Castillian Spanish).On the other hand, Heeren, Bibyk, Gunlogson, and Tanenhaus (2015) did not find any strong evidence for the use of early prosodic cues by listeners to distinguish ordinary declaratives from declarative questions; the listeners mainly relied on the boundary tone (H% or L%) for differentiating the two types.As indicated by the authors, the pre-nuclear area was very short and the stimuli, which were of the form 'Got a +disyllabic NP,' possibly in combination with the game set-up of the experiment, resulted in a bias for a question interpretation.This bias and the short length of the pre-nuclear area may have overridden weak cues at the beginning of the sentence.
The aforementioned studies have investigated the use of prosodic cues by speakers and listeners for distinguishing ordinary declarative sentences and questions with a declarative word order.These questions differ from interrogative yes-no questions, which are syntactically marked as interrogatives, e.g., in English by means of inversion of the verb and the subject.Gunlogson (2002) argues that declarative questions are a special use of declaratives rather than a special type of interrogative.Contrary to interrogative yes-no questions, declarative questions have declarative syntax, and they convey commitment to their propositional content.The commitment to their propositional content introduces the bias that distinguishes declarative questions from neutral interrogative yes-no questions.Gunlogson claims that the rise that normally characterizes the intonation of declarative questions attributes this commitment to the addressee rather than to the speaker, while a falling intonation signals that the commitment comes from the speaker.Addressee commitment turns out to depend on strong contextual cues that make this commitment plausible.For instance, a declarative question of the form 'It is raining?' can only be used in a context in which there is a mutual understanding that the speaker has reasons to believe that it is raining and that the addressee is in a position to confirm this.To conclude, declarative questions impose strong conditions on the context in which they are uttered and the prosodic cues that discriminate ordinary declaratives and declarative questions reflect pragmatic differences rather than differences in syntactic clause type.
In this paper, we compare declaratives with wh-in-situ questions.As in the case of declarative questions, wh-in-situ questions are at the beginning of the sentence string identical to declarative sentences (if we do not use subject question words or pre-verbal wh-adverbials), but contrary to declarative questions, they are interrogatives from a syntactic point of view and they denote questions rather than propositions.Mandarin is a genuine wh-in-situ language (Huang, 1982;Cheng, 1991) in which the wh-phrase occupies the same linear position as its non-interrogative counterpart, as shown in (1) and ( 2), where shénme 'what' and júzi 'oranges' appear in the post-verbal object position.
(  2), a wh-question and its declarative counterpart are string identical except for the wh-word (shénme 'what') and its non-wh counterpart (júzi 'orange' in [2]).This means that in Mandarin there are no overt morphosyntactic cues preceding the wh-word that can be used to distinguish a wh-question from a corresponding declarative.Mandarin wh-questions and their declarative counterparts thus offer unique minimal pairs for us to investigate prosodic cues for distinguishing declarative and interrogative sentences. 2In particular, from (1) and ( 2), we investigate the region before the wh-word for the prosodic cues.The research questions addressed in this paper are the following: (3) Do speakers of Mandarin prosodically mark the clause type (wh-question versus declarative) in the pre-wh-word contour?
(4) If so, do listeners use this prosodic marking to anticipate the clause type before reaching the wh-word or its non-wh counterpart?
The paper is organized as follows.In Section 2, we present the results of a production experiment (Experiment 1) that we conducted to address the question in (3).Section 3 reports the results of an audio gating experiment (Experiment 2) that was run to tackle the question in (4).Section 4 discusses the results and draws conclusions about the prosodic marking of the wh-versus declarative clause types in Mandarin and about the use of prosodic cues for clause type anticipation, as well as the interaction between tones and intonation.
2 A reviewer suggests that we also compare the wh-in situ questions with declarative yes-no questions directly, in order to exclude the possibility that the prosodic properties of wh-in situ questions and declarative questions are in certain respects similar and as such could reflect pragmatic similarity between the two sentence types.In such a situation, the observed differences between declarative statements and wh-in situ questions might be pragmatic in nature as well.Even though we acknowledge that this is a potential limitation of our study, we show below that the types of prosodic cues that we found for Mandarin wh-in situ differ from the ones that are described in the literature for Mandarin declarative questions.Moreover, we suspect that the strong contextual conditions on the use of declarative questions (see above) can lead to confounds when these are compared with wh-questions.

The phonetics of Mandarin wh-questions and declaratives
Previous studies on the intonation of questions in Mandarin have mainly focused on various types of yes-no questions.Crucially, these studies use 'syntactically unmarked yes-no questions' (i.e., declarative questions, see above), yes-no questions with the yes-no particle ma, and in some cases the semantically neutral A-not-A question is also included. 3ost of the studies report a general higher pitch level in declarative (yes-no) questions than statements (De Francis, 1963;Ho, 1977;Shen, 1990), except Tsao (1967), who claimed that yes-no questions do not differ from the corresponding declaratives in terms of pitch level.However, as Liu (2009) notes, these previous studies did not control for sentence focus and some of them made the claims based on impressionistic inspection of data.Here we briefly review two recent studies, both of which include wh-questions as well.
The first one is Lee (2005), who investigated three types of questions (declarative questions, ma-questions, and wh-questions) in contrast with ordinary declaratives.It is reported that with respect to the declarative questions and ma-questions, both the expansion of the pitch range and raising of the overall pitch are manifested towards utterance-final position.Concerning wh-questions, the overall pitch starts higher than that of statements.Importantly, Lee reported that in longer utterances, neither declarative questions nor ma-questions differ from statements in pitch range in the early portion of the utterances.In the case of wh-questions, the wh-phrase itself has an expanded pitch range, possibly associated with narrow focus.Liu's (2009) study also included the three types of questions reported in Lee (2005), but the focus of the questions is controlled (using initial, medial, final, or neutral focus).Liu, similar to many other scholars in previous studies, constructs her stimuli using only one specific tone for the whole sentence.This choice definitely influences the naturalness of the utterances, as Shen (1990) also reported.She found that in the case of neutral focus, the difference between statements and declarative questions is mainly manifested in the final word, which confirms what is claimed in previous literature.Furthermore, pitch raising by question intonation is greater in yes-no questions (both declarative questions and ma-questions) than in wh-questions.She suggests that this may be related to a separation of incredulity and interrogation. 4Given the fact that both declarative questions and ma-questions are biased in the sense of Gunlogson (2002), this might also be a reflection of the specific pragmatic properties of biased questions.Finally, it should be noted that due to the study design of investigating focus, including initial focus, some wh-questions have initial wh-phrases.
In short, the aforementioned studies provide evidence for the claim that declaratives differ from biased yes-no questions in terms of prosody; the pitch level of these yes-no questions is raised in particular towards the end of the utterance.However, due to the biased nature of these yes-no questions, it could be the case that the higher pitch level is associated with the pragmatic interpretation of these questions.As for wh-questions, there is some indication that wh-questions also have a higher overall pitch, with the wh-phrase getting the expanded pitch range.Nonetheless, the wh-questions investigated are either followed by the particle ne (Lee, 2005), 5 or they sometimes start with the wh-phrase, making it hard to draw conclusions concerning anticipation or recognition of wh-questions.The production experiment reported in this paper is the first study that zooms in on the prosodic properties of wh-questions without ne in Mandarin.In particular, the study reported here investigates the prosodic properties of the part of the wh-questions preceding the wh-phrase, making it possible for us to evaluate for the first time whether or not there are specific prosodic properties of the pre-wh region in wh-in situ questions that listeners could use to distinguish these questions from declaratives before the wh-expression is reached.

Stimuli
As the central question of our study concerns whether the prosodic properties in the prewh-word contour in a wh-question differ from the prosodic properties of a string identical declarative (except for the wh-word slot), we constructed two clause types, namely, wh-questions and declaratives.Each stimulus consisted of 11 syllables.Furthermore, we kept the tonal composition of the stimuli constant across items and clause types for all constituents, except for the verb.For the verb (monosyllabic) we included all four tones to ensure that we have more natural stimuli.In total we constructed 56 stimuli (7 exemplars × 4 verb tones × 2 clause types).An example set of stimuli is given in ( 5): (5a) is a wh-question, while (5b) is the declarative counterpart, and (5c) gives the general structure of the stimuli and indicates the mapping of the first six syllables and their tones.
( T2T1 T1/T2/T3/T4-T0 For the subject and the indirect object, we used common Chinese disyllabic proper names.Moreover, for introducing the indirect object we used the verb gěi 'give' which is a typical element for introducing an indirect object and in this case, it functions like a preposition, and it has a low tone (T3).We also included disyllabic temporal adverbs.In the case of the direct object, the first syllable of the wh-word has a rising tone (T2), while the second syllable has a neutral tone (T0), and the same tonal pattern was used for making the corresponding non-question counterpart (e.g., tízi 'grapes').Target stimuli were intermingled with fillers and a pseudo-randomized list of stimuli was prepared for every participant using Praat (Boersma & Weenink, 2017).A list of the target stimuli is given in Appendix A.

Procedure
The recordings took place in a sound-proof booth in a lab of the Department of Foreign Languages and Literatures at Tsinghua University in Beijing.Participants were seated in front of a Dell laptop screen at an approximate distance of 50cm, wearing a head-worn vocal microphone (Shure SM10A), and the utterances were recorded with the microphone connected to an external sound card using Audacity software (sampling rate 44.1kHz, 16bit, mono).The stimuli were presented on-screen using Praat, and the presentation pace was controlled by the experimenter.Participants were instructed to first read the sentence on screen silently to understand its meaning and then utter it as if they were engaged into a conversation.

Participants
Forty participants (23 female and 17 male, x̄ age = 21 years old), students at Tsinghua University were paid to participate in the experiment.All were native speakers of Beijing Mandarin, born and raised in Beijing and had normal or corrected-to-normal vision.

Acoustic analysis
A total of 2240 utterances (56 target stimuli × 40 speakers) was inspected for disfluencies, unnatural pausing, or slips of the tongue, and 171 tokens were excluded from any further analysis.We then analyzed the 2069 utterances with respect to duration, F0, and intensity.
Duration.In every utterance, we manually marked the onset and the offset of the first six syllables in Praat (Boersma & Weenink, 2017), and used a script to extract the duration of each syllable (see Figure 1).Note that syllables in Mandarin are words (Duanmu, 2000(Duanmu, , 2011)).We also calculated the speakers' speech rate adding up the duration of the first six syllables and dividing the outcome by six.
F0.For F0, we obtained a number of tone-specific measurements.Following among others, Duanmu (2000), Chen and Gussenhoven (2008), van de Weijer and Sloos (2014), we represent the four lexical tones in Mandarin (T1, T2, T3, T4) with H(igh) and L(ow); so T1 which is a high level tone, is represented as H, T2, which exhibits a rise, is represented as LH, T3, which is a low tone,6 is represented as L, and T4, which exhibits a fall, is represented as HL.We operationalized these phonological representations making the following measurements.For the static tones T1 and T3 (Xu, 1999), following Chen (2010), we measured F0 at the syllable offset (see Figure 1), measuring F0-maximum for T1 and F0-minimum for T3.For the dynamic tone T2, we measured the F0-minimum (beginning of the F0-rise) and the F0-maximum (end of the F0 rise).To obtain these measurements, we first identified the F0-maximum of the tone unit which occurred at the syllable offset, we then inspected the F0 leftwards to identify the onset of the F0-rise (F0-minimum).Similarly, for the dynamic tone T4, we measured the F0-maximum (beginning of the fall) and the F0-minimum (end of the fall).After identifying the F0-maximum which occurred at the syllable onset, we inspected the F0 rightwards to identify the offset of the F0-fall (see Figure 1).In syllable six (S6) the perfective marker le has neutral tone (T0).As the realization of the neutral tone (T0) is influenced by the preceding tone, following Li (2002), we chose our measure points depending on the preceding tone.We measured first the F0-maximum and then the F0-minimum when le followed a verb that bore T1, T2, or T4, and we measured first the F0-minimum and then the F0-maximum when le followed a verb that bore T3.We also calculated the F0-range in the pre-wh-or corresponding non wh-word contour subtracting the F0-minimum of T2 in the first syllable of the utterance (S1) from the F0-max of T0 in the sixth syllable of the utterance (S6).
To reduce speaker variation, F0 values in Hz were converted into semitones (ST).For female speakers we used the formula in (6), while for male speakers we used the formula in ( 7).Intensity.We measured the syllable intensity range, defined as Syllable Maximum Intensity-Syllable Minimum Intensity, rather than the mean syllable intensity, as this measure is more informative and allows us to capture any differences in intensity between declaratives and wh-questions (see also Titze, 1988;Chen, 2005;Ouyang & Kaiser, 2013).

Statistical analysis
We ran a series of linear mixed-effect models using the lmer function of the lme4 package (Bates, Mächler, Bolker, & Walker, 2015) in R (R Core Team, 2017).Duration.We first ran a null model (m0) with duration in milliseconds as the dependent variable, and speaker and item as random factors.Then, we ran a second model (m1) in which we included in addition the syllable [S1, S2, S3, S4, S5, S6] as a fixed factor.A third model (m2) included also type of the clause [declarative, wh-question] as a fixed factor, while the final model (m3) included duration as the dependent variable, type of the clause, syllable, the interaction type of the clause × syllable as fixed factors, and by-speaker and by-item random intercepts.Models with maximal random effects failed to converge.All four duration models were compared for model fit; see Table 1.In the results section, we present the outcomes of the final model (Duration:m3) that performed best.
Speech rate.We started with a null model (m0) which included speech rate as a dependent variable, and speaker and item as random factors.Then, we ran a second model (m1) in which we included the type of the clause [declarative, wh-question] as a fixed factor and by-speaker and by-item random intercepts.A model with maximal random effects failed to converge, and the second model was found to perform best; see Table 1.F0.We ran two series of analyses, one for F0min measurements and one for F0max measurements.For F0min, similar to duration, we first ran a null model (m0) with F0min in semitones as the dependent variable, and speaker and item as random factors.Then, we ran a second model (m1) in which we included in addition the syllable [S1, S3, S5, S6] 7 as a fixed factor.A third model (m2) included also type of the clause [declarative, wh-question] as a fixed factor, while the fourth model (m3) included also the interaction syllable by type of the clause.The final model (m4) included F0min as the dependent variable, type of the clause, syllable, tone [T2, T3, T4], and the interaction syllable by type of the clause as fixed factors, and by-speaker and by-item random intercepts.Models with maximal random effects failed to converge.All five F0min models were subsequently compared for model fit; see Table 2.In the results section, we present the outcomes of the final model (F0min:m4) that performed best.A similar procedure was followed for F0max.
In Table 2 we report the model fit for all five F0max models.In the results section, we present the outcomes of the final model (F0max:m4) that performed best.F0-range.We started with a null model which included F0-range as a dependent variable, and speaker and item as random factors.Then, we ran a second model in which we 7 Syllable two (S2) and syllable four (S4) are not contributing any data, as S2 and S4 bear T1 (High level tone).included in addition the type of the clause [declarative, wh-question] as a fixed factor.
A third model included also the tone of the verb [T1, T2, T3, T4] as a fixed factor, while the final model included F0-range as a dependent variable, type of the clause, tone of the verb and their interaction as fixed factors, and by-speaker and by-item random intercepts.
In the results section we present the results of the last model which was found to perform best, see Table 2.
Intensity.We first ran a null model with intensity range in dB as the dependent variable, and speaker and item as random factors.Then, we ran a second model in which we included in addition the syllable [S1, S2, S3, S4, S5Tone1, S5Tone2, S5Tone3, S5Tone4, S6Tone1, S6Tone2, S6Tone3, S6Tone4] as a fixed factor.A third model included also type of the clause [declarative, wh-question] as a fixed factor, while the final model included intensity range as the dependent variable, type of the clause, syllable and their interaction as fixed factors, and by-speaker and by-item random intercepts.Models with maximal random effects failed to converge.All four intensity models were compared for model fit; see Table 3.In the results section, we present the outcomes of the final model (Intensity:m3), despite the fact that it is not the best fit model.

Results
Duration.Figure 2 displays the mean duration (in ms) of the first six syllables in wh-questions paired to their counterparts in declaratives.As shown in Figure 2, the mean syllable duration of declaratives is longer than the mean syllable duration of wh-questions.The type of the clause also had a significant effect on speech rate; the first six syllables of wh-questions were uttered faster than the corresponding declarative syllables; see Figure 3 and Table 7 in Appendix B [Estimates = 5.199, SE = 0.6554, t-value = 7.933, p < 0.001].
F0. Figure 4 presents the mean values of the F0 measurement points for syllables S1-S6 in wh-questions as compared to their counterparts in declaratives, broken down per verb tone.As shown in Figure 4, the mean F0 values of the F0 measurement points are higher in wh-questions than the corresponding F0 points in declaratives; the only exception is the H (F0-maximum) of the perfective marker le when the verb bears T1 or T4.
Our results showed a significant effect of clause type for F0min and F0max measurements.Specifically for F0min, the L of the first syllable of the adverb (S3.T2.F0min in Figure 4) was significantly lower in declaratives than in the corresponding wh-questions (see Table 9 in Appendix B).Moreover, the L of the verb, when the verb bore T2, T3, or T4 (S5.T2.F0min, S5.T3.F0min, S5.T4.F0min in Figure 4) was significantly lower in declaratives than in the corresponding wh-questions (see Tables 10-12 in Appendix B).The differences between declaratives and wh-questions were also visible on the perfective marker le; the L was significantly lower in declaratives (see Tables 13-14 in Appendix B).For F0max, the H of the second syllable (S2.T1.F0max in Figure 4) was significantly lower in declaratives than in the corresponding wh-questions (see Table 16 in Appendix B).Moreover, the H of the first and the second syllable of the adverb (S3.T2.F0max, S4.T1.F0max in Figure 4) was significantly lower in declaratives than in the corresponding wh-questions (see Table 17-18 in Appendix B).Additionally, the H of the verb, when the verb bore T1, T2, or T4 (S5.T1.F0max, S5.T2.F0max, S5.T4.F0max in Figure 4) was significantly lower in declaratives than in the corresponding wh-questions (see Tables 19-21 in Appendix B).We also found a significant effect of clause type on F0-range; when the verb bore T1 or T4, declaratives showed a larger F0-range than wh-questions.The difference between the two clause types was not significant when the verb was T2.The opposite pattern was observed when the verb bore T3; see Figure 5 and Tables 24-27 in Appendix B.
Intensity.Our results show that the mean intensity range at S4 was significantly higher in declaratives than in the corresponding wh-questions [Estimates = 0.5345, SE = 0.2133, t = 2.506, p < 0.05]; see Table 31 in Appendix B. When examining the mean intensity range at S1, S2, and S3, there was no significant difference between the two clause types; see Figure 6 and Tables 28-30 in Appendix B. When looking at S5, the two clause types do not differ with respect to intensity range.However, when looking at S6, there is a significant difference between wh-questions and declaratives, the former having a higher intensity range, when the verb carries T4 [Estimates = −1.0576,SE = 0.4267, t = −2.478,p < 0.05]; the difference is not significant when the verb bears T1, T2, and T3; see

Interim conclusion
The results of the production experiment show that speakers mark prosodically the intended clause type already from the onset of the clause.The prosodic differences between the two clause types are summarized below.
Duration.Looking at the region before the direct object (wh-word or its non-whcounterpart) syllables S2-S5 in declaratives are significantly longer than the corresponding syllables in wh-questions.In this region, wh-questions are uttered significantly faster than their declarative counterparts.
F0. Wh-questions have higher F0 than the corresponding declaratives.Wh-questions also have a smaller F0 range in the pre-wh-word contour in comparison to the corresponding declaratives, when the verb bears T1 or T4.
Intensity.In wh-questions, S4 (second syllable of the adverb) has a significantly higher intensity range than the corresponding declaratives.
On the basis of these results, a naturally emerging question is whether listeners use these prosodic cues (duration, F0, and intensity) to anticipate the type of the clause before reaching the wh-word or noun.To address this question, we conducted an audio perception experiment, which is the focus of the following section.

Audio perception experiment: Clause type and anticipation
The findings of the production study show that Mandarin wh-questions and declaratives differ prosodically in the region before the direct object (the direct object is a wh-word or a non-wh-counterpart) and thus suggest that information about the type of the clause is encoded at least prosodically prior to the wh-word.A relevant question is whether listeners use the acoustic cues produced by the speakers, and how early in time they interpret the two audio fragments in a distinct way.The audio perception experiment reported here aims at tackling this issue.For setting up this experiment we used a modification of the gating paradigm (Grosjean, 1980;Lahiri & Marlsen-Wilson, 1991).The gating paradigm was initially used for spoken word recognition.The experimenter presented a word repeatedly varying its length and pitch, and participants were asked to write down the word they had heard after each hearing, and indicate their degree of confidence.In the classic gating paradigm, word and prosodic phrase boundaries were manipulated.In later studies, modifications of the gating paradigm have been used for studying the contribution of prenuclear accents to the meaning of the utterance; in these studies, the word and prosodic phrase boundaries stay intact (see Petrone & Niebuhr, 2014).
In our audio perception experiment, participants heard audio fragments and were asked to complete the audio fragment selecting a declarative or question continuation that appeared written on their screen.If listeners use early acoustic cues to anticipate the clause type, then this finding offers some evidence for the use of prosody for the anticipation of the clause type.

Participants
A total of 36 participants (20 male, 16 female, x̄ age = 19 years old) were reimbursed to take part in the experiment.All of them were native speakers of Beijing Mandarin coming from the Beijing area, and none of them reported any hearing disorders.

Stimuli
The third author of the paper who is a native speaker of Beijing Mandarin inspected the production data and evaluated the speakers for naturalness and clear articulation.
After this inspection, we selected a female speaker (20 years old).This speaker was considered to be one of the best speakers of the 40 speakers who participated in the production study reported in Section 2 in the sense that her speech is clear, natural, and with a good pace.Subsequently, the third author inspected the total production data of this female speaker and selected 40 stimuli on the basis of their naturalness.These stimuli consisted of 20 sets; each set included a declarative and its corresponding wh-question (20 sets × 2 clause types).As discussed in Section 2, tones were kept constant across clause types and items for all constituents except for the verb; the tone of the verb varied and all four lexical tones were included.The stimuli were used as basis for constructing the audio fragments.Specifically, each of the 40 stimuli was cut at three different points: at the offset of the subject of the sentence (gate-a), at the offset of the adverb (gate-b), and at the offset of the perfective marker le (gate-c) 8 .This resulted in a total of 120 audio fragments.An example of the three gates is given in Figure 8 and examples ( 8)-( 10). 8For reasons of naturalness we decided to respect the word boundaries and therefore cut the stimuli at the offset of words; see Petrone and Niebuhr (2014) for a similar reasoning.The number of syllables and the corresponding tones of the three audio fragments is given in ( 11)-( 13).
Acoustic properties of the stimuli.As indicated above, the duration and the F0 of the stimuli uttered by this selected speaker are not the mean duration and the mean F0 of the stimuli that were uttered by the 40 speakers.We present here the acoustic properties of the stimuli that were used in gate-c (syllables S1-S6), as these overlap with the stimuli used at the two other gates.Specifically, we present information about the duration, speech rate, F0, and intensity range of the stimuli that were used in the audio perception experiment and briefly indicate the main differences with the results of the production study.
Statistical analysis of the acoustic properties of the stimuli.We ran a series of linear mixedeffects models using the lmer function of the lme4 package (Bates et al., 2015) in R (R Core Team, 2017).Specifically, for every measurement, we first ran a null model with the relevant measurement in milliseconds as the dependent variable and items as random factors.A second model included in addition clause type as a fixed factor.Model fit was compared using the fit measures Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) (Agresti, 2002).
Duration.The total duration (S1-S6) of the fragmented declarative stimuli (x̄ = 1003ms) was on average significantly longer than the total duration of the fragmented wh-question stimuli (x̄ = 952ms), [Estimates = 0.051, SE = 0.013, t = 3.85].Moreover, the duration of each syllable of the fragmented declarative stimuli was longer than the duration of each syllable of the fragmented wh-question stimuli; see Figure 9.The difference between the two clause types was statistically significant for S1-S4, as shown in Table 4.There was also a difference in S5 and S6, but contrary to what was found in the production study, this difference was not significant.
The fragmented declarative stimuli were also slower in speech rate than the fragmented wh-question stimuli [Estimates = 0.008, SE = 0.002, t-value = 3.85]; see Figure 10.
F0.In general, the F0 points of the fragmented wh-questions stimuli were higher than the F0 points of the corresponding declarative stimuli; see Figure 11.
This difference between wh-questions and declaratives was significant at the H (F0max) of syllable two (S2) which bore T1 and the H (F0max) of syllable four (S4) which also bore T1; see Table 5.Moreover, there was a significant difference at the H of the verb when the verb bore T1 or T4.When the verb bore T2 or T3, the significant difference was at the first F0 point of the perfective marker le; see Table 5.As for the production experiment (see  Figure 4), the F0 points were significantly higher in wh-questions for syllables (S2-S4).
When looking at the F0 measurements for S5 and S6 in the production stimuli, we see that the differences between wh-questions and declaratives are significant at the verb (S5); there was no significant difference on the H when the verb bore T2 (S5.T2.F0max).The comparison with the data from the production study shows that in general, the patterns were similar, even though the differences were less often significant in the data used for the perception experiment.Intensity.The mean intensity range of syllables S1-S6 was not significantly different between declaratives and wh-questions, with the exception of S5 when the verb bore T2; see Table 6 and Figure 12.In the data of the production experiment, we only found a difference in intensity range for S6.From the above analysis, we see that all three acoustic parameters, duration, F0, and intensity of the stimuli used for the gating study are in the same direction as the acoustic parameters of the stimuli obtained from the production experiment, although they cannot be exactly the same, as the stimuli of the gating study come from one particular speaker.
Response stimuli.We also prepared two types of response stimuli that appeared written on screen after each audio fragment was heard.The response stimuli consisted of either a wh-question or a declarative continuation; the two continuations differed only at the wh and its non-wh-counterpart.An example of the two sentence continuations for gate-a is given in ( 14)-( 15).For gate-b, the two sentence continuations were identical to the ones for gate-a, but without the adverb (e.g., zuótiān 'yesterday').For gate-c, the two sentence continuations were identical to the ones for gate-b, but without the verb and the perfective marker le (e.g., bāo-le 'peeled').( 14 The experiment was run using MFC, Praat (Boersma & Weenink, 2017) and proceeded as follows.Participants were seated in front of a computer and were asked to read the instructions that appeared on the computer screen and to press the OK button once they were ready to start with the experiment.The first audio fragment was played via the computer's loudspeakers 1.0 seconds after clicking on the OK button.While the audio fragment was played, the screen was empty and 0.3 seconds after the fragment's offset, two possible sentence continuations appeared on the screen.The participants' task was to select one of the two responses on the basis of the audio fragment they had heard and to confirm their selection by clicking on the OK button.The next audio stimulus was played 1.0 seconds after clicking on the OK button.Participants listened first to the audio fragments of gate-a, then the audio fragments of gate-b, and finally the audio fragments of gate-c.The order of presenting the two sentence continuations was randomized to avoid any presentation bias.We also randomized the order of presenting the audio fragments in every gate to avoid any presentation bias.The experiment took place in a silent room at Tsinghua University in Beijing and lasted approximately 20 minutes.

Statistical analysis
We used a series of mixed-effect models to analyze the likelihood of the type of continuation (wh-questions versus declaratives).All the analyses were run in R using the lme4 Package (Bates et al., 2015).We first ran null models that included participants' responses as dependent variable, and participants and items as random factors.We then ran additional models adding as a predictor the clause type intended by the speaker to see whether the model improved.For gate-c we also ran a third model adding a second predictor, namely, the verb tone, to see whether the model improved, and then a fourth model in which the interaction between the two predictors is also included.For gate-a and gate-b, the null models improved when the clause type intended by the speaker was added as a predictor to the model [for gate-a: x 2 (1) = 90.68,p ≺ 0.001, for gate-b: x 2 (1) = 212.77,p ≺ 0.001].
when the verb bore T1, participants chose a wh-question continuation for audio fragments that originated from wh-questions more often than a declarative continuation.Likewise, they chose a declarative continuation more often than a wh-question continuation for audio fragments that originated from declaratives.After listening to audio fragments of gate-c with T2 on the verb, participants responded in a similar way; audio fragments that originated from wh -questions triggered more wh-continuation responses, while fragments that originated from declaratives received more declarative continuation responses.Audio fragments with T4 on the verb elicited similar responses by the participants.The distribution of the participants' responses was different when the verb bore T3.After listening to audio fragments of gate-c with T3 on the verb, listeners had difficulty in identifying the type of the clause that was intended by the speaker.As shown in the lower left panel in Figure 14, participants chose a declarative continuation 66.7% of the times when the intended clause type was a wh-question.shows that interaction between the clause type that was intended by the speaker and T3 differed significantly from the interaction between the clause type that was intended by the speaker and T4.Furthermore, as shown in Figure 15 (see left panel) when the intended clause type is a wh-question, the participants were more likely to choose a wh-continuation after listening to an audio fragment with T4, than after listening to an audio fragment with T3.variation with respect to tones among the items, and to the global results of gate-c (S1-S6).
In the next section, we will turn to the effects of the different tones on the verb in S5.
The audio fragments that were presented to the listeners at gate-a offered two types of cues for distinguishing between wh-questions and declaratives.In the first place, the F0 maximum (H) on S2 (T1) was significantly higher for questions than for declaratives.In the second place, the duration of both syllables was significantly shorter in questions than in declaratives.There were no differences in intensity.At gate-b, the F0 maximum (H) on S4 was significantly higher for questions than for declaratives, while S3 and S4 were significantly shorter in questions.To some extent, the listeners did not have much more information than at gate-a.These descriptive differences are in line with the fact that the results for gate-a and gate-b were similar, suggesting that the cues used by the listeners were most likely already contained in the first two syllables.The extra duration information on S3 and S4 did not lead to better discrimination, suggesting that duration information in S3/S4 was either not used, or the information in the first two syllables was enough to perceive a higher speech rate.At gate-c, the results differed per verb tone.In the next section, we will consider the interaction between intonation and tones in more detail.

Interaction between tone and intonation
The interaction between tone and intonation in Mandarin has been mainly discussed in declarative questions in comparison with their string identical declaratives.As mentioned in Section 2, the most striking F0 difference between declarative questions and ordinary declaratives lies in the final syllable.That is, the final syllable in declarative questions is normally higher in F0, showing a rising contour, as compared with that in declaratives.Since F0 is not only (one of) the most important acoustic correlates of intonation, but also the most important acoustic correlate of tones in Mandarin, many scholars start to investigate the interaction between intonation and the rising tone (T2) and the falling tone (T4).In particular, the question addressed is whether these tones at the end of the sentence affect the identification of intonations/clause types, or vice versa, whether the intonation affects the identification of tones (Yuan, 2006;Liu, Chen, & Schiller, 2016;Ma, Giocca, & Whitehill, 2010).However, no literature has addressed the interaction of tone 3 and intonation.
The consistent conclusion of these studies is that there is an interaction between tone and intonation especially when the sentence ends in a T2.For instance, Yuan (2011) found that in Mandarin, declarative questions ending with T4 (falling tone) were easier to identify than declarative questions ending with T2 (rising tone).Liu et al. (2016) found that Mandarin listeners can distinguish between declarative-question intonation and declarative intonation when the intonation is associated with a final T4, but fail to do so when the intonation is associated with a final T2.
Even though our study was not designed to study the interaction between tones and intonation, and certainly not in comparison with results from previous studies, our results from both the production experiment and the audio perception study show that the tone on the verbs interacted with other factors.In particular, the items containing a verb with T3 behaved differently from the ones with T1, T2, and T4, even though we also observed effects of T1 and T4.
In the production experiment, the most important effects are found for pitch.Whereas T1, T2, and T4 gave rise to a reduced F0-range and a higher overall pitch in the pre-wh region of the wh-in situ question as compared to the corresponding region in the declarative (see the general pattern described in the previous section), these effects were not observed for T3.When looking at specific measurement points, the final particle le (S6) showed different patterns for T1 and T4.Whereas we found overall higher pitch for T2 and T3 in questions than in declaratives, T1 and T4 give rise to a lower F0 maximum H and a higher F0 minimum L for this syllable.As the data in Figure 4 show, this lower pitch maximum in the wh-in situ questions results in a clearer difference between the two clause types: In the declaratives there is a F0 rise between the verb in S5 and the fall on the verbal particle le in S6, while there is a continuous fall in the wh-questions.Besides pitch, intensity also interplayed with the verb tone.T4 behaved differently from the other tones, as it exhibited a significantly higher intensity range in wh-questions, while wh-questions with a verb that bore T3 were more similar to the corresponding region in declaratives in terms of their overall pitch, their pitch range, and intensity.
Turning to the perception experiment, tone played a role at gate-c as this gate also contained the verb (bearing tones T1-T4) and the perfective particle le.After hearing audio fragments of gate-c, a difference was observed for T3 as opposed to T1, T2, and T4.In these latter cases, the listeners identified the intended clause type of both questions and declaratives above chance level.However, after hearing fragments with a T3 verb, the listeners correctly identified declarative sentences (86.11% correct responses), but they confused wh-questions with declaratives (33% correct responses).This suggests that the prominence of the low-dipping tone, Tone 3, affects the perception of the other prosodic cues, and seems to overrule prosodic information provided in S1-S4.Moreover, the clear pitch difference on the particle le did not lead to correct identification of the clause type.
An alternative interpretation of the results of the perception experiment is that the female speaker that we chose to use in the perception experiment realizes T3 in such a way that confuses listeners with respect to the identification of the clause type; this confusion appears at the verb position.Had we used another speaker whose production data were more often significantly different between the two clause types, then the results could be different.Such hypothetical results would imply that listeners are particularly sensitive to the prosodic cues they hear in the corresponding gates.Alternatively, we could manipulate duration, F0, and intensity and examine their respective contribution to clause type identification.Another related issue is gender; in this perception experiment we chose to use a female speaker.Had we used a male speaker, then the results could have been different.At this point we do not know whether gender is a relevant factor for clause type anticipation.
As for the items with T1, our data show that these were more often interpreted as questions, suggesting that the high level tone (T1) facilitated the correct identification of questions (71.7% of correct responses).T1 and T4 were the only tones in which the maximum pitch on the verb was higher in the fragments used for the audio perception study, but this only helped question identification with T1.For T4 and for T2 no significant difference was found with respect to gate-b.

Conclusion
Both of the research questions that we posed in Section 1 have positive answers: Speakers of Mandarin do prosodically mark the clause type in the pre-wh-word contour; and listeners do use this prosodic marking to determine the clause type before reaching the wh-word or its non-wh-counterpart.As the wh-in-situ questions investigated in our study are not biased questions, we suggest that the prosodic markings that we identified reflect the clause type of these questions.We have seen that though listeners do use the prosodic marking in determining the clause-type, the tonal properties of the stimuli must also be taken into consideration. log2(Hz/50)

Figure 1 :
Figure 1: A waveform and a spectrogram with a superimposed F0 curve of a wh-question.The tones of the syllables S1-S6 and the corresponding F0 measurement points are indicated in TextTier 4 and PointTier 6.
Figure 7 and Tables 31-39 in Appendix B.

Figure 6 :
Figure 6: Mean intensity range in dB of syllables S1-S4 in wh-questions and declaratives.

Figure 7 :
Figure 7: Mean intensity range in dB of syllables S5 and S6 in in wh-questions and declaratives.

Figure 10 :
Figure 10: Mean speech rate in wh-questions and declaratives.
Standard error is given in parenthesis ( ), t-value is given in [ ], and p-value is given in {}.Bold indicates significance <0.05.

Figure 12 :
Figure 12: Upper panel: Mean intensity range in dB of syllables S1-S4 in wh-questions and declaratives.Middle and lower panel: Mean intensity range in dB for syllables S5 and S6 in wh-questions and declaratives.

Figure 14 :
Figure 14: Listeners' responses in percentage (%) to stimuli of gate-c broken by tone on the verb.The upper left panel presents listeners' responses when the verb bore T1, the upper right panel for verb = T2, while the lower left panel for verb = T3 and the lower right panel for verb = T4.

Furthermore
, the results of a linear mixed-effects model analysis showed that the distribution of participants' responses did not differ between gate-a and gate-b [Estimates = −0.194,SE = 0.111, p > 0.05]; see Figure 16.Between gate-b and gate-c the results differed depending on the tone on the verb.When the verb bore T1, the participants were more likely to select the question continuation [Estimates = −0.595,SE = 0.197, p < 0.05], while they were less likely to do so when the verb bore T3 [Estimates = 1.160,SE = 0.193, p < 0.001].The distribution of participants' responses did not differ significantly between gate-b and gate-c when the verb bore T2 [Estimates = 0.210, SE = 0.187, p > 0.05] or T4 [Estimates = 0.051, SE = 0.188, p > 0.05].

Figure 15 :
Figure 15: Effects of clause type and verb tone on participants' responses.

Table 1 :
Model fit measures for duration and speech rate models.

Table 3 :
Model fit measures for intensity range models.

Table 4 :
Results of linear mixed effect models on duration.

Table 5 :
Results of linear mixed effect models on F0.
Note: Standard error is given in parenthesis ( ), t-value is given in [ ], and p-value is given in {}.Bold indicates significance <0.05.

Table 6 :
Results of linear mixed effect models on intensity range.