1. Introduction

The prosodic realization of utterances is characterized by a large degree of variability, and there is generally no one-to-one relationship between prosody and meaning (Cruttenden, 1986; Grice et al., 2017; Nolan, 2020; Wells, 2006; Westera et al., 2020; Zimmermann & Onea, 2011). Nevertheless, there are certain constraints on prosodic realizations, i.e., certain tonal events or combinations thereof, which only occur in some contexts and not in others. For instance, assertions in German are typically produced with an H* pitch accent and an L-% edge tone, but other tonal realizations exist in this context. In perception, listeners make use of the prosodic realization of an utterance for its interpretation (Baumann, 2006; Bishop, 2012; Petrone et al., 2017; Petrone & D’Imperio, 2011; Petrone & Niebuhr, 2014; Snedeker & Trueswell, 2002; Weber et al., 2006). In German, for example, a sentence with declarative word order and a final fall (i.e., H* L-%) is likely interpreted as a (neutral) assertion, while a final rise may signal uncertainty or turn the utterance into a question (Cruttenden, 1994; Geluykens, 1987, 1988; G. Ward & Hirschberg, 1985).

This paper targets the role of tonal and nontonal prosodic cues to rhetorical (RQ) and information-seeking questions (ISQ) in German from a listener’s perspective. Linguistically, ISQs request information from an addressee (e.g., (1a)). RQs, on the other hand, often serve to make a point (Biezma & Rawlins, 2017). They commit the interlocutor to the answer presupposed in the interrogative (e.g., to the fact that nobody likes paying taxes in (1b)). RQs can be signaled by different kinds of linguistic means or by combinations of these, e.g., the use of particles, verb mood, negative-polarity items (Dehé et al., 2022; Dehé, Braun et al., 2024), but there are questions without specific morpho-syntactic or lexical marking that may signal either an ISQ or RQ, depending on their prosodic realization (Kharaman et al. 2019); see examples in (2). In prosodic research, the use of such string-identical questions is useful from a methodological point of view, because effects on interpretation resulting from nonprosodic cues are avoided (ignoring the influence of a potential linguistic context outside the question). Henceforth, we will refer to the contrast between wh-questions and polar questions (left-hand side vs. right-hand side in example (2)) as question type, and the one between ISQs and RQs ((a) vs. (b) in examples (1) and (2)) as illocution type.

    1. (1)
    1. (a)
    1. ISQ: What time is it?
    1.  
    1. (b)
    1. RQ: Who on earth likes paying taxes?
    1. (2)
    1. (a)
    1.  
    1. (b)

German ISQs differ from RQs in several prosodic characteristics (see Section 2.1), specifically intonational realization, which is different for wh-questions and polar questions (see stylized f0 contours in (2), stressed syllables are marked by capitals), but ISQs and RQs also also exhibit nontonal prosodic differences, such as constituent duration and voice quality (not shown in (2)). We use the term prosodic realization/characteristics to refer to suprasegmental properties of utterances, including duration, voice quality and fundamental frequency. The term intonation refers to tonal properties of the entire utterance, i.e., prenuclear accents, nuclear accents, and edge tones (these individual parts are called tonal events).

The analysis of production data allows us to identify frequent prosodic realizations for each illocution type (and question type), but may not be able to reveal which cues are important for the listener to identify the intended illocution. This paper investigates the relative impact of individual cues and cue combinations on interpretation and whether they differ across question types. The results will help us gain a better understanding of the prosody of RQs and ISQs by further specifying the prosody-pragmatics interface, but will have implications well beyond the study of these two illocution types. In fact, these issues are pertinent to some core assumptions and questions in the field of prosody research.

First, intonational meaning can be attributed to individual tonal events or to tunes (combinations of tonal events, e.g., nuclear tunes). Regarding individual tonal events, there are reports that low edge tones signal assertions (Han, 2002), while high edge tones signal inquisitiveness (Altmann, 1984; Isačenko & Schädlich, 1966; Kohler, 2004). Also, nuclear L+H* has been associated with contrastive information (Braun et al., 2018; Watson et al., 2008), see Pierrehumbert & Hirschberg (1990) for a complete compositional semantics of pitch accents, phrase accents and boundary tones. Others report intonational tunes for specific purposes, e.g., the “uncertainty contour” in English (G. Ward & Hirschberg, 1985) or the “hat pattern” to signal contrastive topic-focus structures in German (Büring, 1997). Note that the British School of Intonation (A. Fox, 1984; O’Connor & Arnold, 1973; von Essen, 1964) associates intonational meaning with nuclear tunes. In autosegmental-metrical phonology, nuclear tunes are combinations of a unclear pitch accent and a following edge tone (Baumann et al., 2001). Roessig (2024) recently documented an inverse relationship between prenuclear and nuclear accents in German, suggesting a nonlocal (tune-based) dependency. For the current object of investigation, it is conceivable, for instance, that either certain tonal aspects of the stylized contours in (2b) trigger the rhetorical meaning (e.g., the steep rising-falling accent in rhetorical wh-questions (Zahner-Ritter et al., 2022) or the high plateau in rhetorical polar questions) or that the whole tune matters. Research question Q1 asks whether individual tonal realizations are relevant cues to illocution type or whether their combination strengthens the interpretation (tone vs. tune).

Second, prosody research has recently looked more beyond tonal characteristics and included voice quality and/or duration in signaling meaning, e.g., breathy voice to signal irony (Fünfgeld et al., 2024; Leykum, 2021; Niebuhr, 2014; Schmiedel, 2017) or longer durations, i.e., slower speaking rates, to signal indignation (Mozzionacci, 1995). It is an open question how listeners weigh tonal cues (fundamental frequency) compared to nontonal cues (e.g., Gobl et al., 2002). Research question Q2 asks how strong these nontonal prosodic cues are weighted (tonal vs. nontonal cues). The answer to this question will provide insights into the modelling of prosodic meaning: If nontonal prosodic cues are weighted equally strongly as tonal cues, one may have to consider their inclusion in descriptions of question prosody, alongside tonal cues.

Third, some authors suggested to incorporate nontonal prosodic cues with tonal cues to form prosodic constructions/configurations (Gras & Elvira-García, 2021; Neitsch & Niebuhr, 2019; Ogden, 2010), which forms a direct link from prosody to meaning. One example is the “bookended narrow pitch construction” (N. G. Ward, 2019). Such prosodic constructions could be independent of the syntactic surface form of an utterance. On the other hand, previous work on a number of languages has shown that neutral polar questions and neutral wh-questions occur with different nuclear contours (Grice et al., 2005), perhaps due to the differences in syntactic structure and corresponding ordering and positioning of the constituents making up the meaning of the utterance (see different tonal contours in (2)). Research question Q3 hence asks whether the interpretation of tonal and nontonal cues is affected by question type to answer the broader question of a direct link between prosody and meaning, or whether interpretation is mediated by question type.

Finally, the paper seeks to explore how well cue weights can be predicted from the prosodic analysis of production data. It is a useful assumption in prosody research that information on perception can be derived from behavior in production, and this practice fits well with the assumed links between production and perception (Beddor, 2015; Diehl et al., 2004; R. A. Fox, 1982; Newman, 2003). Research question Q4 asks to what extent cue weights in perception can be predicted on the basis of their frequency of occurrence in production.

To answer these questions, we interpret the results of two perception experiments. In Experiment 1, participants indicated whether they thought an auditory stimulus was intended as an ISQ or not. In Experiment 2, a different set of participants indicated whether they thought they heard an RQ or not. In short questions like those in (2), the initial word and the final noun are likely accent positions (Braun et al. 2019). The tonal options are limited by the intonational phonology of German (Grice et al., 2005), in particular the presence and type of prenuclear accent (4 different possibilities), the type of nuclear accent (6 different possibilities), and the type of edge tone (4 different possibilities), amounting to 96 tonal combinations in total. Nontonal differences may stem from voice quality or speaking rate. Such a large number of cues cannot be studied using classical psycholinguistic paradigms.

A novel contribution to the field of laboratory phonology is the application of Active Learning (AL), a machine learning algorithm that learns the weights of the cues from a limited number of items (see Section 2.2). The presentation of subsequent stimuli is based on an optimization of the classifier’s performance given the participants’ responses (Settles, 2009), which makes it highly efficient for complex designs such as the ones tested here. Given the large number of conditions, we started by modeling binary responses (‘ISQ’ vs. ‘no ISQ’, ‘RQ’ vs. ‘no RQ’).

The paper is organized as follows: Section 2 provides background information on what we already know about the prosodic realization of ISQs and RQs in production and perception (Sections 2.1 and 2.2), desiderata and hypotheses (Section 2.3), and information on how AL systems work in general (Section 2.4). Sections 3 and 4 present the methods and the results of the two experiments, respectively. Section 5 discusses the results of both experiments and Section 6 summarizes the conclusions.

2. Background

2.1. Prosodic realization of ISQs and RQs

The prosodic realizations of ISQs and RQs have recently been investigated in a series of production studies for several languages differing in their prosodic typology (see recent review in Dehé, Braun et al. 2024). For German, Braun et al. (2019) presented participants with contexts (e.g., “You are a teacher on a class trip with your students.”) that triggered either an ISQ reading (“You’d like to know which of your students want to go to the museum. You say: …”) or an RQ reading (“You know that nobody at this age likes to visit museums. You say: …”) and participants then had to produce an interrogative (e.g., “Who wants to go to the museum?”) such that it (prosodically) fit the given context. The main results of this study are summarized in Table 1 and are used to derive specific hypotheses for the current perception experiments. In short, both polar and wh-RQs were most frequently produced with rising nuclear pitch accents in which the low and high tonal targets were realized within the stressed syllable. As was shown by a combination of imitation and meaning tasks in Zahner-Ritter et al. (2022), this accentual realization must be considered distinct from the L+H* and L*+H accents in GToBI (Grice et al., 2005),1 and must therefore be seen as an accent category of its own within the German tonal inventory: (LH)*. Apart from nuclear (LH)*, RQs are characterized by more frequent use of breathy voice quality on the phrase-initial word, and longer constituent durations—all compared to ISQs. Other tonal differences depended on question type (polar vs. wh-question). For instance, for ISQs, polar questions were typically produced with an L* nuclear accent and a high-rising boundary tone (H-^H%), while wh-ISQs showed more variability in nuclear accent and in edge tone (see Table 1). For RQs, polar questions typically ended with a high plateau H-% (and less frequently in H-^H%), while wh-RQs were almost all falling. Certain tonal events occurred only in specific contexts (ISQs or RQs); for example, H-% was restricted to polar RQs, and L* or H+!H* only occurred in wh-ISQs (these illocution-type specific realizations are grey-shaded in Table 1).

Table 1

Most frequent tonal realizations and tonal realizations that occur more often than 15% of the times, split by illocution type (ISQ, RQ) and question type (polar question, wh-question), based on Braun et al. (2019) and Zahner-Ritter et al. (2022). Shaded cells indicate realizations that differ between illocution types (within question type), empty cells indicate lack of alternative realizations with frequency > 15%.

Tonal cue Frequency ISQ RQ
polar question wh-question polar question wh-question
prenuclear accent (1st word) most frequent no accent no accent no accent no accent
> 15% H*, L*+H H* H*, L*+H
nuclear accent on final noun most frequent L* L+H* (LH)* (LH)*
> 15% L*, H+!H* L* L+H*
edge tone most frequent H-^H% L-% H-% L-%
> 15% L-H%, H-^H% H-^H%

These general prosodic differences are also observed in spontaneous speech (Braun et al., 2020 for an analysis of a TV show) and semispontaneous productions (Dehé, Wochner et al., 2024). Dehé, Wochner et al. (2024), for instance, used the same contexts as Braun et al. (2019), but presented participants with sentence fragments rather than full interrogative sentences as stimuli to investigate the interplay of types of interrogatives and prosody. Sentence fragments for intended wh-questions were wh-word (who), verb, and object noun; fragments for intended polar questions were just verb and noun. Participants were asked to use these fragments to form full grammatical questions fitting the context, and they were allowed to add as much linguistic material as needed to do so. In this way, it was possible to study the choice of question type, and the occurrence of any additional lexical means in tandem with prosodic means. Participants generally produced more wh-questions in rhetorical contexts and more polar questions in information-seeking contexts. Results further showed that wh-syntax and discourse particles frequently occur in RQs, and that cues from different areas of the grammar were used in tandem, i.e., there was no reduction in or lack of prosodic means due to the presence of nonprosodic (lexical, syntactic) cues.

To derive predictions for perception (in our setting, the interpretation of question illocution) from production patterns, a good point of departure is to identify those prosodic realizations that are frequent in either RQs or ISQs, but not in the other (compare columns for ISQs and RQs in Table 1 (RQ vs. ISQ). These realizations are contrastive in production. For wh-ISQs, this concerns prenuclear H*, nuclear L* or H+!H* and the rising edge tones L-H% and H-^H%, which are listed for ISQs but do not show up in RQs. There were no contrastive tonal realizations for polar ISQs. For RQs, (LH)* is frequent in both question types, but does not show up in ISQs and is thus contrastive, and the same goes for H-% in polar RQs. Note the asymmetry between polar and wh-questions.

2.2. Perception data

As often noted in prosody research, prosodic cues serve multiple functions, and therefore production data only tell us part of the story. In fact, many of the cues in RQ-productions reported above are not specific to RQs, but occur in other contexts, as well. For example, H-% has been reported to signal incomplete utterances (von Essen, 1964), turn-keeping (Caspers, 1998), reluctance on the part of the speaker to give in to a demand (Niebuhr, 2013), stereotypical utterances (Grice et al., 2005), and questions with a negative bias (Kutscheid, 2024). The nuclear accent typical of RQs, (LH)*, has been shown to also mark aversion, negative attitude, and surprise (Wochner, 2022; Zahner-Ritter et al., 2022). As for the nontonal realizations, breathy voice and longer duration have been related to exasperated attitude (Schourup, 1985), and irony (Fünfgeld et al., 2024; Leykum, 2021; Niebuhr, 2014; Schmiedel, 2017). For this reason, perception data are needed to determine whether the prosodic means identified in production as cues to ISQ- or RQ-meaning can actually be interpreted as intended when all else is equal, i.e., with no linguistic or discourse context, no morphosyntactic cues, and no gestures available.

Kharaman et al. (2019) report a first attempt at testing the perceptual relevance of a subset of prosodic cues. They tested the weighting of these cues to decide whether a wh-question is intended as ISQ or RQ. Three prosodic cues were orthogonally varied: overall intonation contour, overall duration and voice quality of the object noun. In terms of intonation, two contours were presented: one with a high prenuclear accent (H*) on the wh-word, followed by an early-peak accent (H+!H*) on the object followed by a low edge tone (L-%) to signal an ISQ (see (2a); according to Table 1, this nuclear accent was not the most frequent one for ISQs, but it was specific as it did not occur in RQs), and another contour without prenuclear accent but with a rising-falling accent on the object noun, L*+H or (LH)* followed by L-% to signal an RQ (see (2b)). Participants were given an example of an unambiguous ISQ (What time is it?) and of an unambiguous RQ (Who likes paying taxes?) and had to pick the intended illocution on a button box (or press another button for ‘other’ if the stimulus did not match either of the two). Results showed that almost all stimuli were categorized as ISQ or RQ (less than 5% of responses were assigned to ‘other’). Intonation had the strongest effect, while the effects of duration and voice quality were similar to each other, but smaller in magnitude than the effect of intonation. All effects were additive: wh-questions with the (2a)-contour (i.e., prenuclear H*, nuclear H+ !H*, L-% edge tone) realized with short duration and modal voice quality were most often judged to be ISQs (> 90%). On the other hand, realizations with the (2b)-contour (i.e., rising-falling contour) produced with long duration and breathy voice quality were most often judged as RQs (> 90%). When one of the cues was changed, the proportion of the respective question interpretation (ISQ or RQ) decreased. The most ambiguous stimuli, i.e., the stimuli for which identification rates were at chance, were those in which the intonational contour pointed towards one illocution type (e.g., ISQ for H+!H*) and the two nontonal prosodic cues (e.g., long duration and breathy voice quality) to the other illocution type (RQ). Questions in which intonation and one of the nontonal cues (duration or voice quality) pointed towards the same interpretation were recognized as intended with more than 75% probability. This relative cue weighting was also replicated in a multi-speaker version of Kharaman et al.’s (2019) study (Geib & Braun, 2022). The only differences were that all cue weights were a bit lower and there was a slight bias towards ISQ responses.

2.3. Desiderata and hypotheses

The prior production experiments identified several tonal and nontonal prosodic realizations that were frequent in one illocution type and not the other. A link between production and perception (Beddor, 2015; Diehl et al., 2004; R. A. Fox, 1982; Newman, 2003) would suggest that these are the cues that also influence perception. Yet evidence is missing: Prior perception studies (Kharaman et al., 2019; Geib & Braun, 2022) have only included a subset of the cues identified in the production studies (i.e., entire contour, voice quality, duration), mostly due to methodological limitations. Therefore, a more comprehensive paradigm is required (Q4). In particular, since only two intonational contours were presented, we do not know anything about the contributions of the individual parts of the intonational contours and it is possible that other, as yet unused tonal cues or combinations of tonal cues will have a stronger perceptual effect (Q1). The weighting between nontonal prosodic cues (duration, voice quality) with individual tonal cues (prenuclear accents, nuclear accents, edge tones) is also unclear (Q2), and so are potential interactions with question type (Q3). Answers to these questions are relevant to our understanding of how prosody contributes to meaning.

The current state of the art does not provide an answer to these research questions. To address them scientifically, a few modifications in experimental design are necessary. First, the intonational contour needs to be decomposed into individual tonal events, whose contribution to interpretation can consequently be tested independently (Braun & Asano, 2013; Chodroff & Cole, 2019; Grice et al., 2005; Pierrehumbert, 1980; Steffman et al., 2022). Such an approach demands an orthogonal manipulation of prenuclear and nuclear accent types as well as edge tones. Second, the nontonal cues duration and voice quality need to be fully crossed with the tonal cues and third, polar questions must be added to the already tested wh-questions. This results in a large number of conditions, which are handled using Active Learning (AL), the basic procedure of which is introduced in the next section. In this paper, we consider the six binary or multilevel cues given in (3), as well as all their possible combinations, i.e., 768 experimental conditions (4 × 6 × 4 × 2 × 2 × 2).2

    1. (3)
    1. Binary and multilevel cues
    2. 1. Prenuclear accent (four levels: unaccented, L*, H*, L+H*)
    3. 2. Nuclear accent (six levels: H*, L*, L+H*, (LH)*, L*+H, H+!H*)2
    4. 3. Edge-tone (four levels: L-%, H-%, L-H%, H-^H%)
    5. 4. Duration (two levels: short, long)
    6. 5. Voice quality (two levels, manipulated on the first word: breathy voice, modal voice)
    7. 6. Question type (two levels: polar question, wh-question)

If listeners use tunes instead of individual tonal events (Q1), we hypothesize that there are interactions between tonal events (ternary combinations of prenuclear accent, nuclear accent and edge tone, or binary combinations). If, however, individual tonal events are important for listeners, we expect strong weights for individual pitch accents or edge tones. If there is a strong link between production and perception (Q4), we predict the cues that are frequent and contrastive in production (grey-shaded in Table 1) to have large cue weights in perception. In particular, we predict nuclear (LH)* to increase RQ probability for both question types. We further hypothesize that the interpretation of other tonal events is dependent on question type (Q3: nuclear L* and H+!H* as well as L-H% and H-^H% for wh-ISQs, H-% for polar RQs). On the other hand, a more direct mapping from prosody to meaning predicts no interactions with question type (Q3). Based on prior perception work (Kharaman et al., 2019), nontonal prosodic cues are hypothesized to have a lower cue weight than tonal cues (Q2), but this hypothesis needs to be tested with a larger set of cues.

2.4. Active Learning

Active Learning (AL) is a machine learning approach that requests labels (e.g., ‘ISQ’ or ‘no ISQ’) for stimuli with different cues and derives weights for the respective cues of this stimulus (polar question, short duration, breathy voice, etc.; Settles, 2009). The weights are then generalized to other (unseen) stimuli with the same cues and cue combinations and their labels are predicted. The system then selects the next stimulus based on specific stimulus selection mechanisms (e.g., uncertainty of the label).

This contributes to the optimization of the classifier’s performance (Settles, 2009), i.e., an improvement of its prediction accuracy. Across systems, the improvement of accuracy can be based on different criteria, such as error reduction (Settles, 2012), entropy (Vendrig, 2002) or classifier uncertainty (Smallest Margin; Wu et al., 2006). In the present study, we use the classifier’s uncertainty, which has been successfully pretested in simulation studies with main effects and interactions (Einfeldt et al., 2024). In short, at a given time, the system presents the conditions whose classification is most uncertain (or a random choice if there are more cue combinations that are most uncertain for the model). Adding the new response to the already labeled data helps to improve the prediction accuracy. The underlying AL model can be of varying complexity, such as Association Rule Mining models, Linear Regression models, Extreme Gradient Boosting (XGBoost), or even Deep Learning models. In this paper we used XGBoost (see Appendix 3 for more details). Independent of the model’s complexity, the sampling strategies make the labelling process more efficient than working through hard-coded or randomized experimental lists and, hence, reduce the number of participants. This in turn allows us to test many more conditions than would be possible otherwise.

For the current study, we used a similar AL model as in Einfeldt et al. (2024). Given the large number of conditions, the AL model is trained across multiple participants in an online setting (web experiment). To avoid influence from inattentive participants, the experiments started with a grouping phase to determine whether participants were attentive in general and attentive to the nuclear tune in particular (the part of the contour typically associated with utterance meaning). For more information on the implementation see Appendix 3 and the code of our experiments in our GitHub repository (https://github.com/AL-perception-experiments/isq-rq).

3. Experiments

Experiment 1 tested the interpretation of questions as information-seeking or not, Experiment 2 tested the same for the interpretation of questions as rhetorical or not. The experiments were approved by the Institutional Review Board of the University of Konstanz (IRB 05/2021).

3.1. Methods

3.1.1. Materials

We created twelve target utterances (lexicalizations), each consisting of a verb, the discourse particle (PRT) denn and a noun (e.g., spielen + denn + Badminton, ‘to play + PRT + Badminton’). All nouns were trisyllabic with stress on the first syllable. The finite verb form (3rd person singular) of the respective verb was monosyllabic. The PRT denn was included for the sake of eurhythmy.3 Each target utterance was used as polar question with the subject pronoun jemand ‘anyone’ and verb-first interrogative syntax (e.g., Spielt denn jemand Badminton?, ‘Plays PRT anyone badminton?’) and as a wh-question with wer ‘who’ as interrogative pronoun (e.g., Wer spielt denn Badminton?, ‘Who plays PRT badminton?’). The interrogatives were constructed from four different verbs (play, like, keep, eat) and twelve nouns, see Table 4 in Appendix 1. The interrogatives were pretested4 to make sure that they were suitable as both ISQ or RQ.

The stimuli were recorded by a phonetically trained, female native speaker of German. To keep prosodic realizations of the prenuclear regions consistent across items, the target sentences were split into a Part 1, consisting of the linguistic material before the noun (e.g., Spielt denn jemand, ‘Plays PRT anyone’) and a Part 2, consisting of the noun itself (e.g., Badminton), see Table 4 in Appendix 1. Each Part 1 was recorded in eight different ways: four prenuclear-accent conditions (none, L*, H*, L*+H on the first word) and two types of voice quality (breathy and modal on the first word). To avoid disruptions in the f0-contour between prenuclear and nuclear accents, each Part 1 was furthermore recorded in three different versions: (a) appropriate for the early-peak nuclear accent H+!H* (i.e., PRT ends in high pitch), (b) appropriate for the medial-peak accent H* (PRT ends in rising pitch), and (c) with a low-pitched PRT for the remaining nuclear accents starting with a low tonal target (L*, L+H*, L*+H and (LH)*). The nouns (Part 2) were produced with 24 different nuclear tunes: with six different nuclear accents (L*, H*, L*+H, L+H*, H*+!H*, and (LH)*) and four different edge tones (L-%, L-H%, H-%, and H-^H%). Since the combinations (LH)* H-% and (LH)* H-^H% were difficult to produce naturally for our speaker without also changing some nontonal prosodic cues, we manipulated the natural recordings for these contours in regard to the exact alignment of the starting point and maximum of the rise for the pitch accent (LH)* according to the values in Braun et al. (2019). Parts 1 and 2 were then spliced together (at positive zero crossings in the waveform).

After splicing, the duration of each sentence was PSOLA-resynthesized such that there was a long and a short version of each item (cue ‘duration’), by lengthening or shortening the total duration of each sentence by 10% of its original duration. The versions that were not manipulated for duration were not included in the experiment. In total, the whole procedure resulted in 9216 tokens (768 conditions × 12 items).

As practice items, we used three prosodic realizations of another item that was structurally identical to the experimental items (Wer mag denn Statuen?, ‘Who likes PRT statues?’) Furthermore, we used two catch items that were judged to be unambiguously information-seeking in the pretest described in Footnote 4 (Who wants spaghetti?, Who plays the trumpet?) and two that were interpreted as RQ (Who likes insects?, Who likes parades?); these were recorded naturally by the speaker with the intended illocution types.

3.1.2. Participants

For Experiment 1 (ISQ Experiment), data from 81 monolingual native speakers of German were used to train the AL system (average age = 24.3 years, SD = 3.5 years, 47 female, 34 male). In addition, 18 participants started the experiment but did not train the AL because the speakers were either bilingual (N = 5) or failed to correctly label 3 of the 4 catch items (N = 13), see Section 3.1.3 for details. For Experiment 2 (RQ Experiment), a total of 80 participants trained the AL system (54 female, 25 male, 1 not indicated, average age = 26.2 years, SD = 7.9). An additional 37 participants started the experiment but were not admitted to the AL phase, since they were either bilingual speakers (N = 32) or failed to label at least three catch items correctly (N = 5). Consequently, only attentive native speakers of German who had not learned any other language before the age of six trained the model.

3.1.3. Procedure

The experiments were conducted online, using a custom-made JavaScript program running on an in-house server at the University of Konstanz. At the beginning of each experiment, the participants gave informed consent and filled in a questionnaire on their language and demographic background. Next, they were given a definition of ISQs in Experiment 1 (‘the speaker is seeking an answer’) as well as an unambiguous example (see (1)). For Experiment 2, the definition of RQs was that an RQ is a question that does not demand an answer, because the answer is obvious or the speaker thinks the answer is obvious. The example given was Ist denn heut’ schon Weihnachten? ‘Is PRT today already Christmas?’, which suggests “no” as the obvious answer (this RQ was used in a frequently broadcasted German commercial).

The instructions for the experiment were provided in audio-visual form (i.e., participants could read the instructions and listen to them). The audio presentation was used to allow participants to adjust their computer’s audio setting to a comfortable level of loudness. The experiment had three phases: a practice phase with three items (to familiarize participants with the task), a grouping phase with 28 trials (see below for details), and the actual experimental AL phase with 64 trials. Only the data of the AL phase was used to update the AL system. Figure 1 shows a schematic overview.

Figure 1
Figure 1

Flow chart of the procedure.

Each question was played only once. Participants pressed “J” for ‘information-seeking question’ and “F” for ‘non-information-seeking question’ in Experiment 1. In Experiment 2, they pressed “J” for ‘rhetorical question’ and “F” for ‘non-rhetorical question’. Each trial started with the visual presentation of a centered fixation cross and the simultaneous start of the audio. Participants had 10 seconds to respond; after that, they were reminded to respond. After a response was entered, the next trial started. There was an inter-trial interval of 500 ms after each response. Overall, the experiment took about 10 minutes. Participants could enter a raffle: One Amazon voucher worth 25€ was allocated for every ten participants who entered a drawing.

The grouping phase consisted of 24 trials, one for each nuclear tune (with fixed values for all the other factors) and the four catch items (two ISQs and two RQs). The order of all items was pseudo-randomized so that there was no repetition of lexicalization in consecutive trials. If a participant failed to classify at least three of the four catch items correctly, the experiment ended after the grouping phase for this participant.

Otherwise, the responses in the grouping phase determined whether a participant was allocated to one of two groups that trained separate AL systems. Group I included participants who were sensitive to the tonal combination in the nuclear tune. Group II included participants who were not sensitive to it. The threshold for being assigned to Group II was that the proportion of one response key (“J” or “F”) exceeded 75% (i.e., little sensitivity to the tonal manipulation), otherwise participants were assigned to Group I. The threshold of 75% was arbitrary. In the AL phase, items were assigned to participants by the AL System (see Section 2.2); all combinations of cues could appear on any of the 12 lexicalizations.

3.2. Analyses

In Experiment 1 (ISQ Experiment), 62 of the 81 participants that entered the main AL phase (76.54%) were placed in Group I, and the remaining 19 (23.46%) in Group II. In Experiment 2 (RQ Experiment), again the majority of participants (N = 67, 83.75%) relied on the nuclear tune and was placed in Group I, while the minority (N = 13, 16.25%) formed Group II (see Table 5 in Appendix 2 for details). In our analyses, we focus on the larger Group I, which relied on the nuclear tune. There were too few participants to allow generalization of the patterns in Group II, but the analyses were still conducted and are provided in Table 7 in Appendix 6 for comparison.

The analyses were carried out in three stages (see Figure 2 for an outline): We first analyzed the responses using an extreme gradient boosting algorithm, XGBoost (Friedman, 2002) to test which cues and cue combinations would improve the model’s accuracy, and then calculated a linear regression model to test whether the remaining cues and cue combinations would significantly decrease or increase the likelihood of a given response. This reduced the initial set of 768 conditions to those that participants were sensitive to (see results for number of remaining conditions). In stage 3 we tested the remaining hypotheses using F tests and t tests.

Figure 2
Figure 2

Flow chart of the analyses.

3.2.1. Stage 1. Boosting algorithm

The probability of a question to be labelled as ISQ or not in Experiment 1 (or as RQ or not in Experiment 2) was estimated with XGBoost, a machine learning technique, which has proven good for high dimensional classification problems, like the current ones.5 In general, a machine learning tool, rather than a logistic regression, was needed to predict the probabilities due to the high dimensionality of the data. For prediction we used exclusively binary factors. For instance, it was coded whether an accent occurred (e.g., H*) or not (i.e., all other accents except H*). This resulted in 17 binary factors (see Appendix 4 for a full list). We further included three-way interactions between prenuclear accent, nuclear accent and edge tones and all possible two-way interactions between the tonal cues (to address Q1), and 3-way interactions between question type, edge tone and nuclear accent (to address Q3). This procedure resulted in 246 possible explanatory variables6 in predicting ISQ or RQ probability.

The boosting algorithm determines which cues and cue combinations improve the model’s accuracy. For a labelled cue we can compare the label coming from the user (y) and the label predicted by the model y^=1{ p^>0.5 } , where the model predicts the question to be an ISQ if the estimated ISQ probability, p^, is larger than .5. Accuracy measures how many mistakes the model made: labelled ISQ as non-ISQ and likewise for RQs. Every cue contributes to the machine learning model in a nonlinear way, and it can be calculated how much a particular cue or a cue combination improved the model’s accuracy.

The boosting algorithm determines those cues that improve the prediction accuracy by more than 1% (see Figure 2 for outline of procedure).

3.2.2. Stage 2. Linear regression models

The linear regression model determines which of the cues and cue combinations determined by the boosting algorithm significantly improve the predictions of a question as ISQ or RQ, relative to a baseline (α = .01). The baseline (intercept) included those cues and cue combinations that occurred most often when the predicted probability was between 49% and 51% in both experiments, i.e., where participants were around chance level when labeling the respective stimulus. This concerned 23 uncertain questions (items) in Experiment 1 (ISQ Experiment) and 17 in Experiment 2 (RQ Experiment). In this set of 40 uncertain questions, the most frequent cues were wh-questions, breathy voice quality, long durations, no prenuclear accent, an (LH)* nuclear accent, and an H-% edge tone. They are chosen as baseline settings for the linear regression models. Note that these baseline settings are settings that we predicted to signal an RQ (Table 1). The regression models allow us to see whether there are interactions between tonal cues, to estimate the contribution of individual tonal cues relative to the baseline (Q1), to test whether there are effects of nontonal cues (Q2), and whether prosodic cues interact with question type (Q3).

3.2.3. Stage 3. Further hypothesis testing

To evaluate Q1, beyond statistical comparisons with the baseline, we tested effects of individual cues. To this end we formulated appropriate null hypotheses and these linear restrictions were tested by means of F tests. The F test compared the sum of squared residuals of a model with and without the restriction imposed by the null hypothesis. For instance, to test whether each nuclear accent has the same contribution as the others (Q1), we formulated the null hypothesis in (4):

    1. (4)
    1. β(H*) = β(L*+H) = β(L+H*) = β(H+!H*) = β(L*)

If this null hypothesis is rejected (suggesting that one of these nuclear accents has a stronger contribution than the others, α = .05), these inequalities were tested with Student’s t tests. The Student’s t tests were one-sided and check whether inequalities hold (e.g., whether the contribution of H* is smaller than that of L*+H using the following null hypothesis: β(H*) ≤ β(L*+H)).

Regarding the relative weights of tonal and nontonal prosodic cues, the null hypothesis tested is shown in (5), where T stands for tonal cues and N for nontonal cues.

    1. (5)
    1. jT,βj>0βjiN,βi>0βi

3.3. Results and discussion for ISQs (Experiment 1)

Figure 3 shows the effects of those cues and cue combinations that were significant (see Table 6 in Appendix 5 for the full regression table). Positive values indicate a significant increase in ISQ probability relative to the baseline (24%); negative values, a significant decrease. The height of the bars indicates the percent of increase/decrease in responses provided by each significant cue and cue combination. For instance, nuclear H* leads to an increase in ISQ probability of 23% relative to the baseline (first positive bar). In Figures 3 and 4, cue combinations (interactions) are marked by an “&” on the x axis. To distinguish prenuclear from nuclear accents, prenuclear accents are typed in lowercase characters, and nuclear accents in standard uppercase characters. The bars are ordered and colored for research question: Q1: interactions between tonal cues first, followed by main effects of tonal cues (white), Q2: nontonal cues (light blue), Q3: interactions with question type (grey). Other effects and interactions, which were not predicted, are ordered last (dark blue).

Figure 3
Figure 3

Results of AL regression estimates of Experiment 1 (ISQ Experiment) in percent. Prenuclear accents are indicated by lowercase letters (e.g., h* for prenuclear H*) to distinguish them from nuclear accents (which are in standard formatting). The numbers indicate the percent increase or decrease with respect to the baseline. Colors code the three research questions (Q1: white, Q2: light blue, Q3: grey). Further results are shown in dark blue.

Figure 4
Figure 4

Results of AL regression estimates of Experiment 2 (RQ Experiment) in percent. Prenuclear accents in lower-case letters, nuclear accents in upper case letters (Q1: white, Q2: light blue, Q3: grey).

We first discuss the results with respect to the hypotheses and then indicate the cues and cue combinations that result in the highest ISQ probability.

Regarding Q1 (tune vs. tones, white bars), there was one significant 3-way-interaction: an overall low contour with prenuclear and nuclear L* followed by L-% (first bar). However, it decreased the ISQ probability and thus cannot serve as a tune for ISQs. In fact, there was no combination of tonal events that increased ISQ probability, allowing us to reject the idea of an intonational tune for ISQs. Instead, individual tonal events matter more for interpretation.

What is the contribution of these individual tonal events? With regards to prenuclear accents, the regression model showed that prenuclear H* and prenuclear L*+H (2nd and 3rd bars) decreased the ISQ probability relative to “no prenuclear accent” in the baseline, but the negative effect of prenuclear H* was reversed when combined with modal voice quality, as shown by the last bar. Neither prenuclear L+H* nor prenuclear L* changed the ISQ probability relative to the baseline at α = .01. Regarding nuclear accents, the linear regression model showed that all accents increased the ISQ probability relative to (LH)* in the baseline, rendering (LH)* a truly non-information-seeking accent (Zahner-Ritter et al., 2022). Regarding differences within the nuclear accents, we found no statistically significant difference between H* and H+!H* and no differences between H+!H* and L+H* (both p > .1) for wh-questions (in baseline), i.e., statistically, these nuclear accents have the same positive contribution on ISQ probability in wh-questions. Furthermore, the linear regression model showed that for polar questions, H+!H* significantly decreased ISQ probability (negative grey bar in Figure 3). Finally, the effect of L+H* was larger than that of L*+H and L* (both p < .0001). Regarding edge tones, the linear regression model showed that all edge tones increase the ISQ probability relative to H-% in the baseline. Beyond that, further hypothesis testing did not show a statistically significant difference between L-H% and H-^H% (p > .2), so both have the same positive contribution to ISQ probability. The effect of L-H% was larger than that of L-% (p = .02), so L-% is clearly less appropriate to mark an ISQ. Summing up the data relevant for Q1, there is no relevant tune for ISQs, only individual tones matter: The ISQ probability is increased by the combination of prenuclear H* and modal voice quality, by nuclear H* or L+H* and by the edge tones H-^H% or L-H%. For wh-ISQs, but not for polar ISQs, H+!H* is as good a nuclear accent as H* and L+H*.

Regarding Q2, the relative weighting between tonal versus nontonal cues, the results of the linear regression model showed a significant increase in ISQ probability for modal voice quality and short duration (light blue bars). Further hypothesis testing showed that the tonal cues increased the ISQ probability more compared to the nontonal cues (p < .0001). Furthermore, the positive tonal cues (nuclear accents and edge tones) increased the ISQ probability more than the interaction between tonal and nontonal prosodic cues (i.e., between modal voice quality and prenuclear H*, p < .0001). These data suggest that nontonal cues are truly secondary in signalling ISQs.

Regarding Q3, there were a number of interactions with question type (grey bars). The linear regression model showed that L-H% increased the ISQ probability more for polar questions than for wh-questions (in the baseline), and nuclear H+!H* decreased the ISQ probability for polar questions. These interactions speak against a strong version of prosodic constructions for ISQs, because these would have to hold independent of question type.

In summary, the highest ISQ probability that can be achieved for wh-questions (in the baseline7) is 87%. The best cues and cue combinations are:

  • baseline                                                                                                                                 24%

    (wh-question, breathy voice quality, long duration, no prenuclear accent, nuclear (LH)*, H-%)

  • nuclear H*                                                                                                                            +23%

  • H-^H%                                                                                                                                    +15%

  • combination of prenuclear H* (h*) and modal voice quality                                        + 5% (–11%+16%)

  • modal voice quality                                                                                                             + 8%

  • short duration                                                                                                                       +12%

The contribution of the nontonal prosodic cues is smaller than that of the tonal cues (20% for nontonal cues on their own, 25% when factoring in the interaction with prenuclear H*). Without the nontonal prosodic cues, the highest ISQ probability would hence be 25% lower, i.e., at 62%.

For polar questions, the cues are largely identical, but L-H% emerges as a stronger cue to ISQs, leading to the highest ISQ probability of 95%:

  • baseline                                                                                                                                 24%

    (wh-question, breathy voice quality, long duration, no prenuclear accent, nuclear (LH)*, H-%)

  • nuclear H*                                                                                                                            +23%

  • L-H%                                                                                                                                      +13%

  • combination of polar question and L-H%                                                                        +10%

  • combination of prenuclear H* (h*) and modal voice quality                                        + 5% (–11%+16%)

  • modal voice quality                                                                                                             + 8%

  • short duration                                                                                                                       +12%

Note that there are some other options to signal an ISQ with a high probability: with no significant differences, nuclear H* can be changed to L+H* or to H+!H* (for wh-questions only). Also, H-^H% can be changed to L-H%. This optionality in signalling ISQs is another argument against assuming prosodic constructions for ISQs.

Regarding Q4, the possibilities and limitations of predicting perception from production, we see that many of the tonal predictions derived from production data were indeed confirmed by the perception data (modal voice quality, short duration, L-H% and H-^H%, in addition H+!H* for wh-questions). However, this inference from production to perception has its limitations. Our multi-cue study showed that in polar questions prenuclear H* is relevant for perception (though it was not contrastive in production, as shown by the lack of grey shading in Table 1). Nuclear H* was a relevant cue in both question types, but was neither frequent nor contrastive in production. The same is true for L-H% for polar questions. We will revisit this issue in the General Discussion (Section 4).

3.4. Results and discussion for RQs (Experiment 2)

Figure 4 shows the significant individual and combined cue weights for RQs. Since the baseline cues are largely RQ-like, most of the weights are negative.

Regarding Q1 (tune vs. tones, white bars), Figure 4 suggests more tonal interactions for RQs than for ISQs (first three bars): the combination of nuclear L* and L-% (first bar) seems to increase RQ probability (+14%), but this apparent increase is reversed by the slightly larger negative contribution of nuclear L* (–16%), leaving no net increase. There is a further tonal interaction that decreased the RQ probability and does therefore not qualify as tune (second bar). A net increase in RQ probability was only achieved by the combination of prenuclear L*+H and L-% (+ 11% for the interaction and an additional 6% for the prenuclear L*+H). This can qualify as tune, but only for wh-RQs (for polar questions, L-% decreased the RQ probability by 16%, second bar from the right, in grey). While it may seem counter-intuitive to have a tune consisting of a prenuclear accent and an edge tone (prenuclear L*+H and L-%), this combination becomes a complete tune when considering the nuclear accent provided by the baseline (LH)*. However, the generalizability of this tonal combination as RQ-tune is limited to wh-questions. In terms of individual cues, the results of the linear regression model showed that any prenuclear accent increased the RQ probability. Further hypothesis testing showed no difference between the three prenuclear accents H*, L* and L*+H (p > .09). As indicated above, prenuclear L*+H in combination with an L-% edge tone increased the RQ probability even more strongly (but only for wh-questions). Any nuclear accent—except (LH)* in the baseline—decreased the RQ probability. As for edge tones, H-^H% significantly decreased the RQ probability relative to H-% in the baseline, while the other edge tones (L-%, L-H%) had a similar contribution as H-% for wh-questions (i.e., they did not improve the model fit beyond the H-% in the baseline). L-% and L-H% decreased RQ probability for polar questions, though (see last two grey bars). Regarding Q2, the nontonal prosodic cues affect the RQ interpretation. The light blue bars derived from the linear regression model show that modal voice quality and short duration significantly decreased RQ probability (–17% in combination). Further hypothesis testing showed that the tonal cues increased the RQ probability more compared to the nontonal cues duration and voice quality (p < .0001) and compared to the interactions between tonal and nontonal cues (modal voice quality and prenuclear H*, p < .0001). This relative cue weighting between tonal and nontonal cues was the same as for ISQs, giving a lower weight to nontonal cues.

Regarding Q3, we see a number of interactions with question type (grey bars): the negative effect of nuclear L+H* was even more negative for polar questions than for wh-questions. Moreover, as hinted at above, the edge tones L-% and L-H% decreased RQ probability for polar questions, but not for wh-questions.

In sum, the highest RQ probability for wh-questions is 76%, which is achieved by the following cues:

  • baseline                                                                                                                                  59%

    (wh-question, breathy voice quality, long duration, no prenuclear accent, nuclear (LH)*, H-%)

  • combination of prenuclear L*+H L-%                                                                               +11%

  • prenuclear L*+H (l*+H)                                                                                                       + 6%

For polar RQs, the highest probability is 70% only, achieved by:

  • baseline                                                                                                                                 59%

    (wh-question, breathy voice quality, long duration, no prenuclear accent, nuclear (LH)*, H-%)

  • prenuclear H* (h*)                                                                                                               +11%

Without the nontonal cues, the highest RQ probability would be 17% lower and reach only 59% for wh-questions and 53% for polar questions. Numerically, the effects of the nontonal prosodic cues are smaller for RQs than for ISQs (17% for RQs compared to 25% for ISQs), but unlike for ISQs, the nontonal cues seem to make the difference between recognizing a polar question as RQ (> 50% probability) or not. The interactions with question type are more visible in RQs than in ISQs: While wh-RQs are best cued by prenuclear L*+H, nuclear (LH)* and L-%, polar RQs are best signalled by prenuclear H*, nuclear (LH)* and H-%.

Regarding the predictive power of production data for perception (Q4), the data look promising for RQs: breathy voice quality, long duration, nuclear (LH)*, as well as H-% for polar RQs were all predicted by the prosodic realizations in production (grey-shading in Table 1), i.e., they were frequent and contrastive in production. The current data show that they are also relevant in perception. Differences between production and perception are also found: for polar RQs, prenuclear L* was neither frequent nor contrastive in production but it is relevant for perception, the same is true for prenuclear L*+H for wh-RQs. Also, L-% is a relevant cue for wh-RQs in perception, although it was not contrastive in production (because it also occurred frequently in wh-ISQs in production).

4. General Discussion

This paper experimentally crossed tonal and nontonal prosodic cues and tested their roles for the interpretation of polar and wh-questions as information-seeking or rhetorical. The overarching research questions related to whether illocution type was signaled by tunes (combinations of tonal events) or individual tones (Q1), to the relative weighting between tonal cues versus nontonal cues (Q2) and to possible interactions between prosodic cues and question type (Q3). This allows us to further evaluate whether realizations that were frequent and contrastive in production are particularly relevant in perception (Q4).

The response to Q1 is at first glance different for ISQs and RQs: For ISQs, there is no evidence for tunes. Statistically, there was a three-way interaction, but the combination of tonal events decreased the ISQ probability. There were several nuclear accents (H*, L+H* for both question types, H+!H* for wh-questions only) that lead to high ISQ probabilities with no significant differences. Likewise, there was no statistical difference between L-H% and H-^H% for wh-questions. The particular contribution of accents and edge tones may signal further pragmatic nuances, which participants in our experiments were not alerted to (e.g., incredulity or bias). It would be interesting to see whether an alternative instruction (e.g., which of these questions sound incredulous or interested?) might lead to a slightly different pattern that would allow us to distinguish further pragmatic nuances. For RQs, on the other hand, prenuclear L*+H, combined with nuclear (LH)* in the baseline, and L-% may form a tune, but only for wh-RQs. Polar RQs need an H-% edge tone to reach high probability. This dependence on question type (Q3) reduces the generalizability of such a tune and we do therefore not gain much explanatory power by postulating a rhetorical question tune, which is limited to wh-questions.

Regarding Q2, both ISQ and RQ probabilities were significantly affected by the nontonal cues duration and voice quality (in opposite ways, as expected: modal voice quality and short constituent durations increased ISQ probability, while breathy voice quality and long constituent durations increased RQ probability). Statistically and numerically, however, the contribution of nontonal cues was smaller than that of tonal cues (in line with Geib & Braun, 2022; Kharaman et al., 2019), see Table 2 for the numerically highest ISQ and RQ probabilities overall, and separate contributions for tonal cues (3rd row) and nontonal prosodic cues (5th row). The contribution of nontonal prosodic cues was numerically larger for ISQs than for RQs (25% vs. 17%) but a direct comparison is difficult because of the between-subject design.

Table 2

Summary of positive weights of tonal and nontonal prosodic cues for both illocution types and question types. Alternative accents and edge tones that do not differ significantly from the best ones are added in grey font. Cells are split for question type only if necessary.

ISQ (Experiment 1) RQ (Experiment 2)
Highest probability Polar question (95%) Wh-question (87%) Polar question (70%) Wh-question (76%)
Tonal cues prenuclear H*
nuclear H*
nuclear L+H*
nuclear (LH)*
L-H% H-^H%
L-H%
nuclear H+!H*
prenuclear H*
H-%
prenuclear L*+H
L-%
Weight tonal 70% 62% 53% 59%
Nontonal prosodic cues and interactions - modal voice quality (in particular combined with prenuclear H*)
- short duration
- breathy voice quality
- long duration
Weight nontonal 25% 17%

Table 2 also indicates that RQs demand different prenuclear accents and edge tones for the two question types: prenuclear H* and H-% led to highest RQ interpretations for polar questions, prenuclear L*+H and L-% for wh-questions. For ISQs, we see an effect of question type in the alternative edge tones and nuclear accents (those that did not differ significantly from the optimal cue): L-H% was an alternative to H-^H%, but only for polar questions, nuclear H+!H* was an alternative to nuclear H* or L+H*, but only for wh-questions. These results suggest that the interface between prosody and semantics is mediated to some extent by question type (Q3). Interestingly, the asymmetry observed in more spontaneous production tasks, which showed that speakers preferably produced polar questions in information-seeking contexts and wh-questions in rhetorical question contexts (Dehé, Wochner et al., 2024), is replicated here. The highest ISQ probability is achieved with polar questions (95% vs. 87% for wh-ISQs), while the highest RQ probability is achieved with wh-questions (76% vs. 70% for polar RQs).

Regarding the production-perception link (Q4), we need to compare the cues that were relevant in perception with those that were frequent and contrastive in production (Table 1). To this end, we reproduced Table 1 as Table 3, but crossed out those realizations that were not relevant in perception (N = 13: six times prenuclear realizations: four times no prenuclear accent, two times L*+H; four times nuclear accents: three times L*, once L+H*; three times edge tones: two times H-^H%, once L-%, see Table 3). These mismatches would be picked out in a perception procedure that only tested the frequent and contrastive realizations. It is striking that nuclear L* and the high rising edge tone (H-^H%) were very frequent and contrastive in production, but often not relevant in perception. It is possible that the L* nuclear accent is easy to produce and therefore frequent, but not particularly salient as a perceptual cue. The rising edge tones in production may be a remnant from primary school, where pupils are often told to produce a rise when they see a “?”. Clearly, the perception experiments give less weight to the high-rising contours. In boldface we printed realizations that were relevant in perception, but which were too infrequent in production (< 15%). These would likely be missed in a perception procedure that only includes frequent and contrastive cues. In our study, these unexpected perceptual cues were nuclear H*, nuclear L+H*, and L-H% in polar ISQs, nuclear H* in wh-ISQs and prenuclear L*+H in wh-RQs. Unlike in production, two further cues were contrastive in perception: nuclear L+H* for wh-ISQ and L-% for wh-RQ. More naturalistic and implicit tasks may be necessary to resolve the production-perception mismatches. In any case, relevant cues may be overlooked by only testing cues that are frequent and contrastive in production.

Table 3

Reproduction of Table 1 without the frequency column. Here, realizations from Table 1 that were not relevant in perception are crossed out (N = 13). Cues that were not frequent in production but relevant in perception are added in boldface (N = 5). Realizations in cells with dark grey shading strongly support the production-perception link (frequent and contrastive in production, relevant in perception, N = 7). Realizations in light-gray shaded cells were frequent in production (though not contrastive) and relevant in perception and give weaker support for the production-perception link, too (N = 4).

Tonal cue ISQ RQ
polar question wh-question polar question wh-question
prenuclear accent (1st word) no accent no accent no accent no accent
H* L*+H H* H*, L*+H L*+H
nuclear accent on final noun H*, L+H*,
L*
L+H*, H* (LH)* (LH)*
L*, H+!H* L* L+H*
edge tone H-^H%, L-H% L-% H-% L-%
L-H%, H-^H% H-^H%

The results for ISQs show that a sequence of high tonal events (high prenuclear H* accent, high nuclear H* accent, rising H-^H% or L-H% edge tones) already leads to an ISQ probability of 62% (not adding nontonal prosodic cues, cf. Table 2). It may seem surprising, however, that this tonal combination yields the highest ISQ probability for both polar and wh-questions: After all, wh-questions are often assumed to have a low edge tone (Grice et al., 2005). Note, however, that there is considerable variation in edge tones in natural discourse (e.g., Kohler, 2004; Oppenrieder, 1988), which may explain why the high rising edge tone (H-^H%) received high ratings for wh-questions, too. More generally, our findings support the claim that questions need to have a high tonal event anywhere in the utterance (Herman, 1942; Ohala, 1983, 1984), but add that the combination of such high tonal events increases the ISQ probability for German. Analyses of Group II participants, not presented in detail in this paper, suggest that this generalization also holds for the participants that were not sensitive to nuclear tune in the grouping phase (see Table 7 in Appendix 6): nuclear H* and H-^H% increased ISQ probability (and decreased RQ probability).

The highest probabilities for RQs (70% for polar questions and 76% for wh-questions) were generally a bit lower than for ISQ (95% and 87%, respectively). The lower probability of RQs in comparison to ISQs suggests that the prosodic realization alone may not be a sufficient cue to RQ interpretation. Given the pragmatic felicity conditions for RQs (to make a point, to try to commit listeners to the proposition in the question) and the fact that RQs still function as questions (in that an answer is possible), participants may be more reluctant to interpret an utterance as rhetorical, especially when lacking discourse context.

For our specific research question, the cues to RQs and ISQs, there are a number of prosodic cues that had opposite effects in the two experiments, despite the fact that the two illocution types were not presented as alternative response options (i.e., the participants in Experiment 1 were not told about RQs and the participants in Experiment 2 were not told about ISQs). These cues, which we conclude are particularly relevant for signaling the contrast between ISQ and RQ, are:

  • Short duration: increased ISQ probability and decreased RQ probability.

  • Modal voice quality: increased ISQ probability and decreased RQ probability.

  • All accents except (LH)*: increased ISQ probability and decreased RQ probability.

  • Edge tone H-^H%: increased ISQ probability and decreased RQ probability.

The AL system proved useful to handle a large set of cue combinations and items. The AL system updated the XGBoost model (a separate model for Group I and Group II in every experiment) after each participant completed the experimental phase. This iterative process allowed the system to efficiently learn the contributions of individual cues and their interactions to the perception of illocution type. Importantly, the AL system ensured a balanced exploration of the experimental condition space, while minimizing redundancy in stimulus presentation. The implementation of this AL framework enabled the collection of sufficient data to evaluate the perceptual effects of a large number of prosodic and syntactic conditions while maintaining high predictive accuracy and minimizing participant burden. This methodology highlights the utility of AL for large-scale perceptual studies in linguistic research. In terms of further validating the predictions of the AL system, it would be interesting to run more tests to learn how responses of (temporarily) inattentive participants (simulated through random labeling) may disrupt the general weights of the AL system.

There are several directions for future research, most of which are concerned with testing the generalizability of the results. First, the speaker we recorded had a comparatively high pitch register and large pitch range. In future research it will be interesting to test whether our findings generalize to other speakers with different vocal characteristics, different phonetic implementations of pitch accents (with respect to scaling and pitch range) and with tokens from multiple speakers. This will also allow us to further test the effect of voice quality, which may differ across speakers. The challenge is to find trained speakers who can produce the different combinations of cues naturally; resynthesis of duration and f0 work well with PSOLA, but resynthesis of voice quality is difficult (cf. Mehta & Quatieri, 2005 for isolated vowels). Second, in the current experiments, two levels of constituent duration (lengthening and shortening of the entire utterance by 10%) and voice quality (breathy voice and modal voice on the initial constituent) were included. In future research it may also be useful to include more categories (or continuous acoustic measures of speaking rate or voice quality, cf. Ben Barsties v. Latoszek et al., 2020; Hillenbrand et al., 1994) for these cues to get a more fine-grained picture. Third, one may want to include syntactically longer and more varied utterances, such as the inclusion of adverbials (e.g., Wer kauft denn in der Innenstadt Klamotten? ‘Who buys PRT in the city center clothes?’), which may help to shed more light on the role of prenuclear accents and potential interactions with information structure. Finally, to test the specificity of the RQ interpretation it will be important to test questions with other connotations, such as biased questions or surprise questions in a similar paradigm. One final desideratum is to use more implicit tasks that do not require explicit semantic judgements.

5. Conclusions

In two experiments, we tested participants’ weighting of tonal and nontonal prosodic cues for the interpretation of information-seeking questions (ISQs, Experiment 1) and rhetorical questions (RQs, Experiment 2). The stimuli varied in question type (polar question vs. wh-question), presence and/or type of the prenuclear accent on the first word, type of nuclear accent on the sentence-final noun, type of edge tone, voice quality on the first word (modal vs. breathy voice) and global constituent duration (short vs. long). This large number of cues was handled by an Active Learning system, which controlled stimulus order and updated cue weights. Our results showed that the interpretation of both ISQs and RQs is predicted better by individual tonal events rather than tunes: For polar questions, ISQs were interpreted with highest accuracy with an H* prenuclear accent, high nuclear accent H*, and a low-rising edge tone (L-H%), while RQs were best interpreted with an H* prenuclear accent, a steep-rising nuclear accent (LH)*, and a high plateau (H-%). For wh-questions, ISQ interpretation was highest with a prenuclear H*, nuclear H*, L+H* (or H+!H* for wh-ISQs), and final rise (L-H% or H^H%), while RQ interpretation was highest with a prenuclear L*+H, a steep-rising nuclear accent (LH)*, and a low edge tone (L-%). Overall, ISQs reached higher probabilities (95% probability for polar ISQs, 87% for wh-ISQs) than RQs (70% for polar RQs, 76% for wh-RQs) and ISQs had more tonal options than RQs (i.e., RQs were tonally more specific than ISQs). The nuclear realizations generally differed as a function of question type. The only exception is nuclear (LH)*, which was generally interpreted as RQ, independent of question type. Regarding nontonal prosodic cues, modal voice quality and short durations increased ISQ probability and decreased RQ probability for both question types. These nontonal prosodic cues had on average lower cue weights than the tonal cues. The data further showed that realizations that were frequent in production are not inevitably relevant in perception and, vice versa, that listeners may find cues relevant in perception even if they were not frequent in production. Given this discrepancy between frequency of realizations in production and relevance of cues in perception we stress the importance of methodological diversity for studying the prosody-pragmatics interface. Apart from this methodological contribution, our research further shows that the interpretation of prosodic cues is subject to question type and that nontonal prosodic cues are relevant, too. This raises further questions relating to the architecture of grammar, pertaining to additional (levels of) phonetic parameters, syntactic length and complexity, and more illocution types.

Appendix 1. Materials

Table 4

Verb-noun combinations for the creation of stimuli.

Verb Part 1 Part 2: noun
spielen
‘play’
Wer spielt denn ‘who plays PRT’
Spielt denn jemand ‘plays PRT anyone’
Badminton ‘badminton’
Risiko ‘risk’
mögen
‘like’
Wer mag denn ‘who likes PRT’
Mag denn jemand ‘likes PRT anyone’
Karneval ‘carnival’
Komiker ‘comedians’
Tombolas ‘tombolas’
Thymian ‘thyme’
Festivals ‘festivals’
halten
‘keep’
Wer hält denn ‘who keeps PRT’
Hält denn jemand ‘keeps PRT anyone’
Kängurus ‘kangaroos’
Kolibris ‘hummingbirds’
essen‘eat’ Wer isst denn ‘who eats PRT’
Isst denn jemand ‘eats PRT anyone’
Kabeljau ‘codfish’
Kaviar ‘caviar’
Rucola ‘rocket’

Appendix 2. Metadata on participants

Table 5

Information on participants. Gender is indicated as male, female, and NA (not indicated). Grouping into Northern and German participants was done according to the Benrath line, an isogloss separating Low German and High German dialects (based on the High German consonant shift), cf. division in Zahner-Ritter et al. (2022). States with fewer than four participants per group were collapsed into “other” (BW: Baden-Wuerttemberg, BY: Bavaria, RLP: Rhineland-Palatinate, NRW: North Rhine-Westphalia, Lower Sax: Lower Saxony, Hess: Hesse).

Experiment 1 Experiment 2
Group 1 Group 2 Group 1 Group 2
Total number of participants 62 19 67 13
Gender (f/m/NA) 35/27/0 12/7/0 44/22/1 10/3/0
Average age (SD) 24.2 (3.4) 24.7 (4.1) 26.2 (7.9) 25.5 (5.8)
Southern German participants
(BW/BY/RLP/other)
25
(18/6/0/1)
10
(5/5/0/0)
52
(39/7/4/2)
12
(8/1/1/2)
Participants from the north
(NRW/Lower Sax/Hess/other)
37
(28/5/4/0)
9
(6/2/0/1)
15
(7/2/4/2)
1
(1/0/0/0)
Bilinguals (not included in counts above) 5 32
Participants that mislabeled catch items (not included in counts above) 13 5

Appendix 3. Details on Active Learning

The Active Learning (AL) approach implemented in this study consists of three stages: the experimental data collection, the application of a machine learning model for prediction, and the uncertainty-based selection of stimuli for subsequent labelling. These stages are described in detail below.

The initial step involved a grouping phase to assess participants’ attentiveness and sensitivity to prosodic manipulations. Participants were presented with a fixed set of 28 stimuli, including unambiguous examples, ambiguous examples, and 4 catch items. Based on their responses, participants were assigned to one of two groups: Group I included those who showed sensitivity to nuclear tunes, while Group II included participants whose responses indicated low sensitivity, as determined by their performance on catch items and consistency in labelling ambiguous items. The threshold for being assigned to Group II was that the proportion of one response key (J or F) exceeded 75% (i.e., low sensitivity to the prosodic manipulation and therefore little attentiveness), otherwise participants were assigned to Group I.

In the experimental phase, the AL system was used to model the probability of each stimulus being classified as ISQ or RQ and to manage the presentation of the 768 experimental conditions. For estimating the probability XGBoost was chosen due to its ability to handle high-dimensional data and its robustness in accounting for nonlinear interactions between predictors. In this study, the predictors included binary features representing the experimental cues, such as prenuclear accent, nuclear accent, edge tone, voice quality, duration, and question type. To capture interactions among these factors, the model also incorporated second-order interactions between all cues and specific three-way interactions (https://xgboost.readthedocs.io/en/stable/).

XGBoost is a computationally effective algorithm to classify an item with a given set of cues to be ISQ/non-ISQ or RQ/non-RQ. The classification model is a tree, which is split into leaves based on the cues. An example of a tree is presented in Figure 5. In this example, a polar question with an L-H% edge tone (Tree 0) with a (LH)* nuclear accent (lower box in second “column”) and modal voice quality (lower box in “third column” ends up in the bottom leaf; its probability of being an ISQ question is determined as in (6)).

Figure 5
Figure 5

An example of a classification tree (see text for explanation) as exported from XGBoost.

    1. (6)
    1. 11+e(1.125)=0.25

Leaf cover is related to the variable importance, where a large value indicates an “important” leaf in terms of the classifying accuracy. In extreme gradient boosted tree algorithms (XGBoost) we build a lot of such trees and aggregate the results across those trees. The split into the leaves is decided based on the logistic loss function, which is also used in classical logistic regressions. The advantage of the XGBoost is not only its computational efficiency, but most importantly the ability to classify many experimental conditions (here: 768) based on a small number of labelled data (24 when the experiment started), which is not possible for the standard regression techniques. Therefore, XGBoost is implemented in real-time in the AL system to decide which experimental item should be labelled next. During Experiment 1 we estimate the ISQ probability for all 768 conditions and the next item which is displayed for a participant is the item where the ISQ probability is closest to .5 or a random choice if there are more cue combinations that are most uncertain for the model.

For selecting the next stimulus, for each participant, the AL system dynamically selected a stimulus with the cue combination that was most uncertain. This was implemented in JS (https://github.com/AL-perception-experiments/isq-rq), where the uncertainty of all of the 768 experimental conditions was estimated with XGBoost (https://xgboost.readthedocs.io/en/stable/).

Appendix 4. Binary factors

Lower case letters are used for prenuclear accents to avoid confusion with nuclear accents (printed in regular upper case letters), ¬ is the abbreviation for negation.

  1. Voice quality: (modal vs. breathy)

  2. Duration: (long vs. short)

  3. No prenuclear accent (vs. any)

  4. Prenuclear H* (vs. ¬H*)

  5. Prenuclear L* (vs. ¬L*)

  6. Prenuclear L*+H (vs. ¬L*+H)

  7. Nuclear H* (vs ¬H*)

  8. Nuclear L*+H (vs. ¬L*+H)

  9. Nuclear L+H* (vs. ¬L+H*)

  10. Nuclear H+!H* (vs. ¬H+!H*)

  11. Nuclear (LH)* (vs. ¬(LH)*)

  12. Nuclear L* (vs. ¬L*)

  13. Edge tone L-% (vs. ¬L-%)

  14. Edge tone H-% (vs. ¬H-%)

  15. Edge tone L-H% (vs. ¬L-H%)

  16. Edge tone H-^H% (vs. ¬H-H%)

  17. Question type: (polar vs. wh-question)

Appendix 5. Results for Group I participants

Table 6

Regression estimates, heteroscedasticity-robust standard errors, t and p for the cues and cue combinations that improved the fit of the Active Learning model for group I participants. Prenuclear accents are printed in lowercase characters to avoid confusion with nuclear accents (printed in upper case). Boldface indicates cues and cue combinations with p <= .01. The baseline (intercept) includes the cues: wh-question, breathy, long, no prenuclear accent, nuclear (LH)*, and H-%.

ISQ Probability RQ Probability
Estimate SE t p Estimate SE t p
(Intercept) .24 .03 7.76 .00 (Intercept) .59 .03 23.57 .00
l* & L*& L-% .28 .07 –3.9 .00 L* & L-% .14 .04 3.69 .00
h* .11 .03 –3.68 .00 H-^H% .14 .02 5.78 .00
H* .23 .02 9.31 .00 L*+H & H-^H% .19 .05 3.76 .00
L*+H .12 .02 5.43 .00 H* .17 .02 8.04 .00
L+H* .18 .02 7.81 .00 L*+H .10 .02 4.13 .00
H+!H* .2 .03 7.57 .00 L+H* .09 .03 3.12 .00
L* .12 .02 5.3 .00 H+!H* .14 .02 5.79 .00
L-% .08 .02 4.69 .00 L* .16 .03 5.15 .00
L-H% .13 .03 5.01 .00 h* .11 .03 3.48 .00
H-^H% .15 .03 5.81 .00 l* .08 .02 4.16 .00
short .12 .03 4.58 .00 l*+h .06 .02 2.93 .00
modal .08 .03 3.28 .00 l*+h & L-% .11 .03 3.48 .00
polar & L-H% .1 .04 2.85 .00 modal .15 .02 –5.95 .00
polar & H+!H* .15 .04 –4.16 .00 modal & short .08 .03 2.86 .00
l*+h .07 .02 –2.87 .00 short .1 .02 4.39 .00
modal & h* .16 .03 4.73 .00 polar & L-% .16 .03 5.32 .00
short & h* –.03 .04 –.97 .33 polar & L-H% .21 .04 5.62 .00
short & l*+h .04 .03 1.08 .28 polar & L+H* .11 .04 2.92 .00
l* –.04 .02 –2.02 .04 short & L* –.07 .04 –1.76 .08
polar .04 .03 1.25 .21 short & h* –.03 .03 –.9 .37
polar & short –.04 .03 –1.38 .17 L-% .01 .02 .63 .53
polar &
H-^H%*
.06 .04 1.75 .08 L-H% –.05 .03 –1.94 .05
polar & modal .07 .03 2.38 002 l* & H+!H* –.12 .05 –2.69 .01
polar .05 .03 1.81 .07
polar & short .02 .03 .64 .52
polar & L*+H & H-%* .05 .04 1.27 .2
polar & h* .01 .03 .18 0.86
polar & modal 0 .03 .09 .93
modal &
H-^H%
.04 .04 1.06 .29
modal & h* –.07 .03 –2.21 .03

Appendix 6. Results for Group II participants

Table 7

Regression estimates, standard errors, t and p values for the cues and cue combinations that improved the fit of the Active Learning model for group II participants. Prenuclear accents are printed in lower-case characters to avoid confusion with nuclear accents (printed in upper case). Boldface indicates cues and cue combinations with p <= .01. The baseline (intercept) includes the cues: wh-question, breathy, long, no prenuclear, nuclear (LH)*, H-%.

Probability of ISQ Probability of RQ
Estimate SE t p Estimate SE t p
(Intercept) .31 .04 6.92 .00 (Intercept) .56 .05 12.31 .00
L* & L-% –.21 .06 –3.37 .00   H* –.3 .04 –7.22 .00
l*+h –.09 .03 –2.97 .00   L*+H –.34 .03 –10.7 .00
H* .16 .04 4.07 .00   L+H* –.27 .04 –6.34 .00
L*+H .18 .04 4.4 .00   H+!H* –.22 .04 –5.43 .00
L+H* .17 .04 4.34 .00   L* –.24 .04 –5.4 .00
H+!H* .2 .04 5.29 .00   H-^H% –.18 .05 –3.43 .00
L* .24 .05 5.05 .00   L-% .09 .04 2.46 .01
polar .26 .03 7.84 .00   polar –.18 .03 –5.33 .00
modal .15 .03 4.79 .00   polar & H-^H% .13 006 2.24 .03
modal & short .15 .05 3.34 .00 modal –.06 .04 –1.62 .11
H-^H% .09 .04 2.13 .03 h* & L* & L-H% .19 .13 1.52 .13
polar & L-% .11 005 2.13 .03 h* –.05 004 –1.3 .19
h* –.07 .03 –2.04 .04 short –.05 004 –1.26 .21
short & L* –.11 .06 –1.87 .06 modal & short –.05 005 –1.01 .31
L-H% .06 .03 1.75 .08 l*+h –003 004 –.85 .4
short & L-% –.08 .05 –1.66 .1 L-H% .01 004 .36 .72
l* –.05 .03 –1.57 .12 l* 0 .04 .01 .99
L-% .04 .05 0.82 .41
polar & H-^H% –.01 .06 –0.18 .85
short .01 .03 .17 .86

Notes

  1. Compared to the more familiar MAE-ToBI categories (Beckman et al., 2005; Beckman & Ayers, 1997), there are a few differences: in GToBI, the high rise is transcribed as H-^H% (equivalent to H-H% in MAE-ToBI) and the high plateau as H-% (equivalent to H-L% in MAE-ToBI). Furthermore, a low edge tone is L-% in GToBI (equivalent to MAE L-L%). [^]
  2. Note that H+!H* and H+L*, present in the original GToBI system (Grice et al., 2005), were collapsed into one early-peak accent (Rathcke & Harrington, 2006). Therefore only H+!H* is included here. [^]
  3. According to previous perception data, the presence of this particle does not influence participants’ interpretation of illocution type (Neitsch et al, 2018); denn is further considered the particle that is most neutral with regard to illocution type (Thurmair, 1991; Viesel and Freitag, 2019). [^]
  4. A pretest tested the suitability of the interrogatives to be used as ISQ or RQ. To this end, twenty native speakers of German (13 female, 6 male, 1 not indicated, average age = 22.2 years, SD = 2.2 years) judged the predications of the objects (e.g., “liking festivals”) according to their own preferences. On average, in 62% participants agreed with the predications (e.g., they liked festivals), 16% disagreed (e.g., they did not like festivals), 19% had a neutral attitude (e.g., they did not care about festivals, or liked some festivals but not others), and 3% responded to be unfamiliar with the object (e.g., they did not know or had no experience with festivals). Hence, predications were true for some people and false for others, which qualifies them to be used in both readings tested, ISQ and RQ. [^]
  5. There are many other machine learning algorithms which can deal with similar problems, but in terms of statistical stability and computational speed XGBoost has shown superior performance (Memon et al., 2019; Sahin, 2020). [^]
  6. For question type, there are 16 interactions with other cues. For voice quality, 15 interactions, as the interaction with question type has already been taken into account. Therefore, there are 17 single cues, 16+15+14 two-way interactions of question type, duration and voice quality. The two-way interaction between e.g., prenuclear accent 1 and edge tone (4 levels) and nuclear accents (6 levels) adds 10 more interactions, which occurs 4 times for 4 levels of prenuclear accent (40 in total). For a given edge tone level there are 6 possible interactions with the nuclear accents, and as there are 4 edge tone levels, these interactions add 24 more terms. Additionally, there are 6 × 4 three-way interactions between question type and all possible combinations of edge tone and nuclear accents. We then add 3-way interactions between tonal cues: there are 4 levels of prenuclear accent, 4 levels of edge tone and 6 levels of nuclear accent resulting in 4 × 4 × 6 = 96 tonal 3-way interactions, which adds up to 17 + 45 + 40 + 24 + 24 +96 = 246. [^]
  7. In the list of cues, we list all baseline cues for the sake of completeness but cross out those that are overwritten by other cues or cue combinations. [^]

Acknowledgements

This research was supported by the German research foundation (grants BR 3428/4–2 to Bettina Braun, DE 876/3–2 to Nicole Dehé, and KE 740/17-2 to Daniel Keim) as part of the DFG Research Unit 2111 “Questions at the Interfaces”. We particularly thank our speaker for recording the stimuli as well as Justin Hofenbitzer, Friederike Hohl, Meike Rommel, Jasmin Pöhnlein, Johanna Schnell, Elena Schweizer, Daniela Wochner and Tianyi Zhao for help with annotation and preparation of the files. Furthermore, we thank the audience of the presentation at PaPE in Nijmegen 2023, the audience of the final workshop of the Research Unit (Konstanz, 2023), Ida Toivonen, Christoph Schwarze, María Biezma, Maribel Romero and the Bielefeld and Frankfurt Phonetics groups for feedback.

Competing interests

The authors have no competing interests to declare.

Author contributions

BB and ND: Initial conception; ME, AJ, KZR: preparation of stimuli; EK: statistical analyses; RS: Programming of Active Learning system; All: writing of the paper.

References

Altmann, H. (1984). Linguistische Aspekte der Intonation am Beispiel Satzmodus [Linguistic Aspects of Intonation using the Example of Sentence Mode]. Forschungsberichte des Instituts für Phonetik und sprachliche Kommunikation der Universität München (FIPKM), 19, 132–152.

Baumann, S. (2006). The Intonation of Givenness—Evidence from German (Vol. 508). Niemeyer.  http://doi.org/10.1515/9783110921205

Baumann, S., Grice, M., & Benzmüller, R. (2001). GToBI – a phonological system for the transcription of German intonation. In S. Puppel & G. Demenko (Eds.), Prosody 2000: Speech recognition and synthesis (pp. 21–28). Adam Mickiewicz University.

Beddor, P. S. (2015). The Relation between Language Users’ Perception and Production Repertoires. International Congress of the Phonetic Sciences, Glasgow, UK.

Ben Barsties v. Latoszek, Lehnert, B., & Ben Janotte. (2020). Validation of the acoustic voice quality index version 03.01 and acoustic breathiness index in German. Journal of Voice, 34(1), 157.e17–157.e25.  http://doi.org/10.1016/j.jvoice.2018.07.026

Biezma, M., & Rawlins, K. (2017). Rhetorical questions: Severing asking from questioning. In D. Burgdorf, J. Collard, S. Maspong, & B. Stefánsdóttir (Eds.), Proceedings of SALT 27 (pp. 302–322).  http://doi.org/10.3765/salt.v27i0.4155

Bishop, J. (2012). Focus, prosody, and individual differences in “autistic” traits: Evidence from cross-modal semantic priming. UCLA Working Papers in Phonetics, 111, 1–26.

Braun, B., & Asano, Y. (2013). Double contrast is signalled by prenuclear and nuclear accent types alone, not by f0-plateaux. Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon, France.  http://doi.org/10.21437/Interspeech.2013-80

Braun, B., Asano, Y., & Dehé, N. (2018). When (not) to look for contrastive alternatives: The role of pitch accent type and additive particles. Language and Speech, 62(4), 751–778.  http://doi.org/10.1177/0023830918814279

Braun, B., Dehé, N., Neitsch, J., Wochner, D., & Zahner, K. (2019). The prosody of rhetorical and information-seeking questions in German. Language and Speech, 62(4), 779–807.  http://doi.org/10.1177/0023830918816351

Braun, B., Einfeldt, M., Esposito, G., & Dehé, N. (2020). The prosodic realization of rhetorical and information-questions in German spontaneous speech. Proceedings of the 10th International Conference on Speech Prosody.  http://doi.org/10.21437/SpeechProsody.2020-70

Büring, D. (1997). The Meaning of Topic and Focus: The 59th Street Bridge Accent. Routledge.

Caspers, J. (1998). Who’s next? The melodic marking of questions vs. continuation in Dutch. Language and Speech, 41(3–4), 375–398.  http://doi.org/10.1177/002383099804100407

Chodroff, E., & Cole, J. S. (2019). Testing the distinctiveness of intonational tunes: Evidence from imitative productions in American English. Proceedings of Interspeech 2019, 1966–1970.  http://doi.org/10.21437/Interspeech.2019-2684

Cruttenden, A. (1986). Intonation. Cambridge University Press.

Cruttenden, A. (1994). Rises in English. In Studies in General and English Phonetics. Routledge.

Dehé, N., Braun, B., Einfeldt, M., Wochner, D., & Zahner-Ritter, K. (2022). The prosody of rhetorical questions: A cross-linguistic view. Linguistische Berichte, 269, 3–42.  http://doi.org/10.46771/9783967691757_1

Dehé, N., Braun, B., Einfeldt, M., Wochner, D., & Zahner-Ritter, K. (2024). The prosody of rhetorical questions: A cross-linguistic view. Linguistische Berichte, Sonderhefte 35, 103–148.

Dehé, N., Wochner, D., & Einfeldt, M. (2024). The interaction of discourse markers and prosody in rhetorical questions in German. Journal of Linguistics, 60, 103–127.  http://doi.org/10.1017/S0022226722000299

Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech Perception. Annual Review of Psychology, 55(1), 149–179.  http://doi.org/10.1146/annurev.psych.55.090902.142028

Einfeldt, M., Sevastjanova, R., Zahner-Ritter, K., Kazak, E., & Braun, B. (2024). The use of Active Learning systems for stimulus selection and response modelling in perception experiments. Computer Speech & Language, 83, 101537.  http://doi.org/10.1016/j.csl.2023.101537

Fox, A. (1984). German Intonation. Clarendon Press.

Fox, R. A. (1982). Individual Variation in the Perception of Vowels: Implications for a Perception-Production Link. Phonetica, 39(1), 1–22.  http://doi.org/10.1159/000261647

Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.  http://doi.org/10.1016/S0167-9473(01)00065-2

Fünfgeld, S., Braun, A., & Zahner-Ritter, K. (2024). Intonational patterns of verbal irony. A cross-varietal study on two German regional accents. Proceedings of the 12th International Conference on Speech Prosody.  http://doi.org/10.21437/SpeechProsody.2024-185

Geib, L., & Braun, B. (2022). Influence of speaker characteristics on the interpretation of rhetorical questions. Proceedings Phonetik Und Phonologie Im Deutschsprachigen Raum, Bielefeld, Germany.

Geluykens, R. (1987). Intonation and speech act type: An experimental approach to rising intonation in queclaratives. Journal of Pragmatics, 11, 483–494.  http://doi.org/10.1016/0378-2166(87)90091-9

Geluykens, R. (1988). On the myth of rising intonation in polar questions. Journal of Pragmatics, 12, 467–485.  http://doi.org/10.1016/0378-2166(88)90006-9

Gobl, C., Bennett, E., & Chasaide, A. N. (2002). Expressive synthesis: How crucial is voice quality? Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002, 91–94.  http://doi.org/10.1109/WSS.2002.1224380

Gras, P., & Elvira-García, W. (2021). The role of intonation in Construction Grammar: On prosodic constructions. Journal of Pragmatics, 180, 232–247.  http://doi.org/10.1016/j.pragma.2021.05.010

Grice, M., Baumann, S., & Benzmüller, R. (2005). German intonation in autosegmental-metrical phonology. In J. Sun-Ah (Ed.), Prosodic Typology. The Phonology of Intonation and Phrasing (pp. 55–83). Oxford University Press.  http://doi.org/10.1093/acprof:oso/9780199249633.003.0003

Grice, M., Ritter, S., Niemann, H., & Roettger, T. B. (2017). Integrating the discreteness and continuity of intonational categories. Journal of Phonetics, 64, 90–107.  http://doi.org/10.1016/j.wocn.2017.03.003

Han, C.-H. (2002). Interpreting interrogatives as rhetorical questions. Lingua, 112, 201–229.  http://doi.org/10.1016/S0024-3841(01)00044-4

Herman, E. (1942). Probleme der Frage [Problems of questions]. Vandenhoeck & Ruprecht.

Hillenbrand, J., Cleveland, R. A., & Erickson, R. L. (1994). Acoustic correlates of breathy voice quality. Journal of Speech and Hearing Research, 37(as), 769–778.  http://doi.org/10.1044/jshr.3704.769

Isačenko, A., & Schädlich, H. (1966). Untersuchungen über die deutsche Satzintonation [Investigations on the German sentence intonation]. Deutsche Akademie der Wissenschaften zu Berlin.

Kharaman, M., Xu, M., Eulitz, C., & Braun, B. (2019). The processing of prosodic cues to rhetorical question interpretation: Psycholinguistic and neurolinguistics evidence. Proceedings of Interspeech.  http://doi.org/10.21437/Interspeech.2019-2528

Kohler, K. J. (2004). Pragmatic and attitudinal meanings of pitch patterns in German syntactically marked questions. In G. Fant, H. Fujisaki, J. Cao, & Y. Xu (Eds.), From traditional phonology to modern speech processing—Festschrift für Professor Wu Zongji’s 95th Birthday (pp. 205–215). Foreign Language Teaching and Research Press.

Kutscheid, S. (2024). Bouletic bias in German questions: Evidence from production and perception [PhD Thesis, University of Konstanz]. https://kops.uni-konstanz.de/entities/publication/16343bb6-c9b6-4f8e-b047-cefa8bbaad0e

Leykum, H. (2021). Voice quality in verbal irony: Electroglottographic analyses of ironic utterances in Standard Austrian German. Proceedings of Interspeech 2021, 991–995.  http://doi.org/10.21437/Interspeech.2021-452

Mehta, D., & Quatieri, T. F. (2005). Synthesis, analysis, and pitch modification of the breathy vowel. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005, 199–202.  http://doi.org/10.1109/ASPAA.2005.1540204

Memon, N., Patel, S. B., & Patel, D. P. (2019). Comparative Analysis of Artificial Neural Network and XGBoost Algorithm for PolSAR Image Classification. In B. Deka, P. Maji, S. Mitra, D. K. Bhattacharyya, P. K. Bora, & S. K. Pal (Eds.), Pattern Recognition and Machine Intelligence (Vol. 11941, pp. 452–460). Springer International Publishing.  http://doi.org/10.1007/978-3-030-34869-4_49

Mozzionacci, S. (1995). Pitch variations and emotions in speech. Proceedings of the 13th International Congress of Phonetic Sciences. Stockholm, Sweden. pp. 178–181.

Neitsch, J., Braun, B., & Dehé, N. (2018). The role of prosody for the interpretation of rhetorical questions in German. Proceedings of the 9th International Conference on Speech Prosody.  http://doi.org/10.21437/SpeechProsody.2018-39

Neitsch, J., & Niebuhr, O. (2019). Questions as prosodic configurations: How prosody and context shape the multiparametric acoustic nature of rhetorical questions in German. Proceedings of the 19th International Congress of Phonetic Sciences, 2425–2429.

Newman, R. S. (2003). Using links between speech perception and speech production to evaluate different acoustic metrics: A preliminary report. The Journal of the Acoustical Society of America, 113(5), 2850–2860.  http://doi.org/10.1121/1.1567280

Niebuhr, O. (2013). Resistance is futile—The intonation between continuation rise and calling contour in German. Proceedings of Interspeech 2013, 225–229.  http://doi.org/10.21437/Interspeech.2013-72

Niebuhr, O. (2014). “A little more ironic” Voice quality and segmental reduction differences between sarcastic and neutral utterances. Proceedings of 7th International Conference on Speech Prosody.  http://doi.org/10.21437/SpeechProsody.2014-110

Nolan, F. (2020). Intonation. In B. Aarts, A. McMahon, & L. Hinrichs (Eds.), The Handbook of English Linguistics (1st ed., pp. 385–405). Wiley.  http://doi.org/10.1002/9781119540618.ch20

O’Connor, J. D., & Arnold, G. F. (1973). Intonation of Colloquial English. Longman.

Ogden, R. (2010). Prosodic constructions in making complaints. In D. Barth-Weingarten, E. Reber, & M. Selting (Eds.), Prosody in Interaction (pp. 81–104). John Benjamins.  http://doi.org/10.1075/sidag.23.10ogd

Ohala, J. J. (1983). Cross-language use of pitch: An Ethological view. Phonetica, 40, 1–18.  http://doi.org/10.1159/000261678

Ohala, J. J. (1984). An ethological perspective on common cross-language utilization of f0 in voice. Phonetica, 41, 1–16.  http://doi.org/10.1159/000261706

Oppenrieder, W. (1988). Intonatorische Kennzeichnung von Satzmodi [Intonational Marking of Sentence Types]. In H. Altmann (Ed.), Intonationsforschungen (pp. 169–206). Niemeyer.  http://doi.org/10.1515/9783111358413.169

Petrone, C., & D’Imperio, M. (2011). From tones to tunes: Effects of the f0 prenuclear region in the perception of Neapolitan statements and questions. In S. Frota, G. Elordieta, & P. Prieto (Eds.), Prosodic Categories: Production, Perception and Comprehension (pp. 207–230). Springer.  http://doi.org/10.1007/978-94-007-0137-3_9

Petrone, C., & Niebuhr, O. (2014). On the intonation of German intonation questions: The role of the prenuclear region. Language and Speech, 57(1), 108–146.  http://doi.org/10.1177/0023830913495651

Petrone, C., Truckenbrodt, H., & Wellmann, C. (2017). Prosodic boundary cues in German: Evidence from the production and perception of bracketed lists. Journal of Phonetics, 61.  http://doi.org/10.1016/j.wocn.2017.01.002

Pierrehumbert, J. B. (1980). The Phonology and Phonetics of English Intonation [PhD Thesis, MIT]. https://dspace.mit.edu/handle/1721.1/16065

Pierrehumbert, J. B., & Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse. In P. R. Cohen, J. Morgan, & M. E. Pollack (Eds.), Intentions in Communication (pp. 271–311). MIT Press.  http://doi.org/10.7551/mitpress/3839.003.0016

Rathcke, T., & Harrington, J. (2006). Is there a distinction between H+!H* and H+L* in standard German? Evidence from an acoustic and auditory analysis. Proceedings of the 3rd International Conference on Speech Prosody, 783–786.  http://doi.org/10.21437/SpeechProsody.2006-164

Roessig, S. (2024). The inverse relation of pre-nuclear and nuclear prominences in German. Laboratory Phonology, 15(1).  http://doi.org/10.16995/labphon.9993

Sahin, E. K. (2020). Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Applied Sciences, 2(7), 1308.  http://doi.org/10.1007/s42452-020-3060-1

Schmiedel, A. (2017). Phonetik ironischer Sprechweise: Produktion und Perzeption sarkastisch ironischer und freundlich ironischer Äußerungen [Phonetics of ironic tone of voice: production and perception of sarcastic ironic and friendly ironic utterances]. Frank & Timme GmbH.

Schourup, L. C. (1985). Common discourse particles in English conversation. Routledge.

Settles, B. (2009). Active learning literature survey. Computer Sciences Technical Report at the University of Wisconsin-Madison.

Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.  http://doi.org/10.1007/978-3-031-01560-1

Snedeker, J., & Trueswell, J. (2002). Using prosody to avoid ambiguity: Effects of speaker awareness and referential context. Journal of Memory and Language, 48, 103–130.  http://doi.org/10.1016/S0749-596X(02)00519-3

Steffman, J., Shattuck-Hufnagel, S., & Cole, J. (2022). The rise and fall of American English pitch accents: Evidence from an imitation study of rising nuclear tunes. Proceedings of the 11th International Conference on Speech Prosody, 857–861.  http://doi.org/10.21437/SpeechProsody.2022-174

Thurmair, M. (1991). Zum Gebrauch der Modalpartikel “denn” in Fragesätzen. Eine korpusbasierte Untersuchung [On the usage of the modal particle “denn” in interrogative sentence. A corpus-based study]. Niemeyer.  http://doi.org/10.1515/9783111353166.377

Vendrig, D. H. V. (2002). TREC Feature Extraction by Active Learning. 11th Text Retrieval Conference (TREC).  http://doi.org/10.6028/NIST.SP.500-251.video-amsterdam_isis

Viesel, Y., & Freitag, C. (2019). Wer kann denn schon ja sagen?: Natural and experimental data on German discourse particles in rhetorical questions. Zeitschrift für Sprachwissenschaft, 38(2), 243–298.  http://doi.org/10.1515/zfs-2019-2003

von Essen, O. (1964). Grundzüge der Hochdeutschen Satzintonation [Main features of the standard German sentence intonation]. Henn Verlag.

Ward, G., & Hirschberg, J. (1985). Implicating uncertainty: The pragmatics of the fall-rise intonation. Language, 61, 747–776.  http://doi.org/10.2307/414489

Ward, N. G. (2019). Prosodic Patterns in English Conversation. Cambridge University Press.  http://doi.org/10.1017/9781316848265

Watson, D., Tanenhaus, M. K., & Gunlogson, C. A. (2008). Interpreting pitch accents in online comprehension: H* vs. L+H*. Cognitive Science, 32(7), 1232–1244.  http://doi.org/10.1080/03640210802138755

Weber, A., Grice, M., & Crocker, M. W. (2006). The role of prosody in the interpretation of structural ambiguities: A study of anticipatory eye movements. Cognition, 99, B64–B72.  http://doi.org/10.1016/j.cognition.2005.07.001

Wells, J. C. (2006). English intonation: An introduction. Cambridge University Press.

Westera, M., Goodhue, D., & Gussenhoven, C. (2020). Meanings of Tones and Tunes. In C. Gussenhoven & A. Chen (Eds.), The Oxford Handbook of Language Prosody (pp. 442–453). Oxford University Press.  http://doi.org/10.1093/oxfordhb/9780198832232.013.29

Wochner, D. (2022). Prosody meets pragmatics: A comparison of rhetorical questions, information-seeking questions, exclamatives, and assertions [PhD Thesis, University of Konstanz]. http://nbn-resolving.de/urn:nbn:de:bsz:352-2-1pmcoqzp28veb8

Wu, Y., Kozintsev, I., Bouguet, J.-Y., & Dulong, C. (2006). Sampling strategies for active learning in personal photo retrieval. IEEE International Conference on Multimedia and Expo, ICME 2006.  http://doi.org/10.1109/ICME.2006.262442

Zahner-Ritter, K., Einfeldt, M., Wochner, D., James, A., Dehé, N., & Braun, B. (2022). Three kinds of rising-falling contours in German wh-questions: Evidence from form and function. Frontiers in Communication, 7.  http://doi.org/10.3389/fcomm.2022.838955

Zimmermann, M., & Onea, E. (2011). Focus marking and focus interpretation. Lingua, 121(11), 1651–1670.  http://doi.org/10.1016/j.lingua.2011.06.002