Speaking clearly improves speech segmentation by statistical learning under optimal listening conditions

This study investigated the effect of speaking style on speech segmentation by statistical learning under optimal and adverse listening conditions. Similar to the intelligibility and memory benefits found in previous studies, enhanced acoustic-phonetic cues of the listener-oriented clear speech could improve speech segmentation by statistical learning compared to conversational speech. Yet, it could not be precluded that hyper-articulated clear speech, reported to have less pervasive coarticulation, would result in worse segmentation than conversational speech. We tested these predictions using an artificial language learning paradigm. Listeners who acquired English before age six were played continuous repetitions of the ‘words’ of an artificial language, spoken either clearly or conversationally and presented either in quiet or in noise at a signal-to-noise ratio of +3 or 0 dB SPL. Next, they recognized the artificial words in a two-alternative forced-choice test. Results supported the prediction that clear speech facilitates segmentation by statistical learning more than conversational speech but only in the quiet listening condition. This suggests that listeners can use clear speech acoustic-phonetic enhancements to guide speech processing dependent on domain-general, signal-independent statistical computations. However, there was no clear speech benefit in noise at either signal-to-noise ratio. We discuss possible mechanisms that could explain these results.


Introduction
An important task in understanding spoken language is the segmentation of continuous speech into discrete words. Adult listeners are able to discover word boundaries and extract possible word forms via statistical learning-that is, by tracking transitional probabilities across syllables in the speech stream . For example, two syllables that frequently co-occur are likely to be perceived as word-internal, whereas two syllables with a low co-occurrence are interpreted as spanning a word boundary. This form of learning may occur without attention to the speech signal (Fernandes, Kolinsky, & Ventura, 2010;Saffran, Newport, Aslin, Tunick, & Barrueco, 1997) and has been observed in infants Thiessen & Saffran, 2003. Such learning occurs in the visual modality as well (Kirkham, Slemmer, & Johnson, 2002;Turk-Browne, Jungé, & Scholl, 2005), suggesting that speech segmentation by statistical learning is rooted in a domain-general learning mechanism.
Listeners also exploit a variety of sub-lexical acoustic-phonetic cues for segmentation (for reviews, see Cutler, 2012;Davis, Marslen-Wilson, & Gaskell, 2002). One subsegmental signal-driven cue that listeners rely on is coarticulation, or the articulatory also evidence that clear speech reduces lexical competition and enhances lexical access. Scarborough and Zellou (2013), for example, showed that words with many phonological neighbors are produced with greater hyper-articulation in clear speech compared to words with few phonological neighbors and this facilitated lexical decisions for highneighborhood-density words. Van Engen (2017) similarly found that clear speech was helpful in reducing lexical competition for high-neighborhood-density words in noise for older adults. Finally, using a visual word recognition paradigm, van der Feest, Blanco, and Smiljanic (2019) showed that clear speech increased speed of word recognition for high-predictability sentences in quiet and in noise for young adult listeners. While none of these studies looked at speech segmentation, documented processing benefits suggest that clear speech could also make it easier for the listener to track statistical dependencies and improve segmentation.
Work by Palmer and Mattys (2016) provides evidence that at least some of the acoustic-phonetic adjustments typically associated with clear speech are beneficial for segmentation by statistical learning. Using the artificial learning paradigm, they showed that listeners' statistical segmentation of trisyllabic nonwords (e.g., pabiku) from speech streams in which they were presented continuously was facilitated when the speech rate was slower than when it was faster. As they argue, this may be because a slower rate allows more time for representations stored in working memory to be refreshed before they are displaced by incoming input. Given that clear speech is characterized by a slower speaking rate (Behrman, Ferguson, Akhund, & Moeyaert, 2019;Ferguson & Kewley-Port, 2002Liu, Del Rio, Bradlow, & Zeng, 2004;Picheny et al., 1986;Smiljanić & Bradlow, 2005Uchanski, Choi, Braida, Reed, & Durlach, 1996), Palmer and Mattys's finding supports the prediction that it will have an advantage over conversational speech in segmentation by statistical learning.
With regard to coarticulation and speaking style adaptations, the findings are mixed. Under the H&H framework, coarticulation is viewed as a low-cost articulatory behavior (Farnetani & Recasens, 2010). Therefore, similar to other hyper-articulated forms such as strengthening in prosodically strong positions (Cho, 2004), more exaggerated listeneroriented speech is expected to show coarticulatory resistance. Moon and Lindblom (1994) found that vowels in the /w___l/ frame produced in a hyper-articulated style by English speakers showed greater F2 displacements relative to those produced without attention to clarity, suggesting less coarticulatory influence from the neighboring consonants. Analogous findings have been obtained by studies comparing citation-form and spontaneous speech in French, Swedish, Spanish, and Catalan (Duez, 1992;Krull, 1989;Poch-Olivé et al., 1989). In contrast, Bradlow (2002) found that for English and Spanish speakers, clearly produced CV syllables exhibited neither exaggerated nor reduced coarticulation compared to those that were conversationally produced. Similar maintenance of coarticulation across speaking styles was reported by Matthies et al. (2001). Furthermore, unlike the above-referenced studies, in which speakers followed instructions for different styles and might not speak with a true communicative intent, Scarborough and Zellou (2013) investigated anticipatory nasal coarticulation in vowel-nasal sequences under several communicative contexts, including one in which speakers completed an interactive task with a real listener. They found that compared to read speech elicited with instructions to speak clearly or imagine someone hard of hearing, spontaneous speech in this real-listener context showed greater hyper-articulation as evident in vowel space expansion and greater vowel-nasal coarticulation.
Nevertheless, while the presence of the listener in Scarborough and Zellou's (2013) interactive task introduced a communicative intent, it did not imply a communicative difficulty that would explicitly motivate the speakers to adopt a more effortful, under optimal listening conditions Art. 14, page 4 of 24 listener-oriented speaking style. Recently, Guo and Smiljanić (to appear) applied a wholespectrum measure of coarticulation (Cychosz, Edwards, Munson, & Johnson, 2019;Gerosa et al., 2006) to analyze speech from the various communicative conditions of the LUCID corpus , which include read speech elicited with instructions to speak clearly, spontaneous speech from an interactive task where pairs of speakers communicated in good listening condition, and spontaneous speech from the same task but with explicitly imposed communicative barriers (e.g., one speaker's voice was vocoded or mixed with talker babble). Results indicated that the speakers coarticulated less when completing the task under any condition with communicative barriers compared to the no-barrier condition. Moreover, compared to read speech obtained with instructions to speak casually, read clear speech showed a similar extent of coarticulatory resistance as spontaneous speech in most of the barrier-present conditions. Taken together, the findings on the relationship between coarticulation and speaking style, though still mixed, suggest that clear speech could be less coarticulated than conversational speech. If this is the case, the prediction is that the relatively greater coarticulation in conversational speech may facilitate word segmentation by statistical learning (cf. Fernandes et al., 2007Fernandes et al., , 2010 compared to clear speech with relatively less coarticulation. This prediction gains additional support from research investigating the effect of speech rate on statistical learning. Unlike Palmer and Mattys (2016), Emberson, Conway, and Christiansen (2011) found that when listeners were presented with speech streams consisting of nonwords (e.g., meep) that formed statistically coherent triplets, speeding up presentation rate facilitated their discovery of which nonwords constituted a triplet. This suggests that reducing the temporal distances between elements in speech promotes auditory grouping of the elements into larger units. Since conversational speech is characterized by a faster speaking rate and hence reduced temporal distances between syllables, it could lead to better grouping of syllables into discrete units than clear speech. Given the contradictory predictions, the main goal of this study was to assess the role of speaking style on segmentation by statistical learning.
We furthermore wanted to explore whether speaking style variation affects statistical learning under adverse listening conditions, specifically in noise. It is well-documented that clear speech enhances word recognition in noise (e.g., Ferguson & Kewley-Port, 2007;Gilbert, Chandrasekaran, & Smiljanic, 2014;Liu et al., 2004;Picheny et al., 1986;Smiljanić & Bradlow, 2005) and this benefit may become larger as the signal-to-noise ratio (SNR) decreases (Bradlow et al., 2003;Bradlow & Bent, 2002;Payton et al., 1994). Work by Fernandes and colleagues showed that the contribution of coarticulation to segmentation by statistical learning is largest in quiet and becomes smaller or even absent in noise. If noise masks the beneficial coarticulatory cues, and particularly if coarticulation is more pervasive in conversational speech, segmentation may be better for clear compared to conversational speech even if a different speaking style effect is found in quiet.
However, noise may increase cognitive load and affect speech segmentation beyond covering the beneficial acoustic-phonetic cues (Francis, Love, & Boutin, 2019;McCoy et al., 2005;Peelle, 2018;Pichora-Fuller et al., 2016;Rabbitt, 1968Rabbitt, , 1991Rönnberg et al., 2013;Rönnberg, Rudner, Foo, & Lunner, 2008;Schneider, Bernarding, Francis, Hornsby, & Strauss, 2019;Van Engen & Peelle, 2014;Winn, Edwards, & Litovsky, 2015;Zekveld, Kramer, & Festen, 2010, 2011. Palmer and Mattys (2016) showed that the benefit of decreased speaking rate for segmentation by statistical learning was reduced in a dual-task paradigm, where listeners completed a concurrent task while performing speech segmentation. It made no difference whether the concurrent task involved phonological processing or non-linguistic visual processing. This led them to conclude that the benefit of a slower speech rate is supported by domain-general central processing resources within working memory which are taxed more when performing a dual-task than a single task. Similar to the effect of a dual task, increased cognitive load and listening effort due to signal degradation by noise may result in fewer resources available for speech segmentation (e.g., Miles, Jones, & Madden, 1991;Pichora-Fuller, Schneider, & Daneman, 1995;Rönnberg, Rudner, Lunner, & Zekveld, 2010;Zekveld, Kramer, & Festen, 2011). The increased processing load in noise would affect speech segmentation of both speaking styles equally though the adverse effect could be larger for clear speech as any benefit of its slower speaking rate would be reduced (cf. Palmer & Mattys, 2016). These findings suggest that the clear speech benefit for segmentation in quiet, if any, will be reduced or absent in noise. The second goal of the present study was thus to examine the role of speaking style adaptations in segmentation by statistical learning in noise and to gain a better understanding of the processes underlying successful segmentation.
We addressed the above questions using an artificial language learning experiment (e.g., . Such an experiment uses nonsense speech material and therefore prevents listeners from relying on lexical knowledge to parse the speech signal. This was important for this investigation as we wanted to draw listeners' attention to the signal-dependent acoustic-phonetic cues and the relatively signal-independent statistical properties and away from the lexical cues (Mattys & Bortfeld, 2017;Mattys, White, & Melhorn, 2005). Similar to previous artificial language learning studies, our experiment consisted of a learning phase and a subsequent test phase. During the learning phase, listeners heard long speech streams in which syllables co-occurred with varying probability giving rise to the artificial language 'words.' Next, they identified the words in a two-alternative (word versus partword) forced-choice test. Higher recognition accuracy in the test is assumed to reflect better segmentation performance during the learning. To examine the effect of speaking style, listeners heard either a clear or a conversational version of the artificial language. To examine the effect of the listening condition, the learning-phase speech streams were either presented in quiet or masked with noise at increasing levels of difficulty: +3 dB SPL SNR and 0 dB SPL SNR.

Participants
One hundred and eighty speakers of American English (97 female, 79 male, two nonbinary, and two who declined to identify gender) participated in the study, equally and randomly assigned to the six conditions (see below). They self-reported to have started learning English before the age of six and their mean age was 20.4 years (range: 18-46). 1 All participants signed written informed consent and filled out a detailed language background questionnaire adapted from the LEAP-Q questionnaire (Marian, Blumenfeld, & Kaushanskaya, 2007). The questionnaire indicated that 51 participants were functionally bilingual as they had also learned one or more languages other than English before the age of six and self-reported to be fluent in those languages. However, the data of the monolingual and bilingual participants were collapsed since there was no evidence that the two groups performed differently in the experiment. 2 All participants passed a puretone screening test bilaterally at 25 dB hearing level for 1000, 2000, and 4000 Hz, and received course credit or a small honorarium. Four participants who failed the hearing screening were excluded and four additional participants were recruited for a total of 30 participants in each test condition.

Stimuli
Three vowels (/a, i, u/) and six consonants (/p, t, k, m, n, l/) were combined to form six trisyllabic CVCVCV sequences, which served as the 'words' of the artificial language: /pakila/, /timani/, /kutupi/, /mikamu/, /nuluta/, and /lipuna/. The three peripheral vowels were chosen to allow us to estimate the effect of speaking style on vowel space area. The consonants all occur in English. Another six 'partwords'-/pinulu/, /tamika/, /kilati/, /mutima/, /nakutu/, and /lalipu/-were created as three-syllables substrings across 'word' boundaries (e.g., /kutupi/ + /nuluta/ = /pinulu/) and used as distractors in the test phase. A 21-year-old female speaker who acquired English from birth and who had some phonetic training recorded the stimuli in a sound-attenuating room. Each target word or partword was written in the International Phonetic Alphabet (e.g., /pakila/), embedded in a carrier sentence (i.e., The word I said was ____), and presented on a PowerPoint slide. The recording consisted of two parts. In the first part, the speaker was instructed to read the target-bearing carrier sentences aloud in a conversational speaking style, "as if she was talking to her friends or someone familiar with her voice." Additionally, she was instructed not to reduce any vowels. In the second part, which followed after a 10-minute break, the speaker read the same sentences in the clear speaking style "as if she were talking to someone who cannot follow her conversationally or someone with hearing loss." Similar instructions have been used in previous studies to elicit clear and conversational speaking styles (see Pichora-Fuller, Goy, &Van Lieshout, 2010 andSmiljanić &Bradlow, 2009). The recordings were made with a MOTU UltraLite-MK3 Hybrid recorder and a Shure SM10A head-mounted microphone, digitized at a sampling rate of 44.1 kHz, and saved in WAV format.
The 24 stimuli-six words and six partwords for each style-were excised from the carrier sentences and manipulated with Praat (Boersma & Weenink, 2018). First, their F0 contours were flattened to the mean F0 for each style, calculated by averaging the F0 values across all the voiced intervals in the words and partwords. This rendered the learning-phase speech streams (described below) monotonous. The goal was to prevent listeners from using F0 contours for word segmentation (Endress & Hauser, 2010;Shukla et al., 2007) as F0 was not the focus of this study and to prevent ceiling performance on the artificial language learning task. The flattened F0 contour was at 232 Hz for the clear stimuli and at 210 Hz for the conversational ones, consistent with the finding that clear speech has higher overall F0 (Bradlow, Kraus, & Hayes, 2003;. The stimuli were equalized for the average RMS amplitude. For each speaking style, the six resynthesized words were concatenated without pauses to form six long speech streams for the learning phase. Each stream contained 20 repetitions of each word with the repetitions occurring pseudo-randomly such that the same word did not follow itself. Thus, listeners would be exposed to 120 repetitions of each artificial language word. The total duration of the six speech streams was about 7.6 minutes for the clear speaking style and 5.9 minutes for the conversational speaking style. Successful segmentation during the learning phase depends on tracking statistical regularities, typically expressed as transitional probabilities between syllables. Using the formula of , we found that the average betweensyllables transitional probability was one for all the words and ranged between 0.55 and 0.60 (mean: 0.58) for the partwords in each speech stream.
Two noise-masked versions of the six speech streams were created for each speaking style by mixing them with a Gaussian noise shaped to match the long-term average spectrum of the words. In each style, two noise conditions were created with an increasing level of difficulty: +3 and 0 dB SPL SNR. These SNRs were selected based on previous literature examining clear speech intelligibility benefit (Smiljanić, 2021;Smiljanić & Bradlow, 2009) and pilot testing. All the speech streams were faded in and out with fivesecond logarithmic ramps. There were six listening conditions total, resulting from a 2 × 3 between-subjects design with style (clear and conversational) and listening condition (quiet, +3 dB SNR, and 0 dB SNR) as independent factors.
The test phase consisted of a two-alternative forced-choice test. In each trial, two stimuli-a word of the artificial language and a partword-were presented with a 500-milisecond interstimulus interval. There were 36 trials, formed by all possible pairings of the six words with the six partwords. The orders of the two stimuli in a trial (i.e., partword first or partword second) were counterbalanced. All the test-phase stimuli were presented in quiet. The test lasted approximately eight minutes and contained the same speaking style as the learning phase to prevent responses based on acoustic matching. E-prime 2.0 (Psychology Software Tools, 2012) controlled response recording and stimulus presentation.

Procedure
Tested individually in a sound-attenuated room, participants first listened to the six speech streams of the artificial language, presented via Sennheiser HD570 headphones. They were not given any information about the artificial language and were instructed to listen to the recordings carefully and pay as much attention as they could. They were made aware of a subsequent test and those assigned to the +3 dB and 0 dB SNR conditions were additionally warned about the noise in the speech streams. Each participant only heard the six speech streams from one style and one listening condition (e.g., clear speech in quiet or conversational speech in +3 dB SNR). After the learning phase, participants immediately proceeded to the test consisting of words and partwords in the same speaking style. They were told that they would hear two stimuli in each trial and asked to press the button labeled "1" on a response box if they thought the first stimulus was a word of the artificial language and the button labeled "2" if they thought the second one was. They had five seconds to respond after the second stimulus was presented and the next trial would be automatically initiated after they made a response or timed out. Accuracy and reaction times (RTs) were calculated for each response.

Acoustic analysis
Acoustic analyses were performed on the artificial language words to confirm that the speaker produced two distinct speaking styles. Segment boundaries were located by searching for acoustic landmarks such as nasal murmurs and stop closures in the spectrogram and waveform in Praat. When the onsets or offsets of these landmarks could not be pinpointed, as was often the case with the /la/ sequence, the mid-point in the formant transition was defined as the boundary between the segments. under optimal listening conditions Art. 14, page 8 of 24 We focused on the acoustic measures typically found to distinguish clear and conversational speech: consonant and vowel duration, speech rate, and vowel space area (Smiljanić, 2021;Smiljanić & Bradlow, 2009). Speech rate was calculated as the total number of the syllables of the six words (i.e., 18) divided by their total duration in seconds. Vowel space area was calculated as the triangular area formed by the average mid-point F1 and F2 values of the three vowel categories. We also measured syllable duration to examine whether our speaker produced any syllables longer, allowing listeners to use wordinitial stress as a segmentation cue (Cutler & Butterfield, 1992;Cutler & Norris, 1988). We examined coarticulation, which listeners may use to facilitate word segmentation and which may differ across conversational and clear speech productions (e.g., Moon & Lindblom, 1994;Scarborough & Zellou, 2013). We quantified coarticulation using a technique introduced by Gerosa, Lee, Giuliani, and Narayanan (2006) and validated by Cychosz et al. (2019). Unlike methods such as locus equation (e.g., Lindblom & Sussman, 2012;Sussman, McCaffrey, & Matthews, 1991), which measures only F2 and can thus discard coarticulatory information in other resonant frequencies, this technique takes the whole spectrum of a segment into account and can be applied to any segment types. Following Cychosz et al. (2019), we divided each segment into frames of 25.6 milliseconds, applied short-time Fourier transformation to each frame to obtain its spectrum, and convolved a Mel-frequency filter bank over the spectrum. The results were then averaged across the frames of each segment to derive the average log Mel-frequency spectral vector, in which each value could be thought of as representing the segment's intensity in a frequency band. Next, we calculated Euclidean distances between the spectral vectors of each diphone sequence in each word and obtained an overall spectral distance for that word by taking the average of the distances. Greater distances indicate more spectral difference between adjacent segments and hence less coarticulation. This analysis was done with a custom Python script using functions from the LibROSA library for audio processing (McFee et al., 2015). Table 1 lists the mean syllable durations (ms), speech rate (syllables per second), and the vowel space areas for clear and conversational styles. As expected, overall speaking rate was slower for clear than for conversational speech. The clear syllables were on average longer than the conversational ones. 3 Figure 1 shows the vowel space areas formed by /a, i, u/ in two speaking styles. The clear vowels (connected by dark gray solid lines) were generally more peripheral and their vowel space area was larger than that of the conversational vowels (367,632 Hz 2 versus 191,048 Hz 2 ). Finally, coarticulation results showed that segments were less coarticulated in the clear speaking style: The overall spectral distances between adjacent segments in the clear artificial language words (mean: 7.25) were all greater than those of their conversational counterparts (mean: 5.71). The difference in mean overall spectral distance between the two styles (i.e., 1.54) was comparable to those found by Guo and Smiljanić's (to appear) investigation of the LUCID corpus, which ranged between 0.59 and 2.64. In sum, the clearly produced artificial language words were acoustically distinct from the conversational ones, both temporally and spectrally and in the expected direction. Similar style differences were found for the partwords (see Appendix), which were used only in under optimal listening conditions Art. 14, page 9 of 24 the test. The results of acoustic analyses show that our speaker was able to implement the clear speech modifications even for nonsense words (e.g., Ferguson & Quené, 2014;Moon & Lindblom, 1994;Picheny, Durlach, & Braida, 1986;Rosen et al., 2011;Smiljanić & Bradlow, 2005).

Results
Responses in the test phase were analyzed. Failures to respond within the allotted time (i.e., five seconds) were recorded as time-outs and discarded. As in some artificial language learning studies (e.g., Frank, Goldwater, Griffiths, & Tenenbaum, 2010;Palmer & Mattys, 2016), participants whose accuracy rates were more than two standard deviations (SDs) below the mean of their condition were excluded as outliers. Four participants (2% of all  participants: two from clear in quiet, one from conversational in quiet, and one from clear with noise at 0 dB SNR) were excluded. Figure 2 shows the accuracy rates of individual participants and the boxplots for two speaking styles and all listening conditions.
A Bayesian mixed-effects logistic regression model was fitted to the responses using the brms package (Bürkner, 2017) of R (R Core Team, 2020). We opted to use the Bayesian approach as it offers several advantages over traditional frequentist methods based on null-hypothesis significance testing (Kruschke, Aguinis, & Joo, 2012;Lee, 2011;Wagenmakers et al., 2018). For example, it allows the incorporation of prior knowledge in model estimation. Its output is a probability distribution of plausible parameter values, which has a natural interpretation like "There is 95% probability that the true parameter value falls between x and y, given the model and data." As data are usually collected with the goal of obtaining insight about parameters, this is conceptually more appealing than point estimates like the p-value in frequentist analysis, which is the probability of the data given that the null hypothesis is true (see, inter alia, Kruschke, 2014, for more details about Bayesian methods and Vasishth, Nicenboim, Beckman, Li, & Kong, 2018, for a tutorial using brms). The data and script for the statistical analysis are available on the Open Science Framework (https://osf.io/9v7k4/).
The fixed effects of primary interest in our model were Style (baseline: conversational), Listening Condition (LisCond; baseline: quiet), and their interaction. In addition, following Ou and Guo (2021), we log-transformed and scaled response time (LogRT) and included it as a covariate to factor out any potential relationship between response latency and accuracy. The random effects were random intercepts for participant, word, and partword as well as by-word random slopes for all fixed factors and a by-partword random slope for Style. Since there were no previous attempts to investigate the impact of speaking style on statistical segmentation, we used weakly informative, regularizing priors that have a normal distribution with zero mean and SD of 10 on all fixed-effect coefficients. The SD parameters for the random effects had the same normal distribution except they included only positive values. As a standard option, we used the LKJ(2) prior (Lewandowski, Kurowicka, & Joe, 2009) on the correlation parameters between the random intercepts and slopes (Vasishth et al., 2018). The joint posterior distribution was sampled using four Markov chain Monte Carlo (MCMC) chains, each with 2,000 iterations and a warm-up of 1,000. The R-hat statistic (i.e., the ratio of between-to within-chain variance) was one for all parameters, indicating that the chains converged and could be representative of the underlying posterior distribution. Figure 3 shows the marginal posteriors on all the fixed-effect parameters. A criterion for deciding whether a factor had a significant effect or whether two groups differ significantly in Bayesian inference is to examine the 95% credible interval (CI) of a posterior distribution-namely, the interval between the 2.5th and 97.5th percentiles of the distribution. Exclusion of zero from the interval is taken to be evidence for a significant effect or difference. For instance, the 95% CI was between -0.37 and -0.20 (mean: -0.29) for LogRT, meaning that given the model and data, there is a 95% probability that the parameter value of LogRT is between -0.37 and -0.20. The interval excluded zero and fell in the negative region, suggesting that there was an inverse relationship between response latency and accuracy: Responses with longer latencies were less likely to be correct. Such a relationship has been found in the artificial language learning experiments of Ou and Guo (2021). As they argue, this may reflect listeners' response certainty or confidence. Recall that the two stimuli in each test trial were separated by a pause. Thus, listeners might be ready to respond as soon as they heard the first stimulus and could determine whether it was a word of the artificial language or a partword. But when they were indecisive, they might think longer and guess, resulting in slower and less accurate responses.
It was also possible to draw inferences based on 95% CI for the other parameter values in the same way. Nevertheless, these raw parameter values were on a log-odds scale and the resulting inferences might not be immediately interpretable. Therefore, we chose to first compute the log-odds ratio for each condition at each iteration of the MCMC sample and  convert it to accuracy rate (in percentages) to obtain posterior distributions of accuracy for the six conditions. Next, we drew inferences about the effect of a particular factor by computing differences in posterior accuracy between the relevant conditions. Apparent from Figure 2 is that regardless of speaking style, performance dropped substantially when the artificial language speech streams were noise-masked. This was confirmed by the posterior distributions of accuracy drop from the quiet condition, calculated by subtracting the accuracy of the quiet condition from that of a noise condition. For conversational style, the posterior accuracy drop for +3 dB SNR had a mean of -16% and a 95% CI from -33% to -3%: That is, given the model and data, there was a 95% probability that mixing the conversational artificial language speech with speech-shaped noise at +3 dB SNR would decrease performance by 3% to 33% and chance that noise had no effect or enhanced performance was very low. Since the 95% CI excluded zero, it was assumed that for the conversational style, accuracy dropped significantly from the quiet to +3 dB SNR condition. Likewise, the posterior accuracy drop for the conversational style at 0 dB SNR had a 95% CI excluding zero (mean: -29%; 95% CI: [-45%, -12%]), suggesting that noise at 0 dB SNR significantly reduced performance. Similar significant reduction in accuracy in noise was shown for the clear style at +3 dB SNR (mean: -28%; 95% CI: [-52%, -9%]) and 0 dB SNR (mean: -31%; 95% CI: [-54%, -11%]).
To further explore how the adverse effect of noise interacted with style, we computed posterior difference between the two styles in accuracy drop from quiet to each noise condition. For +3 dB SNR, the 95% CI of the posterior difference (mean: -12%; 95% CI: [-29%, 2%]) did not exclude zero, but given the model and data, there was a high (i.e., 96%) probability that the performance decrease was larger for the clear style than for the conversational one. Yet, in the 0 dB SNR condition, there was no evidence that the accuracy drop from quiet differed across the two styles (mean -2%: 95% CI: [-18%, 12%]): The probability that the drop was greater for clear speech was only 60%.
Our main objective was to examine whether clear speech improved segmentation by statistical learning relative to conversational speech and, if so, whether the clear speech segmentation benefit persisted in noise. These questions were addressed by calculating the posterior distribution of 'clear speech benefit'-which was the difference in posterior accuracy between a clear condition and a conversational one-for each level of the listening condition factor. The results are shown in Figure 4. The posterior clear speech benefit in quiet had a 95% CI excluding zero and falling in the positive region (mean: 7%; 95% CI: [1%, 18%]). That is, compared to hearing the conversational style, hearing the clear style significantly increased the possibility of accurate word segmentation when the artificial language speech streams were presented in quiet (mean: 7%; 95% CI: [1%, 18%]). In addition, given the data and model, the probability that the clear speech benefit was zero or negative was less than 2%. These findings support the idea that clear speech improves segmentation by statistical learning relative to conversational speech. 4 In contrast, the posterior distributions of clear speech benefit for the +3 dB SNR (mean: -5%; 95% CI: [-22%, 11%]) and 0 dB SNR (mean: 5%; 95% CI: [-13%, 22%]) were relatively flat and both their 95% CIs did not exclude zero, suggesting much uncertainty about the effect of speaking style. There is thus no evidence for a significant effect of style under either noise 4 A sensitivity analysis was conducted to examine how the evidence for the clear speech benefit in quiet would be impacted by the choice of priors. The model presented above was rerun using four different priors for all the fixed effects: normal distributions with a mean of zero and standard deviations of 0.1, 1, 25, and 50. The resulting 95% CIs of posterior clear speech benefit in quiet for the four priors were [-4%, 5%], [0%, 18%], [1%, 18%], and [1%, 17%], respectively. That is, the 95% CIs failed to exclude zero only when we used the priors with standard deviations of 0.1 and 1, which represented a rather strong belief that the fixed effects were likely around zero.

Discussion and conclusions
The current study investigated how speaking style variation impacted speech segmentation by statistical learning in quiet and in noise. A review of clear speech and speech segmentation research suggested that both conversational and clear speech could improve statistical segmentation, leading to competing predictions. The current work aimed to assess the contradictory predictions and to provide new evidence of the processes underlying successful segmentation. We used an artificial language learning experiment, in which English-speaking listeners recognized the words of a made-up language after hearing speech streams containing uninterrupted repetitions of the language's words.
Results indicated that regardless of speaking style, recognition accuracy of artificial language words in noise was worse than that in the quiet condition, showing that noise reduces segmentation performance. This was true for both the noise levels (+3 and 0 dB SPL SNRs). Crucially, when presented in quiet, the clear version of the speech streams led to higher accuracy than the conversational version, supporting the prediction that clear speech improves segmentation by statistical learning. However, in noise, recognition accuracy was equivalent for both speaking styles.
The finding from the quiet condition is consistent with the clear speech perceptual benefit well-documented in the literature. A shift from conversational to clear speaking style is associated with a number of acoustic-phonetic adjustments such as enhanced consonant and vowel contrasts and has been shown to assist listeners in phoneme identification and various linguistic and cognitive processes (e.g., Bradlow & Bent, 2002;Cooke et al., 2013;  Clear speech benefit (%) Guo and Smiljanic: Speaking clearly improves speech segmentation by statistical learning under optimal listening conditions Art. 14, page 14 of 24 Keerstock & Smiljanić, 2019;Krause & Braida, 2002, 2004Maniwa et al., 2009;Picheny et al., 1985). Our findings suggest that in an optimal listening situation, these adjustments are similarly beneficial for tracking statistical regularities in an unfamiliar language. Part of the clear speech benefit may be attributed to its slower speaking rate, which, as Palmer and Mattys (2016) argue, improves segmentation by statistical learning by allowing more time for syllable sequences to be stored in working memory and to be refreshed. On the other hand, conversational speech is relatively disadvantaged in terms of its support for statistical segmentation in quiet despite its more pervasive coarticulation cues.
It is important to note that conversational speech still led to rather high accuracy. Our participants reached a mean accuracy rate of 85% in the quiet conversational condition and the Bayesian mixed-effects analysis revealed a mean posterior accuracy of 86% (95% CI: [69%, 96%]) for this condition. These percentages are well above chance and relatively higher compared to those reported in many artificial language learning studies (e.g., Ordin, Polyanskaya, Laka, & Nespor, 2017;Tyler & Cutler, 2009), which were mostly below 80%. Our participants' mean accuracy is close to the mean accuracy of 82-85% observed by Fernandes et al. (2007) in the condition where the artificial language speech streams were presented in quiet with congruent coarticulatory and statistical cues. Therefore, conversational speech with its greater coarticulation supports successful segmentation, leading to an above 80% accuracy. However, clear speech provided an even greater benefit for segmentation accuracy. This suggests that, despite the less pervasive coarticulation, other acoustic-phonetic features associated with clear speech contributed significantly to segmentation. This result seems to be at odds with Fernandes et al.'s findings. However, since our speaker produced each artificial word as a whole, rather than as isolated syllables, word-internal segments were coarticulated to some extent even in clear speech. This is different from Fernandes et al.'s baseline condition in which syllables of artificial language words were concatenated without natural coarticulatory transition and against which the condition with congruent or incongruent coarticulation was compared. Further research is needed to precisely determine the relative strength of each acoustic-phonetic cue associated with speaking style variation as it contributes to segmentation.
In contrast to the quiet listening condition, the results showed equivalent segmentation performance for conversational and clear speech in noise at both SNR levels. The overall lower performance in noise is not surprising as a number of beneficial cues to segmentation are masked. In addition, noise increases listening effort, detracting cognitive resources from the segmentation task (e.g., Rönnberg et al., 2010;Zekveld et al., 2011). Unexpected though is the lack of the clear speech perceptual advantage in segmentation considering the oft-reported findings that clear speech significantly improves phoneme and word recognition relative to conversational speech in noise at these or similar SNRs, sometimes even to a greater extent as the SNR becomes more challenging (e.g., Bradlow et al., 2003;Ferguson & Kewley-Port, 2007;Payton et al., 1994;Picheny et al., 1986;Smiljanić & Bradlow, 2005). It is possible that the increased cognitive load in noise depleted the processing resources in a way that the slower speaking rate in clear speech became less beneficial (cf. Palmer & Mattys, 2016), though this possibility remains speculative as the drop in accuracy from the quiet to noise conditions was not significantly greater for clear speech.
Another factor that could play a role in the observed lack of clear speech benefit relates to the relative weight of prosody as a segmentation cue in noise. Recall that our stimuli had monotone F0 contours, neutralizing pitch cues for the listeners. In addition to slower speaking rate and enhanced phonemic categories, clear speech is characterized by an increase in pitch average and range (e.g., Bradlow et al., 2003;Picheny et al., 1986;Smiljanić & Bradlow, 2005). When speech is degraded by noise, beneficial segmental cues are masked and their role in segmentation may be diminished. In contrast, prosodic cues like stress become particularly robust and useful in the adverse listening conditions. English listeners are known to treat stressed syllables as word beginnings (stress-based segmentation: e.g., Cutler & Norris, 1988;Tyler & Cutler, 2009), reflecting the fact that most English words are stressed initially (Cutler & Carter, 1987). This segmentation strategy seems to play a role when the listening conditions are optimal. According to Mattys, White, and Melhorn's (2005) model, descending weights are assigned to lexical, segmental (e.g., coarticulation), and prosodic/metrical cues in speech segmentation. The hierarchy is not fixed as listeners dynamically adjust cue weights in response to the listening condition. In noise-degraded speech, listeners rely on prosodic information, if well-preserved, more than on segmental information (Mattys & Bortfeld, 2017). Since our listeners could not rely on lexical knowledge in the artificial language learning tasks, it is possible that stress could have played an important role in the noise-masked conditions (cf., Fry, 1958;Gay, 1978;Lieberman, 1960;Sluijter & van Heuven, 1996). However, in our stimuli, vowels in the initial syllables were not longer and the words were synthesized with flattened F0, thus eliminating stress-based segmentation (see note 2). As F0 contours are more exaggerated in clear speech, F0 flattening may have affected segmentation in clear speech more than in conversational speech precisely in the listening conditions that would have favored prosodic cues. Future work should examine whether the presence of exaggerated stress cues would contribute to better segmentation performance for clear speech relative to conversational speech in noise. Another important future direction is to determine how noise impacts speech signal in combination with its effect on the domaingeneral processing resources to determine speech segmentation.
The present study showed that speaking clearly relative to conversationally improves segmentation by statistical learning under optimal listening conditions. This finding suggests that as far as listening in quiet is concerned, a shift towards a hyper-articulated listeneroriented speaking style improves not only recognition of meaningful words or sentences, but also processes that depend on domain-general cognitive abilities such as tracking statistical regularities. Listeners are able to use acoustic-phonetic enhancements to track syllable co-occurrences and improve segmentation by statistical learning. It is important though to acknowledge the limitations of the current study. The clear speech benefit for segmentation found here was rather small. It is therefore crucial to replicate these findings with a different group of listeners with the same language profile. Similarly, it cannot be taken for granted that the current results would generalize to more naturalistic communicative contexts or to other 'clear speeches.' Scarborough and colleagues (Scarborough et al., 2007) found that speech to a real listener (e.g., a foreigner) was faster with less expanded vowel space than that to an imagined one. In a later study (Scarborough & Zellou, 2013), they showed that relative to read clear speech, real listener-directed speech had greater vowel-nasal coarticulation. As mentioned, however, Guo and Smiljanić's (to appear) whole-spectrum analysis suggested that clear speech exhibited coarticulatory resistance when communicative barriers were explicitly imposed. Real listener-directed speaking styles also differ in F0 and vowel space or formant range depending on the interlocutor (Burnham et al., 2002;Uther et al., 2007) or the nature of the communicative barrier (e.g., vocoded versus mixed with talker babble) . It is important then to see how the speaking style variation encountered in the real-world communicative situations contributes to speech segmentation. Our speculation is that the various acousticphonetic enhancements consistently found across hyper-articulated clear speaking styles (e.g., slower speaking rate) will improve segmentation by statistical learning relative to the casual hypoarticulated forms, though the effect size may vary across clear speaking styles elicited in different ways.
It is also of interest to examine whether the clear speech segmentation benefit observed at least in quiet extends to the listeners who are learning another language later in life. Clear speech enhancements reflect language-specific properties and gain in intelligibility is greater for the listeners who acquired the target language from birth than for the language learners (Bradlow & Alexander, 2007;Bradlow & Bent, 2002). Using signal-driven cues for segmentation is in part a language-specific skill that requires proficiency or knowledge in the target language (Tremblay & Broersma, 2019;Tremblay, Coughlin, Bahler, & Gaillard, 2012). It is possible that language learners' segmentation by statistical learning is enhanced through the language-independent clear speech enhancements such as slower speaking rate and exaggerated F0 contours. However, they may not be able to take advantage of other signal-driven cues such as vowel and consonant contrast enhancements, which might be available only to those with extensive experience with the sound patterns of the target language (Smiljanić & Bradlow, 2011). Finally, we need to extend this line of work to examine whether speech segmentation of real words is enhanced by clear speech in quiet and in noise differently than what was found for artificial language learning.

Additional File
The additional file for this article can be found as follows: • Appendix. PDF file containing the acoustic analyses of the partwords. DOI: https:// doi.org/10.5334/labphon.310.s1