Voice onset time (the interval between stop release and onset of vocal cord vibration, hereafter VOT) is a primary acoustic cue that differentiates voiced from voiceless stop phonemes in word- and syllable- initial positions in many languages (Beckman et al., 2011; Cho & Ladefoged, 1999; Kessinger & Blumstein, 1997; Lisker & Abramson, 1964, 1967, 1970). In English, initial voiced stop phonemes are generally said to have a VOT of 15 ms or less (short-lag VOT or prevoiced), and voiceless stop phonemes some 30 ms or longer (long-lag VOT) (Lieberman & Blumstein, 1988, p. 215). Speech segment durations are affected by articulation rate, however (e.g., Gaitenby, 1965). Phonetic-phonological research has thus long been interested in how articulation rate affects VOT, and how listeners recover the correct voicing specifications of stop phonemes despite surface variation of VOT in the input.
There are two contrasting views on this issue. The widely accepted view rests on claims that the VOT category boundary location that optimally distinguishes short-lag and long-lag categories shifts with articulation rate. On this view, languages that contrast these categories such as English require rate-dependent VOT category boundaries to distinguish voiced and voiceless stop phonemes effectively, with a larger VOT value for the category boundary at a slower articulation rate (e.g., Miller et al., 1986). Based on their linguistic experience, the listeners shift the perceptual VOT category boundary, or “normalize” the boundary location, according to articulation rate to correctly identify the stop’s voicing specification from VOT.
The less accepted view states that short-lag VOT hardly changes with articulation rate and serves as a phonetic anchor in maintaining the voicing contrasts, and the same VOT category boundary location remains optimal across different rates of articulation (Kessinger & Blumstein, 1997). On this view, the listeners do not shift the VOT category boundary with a change in articulation rate in order to correctly identify the stop’s voicing specification from VOT.
Evidence from perception studies has been generally interpreted as supporting the rate normalization view. Many past studies report a shift in the perceptual VOT category boundary, with larger values for slower articulation rates, emulated by manipulating the duration of surrounding speech segments (e.g., Green et al., 1994; Green & Miller, 1985; Kidd, 1989; Miller & Dexter, 1988; Newman & Sawusch, 1996; Summerfield, 1981; Volaitis & Miller, 1992).
Even so, it has been noted that such shifts in perceptual VOT category boundary locations are often much smaller in magnitude than expected from production studies (Kessinger & Blumstein, 1998; Miller et al., 1986; Pind, 1995; Summerfield, 1975). This is most evident in Pind’s (1995) Icelandic study. In that study a mere 1.5 ms shift in the perceptual category boundary location was observed, where production data predicted a roughly 20 ms shift, although at least the observed shift was in the predicted direction and was statistically significant. The production-perception mismatch is problematic for perceptual normalization views, which assume that rate normalization processes reflect the listener’s “detailed knowledge of the temporal regularities of speech” (Nooteboom, 1979, p. 304).
From a psycho-acoustic perspective, some researchers have cast doubts on the interpretation of perceptual rate normalization studies. Diehl and Walsh (1989) found that the same nonspeech sound is perceived to be shorter before a long sound than before a short sound, and suggested that the findings of perceptual rate normalization studies may instead be attributed to general auditory contrast effects (see also Pisoni et al., 1983). Although Diehl and Walsh (1989) concerned the English /b/-/w/ contrast, if we applied the principle of auditory contrast effects to typical situations in perceptual VOT boundary experiments, the same VOT would be perceived to be shorter before a long segment (in the slow articulation condition) than before a short segment (in the fast articulation condition), which would produce a shift in the VOT category boundary in the direction reported by the perceptual rate normalization studies (see Reinisch and Sjerps  for similar effects induced by temporally manipulating preceding speech contexts). In other words, the observed shifts in VOT category boundary locations in the previous perception experiments could have arisen from general auditory effects rather than speech rate normalization, which reflects listeners’ knowledge of the temporal regularities of speech.
From another perspective, Toscano and McMurray (2012) also argue against perceptual rate normalization of VOT. These authors suggest that English-speaking listeners use the duration of the vowel following a stop onset as an independent cue to the stop’s voicing specification, not as a cue to articulation rate as generally held. All else being equal, vowels following a voiced stop onset (measured from the onset of voicing) are longer than vowels following a voiceless stop onset in English (Allen & Miller, 1999). This vowel duration difference can serve as a secondary cue to the preceding stop’s voicing specification. Consequently, listeners are more biased towards the “voiced” response when the vowel following a stop onset is longer (and vice versa), which gives an appearance of rate normalization.
We suspect that the prediction of rate-dependent shift in perceptual VOT category boundary location is an artifact of rather unnatural elicitation methods used in production studies. For example, Miller et al. (1986), Volaitis and Miller (1992), and Pind (1995) all used a “magnitude production technique”, in which the participants were instructed to produce test syllables/words (e.g., /pi/) at several rates: at normal rate, twice normal rate, four times normal rate, as fast as possible, and so on. Such elicitation methods reveal what the speakers are capable of, but not necessarily what they produce in everyday communication. That is, the ranges of VOT values elicited in these studies are not ecologically grounded, and might not be relevant to central theoretical models of speech communication. (We do not mean that laboratory speech production studies are always or necessarily undesirable. See Xu  for the advantages of well-constructed laboratory speech materials.)
We are aware that the ranges of articulation rate used in the studies employing the magnitude production technique are not entirely arbitrary. Implicitly, they are informed by Miller et al.’s (1984) study on variability in articulation rate in spontaneous speech, where articulation rate was expressed as the mean syllable duration of each pause-free stretch of speech. While we agree with Miller et al. (1984) that articulation rate may fluctuate during a conversation, the estimated variability in articulation rate in that study perhaps is inflated, because it conflates variability arising from various sources such as segments’ intrinsic durations and prosodic temporal adjustments.
In Lehiste (1972), for instance, the duration of stick differed by a factor of 1.6 when her speakers produced the word in isolation vs. in a sentence (the stick was discarded) at a subjectively constant rate (see also Frank & Jaeger, 2008; Yuan et al., 2006). Unlike stick produced in the sentence, stick produced in isolation most probably underwent accentual and utterance-final lengthening, among other things, resulting in rather different durations of stick at similar articulation rates.1 In our view, these additional sources of durational variability should be distinguished from general “articulation rate”, manipulated in a majority of rate normalization studies by instructing participants to produce speech materials (often isolated syllables/words) at different speeds, or by resynthesizing speech materials to shorten or lengthen their overall durations, a common approach for perception experiments.
More recently, Nagao and de Jong (2007) elicited target syllables (/bi/ vs. /pi/) of a much smaller durational range than Miller et al. (1986), and reported a comparable rate-dependent shift in the VOT category boundary in production and perception, except in the fast speech rates. However, participants produced test syllables in time with a metronome, which again deviates from everyday speech production. Additionally, as the authors note, spoken syllables from the production experiment were used as stimuli in the perception experiments without controlling other acoustic cues for voicing specifications such as F0, formant transitions, and the amplitude of aspiration noise (Haggard et al., 1981; Repp, 1979; Stevens & Klatt, 1974). It is thus unclear whether the participants identified stimuli with a long VOT as voiced in slow speech more often (and vice versa) because of perceptual rate normalization, or because of other cues for voicing compatible with the intended category despite atypical VOT values.
Whether or not they subscribe to rate normalization views, virtually all production studies report asymmetrical effects of articulation rate on voicing categories, with much smaller effects on short-lag than long-lag categories (Kessinger & Blumstein, 1997; Magloire & Green, 1999; Miller et al., 1986; Nagao & de Jong, 2007; Pind, 1995; Schiavetti et al., 1996; Stuart-Smith et al., 2015; Volaitis & Miller, 1992). Conceivably, for naturally occurring ranges of VOT, a rate-independent category boundary between short-lag and long-lag VOT is effective enough across different rates of articulation. Other voicing cues would still be useful, as cue redundancy makes speech perception more robust and effective (Nakai & Turk, 2011; Wright, 2004).
To see how relevant the existing literature on perceptual rate normalization of category boundary locations is for naturally occurring ranges of VOT values, we examined word-initial voiced vs. voiceless English stop phonemes (the subject of many rate normalization studies) in spontaneous speech. In Miller et al. (1986) voiced vs. voiceless stop phonemes were produced at various articulation rates, and optimal category boundaries (described in Section 2.3 below) were estimated for syllables grouped by 50-ms intervals. In that study, estimating articulation rate was relatively straightforward because the speech materials were tightly controlled phonetically and prosodically: isolated /bi/ vs. /pi/.
As we pointed out, accurately quantifying the articulation rate of spontaneous speech is not easy, because word sequences and their prosodic groupings vary from utterance to utterance, adding noise to the estimated articulation rate. Therefore, in our main analysis we applied rate-independent optimal VOT category boundaries to spontaneous speech data, and compared their classification accuracy with the overall classification accuracy achieved by Miller et al.’s (1986) rate-dependent optimal category boundary. To make our spontaneous speech data roughly comparable to Miller et al.’s (1986) well-controlled speech material, in our application of rate-independent category boundaries we took into account factors known to affect VOT other than articulation rate (place of articulation, lexical stress, following vowel, word class; see, e.g., Lisker & Abramson, 1967).2
If such rate-independent category boundaries are as effective as Miller et al.’s (1986) rate-dependent category boundary, then we can conclude that classification accuracy is unlikely to improve by additionally taking articulation rate into account. Put differently, comparable performances of rate-independent category boundaries applied to spontaneous speech and Miller et al.’s (1986) rate-dependent category boundary applied to laboratory speech would speak against the need for perceptual rate normalization of VOT category boundaries under natural circumstances.
The spontaneous speech sample used in this study comprised ten episodes of a BBC (the British Broadcasting Corporation) Radio 3 program “the Lebrecht Interview”, broadcasted in 2011 and 2012. Each 45-minute episode featured a prominent artist or administrator in classical music, who talked to the interviewer (a music commentator) at a radio station about work and life in a conversational style. The interviewees whose speech was analyzed comprised four males and six females (age range: 37–78, x̅ = 62). They were all native speakers of English, from different parts of the world: United States (n = 4), United Kingdom (5), and Australia (1) (see Discussion for possible effects of dialectal differences). The episodes were streamed on iPlayer (http://www.bbc.co.uk/radio3) on a MacBook Pro and captured using Audacity, via a Soundflower input/output device at a 44.1 kHz sampling rate with 16 bit quantization, a standard used in the BBC radio studio recordings in the UK (British Broadcasting Corporation, 2010). The resulting audio recordings had a bandwidth of c. 20 kHz. No dropouts were detected.
2.2. VOT measurements
All instances of English words beginning with one of the six oral stop phonemes (/b/, /d/, /g/, /p/, /t/, and /k/) as a simplex onset were identified in the interviewees’ speech for VOT measurements. Words of a foreign origin were excluded unless they were listed in the Collins online English dictionary (http://www.collinsdictionary.com/dictionary/english) with Anglicized pronunciations and judged to have been part of the English language for some time. For example, Bach and Berlin were included, but Bayreuth and Dudamel were excluded. Altogether, 10,479 words that satisfied the criteria were identified (see Table 1).
Of those, the VOT of the initial stop of 422 words (4%) were not measured because of overlapping speech, noise, a devoiced following vowel, unclear stop release, or the stop’s realization as a glide, fricative, tap or nasal. Many (63%) of these belonged to function words, with /t/ in to accounting for 36% of all unmeasured tokens (though, as the most frequent /t/-initial word, 1,596 tokens still remained). Words spoken while laughing were also excluded, as we were uncertain to what extent the speaker had control over the duration of VOT. Words from disfluent sections of speech were included as an intrinsic part of spontaneous speech so long as the word was completed and identifiable, except one case of suspected substitution error (Boint P for Point B). Unfinished words were excluded, as many of them were just one syllable (e.g., bi- Beatles) and did not provide sufficient phonetic evidence to be absolutely sure which stop the speaker had intended.3
The VOT of the remaining 10,057 words were measured manually by the first author in Praat (Boersma & Weenink, 2012). VOT was defined as the interval between the first clear sign of stop release to the first clear sign of voicing that continued into the following vowel, as determined on the waveform in conjunction with spectrographic information (see Figure 1). This meant that no negative VOT was used; a VOT of 0 ms was assigned to prevoiced utterance-initial stops and utterance-medial stops produced with continuous voicing from before the stop release. This decision was made because the onset of voicing could not be easily determined for a majority of such cases, which were utterance-medial and had continuous voicing from segments before the stop closure (see also Lisker & Abramson, 1967; Stuart-Smith et al., 2015). As we elaborate in Section 2.3, this did not affect the locations of optimal VOT category boundaries or their classification accuracy, our main analysis tools. The portion of pseudo-regular waveform corresponding to a mixture of voicing and noisy aperiodic excitation at the release of stop closure was excluded from VOT. (VOT category boundaries estimated using this approach would be at smaller values than those estimated using an approach that includes the frication portion in VOT, regardless of concurrent voicing.)
For reliability, the second author measured the VOT of roughly 5% (500 tokens) of all measured tokens, selected randomly. The Spearman’s correlation coefficient between the two authors’ VOT measurements for each homorganic stop pair was: rs = 0.87 for /b/-/p/; rs = 0.95 (/d/-/t/); rs = 0.96 (/g/-/k/). The median difference between the repeated measurements was 1.7 ms for /b/-/p/, 2.8 ms for /d/-/t/, and 2.4 ms for /g/-/k/.
2.3. Optimal category boundary location
The optimal category boundary location between the two members of each of the three pairs of homorganic stops (/b/-/p/, /d/-/t/, and /g/-/k/) was estimated using Miller et al.’s (1986) categorization method. In this method, a candidate category boundary is placed along the VOT continuum; all items to the left of the boundary (VOT smaller than the value at the boundary) are classified as voiced, and all items to the right of the boundary are classified as voiceless. The boundary location that classifies the voicing specifications of the greatest proportion of the stop phonemes correctly (voiced and voiceless stops combined) is defined as optimal. For example, a category boundary placed at a very small VOT value (e.g., 5 ms) would classify most voiceless stop phonemes correctly but misclassify many voiced stop phonemes, resulting in a low overall classification accuracy.
In a procedural search for the optimal category boundary location, the candidate VOT boundary was moved in 1 ms steps from the smallest meaningful boundary location at 1 ms towards larger values, so that the classification accuracy improved, reached a maximum, and then started to decline. The optimal category boundary location is where the classification accuracy reaches the maximum. If maximum classification accuracy was found at multiple steps, we regarded all of them as optimal, but the midpoint of the range was used for calculations that required a single optimal VOT value.
As explained earlier, we assigned a VOT value of 0 ms to prevoiced utterance-initial stops and utterance-medial stops produced with continuous voicing from before the stop release. This did not affect the estimated optimal VOT category boundary location, as an overwhelming majority of voiceless stop phonemes and many voiced stop phonemes in our data had positive VOT values (values greater than 0 ms). Therefore, the optimal VOT category boundary, located basically at the intersection of the VOT distributions of voiced and voiceless categories, always had a positive value, as expected for a category boundary between short-lag vs. long-lag VOT (see also Miller et al. , who used negative VOT values). If the optimal category boundary has a positive value, assigning 0 ms to negative VOT values makes no difference to classification accuracy, as a VOT of 0 ms would be positioned to the left of the category boundary, just like negative VOT values. Stops with a VOT of 0 ms would always be classified correctly if they are from a voiced category and wrongly if they are from a voiceless category.
2.4 Controlling spontaneous speech data
As laid out in the Introduction, our main goal is to compare the overall classification accuracy of the rate-dependent optimal category boundary applied to isolated /bi/ vs. /pi/ in Miller et al. (1986) against the accuracy of rate-independent optimal category boundaries applied to spontaneous speech data, controlled for known factors that affect VOT other than articulation rate. Rate-independent optimal category boundaries were estimated at four levels of data control: (a) all word-initial homorganic pairs of stop phonemes, (b) word-initial homorganic stop pairs in content words only,4 (c) word-initial homorganic stop pairs in content words with word-initial (primary and non-primary) lexical stress only,5 and (d) word-initial homorganic stop pairs in content words with word-initial lexical stress, grouped by the following vowel.
Needless to say, the controlling factors (place of articulation, word class, lexical stress, and following vowel) used here were far from exhaustive. To keep the analysis manageable in size, these factors were chosen from those reported to affect VOT durations in previous production and perception studies (e.g., Klatt, 1975; Lisker & Abramson, 1967; Yao, 2009) through inspection of items that were misclassified by the optimal category boundary at each analysis level. Among the data at the above four levels of control, the data at the final level of control (d) is the most comparable to Miller et al.’s (1986) data, which consisted of isolated /bi/ vs. /pi/ only.
3.1 Overview of results
Table 2 provides the classification accuracy of rate-independent optimal category boundaries, along with the median VOT value of each phoneme and the semi-interquartile ranges (SIQR) of the voiceless phonemes. The SIQR was not calculated for voiced phonemes, as many of them were assigned a VOT of 0 ms, which in many cases had no numerical significance (see Section 2.2). The median VOT values of the six phonemes at the first level of analysis (all words) were comparable to the mean VOT values of corresponding phonemes in sentence context in Lisker and Abramson (1967), with bilabial stops having the shortest VOT and velar stops the longest (see also Fricke, 2013; Schiavetti et al., 1996; Stuart-Smith et al., 2015). Optimal category boundary locations for the three pairs of homorganic stops were also the shortest for bilabial stops and generally the longest for velar stops, and were roughly within the range of category boundary locations for the three places of articulation reported in Summerfield’s (1975, 1981) perception studies.
|Boundary location (ms)||16||24||27|
|Voiceless||35 (14)||45 (14)||49 (12)|
|Content words (all)||/b/-/p/||/d/-/t/||/g/-/k/|
|Boundary location (ms)||13||28||27|
|Voiceless||35 (14)||54 (14)||50 (12)|
|Content words (initial stress)||/b/-/p/||/d/-/t/||/g/-/k/|
|Boundary location (ms)||13||26||31|
|Voiceless||37 (14)||54 (14)||54 (13)|
|Content words (initial stress) grouped by following vowel||/b/-/p/||/d/-/t/||/g/-/k/|
|n||1,536||1,996 (see Note)||1,700 (see Note)|
|Boundary location (ms)||see Table 4 in Section 3.5|
Importantly, as the context other than articulation rate was progressively controlled, classification accuracy for the three pairs of homorganic voiced vs. voiceless stop contrasts gradually improved and became comparable to the classification accuracy of Miller et al.’s (1986) rate-dependent category boundary at one level or another. The results are consistent with our hypothesis that the VOT category boundary between voiced vs. voiceless stop phonemes need not be adjusted for articulation rate in spontaneous conversational speech to maintain a high degree of accurate phoneme classification. We detail below how rate-independent category boundaries fared with Miller et al.’s (1986) rate-dependent category boundary at each level of data control.
3.2 Word-initial homorganic stop pairs, unrestricted otherwise
VOT is affected by the stop’s place of articulation (e.g., Lisker & Abramson, 1967), which is reflected in the perceptual VOT category boundary location between voiced vs. voiceless stops (e.g., Lisker & Abramson, 1970). At the first level of data control, we therefore grouped all word-initial stop phonemes by place of articulation. Figure 2 plots the durational distributions of VOT for the three pairs of homorganic stops. VOT distributions for /b/-/p/ are reasonably well separated, while those for /d/-/t/ and /g/-/k/ appear to have non-negligible overlap. As given in Table 2 above, rate-independent optimal category boundaries correctly classified 94.8% of /b/-/p/ (at 16 ms), 89.0% of /d/-/t/ (24 ms), and 91.2% of /g/-/k/ (27 ms).
Chi-square tests6 were used to compare the number of items correctly vs. wrongly classified by the rate-independent optimal category boundaries for the three homorganic stop contrasts against the overall classification accuracy of Miller et al.’s (1986) rate-dependent optimal category boundary for /bi/-/pi/ (97.6%, n = 1,013). (Miller et al.  investigated /bi/-/pi/ only.) All three rate-independent category boundaries performed significantly worse than Miller et al.’s (1986) rate-dependent category boundary (/b/-/p/: χ2(1) = 13.0; /d/-/t/: χ2(1) = 70.5; /g/-/k/: χ2(1) = 41.7; all ps < .001).
3.3 Word-initial homorganic stop pairs, restricted to content words
Next, we excluded function words and examined the VOT of word-initial stops in content words only. The exclusion of function words was expected to significantly improve the accuracy of rate-independent category boundaries. Common function words are frequent in occurrence and susceptible to phonetic reduction across syllable rates (Fosler-Lussier & Morgan, 1999). Moreover, function words are more often recognized after the word’s acoustic offset, that is, not immediately recognized from acoustic information alone (Bard et al., 1988), which suggests that their acoustic encoding is prone to ambiguity. As Table 3 shows, at the previous level of data control, function words indeed contributed proportionally more to the overlap in the VOT distributions of voiced and voiceless stops for all homorganic pairs than did content words with word-initial lexical stress (but not necessarily more than content words with an unstressed word-initial syllable; more on this in Section 3.4)
|Content, Stressed initial syllable||/b/-/p/||1,536||3.3%|
|Content, Unstressed initial syllable||/b/-/p/||291||13.0%|
Figure 3 shows the VOT distributions of each homorganic stop pair when function words were excluded. For all pairs (particularly /d/-/t/) voiced vs. voiceless stops were better separated than the previous level of data control. As given in Table 2 above, rate-independent optimal category boundaries now correctly classified 96.4% of /b/-/p/ (at 13 ms), 96.2% of /d/-/t/ (28 ms), and 92.7% of /g/-/k/ (27 ms).
The improvement in the accuracy of the rate-independent optimal category boundary was statistically significant for all three stop pairs (/b/-/p/: χ2(1) = 5.7, p = .02; /d/-/t/: χ2(1) = 100.7, p < .001; /g/-/k/: χ2(1) = 4.2, p = .04). The accuracy of rate-independent category boundaries for /b/-/p/ and /d/-/t/ now only marginally differed from the accuracy of Miller et al.’s (1986) rate-dependent category boundary (/b/-/p/: χ2(1) = 2.89, p = .09; /d/-/t/: χ2(1) = 3.8, p = .05), although the rate-independent category boundary for /g/-/k/ still performed significantly worse than Miller et al.’s (1986) rate-dependent category boundary (χ2(1) = 26.0, p < .001).
As stated earlier, we excluded function words on the premise that common function words are susceptible to phonetic reduction across syllable rates and their acoustic encoding can be ambiguous. Is it possible that by excluding function words we have removed the benefits of the rate-dependent category boundary?7
To address this issue, we compared the effectiveness of rate-independent vs. rate-dependent VOT category boundaries for /d/ in /du/ (do) and /t/ in /tu/ (to, too, and two). We chose these words because to was by far the most frequent function word (n = 1,596), accounting for 43% of their occurrences, and its voiced counterpart do occurred reasonably often (n = 319, verb and auxiliary verb usage combined). Too and two were also frequent among content words (n = 33 and 93). All speakers produced multiple measurable tokens of to and do, and at least one measurable token of too or two.
For the estimation of articulation rate, segments in do, to, too, and two were not used, as their short durations (especially segments in to and auxiliary verb do) can potentially be ascribed to phonetic reduction. Instead, the mean duration of segments in the preceding word was used as a rough index of local articulation rate, assuming similar articulation rates for adjacent stretches of speech. Mean segment (rather than syllable) durations were used, as the former correlated more strongly with the duration of the target VOT: rs = .24 (p < .003) for /du/, rs = .31 (p < .001) for /tu/, according to Spearman’s rank correlation tests.
Because of the way articulation rate was estimated, the analysis here excludes utterance-initial do, to, too, and two, which had no preceding word within the same utterance. Also excluded were cases where the preceding word duration could not be measured using a supralaryngeal criterion (Turk et al., 2006), for example, where the initial segment of the preceding word was a stop phoneme following a pause. This left for analysis 161 tokens of do, 667 to, 17 too, and 63 two.
Figure 4 shows the relationship between the VOT of /du/ and /tu/, and the mean segment duration of the preceding word (hereafter “articulation rate”). Several observations can be made. First, most instances of /t/ with a short VOT (< c. 25 ms) belonged to to produced at fast-mid articulation rates (mean duration of preceding segments < c. 100 ms). Such a short VOT was seldom found for too and two, even though these words also mainly occurred at fast-mid articulation rates. Thus, the short VOT observed for many tokens of to at fast-mid articulation rates seems to have arisen from phonetic reduction rather than articulation rate per se. Phonetic reduction was, unsurprisingly, unlikely to occur at slow articulation rates (see also Frank & Jaeger, 2008). Other types of reduction, for example, vowel devoicing, found for to but excluded from the analysis (see Section 2.2), also occurred predominantly at fast-mid articulation rates.
Second, VOT for do did not strictly increase with articulation rate, although there was a weak positive correlation between the two. Third, at fast-mid articulation rates, where a majority of do and to (both 80%) occurred, their VOT distributions completely overlapped at the short VOT range. As a result, rate-dependent optimal category boundaries produced little advantage over the rate-independent optimal category boundary. The rate-independent optimal category boundary for all tokens of do, to, too, and two yielded classification accuracy of 84.0% (at 6–7 ms). A rate-dependent category boundary yielded 84.7% classification accuracy when the optimal boundary was adjusted for each 50-ms bin of the estimated articulation rate, and 84.8% accuracy when the boundary was adjusted for each 25-ms bin. The effectiveness of the rate-dependent boundaries did not differ significantly from that of the rate-independent boundary (50-ms bin: χ2(1) = 0.10, p = .75; 25-ms bin: χ2(1) = 0.15, p = .70).
As we argued in the Introduction, spontaneous speech is not readily amenable to articulation rate measurement because of numerous confounding factors that cannot be controlled easily. However, to the extent that the mean segment durations of the preceding word reflected articulation rate, we found no clear advantage of rate-dependent over rate-independent VOT category boundaries.
The failure of rate-dependent VOT category boundaries to improve the classification accuracy of /du/-/tu/ does not mean that /tu/ with a short VOT cannot be acoustically distinguished from /du/. A further inspection of the data reveals that /u/ (measured from the onset of voicing) was shorter in a majority of instances of /tu/ than /du/, especially where VOT does not distinguish the two (see Figure 5). If we classify all instances with a short /u/ (< 80 ms) as /tu/ regardless of VOT, and apply a rate-independent VOT category boundary to the rest, we obtain a classification accuracy of 95.2% (at 23 ms), a significant improvement to the 84.0% accuracy of the rate-independent category boundary (at 6–7 ms) that ignores the following vowel duration (χ2(1) = 59.7, p < .001). Dividing the following vowel durations into further groups did not significantly improve the classification accuracy. (Classification accuracy achieved here was still poorer than the 97.6% of Miller et al.’s  rate-dependent category boundary [χ2(1) = 7.5, p < .007]. We return to this issue in the discussion.)
Importantly, the short duration of /u/ found in many instances of /tu/ (particularly to) does not seem to have arisen primarily from fast articulation rates. As can be seen in Figure 6, across articulation rates we find /tu/ whose voiced portion is shorter than 80 ms, used in the earlier analysis to distinguish /du/ from /tu/, where VOT was neutralized. In contrast, only a handful of instances of /du/ had such a short /u/ even at fast articulation rates.
In line with the above observations, regression models fitted to the data indicated that only 2% of variance in /u/ duration was explained by articulation rate alone, while 17% of variance was explained when the preceding stop’s voicing specification (/d/ vs. /t/) was added to the model (a significant increase in explanatory power at p < .001). As Figure 7 shows, /u/ is generally shorter in /tu/ than in /du/, and a very short /u/ suggests that the word is to. These results are consistent with Toscano and McMurray’s (2012) finding that English-speaking listeners interpret the following vowel duration as a cue to the voicing specification of the preceding stop onset rather than articulation rate.
3.4 Word-initial homorganic stop pairs in lexically stressed syllables of content words
We saw that the stops were more likely to be misclassified in lexically unstressed than in stressed syllables of content words at the initial level of data control, where the optimal category boundary was estimated for all measured VOT for each homorganic pair of word-initial stop phonemes (see Table 3 above). This observation is consistent with Lisker and Abramson’s (1967) report that VOT values for English voiced vs. voiceless stops were less distinct in lexically unstressed syllables. As can be seen in Figure 8, in our data too the VOT distributions for /b/-/p/ and /d/-/t/ had a greater overlap in lexically unstressed than stressed syllables. As a result, fewer voiced vs. voiceless stops in lexically unstressed syllables were correctly classified than were stops in stressed syllables, even when optimal VOT category boundaries were separately estimated for the two types of syllables (/b/-/p/: 97.7% vs. 92.1%, χ2(1) = 22.5, p < 0.001; /d/-/t/: 96.7% vs. 94.0%, χ2(1) = 4.6, p = 0.03). As for /g/-/k/, their VOT distributions completely overlapped for unstressed syllables, though this may be ascribed to the paucity of /g/ in word-initial unstressed syllables.
As one would expect, further excluding content words without word-initial lexical stress shifted classification accuracy in the right direction (see Table 2 above): 97.7% for /b/-/p/ (at 13 ms), 96.7% for /d/-/t/ (26 ms), and 94.2% for /g/-/k/ (31 ms). The classification accuracy of rate-independent category boundaries for /b/-/p/ and /d/-/t/ no longer differed significantly from the overall accuracy of Miller et al.’s (1986) rate-dependent category boundary (χ2(1) = 0, p = 1; χ2(1) = 1.8, p = .18), though the accuracy of the rate-independent category boundary was still poorer for /g/-/k/ (χ2(1) = 16.4, p < .001).
Except for /b/-/p/, however, the effectiveness of rate-independent category boundaries did not differ significantly from the previous level, where the data consisted of word-initial stops in all content words (/b/-/p/: χ2(1) = 4.1, p = .04; /d/-/t/: χ2(1) = 0.47, p = .49; /g/-/k/: χ2(1) = 1.6, p = .21). The lack of significant improvement compared to the previous level for all stop pairs can be ascribed to the relatively small number of content words with non-initial stress. As shown in Table 3 above, content words with non-initial stress were not many, accounting for only 10% of measured tokens, consistent with Cutler and Carter’s (1987) report.
Interestingly, the voicing specifications of stops in word-initial unstressed syllables were largely predictable from the following vowel; 93% of /b/, 97% of /d/, and both of the two tokens of /g/ were followed by /ɪ/, while 99% of /p/, and all instances of /t/ and /k/ were followed by /ə/. When each stop pair was analyzed separately depending on the following vowel, rate-independent category boundaries classified voiced vs. voiceless stops with high accuracy: 99% for /b/-/p/ and /d/-/t/, and 100% for /g/-/k/. These classification accuracies were higher than the overall accuracy of Miller et al.’s (1986) rate-dependent category boundary, though the difference was significant for /g/-/k/ only (/b/-/p/: χ2(1) = 0.8, p = 0.38; /d/-/t/; χ2(1) = 2.6, p = .11; /g/-/k/; χ2(1) = 8.3, p = .004).
3.5 Word-initial homorganic stop pairs in lexically stressed syllables of content words, grouped by the following vowel
The vowel following a stop onset has been reported to affect the VOT of the stop onset, and the locations of perceptual VOT category boundaries between voiced vs. voiceless stop onsets (Higgins et al., 1998; Klatt, 1975; Nearey & Rochet, 1994; Summerfield, 1975, 1981). Though there are some discrepancies in the details, the general finding is that stops tend to be accompanied by a longer VOT when they precede phonologically high vowels than non-high vowels.
At the final and most allophonically-rich level of data control, word-initial homorganic stop phonemes in lexically stressed syllables of content words were grouped by the following vowel phoneme, and separate rate-independent optimal category boundaries were estimated for each group. None of the speakers had a strong regional accent beyond that of their country of origin (England, USA, or Australia). The vowel groups used here therefore reflected broad dialectal differences reported for the three varieties, for example, /ɒ/ for the vowel in pot in Anglo English, /ɑ/ for American English, and /ɔ/ for Australian English (Harrington et al., 1997; Wells, 1996). Because only one Australian speaker was represented in our data, vowel phonemes only reported for Australian English were placed with vowels of the same phonological height in other varieties: /ɐ/ and /ɐː/ were grouped with /æ/, and /ʉ/ was grouped with /u/. As there were only several instances of them, Anglo English /əʊ/ and Australian /əʉ/ were grouped with /ɜ/. Diphthongs were grouped based on their initial element; for example, /aɪ/ and /aʊ/ were grouped together. Table 4 gives the resulting optimal VOT category boundary locations.
|Following vowel||Phonological height||/b/-/p/||/d/-/t/||/g/-/k/|
|/i/||High||6–9 (n = 327)||28–30 (n = 117)||(see Note)|
|/ɪ/||16–18 (n = 200)||32–36 (n = 246)||42 (n = 98)|
|/u/||13–24; no overlap (n = 10)||36–40 (n = 451)||26–78; no overlap (n = 6)|
|/ʊ/||12–21 (n = 76)||(see Note)||39–51; no overlap (n = 91)|
|/e/||Mid||13–15 (n = 137)||26–27 (n = 296)||31 (n = 189)|
|/ɛ/||15–16 (n = 63)||19–21 (n = 95)||37–39 (n = 101)|
|/ɜ/||12 (n = 115)||17–26; no overlap (n = 77)||36–39 (n = 144)|
|/o/||9–27; no overlap (n = 14)||1–34; no overlap (n = 12)||41–43 (n = 91)|
|/ɔ/||14–21 (n = 89)||21–23; no overlap (n = 69)||25–28; no overlap (n = 96)|
|/ʌ/||11–14; no overlap (n = 35)||33–36; no overlap (n = 93)||19–21 (n = 238)|
|/æ/||Low||14–15 (n = 257)||26–33 (n = 81)||18 (n = 122)|
|/ɒ/||12–14; no overlap (n = 48)||18–31; no overlap (n = 19)||24–29 (n = 156)|
|/ɑ/||14–16 (n = 148)||19 (n = 116)||24 (n = 254)|
|/a/||9–23; no overlap (n = 17)||21 (n = 324)||19–22 (n = 114)|
These category boundaries produced an overall classification accuracy of 98.4% for /b/-/p/, 98.2% for /d/-/t/, and 97.8% for /g/-/k/ (see Table 2 above), excluding /tʊ/ and /ki/, whose inclusion would have led to higher classification accuracies, as their voiced counterparts (/dʊ/ and /gi/) did not occur in our data. For all three pairs of stops the classification accuracy achieved here was slightly better than, though not significantly different from, the 97.6% accuracy achieved by Miller et al.’s (1986) rate-dependent category boundary for /bi/-/pi/ (/b/-/p/: χ2(1) = 1.41, p = .24; /d/-/t/: χ2(1) = .83, p = .36; /g/-/k/: χ2(1) = .01, p = .93). Compared to the previous level of data control, the classification accuracy improved significantly for /d/-/t/ and /g/-/k/ (χ2(1) = 8.28, p = .004; χ2(1) = 27.6, p < .001) but not for /b/-/p/ (χ2(1) = 1.67, p = .20). We do not know why the following vowels affected the boundary location for /d/-/t/ and /g/-/k/ more than for /b/-/p/, but Nearey and Rochet (1994) report similar findings in perception.
Based on previous studies on perceptual VOT category boundary locations (Nearey & Rochet, 1994; Summerfield, 1975, 1981), we had expected larger VOT values at the category boundary in high vowel contexts and smaller values in low vowel contexts, particularly for alveolar and velar stops. This appeared to be true of our data to some extent, but the differences in VOT boundary location between vowel contexts seemed more directly linked to the difference in the relative frequency of occurrences of voiced vs. voiceless stop phonemes between vowel contexts than to the phonological height of the following vowel.
Figure 9 shows the relationship between the estimated optimal VOT category boundary location in various vowel contexts, and the difference in logarithmic token frequency between voiced vs. voiceless members of each homorganic stop pair in each context. As we saw in Table 4, the range of estimated boundary locations was large in some cases. To ensure some degree of reliability of the estimated boundary location, we only used (1) boundary locations that could be estimated within 1 ms and (2) the midpoint of the estimated range when the boundary location could be defined within 5 ms or less. Figure 9 suggests that the more frequently a voiceless stop onset occurred relative to its voiced counterpart before a given vowel (the larger the value on the x-axis), the smaller the estimated VOT boundary was. According to Pearson correlation tests, this correlation was significant for all three stop pairs (/b/-/p/: r = –.81; /d/-/t/: r = –.75; /g/-/k/: r = –.94; all ps < .02). At the same time, the results exhibited a tendency consistent with the observation of boundaries at a large VOT value for high vowel contexts and a small VOT value for low vowel contexts for alveolar and velar stops.
Recall that the optimal category boundary location was determined on the basis of maximum classification accuracy for voiced and voiceless stops combined (see Section 2.3). All else being equal, the more frequent a phoneme is within the region of distributional overlap, the greater the phoneme’s contribution to the calculation of overall classification accuracy; this pushes the optimal category boundary away from that phoneme. The results above suggest that for alveolar and velar places of articulation, voiceless stops tend to occur less frequently than their voiced counterparts in high vowel contexts and more frequently in low vowel contexts in word-initial position in English, and the category boundary is pushed in different directions depending on the vowel context, towards the less frequent voicing category. Notice that this produces an effect similar to the frequency effects on the perceptual category boundary location between phonemes reported by Kataoka and Johnson (2007).
It is worth noting, in addition, that the total range of VOT values for voiced vs. voiceless stops differed between vowel contexts in our data, in a manner consistent with the observed difference in boundary location between vowel contexts.
First, the more frequently a voiced stop occurred before a given vowel, the longer its maximum VOT tended to be, likely contributing to a larger VOT value at the optimal category boundary. For /d/ and /g/, Pearson correlation tests indicated a significant positive correlation between each stop’s logarithmic token frequencies and maximum VOT values in different vowel contexts (/d/: r = .85; /g/: r = .83; both ps < .001). For /b/, which had a large outlier, the correlation was not significant in a Pearson test (r = .34, p = .24) but significant in a Spearman test, which is robust to the presence of outliers (rs = .64, p = .01).
Conversely, the more frequently a voiceless stop occurred before a given vowel, the shorter its minimum VOT was, likely contributing to a smaller VOT value at the optimal category boundary. Pearson tests indicated a significant negative correlation between each voiceless stop’s logarithmic token frequencies and minimum VOT values in different vowel contexts (/p/: r = –.73, p = .003; /t/: r = –.63, p = .02; /k/: r = –.74, p = .002). The picture was similar for voiced vs. voiceless stops in lexically unstressed word-initial syllables discussed in Section 3.4, although the observations in infrequent vowel contexts were very small in number.
The above observation itself does not necessarily imply different underlying VOT distributions for a given stop phoneme in more vs. less frequent vowel contexts, as the likelihood of obtaining extreme values from the same underlying distribution increases with sample size.8 However, the results of Fricke’s (2013) recent study of voiceless stop onsets in English spontaneous speech point to the possibility that underlying VOT distributions themselves may in fact differ between more vs. less frequent contexts in which the stop occurs. At any rate, our observation does suggest that in real-life conversation we are more likely to encounter extreme VOT values for a stop in a more frequent vowel context. This can also push the perceptual category boundary location towards the less frequent phoneme.
In this study we examined the effectiveness of the rate-independent VOT category boundary for word-initial English voiced vs. voiceless stop phonemes in unscripted conversational speech. Articulation rate varied in our data in, we assume, a natural way; variation in articulation rate was certainly observable across and within speakers, at a qualitative level. Yet, our data suggested that there is no compelling need for listeners to normalize perceptual VOT category boundary locations for word-initial voiced vs. voiceless stops in accordance with articulation rate, supporting Kessinger and Blumstein’s (1997) proposal.
Rate-independent optimal VOT category boundaries classified all three pairs of homorganic, word-initial voiced vs. voiceless categories in content words at accuracy comparable to (or better than) Miller et al.’s (1986) rate-dependent category boundary, when the stops were analyzed separately depending on the presence of lexical stress and the following vowel phoneme. The inclusion of function words led to lower classification accuracy, but in our analysis of /du/ vs. /tu/ (do vs. to, too, and two), classification accuracy did not much improve by adopting rate-dependent category boundaries (using the mean segment duration of the preceding word as an index of local articulation rate). Classification accuracy improved significantly, however, by postulating the short duration of /u/ (measured from the onset of voicing) in /tu/ as an additional cue to the /du/-/tu/ opposition. Crucially, the short duration of /u/ in /tu/ relative to /du/ was found across our measure of articulation rate and could not be ascribed to fast articulation rates of /tu/.
Thus, the lack of large shifts in perceptual VOT category boundary locations for word-initial stops in previous rate normalization studies can be seen to reflect the listeners’ experience of temporal regularities of speech they normally encounter. The small but consistent shifts in VOT category boundary locations found in these perception studies are perhaps better interpreted as arising from cue integration (Toscano & McMurray, 2012) or general auditory (proximal or distal) contrast effects (Diehl & Walsh, 1989; Holland & Lockhead, 1968; Pisoni et al., 1983).
The point we wish to make here is simple: If rate normalization reflects the temporal regularities of the ambient language, then we have little reason to expect such a process where the language does not require it. For example, the durational distributions of singleton vs. geminate sonorants in Cypriot Greek are reported to be well separated across different rates of articulation (Arvaniti, 1999). We therefore do not expect Cypriot Greek speakers to shift the perceptual category boundary for the contrasts with a change in articulation rate. On the other hand, our data suggest that English function words are less likely to be reduced under slow articulation rates. We therefore expect English-speaking listeners to less often report reduced function words in ambiguous speech stimuli when surrounding speech is slow (Baese-Berk et al., 2014; Dilley & Pitt, 2010).
In the absence of relevant information, we have little to say about rate-dependent shifts in perceptual category boundaries reported for other contrasts, for example, the /b/-/w/ distinction in English (e.g., Miller & Liberman, 1979) or consonant and vowel quantity in other languages (Icelandic: Pind, 1995; Japanese: Fujisaki et al., 1975; see also Hirata & Lambacher, 2004).
That said, we think that a lack of need for rate normalization may be found for more contrasts, as durational variation arising from different articulation rates is presumably more malleable than other aspects of speech that are thought to necessitate perceptual normalization such as formant frequencies associated with vocal tract length (but see Johnson, 1997). Even though listeners can apparently cope with such situations (e.g., Ladefoged & Broadbent, 1957; Syrdal & Gopal, 1979), perceptual normalization is not cost-free (Mullennix et al., 1989, 2002; Nakai, 2013; Sommers et al., 1994).
While the rate-dependent VOT category boundary did not seem to significantly improve the classification accuracy of word-initial voiced vs. voiceless stop phonemes, vowel-dependent VOT category boundaries did. The vowel-dependent category boundary produced an optimal category boundary with a relatively small VOT value in a vowel context where voiceless stops were more frequent than voiced stops, and a large VOT value where voiced stops were more frequent than voiceless stops. This resulted in a VOT boundary location at a generally larger VOT value for phonologically high vowel contexts and a smaller VOT value for low vowel contexts for alveolar and velar stops, as previously reported in perception studies (Nearey & Rochet, 1994; Summerfield, 1975, 1981).
If shifting a VOT category boundary depending on articulation rate were costly to the perception mechanism, would vowel-dependent category boundaries not be costly too? With a caveat that we did not conduct any perception experiments, it is plausible that the listeners use categories other than phonemes as basic units in their analysis of incoming speech where it makes sense to do so, as proposed by Reinisch et al. (2014). Rather than normalizing a phoneme-based category boundary depending on the following vowel, the listeners may look for units larger than a phoneme (e.g., CV) and use category boundaries specific to these units.
Notice that this scenario is compatible with the observation that do and to can be largely distinguished on the basis of the duration of /u/, where VOT is neutralized. The scenario also sits well with previous findings that the listeners interpret acoustic cues to the place of articulation of stop onsets (e.g., burst, formant transition) differently depending on the following vowel (Cooper et al., 1952; Dorman et al., 1977) and that the listeners can largely identify the following vowel from the brief period immediately after the stop release (Blumstein & Stevens, 1980). Arguably, structural or phonological contexts such as the following vowel in the same word are different in kind from contexts such as articulation rate. Vowels, being discrete units that constitute a part of a word, can more readily serve as a part of the basic unit of analysis in speech perception, unlike articulation rate, which forms continuity and is presumably unspecified in the lexicon.
Another point we wish to stress is that the greater overlap in the VOT distributions of voiced vs. voiceless stops in function words compared to content words with initial lexical stress was more directly linked to the difference in their overall frequencies, rather than their different lexical statuses. In our data, function words overall had a higher token frequency than did content words with initial lexical stress (calculated for lemmas), as shown in Figure 10. And, as shown in Figure 11, at the initial level of analysis, the more frequent content and function words were, the more likely they were to have a VOT that fell on the opposite side of the optimal category boundary, producing an overlap between voiced vs. voiceless VOT distributions (content words: rs = .32, p < .001; function words: rs = .78, p < .001). For example, word-initial stops in frequent content words like light verbs (e.g., do, get, give) were more likely to have a VOT that fell on the opposite side of the optimal category boundary than were stops in infrequent function words like per. Thus, the relatively high misclassification rate for /du/-/tu/ found in Section 3.3 may be at least partially ascribed to the high frequencies of /du/ and /tu/, especially do and to.
The foregoing observations of the relationships between frequency and VOT values, in relation to vowel contexts and lexical status, may be conceived in terms of predictability, which is highly correlated to frequency (e.g., Bell et al., 2009). A growing body of studies report phonetic and phonological reduction of frequent and/or predictable words and segments, which cannot simply be attributed to fast articulation rates (e.g., Aylett & Turk, 2006; Baese-Berk & Goldrick, 2009; Baran et al., 1977; Bybee, 2000; Ernestus, 2000; Fosler-Lussier & Morgan, 1999; Frank & Jaeger, 2008; Fricke, 2013; Gahl et al., 2012; Jurafsky et al., 2001; Lieberman, 1963; Munson, 2007). Where it is predictable, voicing specifications may not need to be as clearly signaled by VOT for successful communication, considering the facilitative effects of listener expectations on word recognition (e.g., Rubenstein & Pollack, 1963) and listener tolerance for acoustic mismatches in reduced speech (Brouwer et al., 2012). (Of course, other cues to the stop’s voicing specification may also be present, as was the case for to with a very short VOT.)
This is not to suggest that speakers consciously produce unpredictable speech segments more clearly and predictable speech segments less clearly. Clear enunciation of words and enhanced segmental contrasts (including those signaled by VOT) can result from listener-oriented considerations given by the speaker (Bradlow, 2002), but this is not always true (Baese-Berk & Goldrick, 2009; Bard et al., 1988; Gahl et al., 2012). That is, ambiguous renditions of predictable segments and words are not necessarily a product of speakers’ conscious production strategy.
A relatively short VOT for voiceless stops in frequent words and vowel contexts can arise from ease of lexical access on the speaker’s part as well as ease of articulation (Balota et al., 1989; Bard et al., 2000; Bell et al., 2009; Fricke, 2013; Gahl et al., 2012; Munson, 2007). This account, however, would not predict a relatively long VOT for voiced stops in frequent words and vowel contexts, for the ease of production is associated with reduced duration.
Another possibility, though not mutually exclusive from the above, is that clarity of enunciation of some segments/words reflects their phonetic representations. Since Norris et al.’s (2003) influential work, several studies have shown that perceptual category boundary locations for segmental contrasts are affected by ambiguous sounds when the ambiguous sounds are recognized by the listener as a part of a legitimate word (Clarke & Luce, 2005; Eisner et al., 2013; Kraljic & Samuel, 2005; Maye et al., 2008).
Conceivably, phonetic representations of less frequent segments and words are more likely to be shaped by their clear enunciations, as unpredictable segments/words produced with ambiguous pronunciations are less likely to lead to immediate recognition (see Pierrehumbert  for a similar proposal in relation to lexical neighborhood density, and Wedel  in relation to diachronic maintenance of phonemic contrasts). If so, then position and/or context-sensitive representations of phonemes (Dahan & Mead, 2010; Eisner et al., 2013; Mitterer et al., 2013) would predict more distinct phonetic representations of contrasts in positions and contexts where segments are less predictable, and word-specific phonetic representations (Bybee, 2000; Johnson, 2004; Klatt, 1979; Pierrehumbert, 2002; Wedel, 2006) would predict more distinct phonetic representations of less predictable words.
As a final note, the range of VOT values used to signal voiced vs. voiceless stop phonemes can differ between speakers of the same language, depending on factors such as gender and geographical origin (Docherty et al., 2011; Oh, 2011; Scobbie, 2006). It is currently unclear, however, to what extent such factors affect category boundary locations for voicing contrasts along acoustic cues like VOT, as past production studies focused on phonetic targets rather than category boundaries. In perception studies listener sensitivity to social-indexical acoustic variation has been shown, most notably for English vowels (e.g., Hay et al., 2006; Niedzielski, 1999). Listener sensitivity to social-indexical variation in VOT has also been shown, but shifts in perceptual VOT boundary locations for word-initial stops induced through manipulation of speaker gender (Toscano, 2011) or speaker adaptation training (Clarke & Luce, 2005; VanDam, 2007) appear very small in magnitude.9 We welcome further studies on the role of inter-personal and social-indexical factors in the production and perception of speech segments from various angles.