1. Introduction

In speech perception, listeners are confronted with the task of processing immense variation in both speech and listening conditions. Listeners are, however, remarkably adept at processing input from different kinds of speakers, such as children and adults, native and non-native speakers (e.g., Foulkes & Docherty, 2006; Kleinschmidt & Jaeger, 2015; Bradlow & Bent, 2008), as well as different listening situations, such as with considerable background noise or in the case of multi-tasking (e.g., Mattys et al., 2012). This ability to extract and correctly identify the necessary linguistic information from diverse speech signals is referred to as context-dependent perception. One considerable source of variation that must be considered by listeners is speaker sex and gender. Listeners face the task of processing both physiological differences between male and female speakers, mainly in vocal tract length, as well as language and community specific gender performances (Johnson, 2006; Fuchs & Toda, 2010; Stuart-Smith, 2007).

It is well established that speaker gender is used as a cue in phoneme categorization. The effect was first demonstrated by Strand and Johnson (1996) in an experiment in which they found that the gender of the speaker affected the perception of ambiguous sounds along an /s/-/ʃ/ fricative continuum. In a phoneme categorization task, speakers were more likely to classify a sound as /s/ when it was combined with a word produced by a male speaker than when the same token of the fricative was combined with a word produced by a female speaker. It is generally argued that this effect, which we refer to as the speaker gender effect, results from listeners’ implicit knowledge of differences in acoustics between male and female speech, namely that male speakers produce fricatives with lower spectral centers of gravity (Tripp & Munson, 2021; Fuchs & Toda, 2010). To date, most studies on the speaker gender effect have been conducted with English listeners and native language (L1) stimuli. The present study set out to assess the cross-linguistic presence of this effect.

1.1 The speaker gender effect

Since the seminal study by Strand and Johnson (1996), the effect of speaker gender on phoneme categorization has been replicated with different speaker groups, and under diverse listening conditions. Following Strand and Johnson’s methodology, Reese and Reinisch (2022, 2023) replicated the effect of speaker gender on the categorization of sounds along an /s/-/ʃ/ continuum in Austrian German, indicating that this not an English-language specific phenomenon. They furthermore demonstrated the robustness of this effect even when listening in adverse conditions. Reese and Reinisch (2022) showed that the speaker gender effect was not affected by additional cognitive load in the form of dual-tasking during the phoneme categorization task, suggesting that it occurs even if attention is not solely focused on the speech signal. In a follow-up study, Reese and Reinisch (2023) demonstrated that the speaker gender effect was also not affected by the presence of speech-shaped noise masking the speech signal. These results suggest that speaker gender information is integrated during perception efficiently and with such little effort as to be unaffected by competing tasks or competing acoustic information.

The speaker gender effect is not restricted to the /s/-/ʃ/ contrast but instead has also been replicated with different phoneme contrasts. Johnson et al. (1999) found an effect of (imputed) speaker gender on a /ʊ/-/ʌ/ categorization task in which participants were presented with only one gender-ambiguous voice and were instructed to imagine that the speaker was either male or female. More /ʊ/ responses were given when listeners were told the speaker was female. The fact that visually priming speaker gender without actual acoustic differences between the stimuli still produces differences in perception highlights that the speaker gender effect is not an exclusively acoustic process. A substantial body of research on priming effects in speech processing indicates that various extra-linguistic factors, including speaker gender, age, ethnicity, etc., affect perception (Hurring et. al, 2022, provide a thorough overview). With regard to the role of speaker gender in English fricative perception, Munson et. al (2017) investigated two different methods of priming speaker gender in /s/-/ʃ/ categorization. Using artificial, gender-neutral audio stimuli, participants completed two main tasks: a grammaticality judgment task and a phoneme identification task. There were two priming conditions: implicit and explicit. In the implicit condition, speaker gender was primed during the grammaticality judgment task with semantically biasing sentences that, due to their content, suggested a certain gender (e.g., “Hannah took my pink coat” vs. “My Harley will need a tune-up”). In the explicit condition, semantically neutral sentences were used in the grammaticality task, but speaker gender was primed by pictures of a male vs. female speaker. The phoneme categorization task followed and entailed categorizing ambiguous /s/-/ʃ/ sounds. In both conditions, a speaker gender effect was found in the expected direction (more /s/ responses when the voice was primed with sentences that corresponded to male stereotypes or a picture of a male face), though the speaker gender effect was greater in the explicit priming condition (with pictures) than in the implicit condition (biasing sentences).

A further examination of the speaker gender effect was conducted by Munson (2011) in a study revisiting Strand and Johnson’s (1999) findings. In the Munson study, the effect of speaker gender on the categorization of /s/-/ʃ/ and /s/-/θ/ continua was tested. There were two additions to the methodology: first, the synthesis of different “apparent vocal-tract lengths” by manipulating F3 and f0 of the following vowels and second, a visual component to the experiment in which participants were primed with an image of male or female face or a checked box as a control. Strand and Johnson’s original finding that more /s/ responses were given for a male speaker were replicated in the audiovisual conditions: Fricatives were more likely to be rated as /s/ if the continuum was presented with a male voice, a picture of a male face or the longer vocal tract condition. Additionally, participants were also more likely to classify an ambiguous sound along an /s/-/θ/ continuum as /θ/ when paired with a male voice, a picture of a man or the “longer” of the two synthesized vocal tract lengths. In fact, the effect of speaker gender on categorization was larger for the /s/-/θ/ than for the /s/-/ʃ/ continuum. Munson argues that this cannot be predicted by physiological differences between men and women and resulting acoustic differences between male and female speech. This is because acoustic analyses of English fricatives reported in Jongman et al. (2000) showed only a relatively small effect of speaker gender on the spectral peak of interdental fricatives compared to the much larger effect on sibilants (see Jongman et al., 2000, for details on spectral peak and mean slope values). Thus, it has been contended that the effect of speaker sex and imputed gender (in the form of the image of male and female speakers) on the categorization of the /s/-/θ/ continuum results from expectations of gendered speech rather than reported differences in the production of /θ/ between male and female speakers.

1.2 Motivation and theoretical background

Despite the breadth of research on the speaker gender effect in various conditions, the role of speaker gender in phoneme categorization has thus far only been tested among native listeners. The current study aims to fill this gap. Nonnative listening presents a challenge on many levels, from phoneme recognition and lexical activation to understanding cultural context (Cutler, 2012). The sociophonetics attached to a particular phoneme may differ between languages (Gordon et al., 2002; Johnson, 2006; Pépiot, 2014) and hence affect phoneme categorization by non-native listeners differently from native listeners. Moreover, (late) second language learners have been shown to struggle with certain non-native phone contrasts at the phonological level (e.g., Cutler, 2000). The English language fricative contrast /s/-/θ/ lends itself well to investigate different constraints in non-native listening. As discussed above, it is subject to a speaker gender effect by native English listeners. However, cross-linguistically, the phoneme /θ/ is rare, reportedly present in approximately 4% of world languages (Moran & McCloy, 2019). Thus, for most learners of English, it will present a non-native phoneme. For these learners, reduced sensitivity to phonetic detail, as well as lack of exposure to culturally specific gender expectations to /θ/, may affect the role of speaker gender in non-native phoneme categorization.

Additionally, fricatives are an important linguistic variable in the perception of sex, gender, gender typicality and sexuality. In English, the production of /s/ is well established as varying along the social variables of gender, class, and sexual orientation. Fronted /s/, when the sound is produced farther forward in the mouth and therefore with a higher spectral center frequency, is generally associated with femininity, male homosexuality and the stereotype of the “gay lisp” (Tripp & Munson, 2022). It represents a salient sociophonetic marker, one that is widely recognized by listeners and frequently discussed in public discourse (Calder, 2019; Campbell-Kibler, 2011; Mack & Munson, 2012). Fuchs and Toda (2010) examined articulatory and acoustic data from 24 speakers (12 men and 12 women) of German. They compared measures of palate length and width, as well as the spectral center of gravity and skewness of /s/ productions. While they found systematic differences between male and female vocal tracts (with male palates being both longer and wider on average), these anatomical differences were not sufficient to explain the observed acoustic differences in /s/. In other words, even when accounting for physical vocal tract dimensions, women still produced /s/ sounds with higher spectral centers of gravity and greater skewness. Fuchs and Toda therefore argued that the acoustic gender difference in /s/ production must result from a combination of physiological and sociophonetic factors, rather than purely anatomical ones.

Stuart-Smith (2007) demonstrated how other community-specific social contexts affect gendered speech production in a study that compared /s/ productions across class and gender in Glasgow. While working-class and middle-class men’s productions did not significantly differ, the production of /s/ by female speakers, measured in duration of the frication and the spectral mean and peak, was not consistent across class lines. Productions of /s/ by working-class women more closely resembled those of working-class men than those of the middle-class women. As these differences cannot be solely the result of vocal tract length or other physiological differences between men and women or members of different social classes, Stuart-Smith hypothesized a social, performative purpose for this type of speech. These differences in /s/ productions across gender and class were thus argued to be the result of “intentional articulatory strategies” that allow speakers to perform their identity and notion of gender through speech (Stuart-Smith, 2007). Non-native speakers may lack the experience with these social contexts that is needed to integrate this information in speech processing.

1.3 Aims of this study

The experiment we report on here was conducted to provide new insights on how information about speaker gender is employed during both native and non-native speech processing. Specifically, the current study asks how non-native speakers employ speaker gender information in their categorization of ambiguous phonetic input. If variation between men and women’s expected production of /θ/ is not purely a result of physiological differences but reflects learned performances of gender and identity (Munson, 2011; Boyd et al., 2021; Fuchs & Toda, 2010), non-native speakers may not have access to and therefore may not employ this information in a phoneme categorization task to the same extent as native listeners. Put differently, if non-native listeners employ speaker gender cues to a lesser degree than native listeners, it would support previous findings that these cues are culturally specific and learned. To test this, we conducted an internet-based experiment in which participants categorized minimal word pairs containing a manipulated fricative along an /s/-/θ/ continuum (i.e., sick-thick, miss-myth). The words were produced by one male and one female native speaker of English, but the fricative continuum was the same for both voices. Participants were recruited from three groups: native speakers of British English, German learners of English and Polish learners of English. The native English listeners served as a baseline for a speaker gender effect as described in previous studies. In line with Munson (2011), we predict that among this group of participants, more /θ/ responses will be given for stimuli paired with the male voice.

We then asked whether the two groups of non-native listeners would exhibit similar or different levels of the speaker gender effect compared to the native English listeners. Because these learners are accustomed to categorizing fricatives in their native language and have less experience with the English-specific /θ/-/s/ contrast in general and its sociophonetic implications specifically, we predict that non-native listeners will categorize the fricatives primarily based on acoustic cues, with a less pronounced speaker gender effect. If this was indeed the case, then the non-native groups would not be expected to give significantly more /θ/ responses for the stimuli paired with the male voice. Two different learner groups were tested to ensure that any results found were not language specific. These specific learner groups were selected for several reasons. First, neither German nor Polish have /θ/ as a phoneme and speakers of both languages are known to struggle with /θ/ in production and perception when learning English (e.g., Gonet & Pietron, 2006; Hanulíková & Weber, 2012). Second, the fricative inventories of German and Polish differ both from one another and from English (see Table 1). While the fricative inventories of English and German differ mainly in the absence of a velar fricative in English and dental fricatives in German, the Polish fricative inventory differs considerably more from the other two languages. Polish contains more fricatives overall at more diverse places of articulation: particularly the Polish fricative inventory contains the palatal and retroflex fricatives /ʂ, ʐ, ɕ, ʑ/, not present in English or German. Furthermore, the alveolar fricatives /s z/ are articulated between the alveolar and dental regions in Polish, thereby distinguishing the inventory of the Polish dental region from that of both English and German. Thus, the selection of Polish and German speakers provides two linguistically diverse participant groups. However, while differing native fricative inventories have been shown to affect cross-language perception (e.g., Wagner et. al, 2006) and hence may affect the shape of the learners’ categorization functions relative to the native English listeners, the absence of /θ/ in both learner groups does not allow us to make differential predictions about the speaker gender effect between the two groups of non-native listeners. Testing two non-native listener groups serves as cross-validation within the experiment design, in that it makes it more likely that potential differences in the speaker gender effect between the native and learner groups are not L1 specific but represent general processes in L2 processing.

Table 1

Fricatives inventories of languages studied, adapted from Wagner et. al (2006). While German and Polish do not have the dental fricatives /θ ð/, which are present in the English fricative inventory, note that the alveolar fricatives /s z/ are articulated in the dental region in Polish.

Labiodental Dental Alveolar Postalveolar Retroflex Alveolopalatal Velar Glottal
German f v s z ʃ ʒ x h
English f v θ ð s z ʃ ʒ h
Polish f v s z ʃ ʒ ʂ ʐ ɕ ʑ x

2. Methodology

2.1 Participants

Three groups of participants were recruited via Prolific in April and May 2022: native speakers of British English (N = 55, 41 female), as well as German (N = 55, 19 female), and Polish (N = 50, 19 female) learners of English. Learners described themselves as fluent in English, but none had spent more than six months outside their home countries. Participants were aged 20-41 (mean = 29) and reported normal hearing. All participants gave informed consent and were paid based on Prolific’s compensation scheme.

2.1.2 L2 learners’ language experience with English

A questionnaire on participants’ English language experience was adapted from one developed and used in a previous study (Mitterer et al., 2020). This questionnaire was given to all Polish participants and, for technical reasons, a subset (N = 25) of the German participants.1 All participants reported having learned English at school. The complete results of the survey are reported in Table 2.

Table 2

Results of English language experience questionnaire, completed by German and Polish participants.

Category German Participants (n = 25) Polish Participants (n = 50)
Age of beginning English instruction Mean: 9 (SD = 2) Mean: 8 (SD = 2.8)
Years of English instruction: School 7–10+ years: 24 (96%)
4–6 years: 1 (4%)
7–10+ years: 39 (78%)
4–6 years: 9 (18%)
1–3 years: 2 (4%)
Years of English instruction: University 1–3 years: 6 (24%) 1–3+ years: 36 (72%)
How often do you speak English? (1–8 scale) Mean: 5.4 (SD = 2) Mean: 4.8 (SD = 2)
How often do you listen to English? (1–8 scale) Mean: 7.7 (SD = 0.63) Mean: 7.3 (SD = 1)
Self-assessment: Listening Good/Very good: 24 (96%), Bad: 1 (4%) Good/Very good: 41 (82%)
Ok: 8 (16%), Bad: 1 (2%)
Self-assessment: Speaking Good/Very good: 15 (60%)
Ok: 9 (36%)
Very bad: 1 (4%)
Good/Very good: 26 (52%)
Ok: 19 (38%)
Bad: 4 (8%), Very bad: 1 (2%)
Situations where English is heard Internet: 25 (100%), TV/Streaming: 25 (100%), Friends: 12 (48%), Work/University: 11 (44%) TV/Streaming: 49 (98%), Internet: 40 (80%), Work/University: 28 (56%), Friends: 20 (40%)

Independent (two sample) t-tests were used to test for any significant differences between the two learner groups’ responses to the language questionnaires. The results can be found in Table 3. The only significant difference was found in participants’ responses to “How often do you listen to English?,” with German participants reporting listening to English significantly more often than Polish learners. On a scale from 1 to 8, the difference in the mean responses to this question was 7.3 (Polish participants) versus 7.7 (German participants).

Table 3

Results of t-tests comparing the Polish and German participants responses to the language learning questionnaire.

df t p
Age of beginning English instruction 62 –1.48 0.14
How often do you speak English? 49 –1.19 0.23
How often do you listen to English? 69 –2.18 0.03
Self-rating: Speaking 48 –0.59 0.55
Self-rating: Listening 51 –1.37 0.18
Self-rating: Writing 52 –0.66 0.51
Self-rating: Reading 62 –1.36 0.18

2.2 Design and Materials

Stimuli consisted of 13 English language minimal word pairs with an /s/-/θ/ contrast in word-initial or word-final position (e.g., sink-think, face-faith). Four native speakers of British English (two female and two male) were recorded producing the word pairs in a sound-treated booth. The recordings of one male and one female speaker that were judged to have the highest recording quality and similar speech rate were selected for further manipulation.

In order to create /s/-/θ/ continua, the fricatives were first isolated from the rest of the words. Then, sets of fricatives (i.e., one token of /s/ and one token of /θ/) of the same duration were manipulated to create multiple 15-step continua using a script in PRAAT (Boersma & Weenik, 2021). A 7 ms amplitude ramp was added to both sides of all fricative steps. This resulted in smoother perceived transitions between fricative and the rest of the word when the continua were then spliced back into the recordings, replacing the removed fricatives. From the numerous continua created, the authors selected one continuum that was judged by the authors to sound most natural when combined with all words from both speakers. This single continuum was hence used for all of the minimal word pairs produced by the selected male and female speaker. To reduce the overall number of trials and thus prevent participant fatigue, 9 of the 15 continuum steps were selected for use in the experiment: the two unmanipulated endpoints and the 7 intermediary steps (steps 5–11). The middle steps of the continuum are the most ambiguous and therefore it is in their categorization that the strongest effects of context are likely to be present (Gabay et al., 2019). As a result, participants categorization of these stimuli is of most interest. The RMS amplitude of all stimuli was normalized, and all words were additionally rate normalized so that both speakers’ productions of the same word had approximately the same duration.

All 13 pairs of words produced by the two speakers combined with the selected /s/-/θ/ continuum were piloted with British listeners (N = 25) in a two-alternative forced-choice experiment. To limit the duration of the main experiment, again with the aim of preventing participant fatigue, 10 of the 13 word pairs were selected based on the pilot study. The stimuli that were chosen elicited categorization curves as are typically expected in phonetic categorization experiments, that is, with clearly detected endpoints and intermediate steps following a linear or sigmoid shaped change in responses. Of the three word pairs that were excluded, two had lower endpoint recognition (<90%) and one had an atypical categorization curve. The minimal pairs used were: seam – theme, sick – thick, sigh – thigh, tense – tenth, mass – math, fourth – force, gross – growth, face – faith, mouse – mouth, and miss – myth. Participants from the pilot experiment were excluded from the main experiment, and their data was not included in the statistical analyses.

2.3 Procedure

The (main) experiment consisted of a two-alternative forced choice task in which participants were asked to identify the words along the /s/-/θ/ continuum by pressing buttons on a computer keyboard. The experiment was created using PSYCHOPY3 (Pierce et al., 2019) and hosted on Pavlovia (pavlovia.org). Each trial comprised one audio stimulus that automatically played once and could not be repeated. Participants selected via button-press which of the two words from the displayed minimal pair best matched the stimulus. The word containing /s/ was always presented on the right-side of the screen and selected with the button 0; the word containing /θ/ was presented on the left and selected by pressing 1. Trial order was randomized for each participant. The experiment contained a total of 360 trials (10 minimal pairs × 9 continuum steps × 2 repetitions) and took approximately 15 minutes to complete. In order to reduce fatigue in participants, breaks were offered after every 100 trials.

After the experiment and the learner background questionnaire for non-native listeners, participants were asked about the gender typicality of each of the two voices. Again, due to technical issues, answers to this question were collected from only a subset of participants (British N = 25, German N = 25 and Polish N = 50). Participants were presented a single audio stimulus of each the male and the female voice and were asked to rate how gender typical they considered the voice. For the stimulus produced by the male speaker, they were asked, “How typically masculine or unmasculine do you find this voice?,” and were instructed were to rank the voice on a scale of 1 (extremely unmasculine) to 7 (extremely masculine). For the stimulus produced by the female speaker, they were asked, “How typically feminine or unfeminine do you find this voice?,” and again ranked the voice on a scale of 1 (extremely unfeminine) to 7 (extremely feminine). Table 4 reports the mean ratings split by participant language. Using a one-way ANOVA test, no significant differences between participant groups were found for the typicality ratings of either the male (p = .48, df = 2, F = .74) or the female (p = .41, df = 2, F = 0.9) voices.

Table 4

Mean response to typicality rating according to participant group. Participants rated each voice on a scale of 1 (very unmasculine/very unfeminine) to 7 (very masculine/ very feminine).

Male Voice Female Voice
English 5.4 (sd = 1.6) 5.4 (sd = 1.4)
German 5.8 (sd = 1.2) 5.5 (sd = 1.4)
Polish 5.3 (sd = 1.3) 5.4 (sd = 1.1)
All groups 5.4 (sd = 1.2) 5.4 (sd = 1.4)

3. Results

Categorization data is shown in Figure 1, presented as the proportion of /θ/ responses over continuum steps and according to participant language. Statistical analysis was conducted with a generalized linear mixed-effects model using the lme4 package (Bates et al., 2015) in R (R Core Team, 2023). Fixed effects were Continuum Step (centered on zero, higher values more /θ/-like), Speaker Gender (female coded as –0.5, male as 0.5), as well as Participant Language (with English mapped onto the intercept such that the two non-native participant groups were directly compared to the English baseline) and all interactions. Random intercepts for Participant and Minimal Pair were included, as well as random slopes for Continuum Step and Speaker Gender over Participant and over Minimal Pair.

Figure 1
Figure 1

Results of /s/-/θ/ categorization task. Categorization data (proportion of /θ/ responses) over continuum steps (labeled according to the original continuum). Results are presented according to speaker gender (solid lines and light dots for female, dashed lines and dark dots for male) and by language: on the left, the results for the German listeners (blue) compared to native English listeners (black) and on the right, the results for Polish listeners (red) compared with the native English listeners (black).

Statistical results are summarized in Table 5. Participant Language English served as the reference level, with the coefficients for German and Polish representing their differences from the English baseline. For the English group, significant main effects were found for Continuum Step and Speaker Gender. The main effect of Continuum Step (p < 0.001) indicates that more /θ/ responses were given the higher the continuum step; that is, the more /θ/-like the fricative was perceived. The main effect of Speaker Gender (p < 0.05) indicates more /θ/ responses were given for the male than the female speaker. This is in line with our first hypothesis, namely that native English listeners will give more /θ/ responses when stimuli are paired with the male voice. Furthermore, a significant interaction between Speaker Gender and Continuum Step (p < 0.001) appears to explain what is evident in Figure 1. Namely, that while overall more /θ/ responses were given across the continuum, more /s/ responses were given for the steps at the lower, more /s/-like end of the continuum. Comparing the results of the native English listeners to the learners, significant effects of Participant Language:German (p < 0.05) and Participant Language:Polish (p < 0.05) show that the average proportion of /θ/-responses for these participant groups differed from the English listeners. German listeners gave more /θ/ responses across continuum steps and speaker gender, as indicated by the positive sign of the estimate (b = 0.357). Polish listeners gave fewer /θ/ responses than the English listeners (b = –0.376). Critically, no significant effects were found for the interactions including the factors Participant Language and Speaker Gender; that is, none of the two-way or three-way interactions including Participant Language and Speaker Gender were significant. Only the two-way interaction between Participant Language and Continuum Step for the Polish listeners reached the p < .05 significance level, suggesting that Polish listeners showed a somewhat shallower categorization function compared to English listeners. However, in sum, the effect of speaker gender on categorization responses across continuum steps did not differ between the native and either of the two non-native participant groups. Thus, our second hypothesis, that non-native listeners use speaker gender as a cue to a lesser extent than native listeners do, was not supported by these results. Details of this statistical analysis, including the original data, can be found on the Open Science Framework (OSF; https://osf.io/5984h/).

Table 5

Results of the statistical analyses. The model structure was as follows: glmer(phoneme classification ~ continuum step * speaker gender * participant language + (1 + continuum step + speaker gender | participant) + (1 + continuum step + speaker gender | minimal pair).

b SE z p
(Intercept) –0.030 0.187 –0.160 0.873
Continuum Step 0.647 0.034 19.093 <0.001
Speaker Gender 0.402 0.198 2.030 0.042
Participant Language German 0.357 0.157 2.269 0.023
Participant Language Polish –0.376 0.157 –2.398 0.017
Continuum Step: Speaker Gender 0.213 0.019 11.326 <0.001
Continuum Step: German 0.030 0.040 0.740 0.460
Continuum Step: Polish –0.078 0.040 –1.984 0.047
Speaker Gender: German –0.173 0.107 –1.616 0.106
Speaker Gender: Polish –0.118 0.106 –1.113 0.266
Continuum Step: Speaker Gender: German 0.054 0.028 1.946 0.052
Continuum Step: Speaker Gender: Polish –0.035 0.025 –1.398 0.162

To support our finding of no difference in the speaker gender effect between native and non-native listeners, that is, no significant three-way interaction between continuum step, speaker gender and either of the non-native languages, we conducted an approximate Bayes factor analysis following Wagenmakers (2007). Using the Bayesian Information Criterion (BIC), the approximate Bayes Factor in favor of the reduced model (model 0, excluding the three-way interaction) over the full model (model 1) was computed as:

BF01 exp((BIC1BIC0)/2)

where values greater than 1 indicate evidence in favor of the reduced model. Our analysis yielded a Bayes Factor of BF01 = 148, indicating strong evidence in favor of the reduced model (i.e., the model without the three-way interaction) and supporting the interpretation of a null effect. The model for this analysis can also be found on OSF (https://osf.io/5984h/).

4. Discussion

This study set out to examine the effect of speaker gender on the categorization of non-native phonemes. The speaker gender effect is well established among native listeners but has never been tested in language learners. While it is generally understood that the speaker gender effect is caused by listeners’ expectations about how men and women produce different phonemes, here fricatives, the source of these expectations remains subject to debate. In the present study we asked listeners of three different language backgrounds (one native and two non-native groups) to categorize English words containing a manipulated fricative along a /s/-/θ/ continuum, with words produced by one male and one female speaker. Critically, /θ/ was not part of the non-native listeners’ L1 phoneme inventories (i.e., German and Polish). Across participant groups, we replicated Munson’s (2011) finding that more /θ/ responses were given when the fricative was attached to a male voice, thus again replicating the effect of speaker gender on fricative perception. However, we found no effect of the participants’ native language on the magnitude of the speaker gender effect. Note here that the results of the gender typicality ratings did not differ between listener groups, with both voices perceived as rather gender typical by all three groups.

Our results show slight differences in the overall /s/-/θ/ categorization curves for both speaker genders between native and non-native listeners. German learners gave slightly more /θ/ responses than the native English listeners across the continuum and for both speakers, possibly due to this group’s established difficulty with this phoneme (Hanulíková & Weber, 2012). As German learners struggle in both the production and perception of /θ/, and the /s/-/θ/ contrast in particular, they may be hyper-aware of the phoneme and overcorrect in their categorization towards /θ/. On the other hand, Polish learners gave more /s/ responses relative to the native English listeners, possibly because in Polish, the alveolar fricatives /s/ and /z/ are articulated in the dental rather than alveolar region, as in German and English (Wagner et. al, 2006). Thus, continuum steps between /s/ and /θ/ could resemble natural Polish productions of /s/. In general, differences in categorization curves between native and non-native listeners (and between different learner groups) are to be expected, as it has been shown that native phoneme inventories affect how learners categorize non-native phoneme contrasts (see, e.g., the overview in Cutler, 2000).

One initially puzzling finding regarding the speaker gender effect was that, although we found more /θ/ responses for the male voice across the continuum overall, there was an interaction between the speaker gender effect and the effect of continuum step. At the /s/ end of the continuum, all participant groups, including the native English listeners, gave more /s/ responses for the male speaker. In the middle continuum steps (5–6), the average response for the male voice begins to shift from /s/ to /θ/. While this crossover effect has not been reported before (i.e., Munson, 2011, did not report results for individual continuum steps), it could be explained by looking at previous results from the speaker gender effect on different sets of phoneme contrasts. Notably, studies testing the effect on an /s/-/ʃ/ continuum report more /s/ responses for male versus female voices (e.g., Strand & Johnson, 1999; Munson, 2011; Reese & Reinisch, 2022). Since in the present experiment we tested a contrast that includes both /θ/ and /s/—both phonemes previously reported to be perceived more often when paired with the male voice than the female voice—our results are consistent with previous findings.

As the magnitude of the speaker gender effect does not differ between native and non-native participant groups, the data gathered does not allow us to reach a definitive conclusion on the role of culturally specific expectations of gendered speech in fricative perception. One possible explanation for these results is that highly proficient learners also have implicit expectations of differences between male and female fricative production. Note that all our participants rated themselves as fluent in English and many reported the use of native English streaming services. They could thus have had sufficient experience with English-speaking men and women producing these fricatives differently that they began to integrate this information in speech perception. An alternative explanation is that these groups transfer their expectations about gendered fricative production from their L1 into the L2 listening task. Differences in fricative production between male and female speakers in the production of /s/ are documented in languages as diverse as German, Japanese, and Chickasaw. In all studied languages, male speakers produce /s/ with lower spectral frequencies than females (Fuchs & Toda, 2010; Reese & Reinisch, 2022; Heffernan, 2004; Gordon et al., 2002). Furthermore, the effect of speaker gender on /s/-/ʃ/ perception, as first established by Strand and Johnson (1996), has been replicated in German (Reese & Reinisch, 2022). While this perceptual effect to our knowledge has not yet been tested in Polish speakers, across languages listeners can generally expect male speakers to produce fricatives with lower spectral frequencies, and this expectation likely holds true for Polish listeners, as well. To truly isolate the role of L1 transfer, a group of speakers in which the indexicality of fricative expectation operates in the opposite direction—that is, in which female speakers produce fricatives with spectral frequencies lower or at least similar to those of male speakers—must be found. To the best of our knowledge, no such language or dialect group has been identified. In the absence of such a group, the present study indicates that proficient non-native speakers perform similarly to native speakers in their use of speaker gender in the categorization of ambiguous fricative stimuli.

This study represents the first investigation into the effect of speaker gender phoneme categorization in a non-native population, with a focus on contrasts not present in participants’ native fricative inventory. It also contributes to the small but growing body of research on the role of social information in the speech processing of non-native listeners. Our findings indicate that the integration of social and acoustic information is not only present but highly efficient for non-native listeners. These results align with previous research suggesting that awareness of social cues embedded in the speech signal begins early, even with the onset of phonological acquisition (Foulkes & Docherty, 2006). Just like the input encountered during first-language acquisition, the native speech input that language learners receive is already rich with social implications. Similarly, and in accordance with the conclusions of Sumner et al. (2014), who argue that that social and acoustic information are processed in tandem, we showed that this is true for both L1 and L2 listeners. This conclusion is further supported by previous research on the speaker gender effect in adverse conditions (Reese & Reinisch, 2022, 2023) which found no impact of dual-tasking (i.e., divided attention in the form of a visual search task) or noise masking of the speech signal on the speaker gender effect in L1 listeners. As non-native listening is less efficient and more effortful than native listening (Cutler 2000, 2012), its effect on the use of speaker gender in phoneme categorization could be predicted to be similar to that of listening in adverse conditions. Thus, these results again confirm a robust speaker gender effect across listening conditions that require greater listener effort. This supports Reese and Reinisch’s (2022) argument that the integration of speaker gender information is highly efficient or even automatic. Speaker or socially neutral language input does not exist; every utterance carries social meaning, and learners may naturally acquire gendered and other social aspects of speech.

Notes

  1. Due to a technical error, the questionnaire was initially not included for the first group of German participants; this issue was corrected after data had been collected from 30 individuals. Given the overall homogeneity among the remaining participants, this was not seen as justification for excluding or replacing their data from the main task. [^]

Acknowledgements

We would like to thank Christian Kaseß and David Meijer for their help with the statistical analyses.

Competing interests

The authors have no competing interests to declare.

References

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.  http://doi.org/10.18637/jss.v067.i01

Boersma, P., & Weenink, D. (2021). Praat: Doing phonetics by computer (Version 6.1.50) [Computer software]. http://www.praat.org

Boyd, Z., Fruehwald, J., & Hall-Lew, L. (2021). Crosslinguistic perceptions of /s/ among English, French, and German listeners. Language Variation and Change, 33(2), 165–191.  http://doi.org/10.1017/S0954394521000089

Bradlow, A. R., & Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106(2), 707–729.  http://doi.org/10.1016/j.cognition.2007.04.005

Calder, J. (2019). The fierceness of fronted /s/: Linguistic rhematization through visual transformation. Language in Society, 48(1), 31–64.  http://doi.org/10.1017/S004740451800115X

Campbell-Kibler, K. (2011). Intersecting variables and perceived sexual orientation in men. American Speech, 86(1), 52–68.  http://doi.org/10.1215/00031283-1277510

Cutler, A. (2000). Listening to a second language through the ears of a first. Interpreting, 5(1), 1–23.  http://doi.org/10.1075/intp.5.1.02cut

Cutler, A. (2012). Native listening: Language experience and the recognition of spoken words. MIT Press.  http://doi.org/10.7551/mitpress/9012.001.0001

Foulkes, P., & Docherty, G. (2006). The social life of phonetics and phonology. Journal of Phonetics, 34(4), 409–438.  http://doi.org/10.1016/j.wocn.2005.08.002

Fuchs, S., & Toda, M. (2010). Do differences in male versus female /s/ reflect biological or sociophonetic factors? In S. Fuchs, M. Toda, & M. Zygis (Eds.), Turbulent sounds: An interdisciplinary guide (pp. 281–302). De Gruyter.  http://doi.org/10.1515/9783110226584.281

Gabay, Y., Najjar, I. J., & Reinisch, E. (2019). Another temporal processing deficit in individuals with developmental dyslexia: The case of normalization for speaking rate. Journal of Speech, Language, and Hearing Research, 62(7), 2171–2184.  http://doi.org/10.1044/2019_JSLHR-S-18-0264

Gonet, W., & Pietron, G. (2006). English interdental fricatives in the speech of Polish learners of English. Dydaktyka fonetyki języka obcego [Didactics of foreign-language phonetics]. Neofilologia, 8, 73–86.

Gordon, M., Barthmaier, P., & Sands, K. (2002). A cross-linguistic acoustic study of voiceless fricatives. Journal of the International Phonetic Association, 32(2), 141–174.  http://doi.org/10.1017/S0025100302001020

Hanulíková, A., & Weber, A. (2012). Sink positive: Linguistic experience with th substitutions influences nonnative word recognition. Attention, Perception, & Psychophysics, 74, 613–629.  http://doi.org/10.3758/s13414-011-0259-7

Heffernan, K. (2004). Evidence from HNR that /s/ is a social marker of gender. Toronto Working Papers in Linguistics, 23(2), 71–84.

Hurring, G., Hay, J., Drager, K., Podlubny, R., Manhire, L., & Ellis, A. (2022). Social priming in speech perception: Revisiting kangaroo/kiwi priming in New Zealand English. Brain Sciences, 12(6), 684.  http://doi.org/10.3390/brainsci12060684

Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, 34(4), 485–499.  http://doi.org/10.1016/j.wocn.2005.08.004

Johnson, K., Strand, E. A., & D’Imperio, M. (1999). Auditory–visual integration of talker gender in vowel perception. Journal of Phonetics, 27(4), 359–384.  http://doi.org/10.1006/jpho.1999.0100

Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America, 108(3), 1252–1263.  http://doi.org/10.1121/1.1288413

Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148–203.  http://doi.org/10.1037/a0038695

Mack, S., & Munson, B. (2012). The influence of /s/ quality on ratings of men’s sexual orientation: Explicit and implicit measures of the “gay lisp” stereotype. Journal of Phonetics, 40(1), 198–212.  http://doi.org/10.1016/j.wocn.2011.10.002

Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. Language and Cognitive Processes, 27, 953–978.  http://doi.org/10.1080/01690965.2012.705006

Mitterer, H., Eger, N. A., & Reinisch, E. (2020). My English sounds better than yours: Second-language learners perceive their own accent as better than that of their peers. PLoS One, 15(2), e0227643.  http://doi.org/10.1371/journal.pone.0227643

Munson, B. (2011). The influence of actual and imputed talker gender on fricative perception, revisited (L). Journal of the Acoustical Society of America, 130(5), 2631–2634.  http://doi.org/10.1121/1.3641410

Munson, B., Ryherd, K., & Kemper, S. (2017). Implicit and explicit gender priming in English lingual sibilant fricative perception. Linguistics, 55(5), 1073–1107.  http://doi.org/10.1515/ling-2017-0021

Pépiot, E. (2014). Male and female speech: A study of mean f0, f0 range, phonation type, and speech rate in Parisian French and American English speakers. Speech Prosody, 7, 305–309.  http://doi.org/10.21437/SpeechProsody.2014

Pierce, J. W., Gray, J. R., Simpson, S., MacAskill, M. R., Höchenberger, R., Sogo, H., Kastman, E., & Lindeløv, J. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51, 195–203.  http://doi.org/10.3758/s13428-018-01193-y

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org

Reese, H., & Reinisch, E. (2022). Cognitive load does not increase reliance on speaker information in phonetic categorization. JASA Express Letters, 2(5).  http://doi.org/10.1121/10.0009895

Reese, H., & Reinisch, E. (2023). Noise does not affect weighing of speaker information in spoken-word recognition. In Proceedings of the 20th International Congress of Phonetic Sciences (Prague, Czech Republic).

Strand, E. A., & Johnson, K. (1996, October). Gradient and visual speaker normalization in the perception of fricatives. In Proceedings of KONVENS (pp. 14–26).  http://doi.org/10.1515/9783110821895-003

Stuart-Smith, J. (2007). Empirical evidence for gendered speech production: /s/ in Glaswegian. In J. Cole & J. I. Hualde (Eds.), Laboratory Phonology 9 (pp. 65–86). Mouton de Gruyter.

Tripp, A., & Munson, B. (2022). Perceiving gender while perceiving language: Integrating psycholinguistics and gender theory. Wiley Interdisciplinary Reviews: Cognitive Science, 13(2), e1583.  http://doi.org/10.1002/wcs.1583

Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804.  http://doi.org/10.3758/BF03194105

Wagner, A., Ernestus, M., & Cutler, A. (2006). Formant transitions in fricative identification: The role of native fricative inventory. Journal of the Acoustical Society of America, 120(4), 2267–2277.  http://doi.org/10.1121/1.2335422