1. Introduction
A single speaker can maintain a repertoire of phonetic pronunciation variants of the same lexical item. That speaker can deploy these variants in different situations to achieve social goals, such as accentuating an aspect of their identity or communicating solidarity with their interlocutor (Babel, 2009; Bucholtz & Hall, 2005; Johnson, 2006). Such variation is dynamic and strategic, with speakers altering the fine-grained phonetic details of their speech based on the specific social context. To accomplish this, speakers require some way to differentially access socially distinct variants in production. While proposals exist that posit direct links between the representation of social and linguistic knowledge (Johnson, 2006; Pierrehumbert, 2001), the question remains as to how speakers are able to access and utilize these representations to accomplish their own social goals. Answering this question requires an understanding of how the cognitive processes by which speakers perform socially meaningful phonetic variation relate to, and are possibly constrained by, the mechanisms that have been proposed to influence intra-speaker phonetic variation more broadly.
In addition to socially motivated variation, phonetic variation also emerges through competition between variants driven by asymmetries in variant frequency. Unlike variation due to social factors, such phonetic variation might be expected to occur regardless of the social context in which speech is produced. The idea that, all else being equal, productions largely reflect the linguistic input a speaker has accumulated is long-standing in linguistics. This was articulated by Bloomfield as the “principle of density” (Bloomfield, 1933). This principle is central to cognitive theories of linguistic knowledge such as exemplar theory, which models the quantitative relationship between input and production (Pierrehumbert, 2001, 2003). However, it has also been acknowledged that this principle is in some way attenuated by social context. Bloomfield gives the example of a speaker preferentially modelling their own speech on that of someone they hold in high regard, while Pierrehumbert acknowledges that social factors can bias the production mechanisms of exemplar theory. The present study asks how speakers resolve a situation in which social-strategic considerations favor one variant, but the effects of variant frequency favor a competing one. Specifically, we ask how the phonetic characteristics of a socially upweighted, infrequently encountered variant are influenced by competition from a more frequent competitor.
We test these predictions using an artificial language learning paradigm modelled on classroom-based L2 learning. The task, described in more detail below, has participants learn vocabulary items in a novel language based on auditory models of the new words as produced by two (virtual) teachers. The teachers differ systematically in their pronunciation of certain vowels. We manipulate two factors across experimental conditions: The relative frequency with which participants are exposed to a pronunciation, and social weighting of the pronunciation variants. Briefly, in some conditions participants hear one pronunciation variant more often than the other – the exposure frequency manipulation. Independently, in some conditions participants are provided with biographical information about the two speakers which was meant to shift their productions towards those of one speaker over the other – the social weighting condition. We analyze the phonetic details of participants’ productions of the critical vowels after exposure to the model talkers for competing effects of these factors.
In using an artificial language learning task for this study, we address two difficulties in quantifying the competition between socially differentiated word variants. The first is that the relative strength of a variant depends crucially on its distributional properties: The frequency and recency with which it is present in a speaker’s input. Secondly, it is difficult to experimentally control or predict the social weight attached to a particular variant. Even speakers from similar backgrounds may differ in their evaluation of the same variant (Babel, 2010). Furthermore, a speaker’s positive evaluation of a variant in perception does not straightforwardly predict their own uptake of that variant in production (Babel, et al. 2014).
To preview the results presented below, our findings demonstrate that speakers are sensitive to both the (exposure) frequency of a variant and social information attached to it in the artificial language. In general, social information was found to mitigate the effects of frequency such that an input form that was socially upweighted and less frequent exerted a comparable degree of influence on participants’ productions to one that was socially upweighted and more frequent. We interpret these results with regard to our current understanding of how social information and usage frequency might inform socially meaningful phonetic variation. Before introducing the experiment in more detail, the following sections review important background related to socially driven and frequency driven phonetic variation.
1.1. Social factors and phonetic shifting
Empirical evidence has shown that speakers pursue a variety of social goals with the phonetic resources available to them. Style shifting, or intracontextual stylistic variation, has provided one source of evidence for the empirical study of such strategies (Coupland, 1980). Style shifting broadly describes ways by which speakers alter the characteristics of their speech depending on their audience, setting, or the topic at hand. Speakers often index their stance and attitude towards these topics through the performance of persona: Specific, recognizable characters or “social types” (D’onofrio, 2018). These performances often involve associated suites of phonetic traits. An example from Beijing Mandarin is provided by Zhang (2005), who examines how professionals in state-run and private industries construct different personae through using both “local” and “cosmopolitan” phonetic traits in different proportions when discussing different topics. In other cases, speakers may adopt specific phonetic features in order to index specific, locally meaningful social messages. Nycz (2018) reports such localized practice in the speech of Canadian-born New Yorkers who index positive stance towards their home country through raising of the /o/ vowel class.
Laboratory-based studies have found that many of the social factors that have been linked to contextual style shifting can also influence phonetic convergence over smaller-times scales. Phonetic convergence refers to the process of a speaker coming to produce phonetic values more similar to their interlocutors’ over the course of an interaction. This interaction can be as short as a brief laboratory-based exposure (Pardo, 2006) or as long as a college quarter spent living together as roommates (Pardo et al. 2012). Like style shifting, phonetic convergence has been found to be heavily sensitive to social factors. Specifically, the likelihood that a speaker will converge, remain consistent, or diverge from the speech of their interlocutor is strongly mitigated by social factors specific to an interactional situation (Babel, 2009). Such social factors include social “liking,” or the maintenance of a positive stance towards the interlocutor (Babel, 2012), as well as more situationally specific factors such as conversational role as a giver or receiver of information (Pardo et al., 2010). Phonetic convergence has thus been argued to serve a purpose as part of a speaker’s overall social communicative strategy, with speakers adjusting the extent of their convergence based primarily on their appraisal of the interlocutor and their situation (Pardo, 2012).
In contrast to this view of convergence as stylistic variation and unlike stylistic variation more broadly, phonetic convergence has also been argued to be primarily the result of mechanistic factors. Unconscious, spontaneous phonetic imitation has been observed even in socially and communicatively impoverished settings, including shadowing tasks (Shockley et al., 2004). This has led to explanations like those of Pickering and Garrod’s interactional alignment model (Pickering & Garrod, 2004b), which attributes convergence in part to “resource-free” priming mechanisms. Such models do not necessarily exclude a role for social factors in mediating convergence, but rather propose that overriding convergence should be an effortful process (Pickering & Garrod, 2004a). The proposal that at least some aspects of convergence are attributable to factors besides a speaker’s social evaluation of their listener is partly supported by evidence that speakers can converge towards speech forms they may actively disfavor on social grounds. Importantly for this study, this includes convergence to L2 accented speech (Lewandowski & Nygaard, 2018).
Such mechanistic approaches have been criticized for failing to account for the socially conditioned nature of phonetic convergence, as discussed above (Babel, 2010), as well as on the basis of the phonetic selectivity of convergence. This is to say, phonetic convergence appears to selectively target certain variants over others (Babel, 2012; Ostrand & Chodroff, 2021). However, while a purely mechanistic theory of convergence may fall short in this regard, theories of linguistic processing (briefly reviewed below) typically allow mechanisms through which exposure to a form in perception can cause subtle changes in production. This study is designed to investigate to what extent, if any, such mechanisms constrain the selection of a variant based on its social characteristics.
1.2. Frequency-driven competition
Thus far, we have discussed how the phonetic details of a speaker’s production may vary as a result of speakers’ social situation and social goals. Speakers first select a socially preferred variant, the target, from their repertoire of competing forms. Theories of speech processing provide potential explanations for how this decision is implemented in speech production. A pertinent finding from such theories is that while much intra-word phonetic variation can be attributed to speakers’ social goals and strategies, some variation arises from the cognitive mechanisms by which words are represented, stored in memory, and retrieved for production. Unlike socially motivated variation, variation due to variant frequency is expected to occur regardless of social-communicative context. Such mechanisms could potentially limit a speaker’s ability to freely select variants from their repertoire by making one variant less cognitively accessible or by creating pathways through which non target variants can influence speech production.
The primary means by which this could occur is that of inter variant competition. While accounts of the specific mechanisms involved vary between theories, most modern views on speech processing agree that speech production is a competitive process (Dell, 1986; Levelt et al. 1999; Munson & Solomon, 2004). When a speaker selects a particular form for production, the representations of formally similar competitors (i.e., those with shared phonological or morphological features) are also simultaneously activated. These competitors can then have measurable phonetic effects on speech production. Generally, in the case of inter-word competition, competition is usually described as resulting in phonetic dissimilarity. That is, competition from a word that is a phonological neighbor to the intended word results in productions that are more distinct from that neighbor (Wright, 2004). However, theories of inter-variant competition generally propose the opposite: Greater activation of a competitor variant results in productions that are more similar to that variant. Specifically, most previous work that has examined the link between speech processing and social variation has done so from an explicitly exemplar theoretic standpoint (Drager, 2010; Drager & Kirtley, 2016; Squires, 2013 inter alia). The following section provides an overview of the architecture of exemplar theory and how this architecture may potentially limit a speaker’s capacity for social signaling.
1.2.1. Exemplar-based mechanisms
Exemplar theoretic approaches treat phonetic knowledge as arising primarily from knowledge of specific instances, or exemplars of previously encountered word forms (Goldinger, 1998; Johnson, 2007; Pierrehumbert, 2001; see Goldrick & Cole, 2023 for a review of exemplar models in production). These forms can be viewed as points in a high dimensional space, with each encountered instance of a word stored as a separate point. Word form exemplars encode detailed phonetic information, such as precise formant values, as well as linguistic information, such as what word is being said, and extra-linguistic information, such as social characteristics attributed to the speaker. Perceiving a word in such a system proceeds by the perceived token activating exemplars in the neighborhood of previously encountered instances of the word. Exemplars in this neighborhood that are associated with a given word form are activated, with more frequently and recently encountered words exerting a stronger influence on perception.
Exemplar accounts are particularly popular for explaining the relative speed with which social and linguistic information can be integrated in perception. This is demonstrated in perception work showing that speakers are capable of making rapid use of nonlinguistic cues to aid in speech perception (D’Onofrio, 2015; Hay & Drager, 2010; Koops, Gentry, & Pantos, 2008). There is comparatively less work examining the time course of socially meaningful variation in speech production. In one exemplar-based approach, production “proceeds in the opposite direction” to perception (Pierrehumbert, 2001). A speaker first selects a label which activates all exemplars marked with the label. The production process is modelled as sampling from the distribution of phonetic space covered by these labels. Speakers are able to style-shift by selecting only those variants that are labelled with the social information they wish to convey (or imitate). Among the work investigating the production of socially meaningful variation from an exemplar theoretic standpoint, Clopper & Pierrehumbert (2008) have argued that semantic predictability affects regionally-indexed and “standardized” forms differently. The claim is that speakers produce forms that index local or nonstandard varieties when the time course of activation is shorter.
While Clopper and Pierrehumbert argue that nonstandard forms are more easily accessible in production due to reaching an activation threshold more quickly, the question of what sorts of social labels result in greater exemplar activation, and by which mechanisms, remains a subject of debate. Sumner and colleagues (2014) provide evidence that standard or “idealized” variants are more easily recalled from long-term memory than “nonidealized” or colloquial forms. They attribute this imbalance to the way in which forms are encoded in long-term memory, but argue that the asymmetry between idealized and nonidealized forms in long-term encoding does not translate into an advantage for idealized forms for all linguistic tasks and social circumstances. Specifically, they show that idealized and nonidealized forms behave similarly in tasks relying on short-term memory. This has implications for the current study, as the limited exemplar theoretic work on production makes it difficult to say for certain whether any advantage enjoyed by idealized forms in long-term recall translates into those forms exerting greater influence in any particular production task. Similarly, while long-term memory encoding may result in more robust long-term representations for idealized forms, the theoretical architecture of exemplar theory allows for other mechanisms that may potentially counteract such advantages. For example, through online processes such as attentional upweighting, speakers may activate only a subset of a target category’s exemplar distribution by preferentially upweighting tokens with specific, socially relevant labels.
Regardless of the precise mechanism involved, online upweighting or asymmetries in long-term encoding, exemplar theory makes clear predictions about the mechanisms that allow gradient interaction among phonetically distinct exemplars of the same word. Phonetic knowledge consists of distributions of labelled exemplars, and production involves sampling from these exemplars. This means that, when two distributions are near enough in acoustic space to compete for activation and social weighting is held constant, the label with more frequent and/or more recent exemplars should always show higher activation. The presence of exemplars with competitor labels in the neighborhood of the word that is the target of production should, all else being equal, pull the target towards the mean of all the competitors. As an example, consider the case of a speaker whose repertoire for a given word contains two forms: A “dialect” or “regional” form and a “standard” or “supraregional” form. A concrete example comes from the case of the word I as represented by speakers of Southern American English. Such speakers may represent this word as two distinct distributions: One carrying the indexical label “local” containing exemplars that are relatively monophthongal, and one labeled “standard,” whose exemplars are phonetically diphthongal. Under exemplar theory, we would predict that, all else being equal, as the language user encounters more tokens belonging to the “standard” category, their own productions should become more diphthongal. Previous work by Hay et al. (1999) has argued that frequent words, such as I in English, play a major role in the performance of style, which also suggests the possibility that absolute frequency may play a role in speakers’ use of a specific variant. To account for such possible effects, we have designed our study such that all lexical items occur the same number of times.
This has three implications for the current study. The first is that, when faced with competition from a nontarget variant, speakers are expected to upweight the entire distribution of exemplars marked with the (socially upweighted) target label, thereby producing forms that are more typical, or closer to the mean phonetic value of the target variant. The second is that, for a given value of attentional weight or strength of encoding, increased competition from a nontarget variant should always pull the phonetic value of the target towards that distractor. Finally, exemplar accounts posit that tokens are stored in memory holistically, with detailed phonetic information along all phonetic dimensions associated with that token. This predicts that speakers’ productions of a category will reflect all phonetic dimensions of tokens stored within that category. We therefore do not predict to observe phonetic values shifting towards a distractor on only one available phonetic dimension with no movement along other phonetically relevant dimensions. In the case of vowels that differ along F1 and F2, we would not expect to see movement along only one formant.
1.3. Artificial language learning
We use an artificial language-learning paradigm to test the interaction between a social factor upweighting one phonetically distinct variant of a word and the effect of frequency that can favor either of two variants. The use of artificial language paradigms to investigate the influence of social factors on language behavior is well established. Previous work has demonstrated that learners are sensitive to social information in artificial languages (Samara et al., 2017). This sensitivity includes the ability to rapidly link novel linguistic forms to experimentally specific social factors. In particular, a series of studies by Roberts and colleagues have utilized “alien,” languages to answer questions concerning the acquisition and spread of socially motivated linguistic variation (Lai, et al., 2019). Such studies have, for example, investigated the degree to which social salience mitigates language users’ ability to link social and linguistic variation (Li & Roberts, 2023; Wade & Roberts, 2020). Such results therefore provide important empirical validation for the methods used in this study.
This study expands upon previous artificial language learning studies in extending the use of an artificial lexicon to the study of continuous phonetic variables. While previous work has demonstrated that language learners are sensitive to statistical regularities in artificial language input (Austin et al., 2022; Finley & Badecker, 2009), the extent to which speakers reproduce the features of their input has typically been assessed through categorical tasks that rely on variation at the word or phone level. However, as the field of sociolinguistics demonstrates, many socially meaningful forms of variation rely on gradient differences in production. The investigation of phonetic variation differs from such paradigms in allowing speakers to produce forms that reflect the influence of multiple variants in a more gradient way. Similarly, the types of phonetic effects exerted by mechanistic factors are known to result in subtle intracategory variation. Investigating a continuous response therefore provides us the opportunity to address our research question concerning the interaction between these two sources of variation. In doing so, it draws on findings from distributional phonetic learning experiments (Theodore & Monto, 2019), which show that listeners are capable of tracking distributional phonetic information over the course of experimental exposure.
The use of an artificial language offers two primary advantages over using speakers’ preexisting lexicons. The first is the relative lack of preexisting social biases attached to artificial language forms. The primary manipulation of this experiment involves socially upweighting one phonetic form relative to another. Previous work has shown that it can be difficult to experimentally predict whether and how speakers will shift their speech productions for social reasons. Babel (2010) for example, found that a manipulation of whether a talker was insulting or flattering to the participant was not by itself predictive of whether or not that participant would phonetically converge to that interlocutor. Similarly, social biases have been argued to mitigate phonetic convergence. This has been demonstrated by Clopper & Dossey (2020), who found that speakers avoid producing socially stigmatized variants in shadowing tasks. The second advantage lies in the relatively high degree of control it allows the experimenter over the distributional properties of a learner’s input. To predict how the effect of exposure frequency influences participants’ productions, it is critical to have reliable estimates of exposure frequency for all variants of a word. This can be difficult to control in an experiment eliciting participants’ productions of their native language and local dialect, as even speakers from the same city may encounter local dialect forms more or less often depending on factors such as the strength of their local social ties (Dodsworth & Benton, 2017). The use of an artificial language thus allows precise control over how often a speaker encounters a variant.
1.4. The current study
To summarize this section, speakers are capable of modulating the fine phonetic details of their productions as part of a communicative strategy. This modulation occurs as speakers select between socially differentiated variants of the same words. But pathways exist by which competing pronunciation variants can influence the phonetic targets that are ultimately realized. Specifically, the mechanisms by which word forms are represented and accessed for production may activate more frequently encountered lexical forms to a greater degree than less frequent ones, privileging these more frequent forms in production compared to less frequently encountered variants of the same forms. This privileging can result in asymmetrical influences between variants in production, pulling the phonetic value of a less-frequently encountered form towards that of a more frequent one. The current study asks how these frequency-driven and social factors interact in influencing the phonetic form of a given target pronunciation.
To answer this question, we investigate how a variant’s influence on speech changes as a result of the social evaluation attached to it and as a result of its frequency relative to other variants of the same word form. Although our primary question concerns the interaction between these two factors, our first hypotheses are concerned with establishing the effects of each independently. H1 is motivated by findings from sociolinguistics reviewed above. It predicts that social factors exert some influence on which variant is targeted for a given situation and therefore the phonetic realization that is ultimately produced. This is in line with the understanding of variation as a communicative system that speakers may draw on for addressing their social needs. H2 is informed by the above-discussed work on speech production. It predicts that in the absence of social reasons to prefer one form or another, frequency-based factors will predominate and productions will be pulled towards phonetic values typical of more frequent variants.
H1: (Social Effect): On average, speakers’ productions will more closely resemble a previously encountered form that is socially upweighted than one that is not socially upweighted.
H2 (Frequency Effect): On average, speakers’ productions will more closely resemble a form that is more frequently encountered than one that is less frequently encountered.
A third hypothesis addresses our primary research question of how speech production reflects the interaction between frequency and social factors. Our primary prediction in this regard, which we will refer to as H3, is that social upweighting will mitigate the effect of frequency. The critical prediction of this hypothesis is that a form that is less frequently encountered but socially upweighted will exert more influence on a speaker’s subsequent productions than a form that is also less frequent but not upweighted. To instantiate this hypothesis, we introduce an example of a speaker whose repertoire consists of two socially differentiated variants of the same phonological category. We will label these variants “tense” and “lax.” It is possible for either form to be more frequent in the input and for social factors to either favor one form or another or to favor neither form. H3 predicts that when frequency and social information are at odds with one another, i.e., when the lax form is more frequent but the tense form is upweighted, speakers’ productions will be more tenselike than they would if the tense form was not upweighted.
H3: (Mitigation Hypothesis) The effect of social preference will mitigate that of frequency.
Testing these three hypotheses requires precise experimental control of both the frequency of a variant and the social evaluation attached to it. Towards this end, we designed an artificial language-learning speech production experiment. This experiment takes advantage of participant familiarity with the classroom-based model of L2 learning to impose a predictable (for this context) upweighting of a socially differentiated variant, that of the “native speaker,” while otherwise drawing minimally on a speaker’s real-world social preferences. The use of an artificial language allows the relative frequency of both the target “native” and target competitor “nonnative” variants to be precisely manipulated relative to each other. These “native” and “nonnative” forms correspond to a phonetic distinction between English tense and lax vowels. This distinction, as presented in words in the artificial language, is not lexically contrastive in the language. That is, participants are never exposed to minimal pairs in the language differentiated solely by the tense/lax vowel distinction. Instead, model talkers of the language vary in whether they produce exclusively tense or exclusively lax front vowels. The critical manipulations of the experiment are whether this distinction corresponds with the model talker’s status as a “native” or “nonnative” speaker of the language, as made salient to the participant, and which variant is more frequent in the input.
Importantly, we did not assume from the outset that our selection of this specific “native-ness” manipulation would have its intended effect of shifting productions towards the native variant. Rather, this choice of manipulation represented a working hypothesis that participants will tend to upweight a form labelled as “native” within the specific context of a mock L2 classroom environment in which this distinction was made salient. There is, however, prior evidence that this would be the case. Previous work has theorized that speakers place greater social weighting on forms they perceive to be more standard (Sumner et al., 2014). Our goal in creating a mock L2 learning context was thus to identify one form as “standard” and thus upweight one variant relative to another.
It is therefore not necessary for participants to actively disprefer or downweight nonnative speech as a general rule, only that they exhibit at least a slight preference for nativelike forms in this specific experimental context. This is especially important given that previous work has found that although language learners may exhibit explicit biases against nonnative speakers, these biases do not always translate to implicit biases, and these different levels of bias may have implications for our study (Todd & Pojanapunya, 2009). To this end, we elected to identify the nonnative speaker as a “teaching assistant” in the context of the experiment so as to discourage participants from completely downweighting or simply ignoring all input from the nonnative talker.
In addition, previous work has also found that, while speakers can converge to nonnative speech, the extent to which they do so is negatively correlated with perceived degree of foreign accent (Wagner et al., 2021). Our intent in separating the nonnative and native forms of our artificial language along a dimension that was perceptually obvious to English speakers was thus to separate “degree of foreign accentedness” in the artificial language into two clearly distinct categories. Furthermore, while Wagner et al. (2021) found some evidence of convergence, participants were also placed in a communicatively impoverished context in which there was no salient alternative to which speakers could converge and no clear reason not to converge to the nonnative model talker. This is particularly important considering that the dimensions along which convergence was most strongly observed in that experiment were suprasegmental (speech rate and F0), neither of which were contrastive in the relevant L1 (Dutch) and both of which are subject to considerable intralanguage variability. It seems therefore at least possible that neither was viewed as a relevant marker of foreign accentedness and less likely to be downweighted. This would be in line with findings of Clopper & Dossey (2020), which found that speakers would converge to the phonetic features of a downweighted variety along some dimensions but not along those associated with existing social biases.
2. Methods
2.1. Study design
All participants learned the same artificial language through exposure to the same two model talkers. One talker always produced tense front vowels {i,e} in lexical contexts where the other produced the corresponding lax vowels {ɪ,ɛ}. Participants either received no social information about either talker (Bias Absent conditions) or were informed that the talker who produced exclusively tense forms was a “native” speaker and experienced teacher while the lax talker began learning the language at college and is now a teaching assistant and aspiring teacher of the language (Bias Present). In addition to the social bias manipulation, participants heard either the tense variant in word forms containing [i,e] (Tense Frequent) or the lax variant in word forms containing [ɪ,ɛ] (Lax Frequent) four times as often as the competing variant (lax or tense, respectively). Participants are thus sorted into one of four conditions in a 2 × 2 between-participants design (Table 1).
Experimental Conditions.
| Tense More Frequent | Lax More Frequent | |
| Social Information Absent | Tense Frequent Bias Absent | Lax Frequent Bias Absent |
| Social Information Present | Tense Frequent Bias Present | Lax Frequent Bias Present |
2.2. Stimuli
Stimuli consisted of nonce words recorded by two female native speakers of American English in their early twenties. Stimuli were recorded in a soundproof booth using a Shure SM81 Condenser Handheld Microphone. The same voice was always associated with the same variant, regardless of condition, such that all participants heard the same teacher producing the critical tense vowels, and the other teacher producing the critical lax vowels. All words were of the form C1V1C2V2, with initial stress.1 The consonants were selected to minimize their effect on the formants of adjacent vowels. C1 consonants were voiceless obstruents from the set {k, p, s, t, f, t̠ʃ}. V1 was always one of the front vowels from the set {i, ɪ, e, ɛ}, with tense and lax vowels varied critically between the two teachers. C2 consonants were voiced obstruents from the set {g, v, b, d, z, d̠ʒ}. The velar /g/ was permitted only following high vowels, as velars are known to exert strong centralizing effects on preceding non-high vowels in some dialects of North American English (Freeman, 2014). Finally, V2 was a back vowel, /u/, /o/, or /a/ to ensure that listeners heard the critical front vowels only in the first syllable.
Visual stimuli were taken from the Bank of Standardized Stimuli (Brodeur et al., 2010). Images were divided into one of three categories that varied over three experiment rounds: Animals, food, and objects. Participants heard four words per round and words did not carry over between rounds. Vowel categories were asymmetrically represented in each round such that in the first round, three of the four words contained a high vowel while one contained a mid vowel. In the second, two of the words contained a high vowel, and two, a mid vowel; and in the final round, only one of four words contained a high vowel while the rest contained mid vowels. Four words and their accompanying images are shown as examples in Figure 1.
2.3. Participants
73 undergraduate students from Northwestern University, native speakers of American English, were recruited to take part in the experiment for course credit. Participants were randomly assigned to one of the four conditions (Lax Frequent, Bias Absent N = 20; Tense Frequent, Bias Absent N = 18; Lax Frequent, Bias Present N = 18; Tense Frequent, Bias Absent N = 17). All sessions took place in a soundproof booth. Participant recordings were made using a Shure SM81 Condenser Handheld Microphone. Participants listened to stimuli through Sony MDRV700 headphones. The experiment was administered through the use of a PsychoPy script (Peirce, 2007).
2.4. Procedure
The experiment consisted of a baseline recording phase followed by three blocks, each consisting of a round of training trials followed by a round of test trials. In the baseline phase, participants read a set of words in English containing the critical vowels {i, ɪ, e, ɛ} in phonological contexts that were identical or similar to those in the artificial language. Note however that due to the relative infrequency of disyllabic English words with the desired phonological form, monosyllable words were used for the baseline recordings. Following the baseline phase, participants were introduced to the model talkers either with or without accompanying biographical information. Participants in the Bias Present condition were told that when they heard the native and experienced model talker speaking, they would see an image of an apple on the bottom of the screen, while the other talker would be represented by a notebook. We selected these icons because of their associations with teaching and learning, respectively. We elected to use these rather than depictions of human instructors because it was difficult to ensure that two humans looked distinct enough to not be confusable while ensuring that participants would not interpret them as belonging to different preexisting social groups. In addition, we posited that using two human icons would run the risk of participants confusing which was the teacher and which the student teacher. Participants in the Bias Absent conditions were simply told that they would be learning from two speakers, and no images were introduced as representing either of the model speakers.
In the training rounds, participants were instructed to repeat after each model talker as a way of practicing the words they were learning. These repeated productions were recorded, and participants progressed through the training trials at their own pace. Trials in the training round were sequenced such that participants in the Bias Present and Bias Absent conditions alike first heard the model talker who produced the tense vowel variants [i,e] (hereafter the tense talker, identified as the native talker for participants in Bias Present condition). This speaker produced all four words in a given round once, followed by the model talker who produced the lax vowel variants [ɪ,ɛ] (hereafter the lax talker, identified as the nonnative talker for participants in the Bias Present condition), who also produced all four words in the same round. The participant would then hear either the tense or lax talker say each word again, depending on whether they were in the tense- or lax-frequent conditions. This talker repeated the words in the same order six additional times, after which participants always heard a penultimate round of repetitions from the lax talker and a final round from the tense talker (Figure 2). Any recency effects on participants’ productions in the following test round were thus always expected to favor the tense form regardless of condition.
In each block, the test phase followed the final training round and was similar to the training phase in that the same images were presented in the same order. But unlike the training rounds, no audio accompanied the images. Participants were told that if they could not remember the word for an image, they should take their best guess and move on. Participants’ productions in the test phase were also recorded. There was no time limit for responses. After completing the first test round, participants were reminded of the instructions and, in the Social condition only, about the icons used to represent each teacher as the “native” talker or “nonnative talker”.
3. Analysis
3.1. Data Preparation
Recordings were segmented at the word and phoneme level using the Montreal Forced Aligner (McAuliffe et al., 2017). F1 and F2 formant values were automatically extracted using Praat (Boersma & Weenink, 2015). Participants’ productions of test items were manually checked, and tokens were removed that contained disfluencies or for which the speaker gave no response. Additional tokens were removed which met any of the following criteria: The production contained consonants that did not match the target consonants in the active articulator (labial or labiodental, coronal or velar) or in voicing; productions contained a vowel other than /i/, / ɪ/, / ɛ/, or /e/ in the first syllable, or the token was a valid production of a vocabulary item that did not match the picture shown. A total of 138 tokens were removed according to these criteria. Out of a maximum of 219 possible test productions of each word across all participants, an average of 207 repetitions per word across participants were retained after exclusion of trials according to the above criteria. The number of repetitions ranged from 215 (for the words /tʃedu, peza/) to 199 (/pidʒo/).
Formant values were normalized by speaker and by vowel category by subtracting the mean value for each speaker’s vowel class (high {i, ɪ}; mid {e, ɛ}) in the baseline condition from each observed token and dividing the resulting number by the standard deviation of that vowel class in the baseline condition. Positive measurements thus represent measurements greater than the average formant value for that speaker and vowel class, and negative values, less. More tense vowels are therefore represented by positive F2 values and negative F1 values. All of a participant’s baseline productions were used when calculating that participant’s baseline.
3.2. Statistical Analysis
Results were analyzed using multivariate Bayesian linear mixed-effects regression. Because the distribution of test tokens was strongly bimodal in some conditions, we used mixed-effects quantile regression.2 F1 and F2 were included as response variables. Fixed effects included Social Condition (a categorical predictor with levels “Bias Present” and “Bias Absent”), Frequency Condition (a categorical predictor with levels “Tense Frequent” and “Lax Frequent”), vowel category (mid or high), and the interaction between these three. Random intercepts were included for word and participant. Slopes and intercepts were included for frequency and social condition by word. Models were implemented using the brms interface (Bürkner, 2017) to the Stan programming language (Carpenter et al., 2017). Posterior means comparison was carried out using the emmeans package (Lenth et al., 2019). Estimated means and credible intervals are reported for each fixed effect. Credible intervals are reported at 89% and 95% levels. A 95% credible interval that does not include zero is interpreted as evidence for a credible effect, while an 89% credible interval that does not include zero can be thought of as representing a somewhat credible effect. We report 89% CI’s only for those results for which credible effects were credible at this level but not at the 95% level.
H1 and H2 predict that social information and frequency, respectively, will affect phonetic realization independently of each other. They predict that Bias Present conditions and Tense Frequent conditions will be more tenselike than Bias Absent conditions and Lax Frequent conditions by exhibiting higher F2 and lower F1. In our model, these correspond to a main effect of social information and frequency, respectively. H3 predicts that when frequency favors lax vowels, but social upweighting favors tense vowels, the expected effect of frequency (i.e., vowels becoming more laxlike) will be reduced compared to when frequency favors the lax variant and there is no social weighting. In our model, this hypothesis corresponds to the existence of a credible interaction between social information and frequency. In our specific model, this should take the form of a credible positive interaction term for F1, and a negative interaction for F2. In both cases, credible interaction terms would indicate that the two Bias Present conditions (i.e., with Tense Frequent or Lax Frequent) are statistically more similar to one another than are the corresponding two Bias Absent conditions. The predictions of each hypothesis are outlined in Table 2.
Hypotheses and Predictions.
| Hypothesis | Relevant Contrast | Prediction |
| H1 | Bias Present vs. Bias Absent | Bias Present more tense-like |
| H2 | Tense Frequent vs. Lax Frequent | Tense Frequent more tense-like |
| H3 | Difference between Bias Present conditions vs. difference between Bias Absent conditions | Bias Present conditions more similar to one another than Bias Absent conditions. |
4. Results
Figure 3 shows the 95% CIs of the posterior median estimates of all four experimental conditions for F1 and F2, aggregated across vowel types. Addressing H1 (Social Hypothesis) first, no credible main effect of Social Condition was found along either phonetic dimension (F1 β = –0.07, 95% CI = [–0.299, 0.161]; F2: β = 0.11, 95% CI = [–0.103, 0.327]). Addressing H2 (Frequency hypothesis), credible main effects of Frequency Condition were observed for F1 and F2 in the expected directions, suggesting that more frequent exposure to tense vowels is associated with more tenselike productions (F1 β = –0.35, 95% CI = [–0.567, –0.129]; F2 β = 0.59, 95% CI = [0.35, 0.835]). Turning to H3, a credible interaction between Frequency and Social Condition was found for both formants in the expected directions (F1: β = 0.38, 89% CI = [0.039, 0.712]; F2: β = –0.66, 95% CI = [–1.071, –0.236]. This confirms that the Bias Present Conditions are credibly more similar to one another than are the corresponding Bias Absent conditions are to one another.
These results are summarized and compared to predictions of our hypotheses in Table 3.
Hypotheses and Results.
| Hypothesis | Relevant Contrast | Prediction | Result | |
| H1 | Bias Present vs. Bias Absent | Bias Present more tense-like | ![]() |
No main effect of Bias |
| H2 | Tense Frequent vs. Lax Frequent | Tense Frequent more tense-like | ![]() |
Vowels are more tenselike in the Tense Frequent condition |
| H3 | Difference between Bias Present conditions vs. difference between Bias Absent conditions | Bias Present conditions more similar to one another than Bias Absent conditions. | ![]() |
Bias Present conditions more similar to one another than Bias Absent conditions. |
We conducted a series of post hoc pairwise comparisons between the various conditions. The results of these comparisons are illustrated in Figure 3. This post hoc analysis was designed to answer the question of to what degree the effects of frequency and social information affected speakers’ productions. For example: Were productions in the Lax Frequent Bias Present condition as tenselike as those in the Tense Frequent Bias Present condition? Answering this question first, the analysis found that there was no credible differences between the Bias Present conditions along either dimension (F1: β = 0.16, 95% CI = [–0.152, 0.460]; F2: β = –0.26, 95% CI = [–0.586, 0.068]). Similarly, a comparison between the Lax Frequent conditions found that Lax Frequent Bias Present was credibly more tenselike than the Lax Frequent Bias Absent condition (F2: β = 0.443, 95% CI = [0.146, 0.744], F1: β = –0.256, 89% CI = [–0.514, –0.021]). Together, these results confirm that the observed main effect of frequency was driven entirely by the Bias Absent condition. A second question this analysis sought to answer was whether productions are more tenselike in the Tense Frequent Bias Absent or Lax Frequent Bias Present conditions, which can also be thought of as asking whether the effects of Frequency or Social Information were stronger when the two are at odds with one another. Tense Frequent Bias Absent was always credibly tenser than Lax Frequent Bias Present. (F1: β = 0.28, 89% CI = [0.010, 0.549]; F2: β = –0.48, 95% CI = [–0.815, –0.141]). Note that the differences between Tense Frequent Bias Absent and Lax Frequent Bias Present are not explicitly marked in Figure 3. Finally, we note that visual inspection of the graphs seems to suggest that Tense Frequent Bias Absent is, contrary to what might have been expected, tenser than the corresponding Tense Frequent Bias Present condition. A pairwise comparison between these two conditions, however, demonstrates that there is no credible difference along either dimension at any level of credibility, and thus the apparent paradoxical effect of social information is not statistically verified.
Although we did not hypothesize that results would differ between high and mid vowel categories, we conducted an additional post hoc analysis to determine whether observed effects were being driven primarily by one category. Figure 4 illustrates our results broken down by category and formant. The lack of a main effect of social bias was found to hold for all categories and dimensions except F2 of mid vowels. Mid vowels were found to be more tense in Bias Present conditions, as predicted, though the effect was small and credible only at the 89% level (β = –0.2, 89% CI = [–0.377, –0.01]). The effect of frequency held for all combinations of vowel and formant.
5. Discussion
This experiment was designed to assess the combined effects in production of the frequency of exposure to a vowel variant and the social information attached to that variant. While our model suggests a main effect of Frequency, the lack of a credible effect of Frequency in the Bias Present condition (Tense frequent vs. Lax frequent) suggests that the frequency effect was driven primarily by the Bias Absent condition. In other words, when speakers were given no social bias towards either variant, frequency effects appeared to predominate. A credible interaction was found between Social Bias and Frequency such that Bias Present forms were not statistically distinguishable based on Frequency. In other words, the presence of social information strongly mitigated the effects of frequency, to the point of effectively cancelling them out entirely. Our post hoc comparisons between the four conditions of our experiment demonstrate that, despite the lack of an overall effect of Social Bias across both Frequency conditions, speakers did in fact shift their productions towards an infrequent, socially preferred variant at the expense of a frequent, nonpreferred one. These results support the conclusion that speakers can overcome the effects of frequency given sufficient social motivation.
5.1. Artificial Language Learning
Methodologically, this study demonstrates the viability of using artificial language learning paradigms to study phonetic production. In terms of social information, the primary advantage offered by an artificial language paradigm is being able to privilege one variant over another for production in a way that relies minimally on speakers’ preexisting biases. This is useful because previous work has found that while social factors such as stance influence phonetic productions, the relationship between any particular social bias and a speaker’s phonetic behavior in a lab environment is difficult to predict in advance (Babel, 2010). While the involvement of social information is relatively minimal, this experiment does still rely on participants’ preexisting beliefs about the social status of “native” and ““nonnative” variants, at least in a classroom environment.
Our goal in designing this experiment is not to uncritically endorse such a view as unproblematic. Furthermore, we admit that the distinction between “native,” and “nonnative,” speakers is somewhat unusual as a social manipulation compared to, for example, the distinction between native speakers from different regions. We argue, however, that this distinction is still fundamentally social in nature in the sense that a speaker’s knowledge of their interlocutor’s background mitigates that speaker’s uptake of the details of their interlocutor’s productions. We further acknowledge that previous work has found evidence that speakers do not necessarily resist convergence towards nonnative speech if given no explicit reason not to (Kim et al., 2011; Lewandowski & Nygaard, 2018; Wagner et al, 2021). On the other hand, there is also the possibility that speakers would preferentially converge towards the nonnative speaker out of either a sense of solidarity or the belief that the nonnative speaker, as an instructor, was still producing socially acceptable forms. Yet that was not the behavior of participants in our data, who appear to have shifted their productions towards the native form. Still, future work may benefit from the use of similar paradigms to explore how the assumption that speakers will preferentially converge to a native variant plays out as a function of speakers’ individual social and linguistic experiences.
Artificial language learning paradigms may also offer a promising avenue by which to study the interaction between social information and other sources of phonetic variation. One potentially interesting area for further study is that of reduction and enhancement effects. In addition to token frequency (Aylett & Turk, 2004), as tested here, various other word-level and situational factors, such as phonological neighborhood density (Gahl et al., 2012), and contextual predictability (Bell et al., 2009) have been shown to influence phonetic realizations along a spectrum from more extreme and hyperarticulated to reduced and hypoarticulated (Lindblom, 1990). The extent to which such factors interact with the signaling of social information is still a matter of active investigation (Clopper et al., 2023). Our results suggest that artificial language learning paradigms may be well suited to the study of these sorts of interactions.
5.2. Frequency Constraints on Social Signaling
We conceptualize the implementation of a speaker’s social strategy as beginning with the selection of a variant from a repertoire according to socially grounded criteria. We drew on the predictions of exemplar theory to describe how this socially motivated process was implemented using the mechanisms of speech production. The apparent ability of speakers to resist the effect of frequency could be accomplished in an exemplar theoretic framework through multiple potential mechanisms like attentional upweighting, inhibition effects (Pierrehumbert, 2001), or differences in the ways in which “standard” (in this case, “native”) variants are encoded relative to “nonstandard” (“nonnative”). In exemplar theoretic terms, attentional upweighting can be modeled as a speaker activating a set of exemplars tagged with the social labels relevant to the chosen variant, and a phonetic target can be defined as the centroid of multidimensional phonetic space those exemplars occupy. Inhibition represents the inverse situation whereby social weighting weakens the activation strength of the more frequent form.
Attentional upweighting during production is not, however, the only mechanism by which attention may mediate the relationship between frequency and social weighting. Another potential way in which the two variants examined in our study could be differentiated is by differences in their encoding in long-term memory. Sumner and colleagues (Sumner et al., 2014; Sumner & Kataoka, 2013) have argued that greater attention paid to productions during exposure can result in more robust long-term representations, and that “standard” forms attract greater attention. Under such an account, the outsized influence exerted by socially upweighted terms in production may not be the result of speaker-driven processes. That is, they may not reflect effort (conscious or otherwise) on the part of the speaker to sound more “nativelike,” but may instead result from “mechanistic” constraints on how linguistic representations are built. Ultimately, we believe our results cannot differentiate between a speaker-driven or purely mechanistic/representational account; our results are compatible with either. The relatively small body of work on how exemplar theoretic mechanisms may impact production, as opposed to perception and recall, makes it difficult to say with certainty how the multiple potential mechanisms involved may interact in production.
Finally, it is worth considering how our results may relate to the contrast between English tense and lax vowels. It is quite possible, even probable, that there exists a three-way interaction between frequency, social information, and phonetic specification. Because our design was not balanced with regards to which variant was labeled “native,” we cannot discount the possibility that speakers would have converged less to the socially preferred form had lax been the upweighted variant. This is particularly interesting given that the native form was also “hyperarticulated,” and more peripheral forms are known to carry social meanings such as “educated” (Schilling, 2013). Tense forms may thus have enjoyed an advantage because these preexisting social connotations are congruous with the social information they represent in this study (i.e., “native,” “spoken by a teacher”). Given that several phonological and phonetic processes are known to differentially affect different phonetic categories, this asymmetry exists for any other potential phonetic dimension and is a potential avenue for future research. Still, given our finding that participants did converge to the lax form in the absence of social information to persuade them to do otherwise, we can be relatively confident that the observed tendency to shift towards the native form was not purely due to that form’s phonological or phonetic properties.
5.3. Additional Observations
Finally, we speculate on two unexpected results of this study. The first is that, although the Social Bias effect did not reach credibility under the Tense Frequent conditions, the fact that socially biasing speakers towards tense forms results in productions that appear to be less tense than those without social bias is somewhat puzzling. At the very least, the fact that Tense Frequent Bias Present conditions were not more tenselike than those in the corresponding Tense Frequent Bias Absent condition goes against our hypotheses. Although we can only speculate on the possible reasons for this unexpected result, one potential explanation is that tracking social information along with phonetic information is a more cognitively demanding task than only tracking the latter. Speakers for whom the frequent version was upweighted may thus have needed to split their attention between the words being produced and the social label attached to that variant, which may have resulted in being less able to resist influence from the less frequent variant. In other words, this account theorizes that the effect of social information is to inhibit the influence exerted by the most frequent form regardless of the social weighting attached to that form.
Similarly, it is possible that the relatively small differences between speakers’ most tense vowels (Tense Frequent Bias Absent) and the vowels they produced when social information opposed frequency (Lax Frequent Bias Present) represent differences in category acquisition rather than speech production. We observe that participants’ productions of a less frequent but upweighted variant in the Lax Frequent Bias Present condition were not maximally tenselike, i.e., they did not maximally resemble the preferred variant. Nonetheless, participants in the Lax Frequent Bias Present condition were successful in producing a phonetic target that was partially assimilated to the upweighted variant. The vowels produced in that condition were more tenselike than the more laxlike vowels produced in the Lax Frequent condition with no Social Bias. Results from distributional learning suggest that speakers update their beliefs about the phonetic distributions associated with a sound category based on recent experience and their existing beliefs (Kleinschmidt & Jaeger, 2015; McMurray et al., 2009; Theodore & Monto, 2019). In cases where the socially preferred form was not frequent in the input (Lax Frequent, Bias Present), participants may not have had access to enough examples of the upweighted variant to arrive at accurate and precise hypotheses regarding its phonetic characteristics. Consequently, participants may have come to associate preferred variants with phonetic targets that were “inaccurate,” in the sense of not veridically reflecting the input, or “imprecise,” in the sense of overlapping with the distribution of dispreferred variants.
6. Conclusion
This study investigated a situation in which a speaker’s social goals favored the production of a specific sociolinguistic variant, while the effects of variant frequency favored a competitor. We hypothesized that the influence of social information would encourage speakers to produce forms that are as close as possible to those of the socially upweighted form, but that the effect of frequency would pull their productions towards the other, more frequent variant. We made use of an artificial language learning paradigm to precisely control for these and other factors known to influence speech production. Participants learned to produce words that could be realized with either a tense vowel {i,e} or the corresponding lax vowel {ɪ,ɛ}. The relative frequency with which each variant was encountered and the social information attached to that variant varied between conditions. Results of mixed-effect modelling suggest that the phonetic values speakers produced after exposure were predicted by both social information and frequency, but that the effect of social bias was, generally speaking, able to override that of frequency.
This study has implications for our understanding of the relationship between linguistic representations and social behavior. Mainly, they offer evidence that increased exposure to a variant may not necessarily result in that variant exerting a larger effect on subsequent productions. This may help further our understanding of phonetic convergence and accommodation by suggesting mechanisms and strategies through which speakers may fail to converge towards their interlocutor. Overall, these results point to avenues by which our understanding of social decision making can be better integrated with our understanding of the representations involved in speech production.
Competing Interests
The authors have no competing interests to declare.
Notes
- The complete list of words is given here. There are two pronunciations each for twelve lexical items, resulting in a total of twenty-four variants: {tʃedu, tʃɛdu}, {fezo, fɛzo}, {kedo, kɛdo}, {peza, pɛza}, {sevu, sɛvu}, {teva, tɛva}; {tʃigu, tʃɪgu}, {fidʒo, fɪdʒo}, {kibo, kɪbo}, {pidʒa, pɪdʒa} {sigu, sɪgu}, {tiva, tɪva}. English words used as baselines were “bed,” “page,” “seed,” “give,” “cage,” “deed,” “head,” “bid,” “tease,” “fizz,” “save,” “shed,” “peas,” “edge,” “kid,” “daze,” “cheese,” “fade,” and “pig.” [^]
- Quantile regression is more robust to violations of normality (Yu & Moyeed, 2001). Unlike Gaussian linear regression, these models calculate the effect of predictor variables on the conditional median of the dependent variable, rather than the conditional mean. Aside from calculating different measures of central tendency, the two classes of model function similarly in calculating how the expected value of the dependent variable changes conditioned on the independent variable. [^]
References
Austin, A. C., Schuler, K. D., Furlong, S., & Newport, E. L. (2022). Learning a Language from Inconsistent Input: Regularization in Child and Adult Learners. Language Learning and Development, 18(3), 249–277. http://doi.org/10.1080/15475441.2021.1954927
Aylett, M., & Turk, A. (2004). The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech, 47(1), 31–56.
Babel, M. (2009). Phonetic and Social Selectivity in Speech Accommodation. University of California, Berkeley.
Babel, M. (2010). Dialect divergence and convergence in New Zealand English. Language in Society, 39(4), 437–456. http://doi.org/10.1017/S0047404510000400
Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics, 40(1), 177–189. http://doi.org/10.1016/j.wocn.2011.09.001
Babel, M., McGuire, G., Walters, S., & Nicholls, A. (2014). Novelty and social preference in phonetic accommodation. Laboratory Phonology, 5(1), 123–150.
Bell, A., Brenier, J. M., Gregory, M., Girand, C., & Jurafsky, D. (2009). Predictability effects on durations of content and function words in conversational English. Journal of Memory and Language, 60(1), 92–111.
Bloomfield, L. (1933). Language. https://digitalcommons.rockefeller.edu/jason-brown-library/80/
Brodeur, M. B., Dionne-Dostie, E., Montreuil, T., & Lepage, M. (2010). The Bank of Standardized Stimuli (BOSS), a new set of 480 normative photos of objects to be used as visual stimuli in cognitive research. PloS One, 5(5), e10773.
Bucholtz, M., & Hall, K. (2005). Identity and interaction: A sociocultural linguistic approach. Discourse Studies, 7(4–5), 585–614.
Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80, 1–28.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.
Clopper, C. G., Burdin, R. S., & Turnbull, R. (2023). Second dialect acquisition and phonetic vowel reduction in the American Midwest. Journal of Phonetics, 99, 101243. http://doi.org/10.1016/j.wocn.2023.101243
Clopper, C. G., & Dossey, E. (2020). Phonetic convergence to Southern American English: Acoustics and perception. The Journal of the Acoustical Society of America, ESUSA2020(1), 671–683. http://doi.org/10.1121/10.0000555@jas.2020.ESUSA2020.issue-1
Clopper, C. G., & Pierrehumbert, J. B. (2008). Effects of semantic predictability and regional dialect on vowel space reduction. The Journal of the Acoustical Society of America, 124(3), 1682–1688. http://doi.org/10.1121/1.2953322
Coupland, N. (1980). Style-shifting in a Cardiff work-setting. Language in Society, 9(1), 1–12.
Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3), 283–321. http://doi.org/10.1037/0033-295X.93.3.283
Dodsworth, R., & Benton, R. A. (2017). Social network cohesion and the retreat from Southern vowels in Raleigh. Language in Society, 46(03), 371–405. http://doi.org/10.1017/S0047404517000185
D’Onofrio, A. (2015). Persona-based information shapes linguistic perception: Valley Girls and California vowels. Journal of Sociolinguistics, 19(2), 241–256.
D’onofrio, A. (2018). Personae and phonetic detail in sociolinguistic signs. Language in Society, 47(4), 513–539.
Drager, K. K. (2010). Sensitivity to grammatical and sociophonetic variability in perception. Laboratory Phonology, 1(1), 93–120. http://doi.org/10.1515/labphon.2010.006
Drager, K., & Kirtley, M. J. (2016). Awareness, salience, and stereotypes in exemplar-based models of speech production and perception. Awareness and Control in Sociolinguistic Research, 1–24.
Finley, S., & Badecker, W. (2009). Artificial language learning and feature-based generalization. Journal of Memory and Language, 61(3), 423–437. http://doi.org/10.1016/j.jml.2009.05.002
Gahl, S., Yao, Y., & Johnson, K. (2012). Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech. Journal of Memory and Language, 66(4), 789–806. http://doi.org/10.1016/j.jml.2011.11.006
Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105(2), 251.
Goldrick, M., & Cole, J. (2023). Advancement of phonetics in the 21st century: Exemplar models of speech production. Journal of Phonetics, 99, 101254. http://doi.org/10.1016/j.wocn.2023.101254
Hay, J., & Drager, K. (2010). Stuffed toys and speech perception. Walter de Gruyter GmbH & Co. KG.
Hay, J., Jannedy, S., & Mendoza-Denton, N. (1999). Oprah and/ay: Lexical frequency, referee design and style. Proceedings of the 14th International Congress of Phonetic Sciences, 1389–1392.
Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, 34(4), 485–499.
Johnson, K. (2007). Decisions and mechanisms in exemplar-based phonology. Experimental approaches to phonology, 25–40.
Kim, M., Horton, W. S., & Bradlow, A. R. (2011). Phonetic convergence in spontaneous conversations as a function of interlocutor language distance. Laboratory Phonology, 2(1), 125–156. http://doi.org/10.1515/labphon.2011.004
Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148.
Koops, C., Gentry, E., & Pantos, A. (2008). The effect of perceived speaker age on the perception of PIN and PEN vowels in Houston, Texas. University of Pennsylvania Working Papers in Linguistics, 14(2), 12.
Lai, W., Rácz, P., & Roberts, G. (2019). Unexpectedness makes a sociolinguistic variant easier to learn: An alien-language-learning experiment. CogSci, 604–610.
Lenth, R., Singmann, H., Love, J., Buerkner, P., & Herve, M. (2019). Emmeans: Estimated marginal means, aka least-squares means (Version 1.3. 4). Emmeans Estim. Marg. Means Aka Least-Sq. Means Https://CRAN. R-Project. Org/Package= Emmeans.
Levelt, W. J., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences, 22(1), 1–38.
Lewandowski, E. M., & Nygaard, L. C. (2018). Vocal alignment to native and non-native speakers of English. The Journal of the Acoustical Society of America, 144(2), 620–633.
Li, A., & Roberts, G. (2023). Co-Occurrence, Extension, and Social Salience: The Emergence of Indexicality in an Artificial Language. Cognitive Science, 47(5), e13290. http://doi.org/10.1111/cogs.13290
Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In Speech production and speech modelling (pp. 403–439). Springer.
McMurray, B., Aslin, R. N., & Toscano, J. C. (2009). Statistical learning of phonetic categories: Insights from a computational approach. Developmental Science, 12(3), 369–378.
Munson, B., & Solomon, N. P. (2004). The Effect of Phonological Neighborhood Density on Vowel Articulation. Journal of Speech, Language, and Hearing Research : JSLHR, 47(5), 1048–1058.
Nycz, J. (2018). Stylistic variation among mobile speakers: Using old and new regional variables to construct complex place identity. Language Variation and Change, 30(2), 175–202. http://doi.org/10.1017/S0954394518000108
Ostrand, R., & Chodroff, E. (2021). It’s alignment all the way down, but not all the way up: Speakers align on some features but not others within a dialogue. Journal of Phonetics, 88, 101074. http://doi.org/10.1016/j.wocn.2021.101074
Pardo, J. S. (2006). On phonetic convergence during conversational interaction. The Journal of the Acoustical Society of America, 119(4), 2382–2393. http://doi.org/10.1121/1.2178720
Pardo, J. S. (2012). Reflections on Phonetic Convergence: Speech Perception does not Mirror Speech Production: Reflections on Phonetic Convergence. Language and Linguistics Compass, 6(12), 753–767. http://doi.org/10.1002/lnc3.367
Pardo, J. S., Gibbons, R., Suppes, A., & Krauss, R. M. (2012). Phonetic convergence in college roommates. Journal of Phonetics, 40(1), 190–197.
Pardo, J. S., Jay, I. C., & Krauss, R. M. (2010). Conversational role influences speech imitation. Attention, Perception, & Psychophysics, 72(8), 2254–2264. http://doi.org/10.3758/BF03196699
Pickering, M. J., & Garrod, S. (2004a). The interactive-alignment model: Developments and refinements. Behavioral and Brain Sciences, 27(02). http://doi.org/10.1017/S0140525X04000056
Pickering, M. J., & Garrod, S. (2004b). Toward a mechanistic psychology of dialogue. The Behavioral and Brain Sciences, 27(2), 169–190; discussion 190–226. http://doi.org/10.1017/s0140525x04000056
Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, lenition and contrast. In J. L. Bybee & P. J. Hopper (Eds.), Typological Studies in Language (Vol. 45, p. 137). John Benjamins Publishing Company. http://doi.org/10.1075/tsl.45.08pie
Pierrehumbert, J. B. (2003). Phonetic Diversity, Statistical Learning, and Acquisition of Phonology. Language and Speech, 46(2–3), 115–154. http://doi.org/10.1177/00238309030460020501
Samara, A., Smith, K., Brown, H., & Wonnacott, E. (2017). Acquiring variation in an artificial language: Children and adults are sensitive to socially conditioned linguistic variation. Cognitive Psychology, 94, 85–114. http://doi.org/10.1016/j.cogpsych.2017.02.004
Schilling, N. (2013). Investigating Stylistic Variation. In The Handbook of Language Variation and Change (pp. 325–349). John Wiley & Sons, Ltd. http://doi.org/10.1002/9781118335598.ch15
Shockley, K., Sabadini, L., & Fowler, C. A. (2004). Imitation in shadowing words. Perception & Psychophysics, 66(3), 422–429. http://doi.org/10.3758/BF03194890
Squires, L. (2013). It don’t go both ways: Limited bidirectionality in sociolinguistic perception. Journal of Sociolinguistics, 17(2), 200–237. http://doi.org/10.1111/josl.12025
Sumner, M., & Kataoka, R. (2013). Effects of phonetically-cued talker variation on semantic encoding. The Journal of the Acoustical Society of America, 134(6), EL485–EL491.
Sumner, M., Kim, S. K., King, E., & McGowan, K. B. (2014). The socially weighted encoding of spoken words: A dual-route approach to speech perception. Frontiers in Psychology, 4, 1015.
Theodore, R. M., & Monto, N. R. (2019). Distributional learning for speech reflects cumulative exposure to a talker’s phonetic distributions. Psychonomic Bulletin & Review, 26(3), 985–992. http://doi.org/10.3758/s13423-018-1551-5
Todd, R. W., & Pojanapunya, P. (2009). Implicit attitudes towards native and non-native speaker teachers. System, 37(1), 23–33.
Wade, L., & Roberts, G. (2020). Linguistic Convergence to Observed Versus Expected Behavior in an Alien-Language Map Task. Cognitive Science, 44(4), e12829. http://doi.org/10.1111/cogs.12829
Wagner, M. A., Broersma, M., McQueen, J. M., Dhaene, S., & Lemhöfer, K. (2021). Phonetic convergence to non-native speech: Acoustic and perceptual evidence. Journal of Phonetics, 88, 101076.
Wright, R. (2004). Factors of lexical competition in vowel articulation. Papers in Laboratory Phonology VI, 75–87.
Yu, K., & Moyeed, R. A. (2001). Bayesian quantile regression. Statistics & Probability Letters, 54(4), 437–447.
Zhang, Q. (2005). A Chinese yuppie in Beijing: Phonological variation and the construction of a new professional identity. Language in Society, 34(3), 431–466.





