1. Introduction
Systematic relationships between linguistic variables (covariation) can differentiate between groups of speakers in a community (e.g., Horvath, 1985) and sociolinguistic styles and personae are predicated on variables, linguistic or otherwise, working together to create social meaning (e.g., D’Onofrio, 2020; Eckert, 2019, pp. 752–754). In New Zealand English (NZE), covarying vowel clusters have been documented over time and datasets (Brand et al., 2021; Hurring et al., 2025). A speaker’s realisation of one monophthong in a cluster provides information about their realisation of other monophthongs in that cluster, including cases that are not directly attributable to tandem shifts in the vowel space (e.g., chain shifts).
Observations of covarying vowels that do not have a clear structural connection raise the question of whether NZE vowel clusters capture socio-stylistic as well as phonological intervocalic relationships. Covarying variables can characterise socially distinct groups of speakers and be interpreted socially by listeners (e.g., Campbell-Kibler, 2011; Levon, 2007; Pharao & Maegaard, 2017). Initial explorations linking NZE vowel clusters to listener perception also indicate NZE listeners differentiate between speakers who talk faster and slower, have lower and higher fundamental frequencies and are at opposite ends of the leader-lagger continuum—a set of covarying vowels undergoing change in NZE across which speakers are either ‘leaders’ or ‘laggers’ (Sheard et al., 2025). Here, we build on the apparent perceptibility of the leader-lagger continuum and ask if contrasting patterns of vowel covariation affect how NZE speakers are socially evaluated by NZE listeners.
To investigate the potentially social basis of NZE vowel clusters, we implement a preregistered perceptual task where New Zealanders create and label groups of NZE speakers who they think sound similar to one another.1 We use the task responses to (1) reveal the social and linguistic characteristics listeners associate with perceptually distinct groups of speakers and (2) test whether these distinct speaker groups are characterised by their covarying vowel patterns, articulation rate and mean pitch. The findings suggest speakers’ perceived and actual speech rates and pitches align. Listeners also associate leaders who speak at slower rates with lower socioeconomic status and having stronger and broader NZE accents, and laggers with higher socioeconomic status. Overall, the results provide evidence that at least one NZE vowel cluster carries socio-stylistic meaning.
2. Speaker production and listener perception of linguistic covariation
2.1. Delimiting speaker groups based on their production of covariation patterns
Historically, approaches to identifying distinct groups of speakers focused on single linguistic features—mapping individual isoglosses and tracking quantitative variation in individual variables across macro-social categories (e.g., Labov, 1966; Trudgill, 1972). Since the mid-twentieth century, however, dialectology and (dia)lectometry have increasingly examined how language varieties are distinguished by their aggregate differences across a range of linguistic features (see Wieling & Nerbonne, 2015). Contemporary variationist sociolinguistics has also seen a growing interest in the relationships between sociolinguistic variables (e.g., Beaman & Guy, 2022).
Sociolinguistic work situated within the third wave in particular (e.g., Eckert, 2012; Hall-Lew et al., 2021) has emphasised how the co-occurrence of specific subsets of variables contribute to the expression of personal identity and construction of speaker styles, stances, and personae in speech production (e.g., Bucholtz, 2008; Pratt, 2019). Others have applied multivariate statistical methods to sociolinguistic data to delimit groups of speakers characterised by systematic patterns of covarying variables. For example, Horvath’s (1985) application of Principal Components Analysis (PCA) in her seminal work on ethnic variation revealed distinct Australian English sociolects in Sydney. These sociolects corresponded to age, ethnic and socioeconomic speaker groups characterised by their patterns of covarying face, fleece, price, mouth and goat variants. More recently, multivariate techniques have been applied to differentiate between speakers with more and less coherent use of a range of Swabian dialect features over time (Beaman & Sering, 2022) and identify the Bequian English linguistic signatures of villages in St. Vincent and the Grenadines (Meyerhoff & Klaere, 2017).
Multivariate analyses of New Zealand English data have also demonstrated that speakers of New Zealand English can be characterised in terms of covarying patterns in the normalised vowel space (Brand et al., 2021; Hurring et al., 2025). While the NZE vowel system is comparable to other standard, non-rhotic, English varieties such as RP and Australian English (Bauer et al., 2007, p. 98), NZE monophthongs have undergone extensive changes over the course of the twentieth and twenty-first centuries (see Figure 1). In particular, the short front vowels kit, dress, and trap have undergone a well-documented push-chain shift where the raising and fronting of trap prompted the raising and fronting of dress and centralisation of kit (e.g., Gordon et al., 2004; Hay et al., 2015). This has also been accompanied by a retraction and lowering of fleece, raising and fronting of nurse and thought, and fronting of goose (Brand et al., 2021; Maclagan et al., 2009, 2017). The back vowels lot, start and strut have each moved lower and/or backer in the vowel space to different degrees (Brand et al., 2021, p. 10).
Change over time for NZE monophthongs in Origins of New Zealand English (ONZE) corpus data (Gordon et al., 2007). Hurring et al. (2025, p. 7) used under CC BY 4.0.
Brand et al.’s (2021) application of PCA to data from the Origins of New Zealand English (ONZE) corpus (Gordon et al., 2007), representing speakers born 1864–1982, revealed that a speaker’s realization of a given monophthong within the set of ten NZE monophthongs in Figure 1 provides insight into their realisations of the other vowels in this set. The analysis specifically revealed three subsets of covarying vowels (vowel clusters), the first two of which are the focus of this paper. Brand et al. label the first subset as ‘the restructuring back vowels’ (speakers with lower, fronter thought have backer start and strut, henceforth the back-vowel configuration). The second subset captures a group of vowel changes in progress (trap, dress, fleece, kit, nurse, and lot), across which individual speakers are consistently leaders or laggers (henceforth the leader-lagger continuum).
Hurring et al. (2025) then implemented the same methodology as Brand et al. (2021) using the QuakeBox (QB), a more recent corpus of accounts of the 2010–2011 Canterbury earthquakes (Clark et al., 2016; Walsh et al., 2013). Speakers aged 18 to 85+ years old are represented in the corpus. Hurring et al. not only revealed similar clusters of vowels in QB as those in ONZE but documented relative stability of these clusters within recordings made of the same individuals eight years apart.2 Hurring et al. therefore provide evidence for stable relationships between back vowels and between ongoing sound changes across NZE speakers and over the lifespan of individuals. Together, Hurring et al. and Brand et al. show that subsets of NZE monophthongs systematically covary in speech production.
Brand et al. (2021, p. 21) argue that the observed patterns of covariation cannot be solely explained as “the phonetic implementation of wholesale phonological features.” The leader-lagger continuum encompasses vowels involved in chain-shifts in the NZE vowel space (trap, dress, and kit) as well as vowels that are not, but are still undergoing change (e.g., nurse). Moreover, all NZE monophthongs have undergone change over time, but they are not all captured in a single cluster. The repeated observations of stable covarying vocalic patterns that are not clearly structurally linked brings us, then, to the question of whether the underlying explanation for these vowel clusters could be social (Brand et al., 2021, pp. 21–23).
The results of Hurring et al. (2025) further indicate that the observed clusters could be social by showing the same vowels continue to covary even after the apparent completion of some sound changes. For example, trap was one of the leader-lagger vowels in Brand et al.’s (2021) analysis of the older ONZE corpus, in which trap was also undergoing change as part of the front-vowel chain shift. In the contemporary QB corpus examined by Hurring et al. (2025), however, realisations of trap appear to be stable and are not significantly predicted by the age of the speaker. Nonetheless, its production continues to covary with the other leader-lagger vowels. While the initial covariation may have been partly structural, its persistence in contemporary NZE is consistent with acquiring a relatively stable social meaning.
One starting point for examining the potentially social basis of covarying NZE vowel clusters is their perceptual relevance and sociolinguistic salience: Do listeners associate different covariation patterns with different social characteristics?
2.2. The perception of speakers with different covariation patterns
How do patterns of covariation in speech translate in listener perception? The sensitivity of listeners to sociolinguistic variation in speech is well documented (see Campbell-Kibler, 2010; Drager, 2010; Thomas, 2002), but our knowledge of how listeners differentiate between speakers with distinct covariation patterns is comparatively limited. Despite the evidence for linguistic features working “synergistically in the perceptual processes” (Montgomery & Moore, 2018, p. 656), sociolinguistic perceptual research, like its research on production, has focussed on individual variables, often in highly controlled audio stimuli. Even perceptual work on personae has tended to focus on how social information affects listener categorisation of individual variables (e.g., D’Onofrio, 2015, 2018) or the social meanings (and, by extension, personae) associated with individual variables or multiple variables analysed independently (e.g., Becker, 2014; MacFarlane & Stuart-Smith, 2012; Pharao et al., 2014).
The available sociolinguistic perceptual research on covariation has mainly focused on how variants of multiple variables interact to affect speaker evaluations of male speakers’ masculinity and sexuality (e.g., Campbell-Kibler, 2011; Levon, 2007, 2014; Pharao & Maegaard, 2017). These studies demonstrate variant combinations can affect how speakers are perceived and evaluated by listeners, although co-occurring variants do not guarantee an additive perceptual effect. For example, Levon (2014, pp. 554–555) found combining elevated mean pitch and increased sibilance did not lead to an increase in perceived “gayness” beyond what either variable achieved independently. Separately, the literature on listener attitudes towards accents has shown standard or conservative English accents tend to carry more prestige than non-standard ones (Giles, 1970; Hiraga, 2005), although this effect can be mediated by listener age (e.g., Levon et al., 2021). Convergence research has also revealed listeners can associate ‘heard’ variants with the ‘unheard’ variants with which they (are expected to) typically co-occur. For example, Wade (2022) shows speakers converge on more ‘Southern’ realisations of price after exposure to Southern features other than monophthongal price in audio stimuli.
While there is evidence for New Zealand English variables carrying social meaning for NZE listeners (e.g., Bayard & Bartlett, 1996; Szakay, 2008; Szakay, 2012; Walker, 2007), there is limited evidence tying social judgements of NZE speakers to covarying sociolinguistic variables (though see Gordon, 1997). Sheard et al. (2025) took a first step to investigating whether the covarying NZE vowel patterns identified in Hurring et al. (2025) are perceptible to listeners, taking short clips from a subset of the monologues analysed in Hurring et al. and playing them to listeners in a pairwise similarity rating task. Listeners were asked to rate how similar the speakers sounded and given instructions that steered them toward rating the likely social characteristics of the speakers, as opposed to the content or superficial acoustic features. Sheard et al. used Multidimensional Scaling (MDS) to create a perceptual similarity space, and investigated how this space was related to a subset of linguistic variables: the Principal Components from the analysis in Hurring et al., representing the speaker’s position in the leader-lagger continuum and the back-vowel configuration, and speaker articulation rate and mean pitch.
Their analysis suggested that listeners systematically differentiated between speakers based on their pitch, speed, and position in the leader-lagger continuum, but not the back-vowel configuration. Figure 2 shows the interpretation of the perceptual space based on listeners’ pairwise ratings in Sheard et al. (2025), with labels for characteristics of distinct groups of speakers based on their speech production. The results indicate higher pitch speakers (top left corner, diamonds, circles) are perceptually distinct from speakers with lower mean pitch and/or slower speech rates (bottom, squares, and down-facing triangles). Within the lower-pitch (and/or slower) speakers, leaders (down-facing triangles) are perceptually distinct from laggers (squares). Fast leaders appear to be less perceptually distinct and overlap with each of the surrounding groups.
The results of perceptual research to date point, therefore, to listener awareness of certain covariation patterns and the capacity of co-occurring variants to affect listener evaluations. We also have evidence that the leader-lagger continuum is perceptible to listeners and differentiates between speakers of NZE from both the side of production and perception. We do not yet know, however, if NZE speakers with different patterns of vocalic covariation are associated with specific or different social meanings or characteristics. Is there a salient social difference between the speakers in different clusters shown in Figure 2?
Further, Sheard et al. (2025) show that some characteristics may be more salient than others–for example, speakers sharing high pitch are rated as similar more consistently than speakers sharing low pitch. The task also shows order effects, suggesting that the characteristics of the first stimulus heard may influence the dimension on which similarity is judged, consistent with work showing that similarity is not a symmetric concept (Tversky, 1977). Would the same clusters of speakers emerge if we did not require listeners to assess pairwise similarity but instead freely classify speakers into groups and listen to the speakers as many times as they liked? Or would such a task enable truly social effects to emerge more robustly?
Variationist and accent bias perceptual research tend to take a top-down methodological approach in which listeners evaluate speakers along a set of predetermined characteristics (e.g., professional, friendliness, employability) based on samples that researchers have pre-selected as representative of different speakers or accents. In our case, however, we have not yet established the social characteristics listeners might associate with either the leader-lagger continuum or back-vowel configuration. It is possible covarying vowel patterns carry associations with broader macro-social categories (e.g., rural or urban New Zealanders) or more specific, localised, personae (i.e., a specific kind of urban or rural New Zealander). Alternatively, it is possible they do not carry any social meaning at all or are not salient in the context of uncontrolled audio stimuli.
Interpretation of perceptual space produced by means of MDS in Sheard et al. (2025), with speakers grouped by shared production characteristics (used under CC BY 4.0).
As such, here we build on the foundations of the production and perception research on covarying NZE vowels by taking a bottom-up approach to listener perception of covariation that allows us to directly connect the perceptual grouping of speakers to the labels NZE listeners used to describe them. We conduct a free classification task, in which we asked participants to create groups of similar-sounding speakers. Crucially, we require participants to label these groups, providing insight into the types of judgements participants are making and the degree to which these are social or not.
2.3. Research questions
We address the following two research questions through the implementation of our free classification task:
A social interpretation of the relationship between covarying vowels: Brand et al. (2021) argue that the observed vowel clusters, particularly the leader-lagger continuum, likely reflect socio-stylistic rather than structural or phonological relationships between vowels. This explanation relies on these clusters carrying associated social meaning(s) which have not yet been tested for. As such, we ask whether speakers with different covarying vowel patterns are not only perceived to sound different to one another but are associated with distinct social characteristics. Are the previously documented perceptual differences between leaders and laggers socially meaningful? In other words, do listener judgements support a social interpretation of the relationships between covarying vowels, a more structural interpretation of these clusters as phonologically driven parallel sound changes, or a combination of the two?
The social meaning of speaker pitch and articulation rate: The participants in Sheard et al. (2025) consistently used these characteristics to rate speaker pairwise similarity, and speakers were differentiated within the MDS perceptual space based on both characteristics (in interaction with each other and–in the case of pitch–the leader-lagger vowels). Is this because they hear these characteristics as indexed to particular social meanings? Or is it simply because these characteristics are so salient that they dominate, even when the instructions steer participants toward social ratings?
3. Methodology
3.1. Free classification task design
Free classification is a technique with comparatively limited implementation in the context of sociolinguistic perceptual research. Where it has been applied, the focus has usually been on listeners’ identification of regional accents of American English speakers (e.g., Clopper & Bradlow, 2009; Clopper & Pisoni, 2007). In free classification, participants are presented with multiple stimuli at the same time and asked to place the stimuli into groups, with the number and size of groups determined by the participant. It is a “perceptual sorting task in which listeners are asked to group stimuli according to similarity” (Lansford et al., 2014, p. 2). The “critical property” of the paradigm is that category labels and dimensions of contrast do not need to be specified in advance by the experimenter (Clopper, 2008, p. 575), making it inherently suited for our research question. It allows for clusters of speakers to be developed from the bottom up.
We use the same audio stimuli from the same 38 QuakeBox participants in Sheard et al. (2025) so that results of the free classification task are comparable to those of the pairwise rating task. These speakers are all Pākehā (New Zealand European) women, aged 46–55, who were analysed in Hurring et al. (2025) and consented to have audio of their QuakeBox monologue shared publicly. As detailed in Sheard et al., the clips are each up to 10 seconds in length, normalised to 70 dB for consistency, and contain at least half of the monophthongs analysed in Brand et al. (2021) and Hurring et al. Further details about the stimuli are available in Section 3 of the supplementary materials.
To enable the experiment to be conducted online, we used an adapted version of the Audio Tokens JavaScript toolbox developed by Donhauser and Klein (2023) integrated with jsPsych (de Leeuw, 2015), an open-source JavaScript framework for creating online experiments. This task specifically utilised the clustering setting from the Audio Tokens toolbox, which is effectively a free classification task where participants are presented with stimuli represented by coloured circles which are surrounded by a larger circle. Participants then drag and drop the stimuli into groups within the larger circle, with a black line appearing to connect stimuli placed in the same cluster. We also adapted the clustering setting of the toolbox so that a text box appeared on the page for each group a participant made (i.e., if a participant made four groups, there would be four text boxes, see Figure 3).3 Participants were required to write specific labels related to the groups they had created to complete the task. At the end of the task, participants completed a demographic questionnaire regarding their age, gender, ethnic background, level of education, occupation, and region they grew up in.
To familiarize participants with the interface and the drag and drop mechanism, they were given a practice task that used audio stimuli from musical instruments rather than voices to avoid priming them for the actual stimuli (e.g., using stimuli from different geographical varieties of English may have primed them to listen for regional differences in the actual task). To avoid overloading participants, the total number of stimuli were distributed randomly across three tasks (12 stimuli in the first of the tasks, 13 in the second and third).
Participants were given the following instructions:
“Sometimes people speak similarly because they are friends, have similar occupations or personalities, or grew up in the same place. In this task, we want to know which speakers of New Zealand English you think sound similar to each other, and which speakers you think sound different to one another.
You will be given audio clips from different New Zealand women talking about their experiences of the Christchurch earthquakes.
We would like you to put the people who you think sound similar into groups by moving the coloured circles closer together. We are interested in who you think sound similar based on the way they talk, rather than the things they say.
Before this task, there will be a practice task using audio from musical instruments.
We estimate that both tasks will take about 20 minutes, but you may take as much or as little time as you need.”
3.2. Participants
Participants were recruited online using the same eligibility criteria as Sheard et al. (2025), with a total of 127 eligible participants completing the task.4 All participants are (a) over the age of 18, (b) a speaker of New Zealand English, and (c) born in New Zealand or living in New Zealand since the age of seven. Participants were paid a $10 e-voucher for completing the task. Based on the completed demographic questionnaires, 75% of participants are women, 18% are men, and the remaining 6% comprises non-binary genders (non-binary, agender, gender diverse, genderqueer). In terms of ethnicity, 75% of participants are Pākehā. The next largest ethnic group is made up of those who identified as both Māori and NZ European (10%), with 2% of participants identifying only as Māori. The remaining participants were of Asian background (NZ Chinese, Japanese, Indian, Nepali, and Filipino), or unspecified mixed ethnicity. Just over half the participants (53%) are aged 18–29, and a fifth aged 30–39 (22%) and 40–49 (18%). The remaining 8% are over 50 years old. Two-thirds of participants grew up on the South Island of New Zealand, predominantly in the Canterbury region (48%), with the remaining third growing up in the North Island. While participants grew up in both the South and North islands, all are currently living in the South Island. Almost half are currently living in Christchurch (48%), with 8% living in the Canterbury region but not in Christchurch and the remainder living in other areas of the South Island.
4. Analysis and results
To address our research questions, first we identify and classify the strategies listeners used to label speaker groups (Section 4.1). Second, we follow Sheard et al. (2025) and apply multidimensional scaling (MDS) to the results of the free classification task to create a perceptual similarity space. Speakers grouped together more frequently in the free classification task are closer to one another within the perceptual similarity space, and speakers grouped together less frequently are further away from one another (Section 4.2). We then use the classified labels to reveal groups of perceptually distinct speakers within the MDS space (Section 4.3). Finally, we test whether these “perceptual groups” are characterised by their covarying vowel patters, as well as their speech rate and mean pitch (Section 4.4).
4.1. Identifying listener categorisation strategies
We classified the free text label responses from each participant on three levels, differing in the degree of detail. The initial label classifications are listed in the Specific classification column of Table 1. Where a participant used multiple categories within the same label, we classified the label according to the first social category in the label or, when participants did not use any social categorisation, the first listed classification. We then grouped specific labels into broader categories that captured the different levels of the categories (e.g., ‘young,’ ‘middle age’ and ‘old’ are all labels related to ‘Age’). Finally, we make a broad three-way, differentiation among labels related to ‘Social Factors’, labels related to ‘Features of Speech’, and labels not related to either social or speech characteristics.
Label categories.
| Specific classification | Intermediate classification | Broad classification |
| Young, Middle Age, Old | Age | Social Factors |
| Pākehā, Māori | Ethnicity | |
| Femme, Masculine | Gender | |
| Low SES, Middle SES, High SES | SES | |
| Confident, Friendly, Timid, Resilient, Mental state | Personality | |
| Rural, Region-specific, Pronounced NZ, Average NZ, ‘Proper’ NZ | NZ Accent | |
| Australian, British | Non-NZ regional accent | |
| Nasal, Breathy, Creaky, Other | Voice quality | Features of speech |
| Lisp, Clear articulation, Slurred, Unclear | Articulation | |
| Rising, Falling, Flat, Expressive | Intonation | |
| Informal, formal | Formality | |
| High pitch, Low pitch, Medium pitch, Quiet | Pitch/volume | |
| Fast, Moderate speed, Slow | Speech rate | |
| Rhotic, Distinctive vowels, Distinctive consonants, distinct, In-between | Vowel/consonant features | |
| Smart, less smart | Intelligence | Other (Not social and not speech) |
| Calm, Emotional, Negative/Positive emotions | Emotions | |
| Clip Topic, Participant unsure, Circle colours | Not grouped further |
It is clear from Table 1 that, despite our attempt to encourage participants to make groups of (perceptually) socially similar speakers, not all participants did so. Indeed, a full 50% of the labels related only to features of speech, with 40% of labels including social factors (see supplementary materials, Section 4.1). Listeners also tended to draw on multiple categorisation strategies across the three iterations of the task; the number of groups participants made in a single trial ranged from 1 to 8, with a mean of 3.6, while the number of broad label categories participants used in a single trial ranged from 1 to 3, with a mean of 1.7. Finally, as Figure 4 displays, while participants referred to a wide range of speech features and social factors in their labels, the most frequently used specific categories relate to perceived age, the typicality of their accent, their articulation and their speed.
From the distribution of specific and broad label categories, there is clear variability between the participants in how they approach the categorisation of ‘similar’ NZE speakers. There are also differences within individual participants. It is common for listeners to use not only one broad approach or another but a combination within the same iteration of the task. The comparatively high number of participants employing labels related to speed and pitch is consistent with other perceptual research documenting their salience to listeners and supports their role in perceptually differentiating between the same NZE speakers in the pairwise rating task (Sheard et al., 2025). Perceived clarity of articulation was also a highly common labelling strategy, as was perceived creakiness and whether a speaker had a ‘lisp.’5
4.2. Applying multidimensional scaling to determine dimensions of perceived social similarity
We then conducted our preregistered analysis and applied multidimensional scaling (MDS) in R (R Core Team, 2024) to the filtered results of the free classification task to create a perceptual similarity space. We filtered out participants who used the clip topic as a labelling strategy (n = 10) and who did not place audio stimuli into groups (n=1), due to these being clear markers of the task instructions not being followed. The results from the remaining participants were included (n = 116).
Multidimensional Scaling is a data-reduction technique that is highly suited to perceptual similarity data. It has a strong precedent of application in analyses of perceived speaker similarity in speech science and perceptual dialectology (e.g., Casserly, 2010; Clopper & Bradlow, 2009; Clopper et al., 2006; Gelfer, 1993; Kreiman et al., 1992; Matsumoto et al., 1973; McDougall, 2013; Murry & Singh, 1980). MDS converts measures of pairwise (dis)similarity between objects into distances within a low-dimensional space, where the relative positioning of each object reflects perceptual similarity (Borg & Groenen, 2005, p. 3). The dimensions of this perceptual space are interpreted as corresponding to the underlying cues driving listeners’ judgments of similarity. Importantly, unlike PCA, where the order of principal components reflects the comparative amount of variance in the data they explain (i.e., PC1 explains more than PC2), the order of the dimensions in MDS is arbitrary (i.e., Dimension 1 is not more important than Dimension 2).
As described in Sheard et al. (2025) and Section 4.2 of the supplementary materials, we used smacofSym from the smacof package (Mair et al., 2022) to apply a two-dimensional spline MDS analysis to a pairwise 38 x 38 dissimilarity matrix derived from the filtered results of the Free Classification task. As participants were not exposed to all speaker pairs, each matrix value represents the proportion of times a given pair of speakers was (not) placed into the same group. We chose to run a two-dimensional MDS analysis based on two permutation tests (Mair et al., 2022; Wilson Black & Brand, 2021) that did not indicate further dimensions were required.
One of the outputs of smacofSym is coordinates, or scores, for each stimulus in the input data. These scores can be mapped along an x (‘Dimension 1’/’D1’) and y (‘Dimension 2’/’D2’) axis, to create a perceptual similarity space. Stimuli situated closer to each other within this space were grouped together more often by listeners and are more perceptually similar. Stimuli situated further away from each other were grouped together less often by listeners and are less perceptually similar. It is the relative distance between stimuli that is important for the interpretation of an MDS space. The position of stimuli within the space is otherwise arbitrary; while it is possible that stimuli D1 or D2 scores would directly align with a given feature in production or with a different MDS space generated from other data, this is not guaranteed.
As we are here building on the results of Sheard et al. (2025), we can ensure that the MDS spaces are comparable by rotating our MDS scores to maximally align the two spaces (Figure 5). We have, therefore, applied Procrustes rotation (Oksanen et al., 2022) to the space (i.e., speakers’ D1 and D2 scores) to maximally align with speakers’ D1 and D2 scores from the MDS space in Sheard et al. (2025). The MDS spaces from the two tasks are now highly similar. The rotated D1 scores from the Free Classification MDS space and the D1 scores from the Pairwise Rating task are strongly correlated (r = 0.81, p < 0.05). The rotated Free Classification D2 scores also strongly correlate with the Pairwise Rating D2 scores (r = 0.73, p < 0.05). As such, the rotated scores in Figure 5B will be the object of the following analysis, and we explore the similarities between the two spaces further in Section 5.2.
(A) The MDS output from Sheard et al. (2025) based on pairwise ratings and (B) the Free Classification MDS output rotated to maximally align with the space in (A).
4.3. Principal Component Analysis to determine relationships between MDS dimensions and perceived speaker groups
We now have our dimensions of perceptual similarity from the MDS and have rotated the space so that D1 and D2 align with the MDS space in Sheard et al. (2025). The MDS does not tell us how listeners differentiate between speakers along these dimensions. When data on listener strategies are not available, the linguistic variables that differentiate between speakers from the side of production (i.e., speakers in one area of the space are faster than speakers in the opposite area of the space) are assumed to be perceptually meaningful to listeners (as in Sheard et al., 2025). The results from the Free Classification task, however, provide a means of identifying both (a) which speakers are most perceptually distinctive, and (b) which labels listeners used to describe them.
Here, we use the categories from Section 4.1. to identify which labels increase or decrease in use as speakers’ rotated D1/D2 scores increase or decrease by implementing a separate ‘Label’ PCA as described in Section 4.3 of the supplementary materials. We note that this PCA was not preregistered, and we did not preregister any analyses related to the label data. The input data for the Label PCA were speakers’ rotated D1/D2 scores, along with the proportion of times that each specific label category was used to describe each stimulus (see first column in Table 1, with categories used by <10% of total experiment participants excluded).6
Unlike MDS, PCA does not require researchers to specify the number of dimensions in advance, but we do need to decide how many Principal Components should be considered in the interpretation of its results. Following the approach to PCA outlined in Wilson Black et al. (2023), we used a permutation and bootstrapping procedure to compare the variance explained by each PC in bootstrapped versus randomised data. This comparison reveals that the variance explained by the Principal Components from PC3 onwards for the bootstrapped data does not explain more variance than would be explained by chance. As with the MDS, we decide to stick with two Principal Components (PCs), referred to here as Label PC1 and Label PC2. The output of PCA has two relevant components. First, each stimulus has a Label PC1 score and Label PC2 score. Second, a subset of the input variables (i.e., speakers’ MDS D1 and D2 scores and the label proportions) are loaded onto each Label PC. Each variable is loaded either positively or negatively. The value of a positively loaded variable increases as a speaker’s PC score increases, while the value of negatively loaded variable decreases as a speaker’s PC score increases.
Figure 6A and Figure 6B depict the variables loaded onto Label PC1 and Label PC2, respectively. A ‘+’ indicates that a variable is positively loaded on the PC while a ‘–’ indicates that variable is negatively loaded on the PC. The figure also represents the estimated randomised/permuted (blue) and bootstrapped (red) distribution of the loadings for each variable. If an index loading (+/– sign) falls above the 90% confidence band for the null distribution (i.e., above the blue line), then we consider it to meaningfully contribute to the PC.7 Rotated Dimension 1 and Dimension 2 scores are split across PC1 (Figure 6A) and PC2 (Figure 6B), respectively. D1 scores are positively loaded onto PC1, and D2 scores are negatively loaded onto PC2. This means that speakers with a positive Label PC1 or PC2 score will have a positive/higher D1 score and a negative/lower D2 score in the MDS space (and the reverse for speakers with negative Label PC scores).
Moving to listener labels, the following label classifications also meet our 90% confidence band level threshold for the Label PC1: NZ rural (accent), Low, Middle, and High Socioeconomic status (SES), Strong NZ (accent), Distinctive vowels and quiet, Clear articulation, and creaky. Notably, the NZ rural, Low SES, strong NZ, distinctive vowels and creaky labels are positively loaded (with a + sign in Figure 6A) while the High/Middle socioeconomic status, quiet and clear articulation are negatively loaded (with a – sign in Figure 6A). This means that listeners describe speakers with positive Label PC1 scores (and lower rotated D1 scores) less frequently with labels related to higher socioeconomic status and clear speech, and more frequently with labels related to low socioeconomic status, rural and stronger NZ accents and creak. We therefore refer to these speakers as perceptually low SES (noting that, in the NZ context, speakers from rural areas may sound distinct from urban speakers and have high incomes).
(A) Variable Loadings for Principal Component 1; (B) Variable Loadings for Principal Component 2; (C) Perceptually Low and High SES speakers whose PC1 scores are 0.5 (low alpha) or 1 (high alpha) standard deviation above (+) or below (–) the mean; and (D) Perceptually Slow/Low pitch and Fast/High pitch speakers whose PC2 scores are 0.5 (low alpha) or 1 (high alpha) standard deviation above (+) or below (–) the mean.
Correspondingly, listeners describe speakers with negative label PC1 scores (and higher rotated D1 scores) more frequently with labels related to higher socioeconomic status and clear speech, and less frequently by labels related to broader accents and creak. We therefore refer to such speakers as perceptually high SES. While there are some labels related to speech features (namely creak and clear articulation) on Label PC1, this PC is driven by labels related to perceived accent strength and speaker socioeconomic status.
Label PC2 in Figure 6B is instead driven by labels related to perceived speed and pitch; fast and slow speech rates, high, middle and low pitch meet our 90% confidence band level threshold for PC2, with labels related to Old, NZ Proper (accent) and flat tone also just meeting it. Slower speech rates and lower pitch are positively loaded onto Label PC2, meaning that speakers with positive Label PC2 scores (i.e., lower/negative rotated D2 scores) are described more frequently by listeners with labels related to slower speech and lower pitch and old age, and less frequently by labels related to fast speech and higher pitch. We therefore refer to such speakers as perceptually slow/low pitch. Correspondingly, listeners describe speakers with negative Label PC2 scores (i.e., higher/positive rotated D2 scores) more frequently with labels related to faster speech rates and high and medium pitch, and less frequently with the positively loaded labels. We therefore refer to these speakers as perceptually fast/high pitch.
Figure 6C highlights the perceptually high and low SES speaker groups within the rotated perceptual space in Figure 4B, and Figure 6D highlights the perceptually slow/low pitch and fast/high pitch speaker groups. Specifically, speakers whose label PC1 and PC2 scores are more than 0.5 (lighter shade) and 1 (darker shade) standard deviation above or below the mean PC score are encircled. Perceptually low SES speakers (+ Label PC1 and D1 scores, low SES labels) are represented by yellow squares. Perceptually high SES speakers (– Label PC1 and D1 scores, high SES labels) are correspondingly represented by purple circles. Similarly, perceptually slow/low pitch speakers (+ Label PC1 scores, – D1 scores, slow speech rate labels) and perceptually fast/high pitch speakers (– Label PC1 scores, + D1 scores, fast labels) are represented by blue upside-down triangles and red diamonds, respectively. In other words, listeners tended to classify speakers in the yellow groups (squares) as sounding more rural and lower socioeconomic status, speakers in the purple groups (circles) as higher socioeconomic status, speakers in the red groups (diamonds) as faster and higher pitch, and speakers in the blue groups (upside-down triangles) as slower and lower pitch, with the darker shapes representing the speakers with the highest and lowest Label PC1/PC2 scores (i.e., the most perceptually low SES/high SES/slow/fast).
To summarise, we first categorised listener labels. Second, we constructed a two-dimensional perceptual similarity space by means of MDS. The MDS space situates the speakers grouped together by listeners more frequently as closer together (and vice versa). We rotated the perceptual space so that it was maximally aligned with the space in Sheard et al. (2025). We then applied PCA to the label category proportions and rotated D1/D2 scores for each speaker. The label PCA captured a distinction between primarily socially-oriented (PC1) and speech-orientated (PC2) approaches to labelling speakers, with D1 scores loaded onto former and D2 scores onto the latter. Finally, using speaker Label PC1 and PC2 scores, we mapped four perceptual groups within the MDS space: lower SES, higher SES, slow (and low pitch), and fast (and high pitch) speakers. The results of the label PCA therefore point to a potential socially meaningful distinction between speakers who are differentiated along D1, which we will explore further in the next section.
4.4. Connecting perception to production: Mapping variables to perceptual space
Which linguistic variables characterise the perceptual groups established in the previous section? Replicating the procedure in Sheard et al. (2025) and described in Section 4.4 of the supplementary materials, we fit two regression trees with stimuli D1 and D2 scores as the dependent variables and the four independent variables below:
Stimulus articulation rate, operationalised as the total number of canonical syllables in the stimulus divided by its total phonation time (based on the CELEX dictionary and forced-alignment of the QB monologue in LaBB-CAT)8
Stimulus mean pitch, as extracted manually from Praat (Boersma, 2001)
The speaker’s position on the leader-lagger continuum (i.e., How far ahead or behind they are in the changes for fleece, dress, kit, trap, nurse, strut, goose, as represented by their QB1 Principal Component 1 score from Hurring et al. (2025) – see Figure 7A)
The speaker’s back-vowel configuration (i.e., the relative positioning of their start, thought and lot vowels in the vowel space, as represented by their QB1 Principal Component 2 score from Hurring et al. (2025) – see Figure 7B).
To be explicit, the independent variables in this analysis are the same measurements as those in Sheard et al. (2025); the analyses are identical except for the dependent variables from the new MDS space. As such, the independent variables are here also scaled across the 38 speakers, and speakers’ leader-lagger and back-vowel configuration scores from Hurring et al. (2025) are based on the monophthongs they produced across their entire QuakeBox monologue, not just within the stimuli. We also note that we preregistered a series of pairwise correlations between speakers’ D1 and D2 scores and the independent variables rather than regression trees. Section 5 of the supplementary materials reports the preregistered correlations along with the results of random forests assessing the stability of the importance of the independent variables in predicting D1 and D2 (Section 4.4).
(A) The Leader-Lagger continuum and (B) back-vowel configuration from Hurring et al. (2025; supplementary materials). Speakers with a high leader-lagger score are in the green areas of the distributions shown in (A) and thus are leaders in a set of ongoing sound changes. Speakers with a high back-vowel-configuration score are in the green areas of the distributions shown in (B).
Figure 8A and Figure 8B display the results of the regression trees predicting D1 and D2, respectively. For each node in a tree, we can see the estimated dependent variable and the proportion and number of observations represented in the node. The if-else statements from regression trees provide cut-off values for the relevant independent variables in predicting the dependent variable, making them a highly useful tool for identifying distinct groups of speakers based on measures in their speech production. We can then map the identified production groups onto our perceptual groups (e.g., are speakers above a particular leader-lagger score cut-off concentrated in the perceptually low SES group?).
The most important predictor of D1, of the four variables, is a speaker’s position in the leader-lagger continuum. Laggers are estimated to have a lower D1 score. There is a mediating effect of speed within the leaders where slow leaders, specifically, are estimated to have a low D1 score while fast leaders are not. As in Sheard et al. (2025), articulation rate is the most important predictor of D2; slower speakers are estimated to have a lower D2 score. Pitch also plays a mediating role where the speakers estimated to have the highest D2 scores are both fast and higher pitch. The results of the random forest procedures provide additional evidence for the importance of pitch, speed and the leader-lagger continuum in predicting D1 and D2. The analyses therefore indicate all three variables contribute to the structural relationships between speakers who are perceived to sound similar and different to each other.
How do these results relate to our four perceptual groups? Figure 8A maps the cutoff values in the regression tree predicting D1 onto the rotated D1 and D2 coordinates, with the laggers in purple circles and the slow and fast leaders (with a leader-lagger score below/above –0.26) in yellow squares and orange triangles, respectively. Figure 8A also shows where the leader and lagger groups are situated relative to the perceptually low SES and perceptually high SES speaker groups, as determined by speakers’ Label PC1 scores. In general, most of the slow leaders are members of the perceptually low SES group in yellow (with some fast leaders), while most members of the perceptual high SES group are laggers. In other words, perceptually low SES speakers tend to be slow leaders while perceptually high SES speakers tend to be laggers. There are, however, laggers not in the perceptually high SES group, and most fast leaders are not members of the perceptually low SES group.
(A) The regression tree predicting D1 and its output mapped onto the perceptual similarity space. Speakers whose label PC1 is <0.5 SD +/– the mean have a lower alpha. (B) The regression tree predicting D2 and its output mapped onto the perceptual similarity space. Speakers whose label PC2 is <0.5 SD +/– the mean have a lower alpha. (C) Overall interpretation of the perceptual space based on the label PCA and regression tree results.
Figure 8B maps the cutoff values in the regression tree predicting D2 onto the rotated D1 and D2 coordinates, with slow speakers (here, those with an articulation rate below –0.11, blue asterisks) concentrated in the bottom of the space, fast high pitch speakers (with a mean pitch above 0.15, red diamonds) concentrated in the top left of the space, and fast and low pitch speakers in between (orange triangles). Figure 8B also shows where the slow and fast speakers are situated relative to the perceptually slow/low pitch and perceptually fast/high pitch groups as determined by speakers’ Label PC2 scores. The fast and high pitch speakers are, indeed, concentrated in the perceptually fast group in red, while most of the slow speakers are concentrated in the perceptually slow group in blue. In other words, the perceptually slower and lower pitched speakers tend to be slower and lower pitched while perceptually faster and higher pitched speakers tend to be higher pitched and fast. Fast and low pitch speakers tend to occupy space in between these two perceptual areas.
We now build on the separate analyses of the two dimensions for an overall interpretation of the two-dimensional space in Figure 8C. Taking the cutoffs for speed, pitch, and the leader-lagger continuum from the regression trees, there are four main production groups characterised by differences in speaker production:
Fast and/or higher pitched speakers (red diamonds)
Slow and/or lower pitched speakers (blue asterisks)
Laggers (purple circles)
Slow leaders (yellow squares)
Notably, unlike the groups above, the fast leaders (orange triangles) are not consistently together; some fast leaders are grouped more often with the slow leaders, others with the fast and high pitch speakers, including fast laggers. This is consistent with Sheard et al. (2025) and provides further evidence for the fast leaders being less perceptually distinct than the other groups. We can now map the four main production groups onto our four perceptual groups characterised by differences in speaker perception:
Faster/higher pitched speakers are concentrated in the perceptually fast/high pitch group (red diamonds, top)
Slower/lower pitched speakers are concentrated in the perceptually slow/low pitch group (blue asterisks, bottom)
Laggers are concentrated in the perceptually high SES group (purple circles, left)
(Slow) leaders are concentrated in the perceptually low SES group (yellow squares, right).
Listeners appear, therefore, to perceive slower/lower pitch and faster/higher pitch speakers as such. The results also point to a socially-driven perceptual distinction between leaders with slower speech rates and laggers; the former are perceptually rural, lower SES and broader, while the latter are perceptually higher SES and with clearer articulation (despite there being both fast and slow speakers in this group).
The combined interpretation of the perceptual and production speaker groups in the MDS space also provides insight into some of the additional labels loaded onto PC1 and PC2 from the label PCA. For example, most speakers in the perceptually low SES group (with higher D1 scores) are not only slower but lower pitched speakers and are included in the wider perceptually slow/low pitch group. Faster and higher pitched speakers are, in contrast, concentrated in the top right (and therefore have both higher D2 and lower D1 scores). It is, therefore, reasonable for the creaky labels to be loaded with the perceptually low SES speakers because of this contrast (it remains possible that perceived creakiness is, in and of itself, associated with being more rural/lower SES). Similarly, most perceptually slow/low pitch speakers are concentrated in the bottom right of the space, with as many laggers as leaders within this group. It is understandable for the slower and lower pitched speakers to be associated more with labels related to older age.
To summarise, we applied regression trees to speakers’ rotated D1 and D2 scores to see if we could meaningfully interpret the perceptual groups based on their covarying vowel patterns, or their speed and pitch. The analysis indicates a general alignment between perceived and actual speaker speed and pitch. The perceptually low SES and high SES groups are also characterised by speakers with different covarying vowel patterns. Listeners perceive (slow) leaders in change as being lower SES, more rural, and having more distinctive NZ accents, while they perceive laggers as being of higher socioeconomic status. However, there is not a 1:1 relationship between a speaker’s position in the leader-lagger continuum and the associated social meanings. Some fast leaders are perceived to sound more like other slower leaders, others more like fast laggers. Fast and high laggers are also less differentiated along the social dimension than slow and fast leaders (i.e., all laggers are on the left of the space, along with faster and higher pitched leaders).
5. Discussion
We originally asked (1) whether NZE listener judgements support a social interpretation of the relationships between covarying vowels, and (2) whether listeners are interpreting articulation rate or pitch socially, either in isolation or in interaction with other characteristics. We now consider these questions in light of our results, before comparing our findings with previous results and considering the limitations and future directions of the study.
5.1. Acoustic and sociolinguistic salience in listener perception
5.1.1. Is there a social basis for NZE vowel clusters?
Our first research question asks whether listener judgements support a structural or social interpretation of the relationship between covarying NZE vowels, or a combination of the two. The results from this task do not support a social interpretation of the back-vowel configuration and indicate the cluster may primarily capture structural, tandem shifts in the back vowel space. The task results also provide evidence that the relationships between vowels in the leader-lagger continuum are interpreted socially.
As with Sheard et al. (2025), the leader-lagger score emerged as an important predictor of participants’ position along the first dimension of our perceptual similarity space, albeit with a different position in the regression tree.9 In this analysis, the leader-lagger score was represented in the first node of the tree, with speaker articulation rate represented in the second node (Figure 8A). The regression tree indicates listeners labelled slow leaders more often as rural and lower socioeconomic status. Listeners labelled laggers in this clusters of vowel changes more often as higher socioeconomic status and having clearer articulation (Figure 8A). This points to speed mediating the perceived socioeconomic status of leaders and laggers. We note that the same nodes emerge even when we restrict our analysis to cases where the participant has labelled the group using social labels (see supplementary materials Section 7), so this is not merely an artefact of combining different labelling strategies in a single analysis.
As such, the analysis provides evidence leaders are socially evaluated differently to laggers. By extension, the findings suggest the leader-lagger vowel cluster does not solely capture phonologically-driven parallel sound changes but socio-stylistic differentiation in the community. We found that leaders (especially leaders with slower speech rates) are perceived as lower socioeconomic status or rural and laggers are perceived as having higher social status. The perceptual differentiation between leaders and laggers aligns with Labov’s (2001) suggestion that sound changes which originate below the level of conscious awareness can attain social meaning and are often led from below or from the middle of the social hierarchy (i.e., upper working-class/lower middle-class speakers).
While speculative, the contrast in ongoing directions of sound change within the back-vowel configuration and leader-lagger continuum may reflect the greater social salience of the latter cluster. Namely, sound change reversals have been recently observed for the formant measurements of NZE trap, dress, nurse, and the trajectories of fleece and goose in Auckland (Ross, 2024; Ross et al., 2022), and for formant measurements of leader-lagger vowels in QuakeBox data (Flego et al., 2025; Hurring et al., 2025, p. 7), including dress, kit, nurse, fleece and goose. The production of leader-lagger vowels has started to shift away from the variants most associated with a lack of education, accent broadness and lower socioeconomic status. Despite such reversals, the stability of the leader-lagger continuum in QuakeBox data is further indication that the cluster is not tied solely to parallel sound changes but to social differentiation between speakers within such changes. The back vowels, by contrast, also continue to exhibit some changes in real and apparent time in QuakeBox data but not the sound change reversals observed for the leader-lagger vowels (and are stable in Ross, 2024). It is, therefore, possible that the manifestation of different tandem vowel shifts captured in different vowel clusters are affected by their (lack of) associated social meanings.
It is important to note that although the results support a social interpretation of the relationships between leader-lagger vowels, being a leader or lagger does not mean listeners will automatically label a speaker with a certain socioeconomic status. The descriptive analysis of the label data revealed that listeners used a wide range of strategies when we did not specify the (perceived) social characteristics we wished them to use (Figure 4, Table 1). Listeners also used a combination of social characteristics (e.g., socioeconomic status and rurality) as well as features of speech (e.g., pitch, speed, and voice quality) to label their speaker groups. This is broadly consistent with perceptual research on listener identification of the regional origin of speakers in the United States that did not specify social characteristics for listeners (e.g., Clopper & Bradlow, 2009; Clopper & Pisoni, 2007). Listeners displayed substantial variation in the number and scope of categories used in classifying speakers, even within the constraint of regional variation.
The associations between covarying NZE vowel patterns and speakers are, therefore, not necessarily one-to-one.10 An underlying assumption of work within the variationist paradigm is that the systematic variation documented in speech production is accessible and socially meaningful to listeners (e.g., D’Onofrio, 2020; Drager, 2018; Labov et al., 2011; Thomas, 2002). While there is substantial evidence for this being the case, our understanding of the contexts in which certain (combinations of) variables become relevant to listener perceptions or associations is less developed. Here, fast and high pitch speakers are perceptually most similar to each other regardless of their covarying vowel patterns, and some fast leaders are perceived to be as similar to fast laggers (i.e., as higher SES) as they are to slow leaders (i.e., as lower SES). It would instead appear that the social relevance of being a leader or lagger emerges in context and is mediated by other perceptually salient features (e.g., pitch and speech rate).
One interpretation of this result is that listeners hear the distinctive vowels of the leaders better in slower speech. Longer vowel durations give listeners more time to process the social meanings associated with the leaders of this cluster of changes. Slow leaders may, therefore, be particularly distinct relative to the laggers, whose vowel realisations we would consider more conservative and deviating less from the standard. Standard varieties are typically constructed as unmarked or neutral in sociolinguistic terms (e.g., Lippi-Green, 1997). This perceived neutrality often reduces their salience, making them less noticeable compared to non-standard forms. Speakers of the standard (i.e., laggers) may therefore be less perceptually salient here–regardless of their speech rate–because their speech aligns with listeners’ expectations, while the slow leaders have a perceptual contrast with both laggers and faster speakers.
The results ultimately highlight the variability of listener perception and underscore the value of considering the perceived social meaning of (co)variation without imposing categories onto listeners.
5.1.2. Are articulation rate and pitch “social”?
Although the consideration of speech rate (and pauses) in sociophonetics remain “somewhat peripheral” (Kendall, 2023, p. 56), previous research has documented both pitch and speech rate as salient acoustic features to listeners (e.g., Harnsberger et al., 2008; Koreman, 2006; Waller et al., 2015), even when assessing characteristics of speakers from different English varieties. For example, the paralinguistic differences (pitch variation and speech rate) between a NZE and American English speaker had greater impact on their perceived social attractiveness and competence than differences in their regional accents (Ray & Zahn, 1999). In the analysis of D2, we saw that listeners are accurately labelling higher pitch speakers as high pitch, and faster speakers as fast. We see an overall effect of speed and, for faster speakers, an additional effect of pitch. The Label PC2 from the Label PCA suggests that this categorization is largely a speech similarity categorization, with labels relating specifically to speed and pitch.
While ‘Old’ is weakly loaded onto Label PC2, such that slower, lower pitch speakers are heard as older, the relationship between age labels and speed and pitch is not upheld when we restrict our analysis to only the data where social labels are used (see Section 7 of the supplementary materials). Specifically, we see a split between the use of labels related to age and to socioeconomic status (i.e., listeners’ use of labels related to old and young age are more systematically related to each other than to their use of labels related to socioeconomic status). And, while leader-lagger scores still predict perceived socioeconomic status, again mediated by articulation rate, within this data subset perceived age is still not predicted by articulation rate mediated by pitch. This suggests that listener differentiation of speakers along D2 is driven by the perceived acoustic properties of the speech (i.e., speakers’ speed and pitch) and not by their overt association with other social characteristics (i.e., speaker age).
The unclear role of perceived age in the perceptual similarity space, despite labels related to older age being the most frequent social label used to described speaker groups (Figure 4), suggests that, while speakers frequently mentioned age in the names of their groups, there was not a systematic relationship between this label and the structure of the groups associated with it. In other words, different participants may have used different cues to make age-based judgements. Alternatively, the loading of ‘older’ labels onto Label PC2 may reflect articulation rate being one of the cues informing perceived age for some listeners alongside other, potentially stylistic, features of speech that are related to, but distinct from, the specific measures of articulation rate and pitch considered in this analysis (e.g., use of pauses). Explorations of the data in Section 7 of the supplementary materials indicate that speakers associated with older labels do tend to either speak slowly or have a high use of pauses, relative to speakers associated with younger labels who tend to have lower use of pauses. But this relationship is not as strong or stable as the relationship between the leader-lagger vowels and perceived socioeconomic status. There is, therefore, only some post-hoc evidence that listeners interpret speaker articulation rate socially, alongside other cues.
It is important to note that all speakers were born in the same decade, and robust perceptual differences are less likely to have been observed than if listeners had compared them with much younger or older speakers (i.e., speed and pitch may be more social if there is a stronger social contrast between speakers). Manual inspection of the labels also reveals that in some cases, participants used ‘old age’ as the first social label (and therefore the primary classifier of the group) and then used other acoustic or social cues (e.g., ‘older, fast’ and ‘older, posher’) to differentiate the speakers. The lack of the appearance of Age as a robust cue in the Label-PCA in our analysis, and the lack of clear associations between Age labels and our subset of independent variables in the social subset of our data, may reflect the somewhat homogenous age sample represented by the audio stimuli.
As discussed in the previous section, articulation rate mediated the impact of the leader-lagger vowels on perceived socioeconomic status. An alternative explanation—or an additional contributor—to slower leaders being simply more salient is, therefore, that listeners are socially interpreting leaders differently at different speeds (i.e., listeners perceive fast leaders as sounding higher SES than slow leaders). The available evidence from the analysis indicates that the differentiation between speakers based on speed and pitch along D2 is primarily based on acoustic rather than sociolinguistic perceptual salience, while articulation rate may mediate the impact of the leader-lagger continuum on the perceived socioeconomic status of speakers. Future work is needed to test these two possible explanations of the relationship between the leader-lagger vowels and the broader speech context.
Regardless of whether they are ‘social,’ speaker articulation rate and pitch are evidently salient and used by listeners to evaluate speaker similarity across the pairwise and free classification tasks. We therefore make the methodological point that when we ask speakers to rate speaker similarity or respond in other ways to characteristics of voices based on stimuli that are not highly controlled, speakers may rely on non-segmental cues more than the variables that are of interest to sociolinguists. As in Sheard et al. (2025), our findings here suggest that sociolinguistic perceptual work may be strengthened by actively considering the role of paralinguistic features such as speech timing and pitch in mediating listener perception of sociolinguistic variables.
5.2. Comparison with the pairwise rating task
Our experiment replicates the result from Sheard et al. (2025) in that leaders were distinct from laggers within a two-dimensional perceptual similarity space. The main point of contrast for D2 in both the Free Classification and Pairwise Rating tasks was speed and then pitch. Slow speakers are concentrated together, and, within the fast speakers, higher pitched speakers are distinct from lower pitched speakers.
However, there are also some differences in how the leader-lagger vowel effect was mediated by other variables. In our free classification task, leader-lagger vowels were represented in the first node of the regression tree predicting D1, followed by articulation rate in the second node. In Sheard et al. (2025) speaker pitch was the first node in the regression tree predicting D1, with leader-lagger vowels represented in the second node within the lower pitch speakers. While the correlations in Section 4.2 indicate the overall structure of the two spaces is similar in terms of which speakers are perceptually distinct from each other (see also Section 5 in the supplementary materials), pitch emerges as more important in the pairwise comparison space.
This may highlight methodological differences in terms of what listeners are doing in pairwise versus grouping tasks. Work outside of linguistics has shown that pairwise tasks can encourage more use of low-level acoustic features in assessing similarity of sounds (e.g., a doorbell, sandpaper, rain, and snoring), while grouping tasks encourage greater use of categorical features (Aldrich et al., 2009). Correspondingly, in our similarity rating task, listeners don’t know the range of variation in the dataset when they rate speaker pairwise similarity on a continuum. This may lend itself to concentrating more on the gradient surface characteristics of the voices (such as pitch). In the free classification task, participants are instead “instantly calibrated to the full ranges of the important stimulus dimensions” (Hout et al., 2013, p. 274), due to multiple stimuli being presented simultaneously. Being able to more fully consider the types of variation present across stimuli to make their decision may lend itself to less continuous approaches, such as their perception of whether the speaker is ‘rural,’ or ‘high SES,’ or ‘fast and high.’
The socially interpreted leader-lagger continuum may, therefore, be more readily apparent to listeners (although mediated by speech rate) in the more categorical free classification task. In the pairwise rating task, gradient surface acoustic cues are instead more salient, and pitch becomes the primary cue. Since sharing high pitch leads to high perceived pairwise similarity while sharing low pitch does not (Sheard et al., 2025), perhaps high pitch speakers are so salient in the pairwise task that the more social leader-lagger vowels can only emerge as relevant in the lower pitch speakers. Nonetheless, the key point is that, although the different task mechanics likely led to differences in the relative salience of cues for listeners, the leader-lagger vowels emerge as perceptually salient across both perceptual spaces.
5.3 Limitations and future directions
Here we discuss the study’s limitations and the future directions suggested by our results. First and foremost, the experiment design lends itself to higher variability in listener responses. As all production work on covarying vowels in NZE has been based on uncontrolled speech from the ONZE and QB corpora, using uncontrolled audio stimuli allowed us to compare speakers’ actual position in the leader-lagger continuum and back-vowel configuration directly to their perception by NZE listeners. Long and uncontrolled stimuli, however, introduce potential confounds that shorter, controlled stimuli would not. This includes high variability in paralinguistic features which, as discussed previously, are perceptually salient to listeners.
The uncontrolled stimuli also meant listeners were not evenly exposed to variables from both covarying vowel clusters. The back-vowel configuration, which captures covariation between thought, start, strut (F2), and lot, has been found to be a robust and stable cluster of covarying vowels in speech production in New Zealand English (Brand et al., 2021; Hurring et al., 2025). However, the results of Sheard et al. (2025) and the analysis presented here suggest that listeners are not systematically using the back vowel cluster in their evaluation of speakers. One possible explanation is that the back vowels are simply not salient, or at least not as (socially) salient, as the leader-lagger vowels, speech rate and pitch. This explanation aligns with evidence for New Zealanders associating the realisations of some vowels in the leader-lagger continuum, particularly kit, with the New Zealand accent (e.g., Bell, 1997), but not for any of the back vowels.
However, as Sheard et al. (2025) note, the back vowels’ apparent lack of perceptual relevance may instead be an artefact of the limited back-vowel data to which listeners were exposed. Listeners heard more than three times the number of leader-lagger vowels than they did tokens of the back vowels, assuming they listened to the audio stimuli in full. Future work could explore the perceptual relevance of the back-vowel configuration with more controlled stimuli (e.g., by repeating the experiment using stimuli with the same pitch and speech rate throughout) which only exposed listeners to variation in the back vowel cluster, not the leader-lagger vowels.
We may also have had more success in prompting specific social associations from listeners by introducing firmer task parameters (as in the free classification work on American regional dialects). Our primary intent, given the lack of established hypotheses and the potential for the labels researchers use to affect listener perception of stimuli (Niedzielski, 1999; Wade et al., 2023), was to avoid imposing our own intuitions or expectations onto listeners. But the free classification task may have been overly free, resulting with listeners tuning into non-segmental as much as segmental features. The question of whether imposing social categories on listeners would have helped or hindered in achieving our aims is an empirical one, and there remains substantial scope for developing our understanding of when and what listeners tune into socially when the experiment design does not force them to make such associations.
Finally, while we have shown that a speaker’s position in the leader-lagger continuum is associated with judgements linked to socioeconomic status, we have not shown that all the individual leader-lagger vowels are contributing equally or at all to the percept of socioeconomic status. Specific vowels may encode different NZE accents more than others and, correspondingly, carry more social meaning for listeners. Our focus on speakers’ leader-lagger scores in the analysis could, therefore, have obscured the relative importance of different leader-lagger vowels in listener judgements.
There is also the possibility of variables other than the leader-lagger vowels (but that may nonetheless correlate with speakers’ leader-lagger scores) contributing to the perceptual differentiation between speakers. For example, speakers in the perceptually high SES group are associated with clearer articulation more than the slower, perceptually low SES speakers, despite there being more fast than slow laggers in the former group. This points to the potential contributing role of consonant production or suprasegmental features, which the leader-lagger scores nonetheless successfully capture. We are left, then, with questions such as could listeners make the same ratings if they could only hear realizations of one of the leader-lagger vowels? And, if so, could they do it to equal degrees regardless of the identity vowel? Future work is necessary to disentangle the precise mechanism underpinning the association between the leader-lagger continuum and perceived socioeconomic status.
6. Conclusion
Speakers produce subsystems of covarying vowels in New Zealand English (see Brand et al., 2021; Hurring et al., 2025). In this paper we have introduced a novel method for identifying and interpreting the social meaning(s) associated covarying sociolinguistic variables. Our implementation of MDS and PCA shows that one of these vowel clusters–the leader-lagger continuum–is salient to New Zealand English listeners and likely captures socio-stylistic intervocalic relationships. Listeners not only perceive leaders and laggers in change as less similar but socially evaluate them differently. Laggers are heard by listeners as sounding more like each other, less like leaders, and as sounding as if they have a higher socioeconomic status. Listeners also evaluate leaders, especially those who also speak slowly, as sounding more like each other, less like laggers, and sounding lower in socioeconomic status or rural.
First, these findings align with longstanding sociolinguistic theories of community sound change that are based on analyses of individual variables. Changes from below, including those which systematically covary, can carry social meaning(s) for listeners. Second, this work advances research in sociolinguistic perception by linking patterns of covariation in speech production with listeners’ social judgements, providing evidence that listeners can attend to a subsystem of covarying vowels when making social evaluations about listeners. The study also highlighted, however, that there is not a one-to-one connection between speaker production of covarying variables and listener perception. The social meaning of the leader-lagger continuum only becomes apparent in context; perceived socioeconomic status is mediated by both vowel patterns and highly salient paralinguistic cues. The salience of the same cues and the associated social characteristics are also variable across listeners.
Finally, the results begin to provide an answer to the question Eckert and Labov (p. 490, 2017) highlight as central to sociolinguistics: “What are the limitations on the generalization of social meaning to more than one variable?” In addition to this study establishing a methodology and workflow for continuing work to address this question, its results indicate that covarying variables can, but do not innately, carry social meaning. The back-vowel configuration, robust in production, did not emerge as perceptually salient or relevant to listeners, indicating that the social and perceptual relevance of linguistic covariation in the leader-lagger continuum is not generalizable to other clusters of covarying variables in NZE, or other varieties and languages.
Notes
- The markdown files, R packages (R Core Team, 2021) and anonymised data used for this paper are available as supplementary materials in a public GitHub repository at https://github.com/nzilbb/qb-free-classification-public. The experiment preregistration is available in the supplementary materials, and we identify which parts of the reported analysis diverge from what was preregistered. The supplementary materials provide additional context and detail but are not necessary for following the analysis. [^]
- The leader-lagger continuum in Brand et al. (2021) includes trap, dress, fleece, kit, nurse, and lot, while in Hurring et al. (2025) it includes trap, dress, fleece, kit, nurse, goose and F1 of strut, with lot more closely associated with the back-vowel configuration (along with thought, start and F2 of strut). [^]
- This adaptation is publicly available on GitHub at https://github.com/nzilbb/audio_tokens. [^]
- We note this is fewer than our preregistered aim of 180–200. We stopped data collection and started analysis once it became apparent that participant recruitment had slowed to a point that further recruitment would result in a significant delay. [^]
- None of the speakers systematically produce /s/ as /θ/. However, the three speakers who 10 or more listeners labelled as having a ‘lisp’ (37, 25, 21) do, impressionistically, have more fronted realisations of /s/. [^]
- The participants excluded from the MDS analysis in Section 4.1 were also excluded from the Label PCA and all subsequent analyses. [^]
- The 90% confidence interval follows the approach to PCA testing in Vieira (2012) and applied in Wilson Black et al. (2023). We treat α as 0.05 elsewhere. [^]
- Phonation time includes corrections, incomplete productions, filled pauses (um and uh/ah), and inter-word pauses less than 50 milliseconds in duration. [^]
- Please refer to Section 5 of the supplementary materials for a detailed comparison of the results of the two studies. [^]
- As a reviewer noted, the relatively low percentage of variance explained by Label PC1 and PC2 echo the results of other sociolinguistic applications of dimension-reduction techniques to speech production data. It is possible that such low levels of explained variance reflect (covarying) variables not mapping neatly onto real or perceived socio-contextual factors. [^]
Ethics and consent
This research received ethics approval from the Human Research Ethics Committee at the University of Canterbury (2023/60/LR-PS).
Acknowledgements
This research was funded by a Royal Society of New Zealand Marsden Research Grant (Proposal ID 21-UOC-107/Grant Number UOC2110) awarded to Kevin Watson, Jennifer Hay and Lynn Clark. We would like to thank our colleagues at the New Zealand Institute of Language, Brain and Behaviour, with a special thanks to Gia Hurring, for their feedback as this work developed. We also sincerely thank both reviewers and the audiences at Laboratory Phonology XIX and Methods in Dialectology XVIII for their thoughts and suggestions.
Competing interests
The authors have no competing interests to declare.
Author contributions
Elena Sheard: Conceptualization, Methodology, Formal analysis, Investigation, Writing – Original Draft, Writing – Review & Editing, Visualization. Jennifer Hay: Conceptualization, Methodology, Funding Acquisition, Writing – Original Draft, Writing – Review & Editing. Robert Fromont: Methodology, Resources, Software. Joshua Wilson Black: Conceptualization, Formal analysis, Methodology, Software, Writing – Review & Editing. Lynn Clark: Methodology, Conceptualization, Funding Acquisition, Writing – Review & Editing.
References
Aldrich, K. M., Hellier, E. J., & Edworthy, J. (2009). What determines auditory similarity? The effect of stimulus group and methodology. Quarterly Journal of Experimental Psychology, 62(1), 63–83. http://doi.org/10.1080/17470210701814451
Bauer, L., Warren, P., Bardsley, D., Kennedy, M., & Major, G. (2007). New Zealand English. Journal of the International Phonetic Association, 37(1), 97–102. http://doi.org/10.1017/S0025100306002830
Bayard, D., & Bartlett, C. (1996). You must be from Gorrre: Attitudinal effects of Southland rhotic accents and speaker gender on NZE listeners and the question of NZE regional variation. Te Reo, 39, 25–45.
Beaman, K. V., & Guy, G. R. (Eds.). (2022). The coherence of linguistic communities. Routledge. http://doi.org/10.4324/9781003134558
Beaman, K. V., & Sering, K. (2022). Measuring change in lectal coherence across real- and apparent-time. In K. V. Beaman & G. R. Guy (Eds.), The coherence of linguistic communities (pp. 87–105). Routledge. http://doi.org/10.4324/9781003134558
Becker, K. (2014). The social motivations of reversal: Raised BOUGHT in New York City English. Language in Society, 43(4), 395–420. http://doi.org/10.1017/S0047404514000372
Bell, A. (1997). The phonetics of fish and chips in New Zealand: Marking national and ethnic identities. English World-Wide, 18(2), 243–270. http://doi.org/10.1075/eww.18.2.05bel
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345. https://cir.nii.ac.jp/crid/1572261550900588928
Borg, I., & Groenen, P. J. F. (2005). Modern multidimensional scaling: Theory and applications (2nd ed.). Springer.
Brand, J., Hay, J., Clark, L., Watson, K., & Sóskuthy, M. (2021). Systematic co-variation of monophthongs across speakers of New Zealand English. Journal of Phonetics, 88, 101096. http://doi.org/10.1016/j.wocn.2021.101096
Bucholtz, M. (2008). The whiteness of nerds: Superstandard English and racial markedness. Journal of Linguistic Anthropology, 11(1), 84–100. http://doi.org/10.1525/jlin.2001.11.1.84
Campbell-Kibler, K. (2010). Sociolinguistics and perception. Language and Linguistics: Compass, 4(6), 377–389. http://doi.org/10.1111/j.1749-818X.2010.00201.x
Campbell-Kibler, K. (2011). Intersecting variables and perceived sexual orientation in men. American Speech, 86(1), 52–68. http://doi.org/10.1215/00031283-1277510
Casserly, E. D. (2010). Perceptual similarity across multiple sociolinguistic variables. IULC Working Papers, 10(1), 1–15. https://scholarworks.iu.edu/journals/index.php/iulcwp/article/view/25838
Clark, L., MacGougan, H., Hay, J., & Walsh, L. (2016). “Kia Ora. This is my earthquake story.” Multiple applications of a sociolinguistic corpus. Ampersand, 3, 13–20. http://doi.org/10.1016/j.amper.2016.01.001
Clopper, C. G. (2008). Auditory free classification: Methods and analysis. Behavior Research Methods, 40(2), 575–581. http://doi.org/10.3758/BRM.40.2.575
Clopper, C. G., & Bradlow, A. R. (2009). Free classification of American English dialects by native and non-native listeners. Journal of Phonetics, 37(4), 436–451. http://doi.org/10.1016/j.wocn.2009.07.004
Clopper, C. G., Levi, S. V., & Pisoni, D. B. (2006). Perceptual similarity of regional dialects of American English. Journal of the Acoustical Society of America, 119(1), 566–574. http://doi.org/10.1121/1.2141171
Clopper, C. G., & Pisoni, D. B. (2007). Free classification of regional dialects of American English. Journal of Phonetics, 35(3), 421–438. http://doi.org/10.1016/j.wocn.2006.06.001
de Leeuw, J. (2015). JsPsych: A Javascript library for creating behavioral experiments in a web browser. Behavior Research Methods, 47(1), 1–12. http://doi.org/10.3758/s13428-014-0458-y
Donhauser, P. W., & Klein, D. (2023). Audio-tokens: A toolbox for rating, sorting and comparing audio samples in the browser. Behavior Research Methods, 55(2), 508–515. http://doi.org/10.3758/s13428-022-01803-w
D’Onofrio, A. (2015). Persona-based information shapes linguistic perception: Valley Girls and California vowels. Journal of Sociolinguistics, 19(2), 241–256. http://doi.org/10.1111/josl.12115
D’Onofrio, A. (2018). Personae and phonetic detail in sociolinguistic signs. Language in Society, 47, 513–539. http://doi.org/10.1017/S0047404518000581
D’Onofrio, A. (2020). Personae in sociolinguistic variation. Wiley Interdisciplinary Reviews: Cognitive Science, 11(6), e1543. http://doi.org/10.1002/wcs.1543
Drager, K. (2010). Sociophonetic variation in speech perception. Language and Linguistics: Compass, 4(7), 473–480. http://doi.org/10.1111/j.1749-818X.2010.00210.x
Drager, K. (2018). Experimental research methods in sociolinguistics. Bloomsbury. http://doi.org/10.5040/9781474251815
Eckert, P. (2012). Three waves of variation study: The emergence of meaning in the study of sociolinguistic variation. Annual Review of Anthropology, 41(1), 87–100. http://doi.org/10.1146/annurev-anthro-092611-145828
Eckert, P. (2019). The limits of meaning: Social indexicality, variation, and the cline of interiority. Language, 95(4), 751–776. http://doi.org/10.1353/lan.2019.0072
Eckert, P., & Labov, W. (2017). Phonetics, phonology and social meaning. Journal of Sociolinguistics, 21(4), 467–496. http://doi.org/10.1111/josl.12244
Flego, S., Clark, L., Hay, J., Hurring, G., Sheard, E., Walker, A., & Wilson Black, J. (2025). Changing trajectories of New Zealand English vowels. Journal of the Acoustical Society of America, 157(4_Supplement). http://doi.org/10.1121/10.0037628
Gelfer, M. P. (1993). A multidimensional scaling study of voice quality in females. Phonetica, 50(1), 15–27. http://doi.org/10.1159/000261923
Giles, H. (1970). Evaluative reactions to accents. Educational review, 22(3), 211–227. http://doi.org/10.1080/0013191700220301
Gordon, E. (1997). Sex, speech, and stereotypes: Why women use prestige speech forms more than men. Language in Society, 26(1), 47–63. http://doi.org/10.1017/S0047404500019400
Gordon, E., Campbell, L., Hay, J., Maclagan, M., Sudbury, A., & Trudgill, P. (2004). New Zealand English: Its origins and evolution. Cambridge University Press. http://doi.org/10.1017/CBO9780511486678
Gordon, E., Maclagan, M., & Hay, J. (2007). The Onze Corpus. In J. C. Beal, K. Corrigan, & H. Moisl (Eds.), Creating and digitizing language corpora (pp. 82–104). Palgrave Macmillan. http://doi.org/10.1057/9780230223202_4
Hall-Lew, L., Moore, E., & Podesva, R. (Eds.). (2021). Social meaning and linguistic variation: Theorizing the third wave. Cambridge University Press. http://doi.org/10.1017/9781108578684
Harnsberger, J. D., Shrivastav, R., Brown Jr., W. S., Rothman, H., & Hollien, H. (2008). Speaking rate and fundamental frequency as speech cues to perceived age. Journal of Voice, 22(1), 58–69. http://doi.org/10.1016/j.jvoice.2006.07.004
Hay, J., Pierrehumbert, J. B., Walker, A., & LaShell, P. (2015). Tracking word frequency effects through 130 years of sound change. Cognition, 139, 83–91. http://doi.org/10.1016/j.cognition.2015.02.012
Hiraga, Y. (2005). British attitudes towards six varieties of English in the USA and Britain. World Englishes, 24(3), 289–308. http://doi.org/10.1111/j.0883-2919.2005.00411.x
Horvath, B. (1985). Variation in Australian English: The sociolects of Sydney. Cambridge University Press.
Hout, M. C., Goldinger, S. D., & Ferguson, R. W. (2013). The versatility of SpAM: A fast, efficient, spatial method of data collection for multidimensional scaling. Journal of Experimental Psychology, 142(1), 256–281. http://doi.org/10.1037/a0028860
Hurring, G., Wilson Black, J., Hay, J., & Clark, L. (2025). How stable are patterns of covariation across time? Language, Variation and Change, 37(1):111–135. http://doi.org/10.1017/S0954394525000043
Kendall, T. (2023). Sociophonetics and speech rate and pause. In C. Strelluf (Ed.), The Routledge handbook of sociophonetics (pp. 55–75). Routledge. http://doi.org/10.4324/9781003034636-4
Koreman, J. (2006). Perceived speech rate: The effects of articulation rate and speaking style in spontaneous speech. Journal of the Acoustical Society of America, 119(1), 582–596. http://doi.org/10.1121/1.2133436
Kreiman, J., Gerratt, B. R., Precoda, K., & Berke, G. S. (1992). Individual differences in voice quality perception. Journal of Speech and Hearing Research, 35(3), 512–520. http://doi.org/10.1044/jshr.3503.512
Labov, W. (1966). The social stratification of English in New York City. Center for Applied Linguistics.
Labov, W. (2001). Principles of linguistic change, Volume 2: Social factors. Blackwell.
Labov, W., Ash, S., & Ravindranath, M. (2011). Properties of the sociolinguistic monitor. Journal of Sociolinguistics, 15(4). http://doi.org/10.1111/j.1467-9841.2011.00504.x
Lansford, K. L., Liss, J. M., & Norton, R. E. (2014). Free-classification of perceptually similar speakers with dysarthria. Journal of Speech, Language, and Hearing Research, 57(6), 2051–2064. http://doi.org/10.1044/2014_JSLHR-S-13-0177
Levon, E. (2007). Sexuality in context: Variation and the sociolinguistic perception of identity. Language in Society, 36(4), 533–554. http://doi.org/10.1017/S0047404507070431
Levon, E. (2014). Categories, stereotypes, and the linguistic perception of sexuality. Language in Society, 43(5), 539–566. http://doi.org/10.1017/S0047404514000554
Levon, E., Sharma, D., Watt, D. J., Cardoso, A., & Ye, Y. (2021). Accent bias and perceptions of professional competence in England. Journal of English Linguistics, 49(4), 355–388. http://doi.org/10.1177/00754242211046316
Lippi-Green, R. (1997). English with an accent: Language, ideology, and discrimination in the United States. Routledge.
MacFarlane, A. E., & Stuart-Smith, J. (2012). “One of them sounds sort of Glasgow Uni-Ish.” Social judgements and fine phonetic variation in Glasgow. Lingua, 122(7), 764–778. http://doi.org/10.1016/j.lingua.2012.01.007
Maclagan, M., Watson, C. I., Harlow, R., King, J., & Keegan, P. (2009). /U/ fronting and /T/ aspiration in Māori and New Zealand English. Language Variation and Change, 21(2), 175–192. http://doi.org/10.1017/S095439450999007X
Maclagan, M., Watson, C. I., Harlow, R., King, J., & Keegan, P. (2017). Investigating the sound change in the New Zealand English nurse vowel /ᴈ:/. Australian Journal of Linguistics, 37(4), 465–485. http://doi.org/10.1080/07268602.2017.1364126
Mair, P., Groenen, P. J. F., & de Leeuw, J. (2022). More on multidimensional scaling and unfolding in R: Smacof, version 2. Journal of Statistical Software, 102(10), 1–46. http://doi.org/10.18637/jss.v102.i10
Matsumoto, H., Hiki, S., Sone, T., & Nimua, T. (1973). Multidimensional representation of personal quality of vowels and its acoustical correlates. IEEE Transactions on Audio and Electroacoustics, 21(5), 428–436. http://doi.org/10.1109/TAU.1973.1162507
McDougall, K. (2013). Assessing perceived voice similarity using multidimensional scaling for the construction of voice parades. The International Journal of Speech, Language and the Law, 20(2), 163–172. http://doi.org/10.1558/ijsll.v20i2.163
Meyerhoff, M., & Klaere, S. (2017). A case for clustering speakers and linguistic variables. In I. Buchstaller & B. Siebenhaar (Eds.), Language variation—European perspectives VI: Selected Papers from the Eighth International Conference on Language Variation in Europe (Iclave 8), Leipzig, May 2015 (pp. 23–46). John Benjamins. http://doi.org/10.1075/silv.19.02mey
Montgomery, C., & Moore, E. (2018). Evaluating S(c)illy voices: The effects of salience, stereotypes, and co-present language variables on real-time reactions to regional speech. Language, 94(3), 629–661. https://www.jstor.org/stable/26630374
Murry, T., & Singh, S. (1980). Multidimensional analysis of male and female voices. Journal of the Acoustical Society of America, 68(5), 1294–1300. http://doi.org/10.1121/1.385122
Niedzielski, N. (1999). The effect of social information on the perception of sociolinguistic variables. Journal of Language and Social Psychology, 18(1), 62–85. http://doi.org/10.1177/0261927X99018001005
Oksanen, J., Simpson, G. L., Blanchet, F. G., Kindt, R., Legendre, P., Minchin, P. R., O’Hara, R. B., Solymos, P., H. Stevens, M. H., Szoecs, E., Wagner, H., Barbour, M., Bedward, M., Bolker, B., Borcard, D., Carvalho, G., Chirico, M., De Caceres, M., Durand, S., … Weedon, J. (2022). Vegan: Community Ecology Package. (Version 2.6–4). https://github.com/vegandevs/vegan
Pharao, N., & Maegaard, M. (2017). On the influence of coronal sibilants and stops on the perception of social meanings in Copenhagen Danish. Linguistics, 55(5), 1141–1167. http://doi.org/10.1515/ling-2017-0023
Pharao, N., Maegaard, M., Spindler Møller, J., & Kristiansen, T. (2014). Indexical meanings of [S+] among Copenhagen youth: Social perception of a phonetic variant in different prosodic contexts. Language in Society, 43(1), 1–31. http://doi.org/10.1017/S0047404513000857
Pratt, T. (2019). Embodying “Tech”: Articulatory setting, phonetic variation, and social meaning. Journal of Sociolinguistics, 24(3), 328–349. http://doi.org/10.1111/josl.12369
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://www.R-project.org/
Ray, G. B., & Zahn, C. J. (1999). Language attitudes and speech behaviour: New Zealand English and Standard American English. Journal of Language and Social Psychology, 18(3), 310–319. https://journals.sagepub.com/doi/pdf/10.1177/0261927X99018003005
Ross, B. (2024). A new look at sound change in New Zealand English [Doctoral dissertation, University of Auckland]. Auckland. https://hdl.handle.net/2292/70721
Ross, B., Ballard, E., & Watson, C. I. (2022). Young Aucklanders and New Zealand English vowel shifts. In R. Billington (Ed.), Proceedings of the Eighteenth Australasian International Conference on Speech Science and Technology (pp. 186–190). Australasian Speech Science and Technology Association.
Sheard, E., Hay, J., Wilson Black, J., & Clark, L. (2025). Do ‘leaders’ in change sound different from ‘laggers’? The perceptual similarity of New Zealand English voices. PLOS ONE, 20(12), e0338199. http://doi.org/10.1371/journal.pone.0338199
Szakay, A. (2008). Social networks and the perceptual relevance of rhythm: A New Zealand case study. University of Pennsylvania Working Papers in Linguistics, 14(2), 148–156. https://repository.upenn.edu/handle/20.500.14332/44697
Szakay, A. (2012). Voice quality as a marker of ethnicity in New Zealand: From acoustics to perception. Journal of Sociolinguistics, 16(3), 383–397. http://doi.org/10.1111/j.1467-9841.2012.00537.x
Thomas, E. R. (2002). Sociophonetic applications of speech perception experiments. American Speech, 77(2), 115–147. http://doi.org/10.4324/9781003034636
Trudgill, P. (1972). Sex, covert prestige and linguistic change in the urban British English of Norwich. Language in Society, 1(2), 179–195. http://doi.org/10.1017/S0047404500000488
Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352. http://doi.org/10.1037/0033-295X.84.4.327
Vieira, V. (2012). Permutation tests to estimate significances on principal components analysis. Computational Ecology and Software, 2(2), 103–123.
Wade, L. R. (2022). Experimental evidence for expectation-driven linguistic convergence. Language, 98(1), 63–97. http://doi.org/10.1353/lan.2021.0086
Wade, L. R., Embick, D., & Tamminga, M. (2023). Dialect experience modulates cue reliance in sociolinguistic convergence. Glossa Psycholinguistics, 2(1). http://doi.org/10.5070/G6011187
Walker, A. (2007). The effect of phonetic detail on perceived speaker age and social class. In J. Trouvain & J. Barry (Eds.), Proceedings of the 16th International Congress of Phonetic Sciences (pp. 1453–1456).
Waller, S. S., Eriksson, M., & Sörqvist, P. (2015). Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age. Frontiers in Psychology, 6. http://doi.org/10.3389/fpsyg.2015.00978
Walsh, L., Hay, J., Bent, D., King, J., Millar, P., Papp, V., & Watson, K. (2013). The UC Quakebox Project: Creation of a community-focused research archive. New Zealand English Journal, 27, 20–32.
Wieling, M., & Nerbonne, J. (2015). Advances in dialectometry. Annual Review of Linguistics, 1, 243–264. http://doi.org/10.1146/annurev-linguist-030514-124930
Wilson Black, J., & Brand, J. (2021). Nzilbb.Vowels. (Version 0.2.1) [R Package].
Wilson Black, J., Brand, J., Hay, J., & Clark, L. (2023). Using principal component analysis to explore co-variation of vowels. Language and Linguistics Compass, 17(1). http://doi.org/10.1111/lnc3.12479







