A multi-method approach to correlate identification in acoustic data: The case of Media Lengua

This study of Media Lengua examines production differences between mid and high vowels in order to identify the major correlates that distinguish these vowel types. The Media Lengua vowel system is unusual in that it incorporates lexical items originating in Spanish’s five-vowel system into a three-vowel system inherited from Quichua, resulting in high degrees of overlap between the front versus back, mid and high vowel pairs /e, i/ and /o, u/ in F1xF2 space. As Media Lengua speakers utilize and differentiate between all five vowels despite the large degree of acoustic overlap between mid and high vowels, this raises the question of what other correlates beyond F1 and F2 might be involved. To address this, our study looks at a range of variables, both acoustic and qualitative, in a multi-method approach using both factor analysis for mixed data and linear mixed effects regression modelling. Each method provides a unique view on the correlates of vowel differentiation in Media Lengua. Taken together, our results indicate that Media Lengua speakers rely on both social and linguistic contextual cues to distinguish mid from high vowels, which overlap in acoustic space (F1 and F2).


Introduction
Contact languages provide an ideal platform for exploring and testing the atypical arrangements of linguistic elements, given that such languages do not form through ordinary processes of evolutionary change. Instead, contact language formation is frequently tumultuous, rapid and, more often than not, forceful as groups with no common language must communicate. However, in the case of mixed languages, a sub-category of contact languages, the process is systematic, metalinguistic, and expressive as the originators are already proficient bilinguals in both source languages. This allows mixed languages to form through expressive means (e.g., ethnic or cultural identification) rather than through communicative necessity (Meakins & Stewart, in press). Because of this, mixed languages are often used internally within a speech community and show clear categorical source language divisions in their lexicon and/or grammar. For instance, Michif, a mixed language spoken sparsely throughout central Canada and in the northern U.S., integrates Plains Cree verb phrases and Métis French noun phrases; while Media Lengua, a mixed language spoken in the Ecuadorian Andes, integrates Quichua morphosyntax and Spanish lexicon (see Section 2).
Nonetheless, given the unique nature of the development of Media Lengua's vowel system as a contact language derived from two languages with differently-sized inventories-and specifically developed by speakers with a smaller L1 inventory who have incorporated vocabulary from an L2 system with a wider variety of vowel phonemes-it should not be presumed which correlates serve as the primary means of distinction between Media Lengua vowels. To sufficiently allow for the determination of the most important correlates in this regard requires a suitably unbiased methodology, one which does not a priori assign certain factors an elevated rank over others. The present study seeks to accomplish this through the application of statistical methods for dimensional reduction to a dataset composed of a multiplicity of acoustic and other variables related to Media Lengua vowel production. Dimensional reduction methods construct unobserved variables by identifying patterns within the observed variables in a dataset, attributing variation in the most concise means possible through the reduction of the number of dimensions needed to describe the primary patterns of variability, and relating the original empiricallymeasured variables to the constructed dimensions so as to rank their relative influence as primary correlates of variation. Whether or not such methods indicate that the primary correlates of acoustic vowel production in Media Lengua pattern similarly to or differently from those of non-contact languages, we believe that this approach is worthwhile, as it is inherently less biased than other methods. Moreover, where dimensional reduction methods are able to determine unexpected patterns, such approaches may offer something of value to linguistic research in a more specific sense.

About the Media Lengua language
Media Lengua is described as a lexical-grammar mixed language (Meakins, 2013;Meakins & Stewart, in press;Muysken, 1997), which is spoken in the northern highlands of Ecuador in the province of Imbabura. Media Lengua combines Quichua's 1 suffixing morphological system and word order with the lexicon of the Rural Spanish dialect spoken in the northern Andes. The Spanish lexicon has replaced over 90% of Quichua's vocabulary primarily through relexification (Deibel, 2017(Deibel, , 2019(Deibel, , 2020Gómez-Rendón, 2005;Lipski, 2016;Muysken, 1981Muysken, , 1997Stewart, 2011Stewart, , 2015b. Media Lengua is spoken by an estimated 2,000 people in a handful of communities near Lago San Pablo. Muysken (1997) and Stewart (2011Stewart ( , 2015b both suggest that the language formed in the early 20th century. Data from this study come from the community of Pijal (see Figure 1).
A sample of Media Lengua is proved in example 1; the first line contains orthography with the bold elements being of Spanish origin, the second line contains a broad IPA transcription with morpheme boundaries, the third line contains the interlinear glosses, and the fourth and fifth lines provide translations in Quichua and Spanish (respectively) for cross comparison.

The Media Lengua vowel system
This section describes Media Lengua's unconventional vowel system based on acoustic and experimental data; for impressionistic descriptions of Media Lengua vowels, see Muysken (1997) for Cotopaxi Media Lengua, and Gómez-Rendón (2005) for Imbabura Media Lengua.
Media Lengua vowels are an integrated system comprised of Quichua's three-vowel system, (/i, u, a/), and Spanish's five-vowel system (/i, u, e, o, a/). The Media Lengua system is described in detail in Stewart (2014) based on F1 and F2 measurements from 2,515 vowel tokens taken from elicited phrases produced by 10 participants. Stewart's results show small, yet statistically significant, differences between Spanish origin and Quichua origin /i/, /u/, and /a/, where the Spanish origin vowels are only 13 Hz less centralized in acoustic space, on average, compared to the Quichua origin vowels (2A). In most cases, a difference of 13 Hz could simply be chalked up to a type I statistical error attributed to the quantity of tokens tested. However, the dispersion patterns of F1 in each Spanish origin corner vowel correspond to the directions predicted by adaptive dispersion models. In other words, vowel systems with more vowels, such as Spanish's five, are predicted to occupy more extreme acoustic spaces, e.g., lower F1 values for high vowels and higher F1 values for low vowels, compared to systems with fewer vowels, such as Quichua's three. However, the degree of dispersion expected for contrastive purposes is not met with an average difference of only 13 Hz (see e.g., Flege, 2007;Johnson, 2000;Liljencrants & Lindblom, 1972;Lindblom, 1986Lindblom, , 1990Livijn, 2000, for details on adaptive dispersion theories). Therefore Stewart's (2014) results suggest that Media Lengua speakers produce imperceptible, yet consistent differences in vowels of the same quality, similar to near-mergers or covert contrasts, in a complex case of stratification based solely on the language of origin of the morpheme. Stewart (2014) also describes the differences between Spanish origin mid-vowels and Quichua/Spanish origin high vowels (Figure 2A). His results suggest that Media Lengua speakers produce statistically significant F1 differences between both groups with /e/ and /o/ having lower tongue body positions, as indicated by formant differences of 41 Hz on average. When converted to 0.36 bark, this is just enough distance to surpass Kewley-Port's (2001) threshold of 0.3 bark for possible formant discrimination (for values between 200-3000 Hz). However, like the corner vowels previously described, the mid-vowel and high-vowel categories also exhibit substantial overlap. It was later confirmed by Stewart (2018a) that Media Lengua speakers take advantage of this small albeit important distance between categories to identify differences between Spanish origin mid and high vowels. Stewart's results ( Figure 2B) were based on a 10-step, two-alternative force choice (2AFC) identification task experiment using minimal pairs as stimuli, which contained modified F1, F2, F3, pitch, duration, and intensity values. Although this line of research has successfully described the arrangement of Media Lengua's different-origin vowels and identified their role in perception, it is still not determined which correlate or correlates are primarily responsible for consistently identifying contrasts in the minimal pairs.
Onosson and Stewart (in press) have recently shown that Media Lengua speakers have also successfully integrated Spanish vowel-sequences (i.e., diphthongs) into the overlapping mid and high vowel space by reducing the overall range of the formant trajectories compared to equivalent sequences produced by Spanish speakers. This was even true for vowel-sequences, which only differ by mid and high vowels in the same articulatory region. For example, e.g., /ie/ and /ei/ consistently had opposing initial and final targets even within the tightly overlapping spaces (see 2A).
The results of these studies taken together describe a dense vowel system with the capacity to operate up to eight monophthong vowels and 19 vowel-sequences which are arranged in an acoustic space that reflects the original Quichua three-vowel system. Stewart and Meakins (2021) suggest that such a system may be a result of Media Lengua's originators having acquired Spanish as late bilinguals, and who spoke a Quichua-accented Spanish. However instead of Spanish origin /e/ assimilating to Quichua /i/ and Spanish origin /o/ assimilating to Quichua /u/, enough acoustic distance was maintained to handle the newly relexified vocabulary of the emerging mixed language, which brought with it a substantial mid versus high vowel phonological functional load. Given that Media Lengua's vowel space has highly overlapping mid and high vowels in the F1 and F2 dimensions (Stewart, 2014), and that Stewart's (2018b) perception experiment incorporated multiple correlates into the stimuli (F1, F2, F3, pitch, duration, intensity), it is not fully clear if the limited differences in F1 and F2 are the primary correlates used in contrasting the pairs. Therefore, this study implements a novel, unbiased approach to better identify potential acoustic correlates involved in producing and perceiving differences in Media Lengua's unconventional vowel system; a system which opposes predicted arrangements in standard dispersion models, and yet still shows clear contrastibility even with overlapping clusters in acoustic space.

Methods
The aim of this study is to identify the acoustic and other correlates involved in differentiating highly overlapping mid and high vowel clusters in Media Lengua. To do so, we make use of the exploratory tool factor analysis of mixed data (FAMD) which is used to categorize multiple quantitative and qualitative variables. FAMDs, and particularly their quantitative component Principal Component Analysis (PCA), are a common exploratory technique used in natural sciences (e.g., biology, chemistry); however, FAMDs have not been readily applied to date in linguistics and acoustic phonetics. We also make use of a well-established statistical method used in acoustic phonetics, linear mixed effects regression modelling (MEM), as a second layer of analysis to corroborate the usefulness of the application of FAMDs to phonetic or other linguistic data.

Materials
For this study, n = 1,202 vowel tokens were analyzed. The vowel data were gathered from two sources. 2 The first source (corpus 1) comes via a 2010-2014 corpus of wordlist/ sentence list data and contains 242 (20%) tokens. The second source (corpus 2) comes via a 2015-2019 corpus of natural speech data and contains 960 (80%) tokens. Token quantities based on the corpora are summarized in Table 1.
During the wordlist and sentence list sessions (corpus 1), participants were asked to read words or sentences in Media Lengua from a computer screen. Since Media Lengua speakers are multilingual (Media Lengua, Quichua, and Spanish) Media Lengua inflectional morphology was added to each word to prime the target language; e.g., seeing the word casa 'house' instead of casamanmi 'house-dir-val' might elicit a more Spanish-like pronunciation. The participants were recorded on a TASCAM DR-1 portable digital recorder using a NEXXTECH unidirectional dynamic microphone (50-13,000 Hz response) in 16-bit WAV format with a sample rate of 44.1 kHz; see Stewart (2015b) for more details on the data collection procedures for corpus 1. Data from corpus 2 were recorded by two assistants from Pijal working in the community. Participants were asked to converse in pairs about a topic of their choosing. These data were recorded in 16-bit WAV format with a sample rate of 44.1 kHz on a Zoom H6 Handy Recorder with its internal microphone (unidirectional XYH-6 capsule), in 16-bit WAV format with a sample rate of 44.1 kHz.

Participants
The participants in this study included 26 trilingual speakers (L1 Media Lengua, L1 Quichua, and L2 Spanish). This group consisted of 15 women and 11 men. All participants were from the community of Pijal Bajo and had acquired Quichua and Media Lengua simultaneously from birth. Upon entering primary school, typically at six-seven years of age, they learned Spanish. Each language has its own niche with Media Lengua only used within the community and typically among speakers aged 40 and above. Spanish is now the dominant language in Pijal and most Media Lengua speakers use Spanish to communicate with their children and those younger than 40; in fact some of the consultants' children had no idea they even spoke Media Lengua and were in awe during the first recording session (see Stewart, 2011). Quichua, while occasionally used alongside Media Lengua, is more commonly used outside Pijal when conversing with people from other Indigenous communities. Given that Media Lengua is typically used as an internal language, and speakers are fluent in both Quichua and Spanish, many people outside of Pijal are not even aware of Media Lengua. However, those that are familiar with it will typically state that people in Pijal "can't speak Quichua well" or "it doesn't make sense," when prompted. As the language is not used with others outside the community, there is little in the way of external linguistic discrimination. Internally, it is a different story; Media Lengua is often referred to as yanga shimi 'a nothing language,' and speaker attitudes range from, "it's fun to speak Media Lengua as it has different intonation and it's expressive" to nostalgia, with some speakers remembering how their parents used to speak, to disdain with one consultant once stating, "this language is stupid and should be forgotten; we'd be better off learning English." Pijaleños in their late 20s and 30s typically have a passive knowledge of Media Lengua and may be able to carry on a basic conversation. However, Pijaleños below this age range are typically Spanish monolingual, though they may have some knowledge of Unified Quichua from classes taught at school (for more information on the social context or language ideologies see Jarrín Paredes, 2014;Stewart, 2011).

Procedures
This section discusses the procedures followed in this study, beginning with methods of vowel extraction. This is followed by a detailed description of the FAMD approach, including prior examples of the implementation of FAMD (and related methods) in phonetic research, as this method is likely to be largely unfamiliar to a number of researchers. Finally, we briefly discuss the MEM method and our rationale for employing it along with FAMD; we omit a lengthy overview of MEM methodology as it is a relatively familiar statistical tool for many phoneticians, while providing some important references for those who are less familiar with it. All statistical analysis and plotting was carried out in R (v3.5.2, R Core Team, 2020) with extensive use of the tidyverse package suite (Wickham et al., 2019).

Vowel extraction
Each target vowel token was manually segmented in Praat (v6.1.04, Boersma & Weenink, 2020), and acoustic data was extracted using a Praat script written by the authors, which takes F1, F2, F3, pitch, and intensity measurements from a boundary point placed at a steady formant state typically located near of the centre of the vowel. Duration was extracted from interval boundaries that demarcated the vowel based on several criteria (appearance of or changes in glottal pulse and formant patterns, changes in the wave form, intensity, etc.). The script also adjusted the ceiling of the formant search range to 5000 Hz for men and 5500 Hz for women as suggested in the Praat manual (Boersma & Weenink, 1996).
The vowel data were also marked for stress (levels: stressed and unstressed) as it has been known since at least the 1950s that stress can affect vowel formats and vowel duration (see e.g., Fry, 1955Fry, , 1965. Media Lengua, like Quichua, has fixed stress on the penultimate syllable, which 'shifts' to the new penultimate syllable when its suffixing agglutinating morphology is added e.g., 'cat-acc-val.' Similarly, the vowel data were also marked for syllable type (levels: open [-coda] versus closed [+coda]) as vowel production can be affected based on the presence or absence of a coda, especially duration (see e.g., Maddieson, 1985).

Factor analysis for mixed data
Factor analysis for mixed data (FAMD; Pagès, 2004) belongs to a family of multivariate statistical methods for the dimensional reduction of data. Dimensional reduction methods seek to determine internal correlations among dependent variables within a set of observations through the construction of derived, unobserved variables (variously termed factors, dimensions, or components depending on the technique), and thereby reducing the number of overall variables needed to describe relationships within the data. Certain methods, such as principal component analysis (PCA; Abdi & Williams, 2010;Michailidis, 2007) are restricted to the analysis of continuous, quantitative measurements, such as measures of formant frequency or acoustic intensity, while other methods, such as (multiple) correspondence analysis (MCA; Abdi & Valentin, 2007) are applicable to data represented by categorical, qualitative variables, such as speaker sex or elicitation style. FAMD incorporates both PCA and MCA methods together into a single analysis, making it ideally suited for the analysis of sociophonetic data which may include a mixture of quantitative acoustic variables along with qualitative social or other categorical variables.
In most dimensional reduction methods, the derived variables are arranged according to the total amount of variance in the dataset described by that variable, such that Dimension 1 is correlated with the greatest amount of data variance among all dimensions, Dimension 2 a lesser amount, and so on. 3 Each derived dimension has a relationship to the original variables in the dataset which is described via ranked loadings which indicate how much of each dimension is composed of the various original variables. In this way, the relative influence or importance of the original variables can be determined, as well as the relationships between these variables, as variables which have loadings on the same dimension are necessarily correlated with each other to some degree.
PCA, the quantitative method used within FAMD, has been previously used within acoustic phonetic research in relation to the study of vowel acoustics, beginning with a series of studies on Dutch in the 1960s and 70s (Klein, Plomp, & Pols, 1970;Plomp, Pols, & van de Geer, 1967;Pols, Tromp, & Plomp, 1973;Pols, van der Kamp, & Plomp, 1969;van Nierop, Pols, & Plomp, 1973). 4 The approach taken in these studies and subsequent work has been to measure vowel spectra via a series of band-passed filters, which measure the acoustic intensity within a small frequency band. Results from those early studies, and confirmed by later research following similar methods (Jacobi, Pols, & Stroop, 2005;Leinonen, 2009), has confirmed that PC1 and PC2 tend to correlate strongly with F1 and F2 respectively, to the extent that plots of vowel spectra-derived PC1 and PC2 strongly approximate traditional F1xF2 vowel plots. Implementation of PCA analysis with speech data has been typically restricted to the band-passed spectral filter method described above. Other dimensional reduction methods such as Factor Analysis (FA) have been successfully used (e.g., Clopper & Paolillo, 2006) to examine direct measurements of vowel formants as well as vowel duration, and to further consider cross-dialectal production differences (Clopper & Paolillo, 2006;Leinonen, 2009).
In this study we utilize PCA and MCA, implemented together within an overall FAMD model, in an innovative approach to the analysis of vowel acoustics. Rather than applying the band-passed filtered spectral measurement method typical in PCA application to acoustic vowel data, we take direct formant measurements of F1-F3 at vocalic mid-point as well as intensity and pitch (F0), along with overall vowel duration. 5 Unlike other notable studies of vowel acoustics which have utilized PCA, our study considers a broad range of diverse acoustic variables rather than solely relying on band-pass-filtered spectra. Our aim in adopting this method is to maximally reduce bias in the dimensional analysis input by incorporating a variety of acoustic measures, and thereby avoid prejudicing it too much in favour of high-intensity spectral frequency bands (i.e., acoustic formants). Although there is no question that cross-linguistically formants are a primary acoustic characteristic of vowels, in this study we thought it important to also consider as many other potentially relevant acoustic qualities as possible, due to the unusual nature of Media Lengua's vowel system, and to allow dimensional reduction techniques to determine which qualities were most strongly correlated with each other, as well as with Media Lengua's various vowels. Furthermore, the use of MCA within the larger FAMD allows us to include a range of other variables such as speaker sex, elicitation style, syllable stress, etc., covering a host of nonacoustic factors potentially involved in vowel production variation.

Linear mixed effects regression
Given that FAMD has not been readily used as an analytical technique in the field of acoustic phonetics, we also evaluate each dependent quantitative variable using linear mixed effects regression models (a.k.a. 'mixed effects models' or MEMs). This provides us with a basis for comparing and interpreting FAMD results with a well-established analytical tool common in the field. This also allows us to observe relationships in the data which might not be apparent under a single method. Linear regressions are apt for this type of data as they allow for the analysis of a continuous dependent variable (e.g., formant frequencies, duration, pitch, and intensity) along with multiple independent variables that include an entire population (e.g., Sex, Vowel Stress, Syllable Type, etc.), while testing for possible interactions. The mixed effects version of a linear regression incorporates an additional layer of analysis by including variables whose populations cannot easily be exhausted e.g., speaker and word (where it is impossible to test every word in a language and its variation each time it is uttered). These models help answer two basic questions: (1) do the independent variables in question have a significant effect on the production of the continuous variable? And (2) how large is this effect? For additional information on the MEM technique, Bates et al. (2015) discuss the computational methods behind its application and Gries (2015) discusses the use of MEM specifically in linguistics; see also Kirby and Sonderegger (2018) on the topic of experimental design.

Media Lengua vowels
Following the methods described in Section 3.3.1, per-token vowel formant values were extracted from the audio recordings. Figure 3 shows sex-differentiated bag plots of the Media Lengua vowel formants within F1xF2 space. Bag plots (Marwick, 2018;Rousseeuw, Ruts, & Tukey, 1999) are two-dimensional analogs of box-and-whiskers plots; the inner polygonal 'bag' (analogous to the 'box') encompasses 50% of tokens and the larger, outer polygonal 'loop' (analogous to the 'whiskers') covers an area three times as large. A median point is also demarcated within the central area of the bag. See Appendix A for a listing of mean vowel formants.

Factor analysis for mixed data
A single FAMD model was built for the Media Lengua dataset, comprised of six quantitative acoustic variables (F1, F2, F3, pitch, duration, and intensity) and four qualitative variables (speaker sex, vowel stress, syllable type, and elicitation style). Note that vowel is not a variable within the FAMD model, an important point to which we return later in this section. The model was built using the FAMD function provided by the FactoMineR package (Lê, Josse, & Husson, 2008; see Appendix B for full model results). Because the acoustic observations involve different unit types (Hertz, milliseconds, and decibels) and variances, all units are automatically scaled according to within-variable internal variance following the formula (Kassambara, 2017) Figure 4A shows a scree plot, which is a plot type common to several dimensional analysis methods. The scree plot displays the amount of variance within the dataset across the first six computed dimensions from the FAMD. As can be seen, there are diminishing returns to the inclusion of larger numbers of dimensions, as each additional dimension explains a progressively decreasing amount of variation. Moreover, lower dimensions tend to become dominated by single variables and so do not increase explanatory power. Figure 4B through G show the ranked contributions of the individual quantitative and qualitative variables to Dimensions 1 through 6, respectively. As can be seen, Dimension 5 contains only two variables (syllable type and intensity) with contributions greater than the expected level for that dimension (which serves as a threshold when considering the relative influence of variables) while over 60% of the variation in Dimension 6 is contributed by just a single variable (syllable type). For these reasons, we focus on just the first four FAMD dimensions-together these account for 55% of the variation in the data (see Appendix B), and each contains a multiplicity (i.e., four or more) of variables  Contribution of variables to Dim−6 G which meet or exceed the respective expected contribution level (as indicated by the red dashed line) for that dimension.
Dimension 1 is most strongly correlated with speaker sex which accounts for more than 25% of variation along that dimension, followed by pitch, F3, and F2 in decreasing order of contribution. The remaining variables fall below the expected average contribution threshold (shown by the red dashed line), i.e., they are not very strongly correlated with Dimension 1, although F1 stands out among these as being only just below the average contribution level. The largest contributors to Dimension 2 are syllable type (nearly 40%), followed by duration and stress, and F1 which again falls just below the threshold. Dimension 3 is dominated by elicitation style as the largest contributor, but there are several other large contributors including F2, stress, F3, and syllable type. Finally, Dimension 4 is dominated by intensity and elicitation style, with syllable type, F1, and duration also making large contributions.
In interpreting which external factors the FAMD dimensions might represent according to their relative contributions, we think it is sensible to take Dimension 1 as generally reflecting most substantially the overall influence of speaker sex, but also having large contributions from vowel quality. Sex differences are well-known to correlate strongly with both pitch and vocal tract length, the latter of which further impacts upper vowel formants such as F2 and, notably, F3. Based on this interpretation, F3 appears to have an unexpectedly large role to play within Media Lengua vowel variation in comparison to the lesser-contributing F1 and F2, formants which are more traditionally associated with the acoustic maintenance of vowel quality differentiation, and which form the basis of the most commonly-used interpretation of vowel production as derived from acoustic data, i.e., the standard F1xF2 vowel plot. Dimension 2 is correlated with aspects of prosodic/ syllabic structure, but importantly does not include pitch and intensity. Dimension 2 also has an important secondary contribution from F1. Taken together with the lowcontribution level of F1 to Dimension 1, we can interpret this as an indication that vowel height generally is of lesser importance within the Media Lengua vowel system relative to vowel front-back position. Dimension 3 is correlated with a large range of variables, with elicitation style standing out as the most important but not by a great amount, and Dimension 4 is even less easy to characterize in a straightforward way. We will return to these interpretations of the FAMD dimensions in the Discussion. Figure 5 shows the results of the PCA sub-component of the FAMD performed on the quantitative acoustic variables of F1, F2, F3, pitch, duration, and intensity, plotted against  Dimensions 1-2 in Figure 5A and Dimensions 3-4 in Figure 5B. The length and relative darkness of the shading of the text and arrows associated with each variable indicate the relative degree of their contributions, and the cross-correlation between the plotted dimensions is indicated via the angles of arrows between the horizontal and vertical axes. For example, in Figure 5A the length and sharply horizontal angles of the arrows for F2 and F3 indicates both the strength of their correlation with Dimension 1 (F3's longer and more darkly-shaded arrow indicating its larger contribution) and their corresponding weak-to-nonexistent correlation with Dimension 2. In contrast, the intermediary (roughly 45°) angle of F1's arrow indicates its relatively equal contribution to both dimensions, indicative of its secondary but still important role in the system. In Figure 5B we see that F1 and F2 are diametrically opposed, standing at a near 180° angle relative to each other, and correlated with both Dimensions 3 and 4. This opposition indicates that the two are negatively correlated. A negative correlation between F1 and F2 is generally not expected for most vowel systems, as there is typically not a very strong relationship between vowel height and frontedness, which makes this relationship somewhat notable-although it only occurs in two of the lower dimensions which together account for only about 20.3% of variation, and so should not be taken to indicate categorical opposition between these two factors. Figure 6 shows the results of the MCA sub-component of the FAMD performed on the qualitative variables of speaker sex (2 levels: female and male), syllable type (4 levels: open, closed, open-final, and closed-final), syllable stress (2 levels: stressed and unstressed), and elicitation style (2 levels: wordlist and speech); these variables are plotted according to their contributions to Dimensions 1-2 in Figure 5A, and according to their contributions to Dimensions 3-4 in Figure 6B. In Figure 6A, the diametrically-opposed positions of "m" (male) and "f" (female) at great distance along Dimension 1 reflects the FAMD finding that speaker sex is the single largest contributor to variance in the dataset (see Figure 4B). The various levels of syllable type are mostly arranged vertically along Dimension 2, with the exception of open-final syllables which are positioned horizontally away from the other types along Dimension 1, indicating a primary division between open-final versus other syllable types. Closed-final syllables exhibit the next-largest distance from the other types at this level of the dimensional analysis. Stress shows involvement of both Dimensions 1 and 2 in differentiating stressed from unstressed vowels. And, speech style (speech versus wordlist) is mostly differentiated along Dimension 1, but with far less distance between the two levels than among the levels of the other qualitative variables.  Here, syllable types are again arranged mostly vertically (Dimension 4) except for closed syllables which are differentiated from other types along Dimension 3, indicating a division between them versus other types at this level. Stress shows only differentiation along Dimension 3, while speech style involves both Dimension 3 and 4, indicating a degree more of variation between levels of that variable. In contrast, speaker sex is almost completely non-differentiated in these dimensions. It is especially notable that variation related to sex, while being the largest contributor to Dimension 1, is almost entirely absent from the other dimensions. This suggests that while cross-sex variation is substantial within the Media Lengua vowel system, it does not interact with many of the other aspects of variation aside from (per the PCA) pitch and vowel quality. Figure 7 plots the combined FAMD (that is, combining both the PCA and MCA) contributions of the original variables in the dataset along Dimensions 1 through 4. Note that Figure 7A and B have been scaled so that their axis scales are relatively proportional; the result is that Figure 7B, which covers a much smaller range of variance, appears much smaller visually. This is intentional and illustrates the fact that the contributions in Figure 7A are much more substantial in terms of their role within the overall variance of the dataset. In Figure 7A, the general patterning of variables exhibits a strong dichotomy between contributions to Dimensions 1 and 2, with a majority of variables being only strongly correlated with one dimension or the other, the exception being F1 which shows similar contributions to both. Some degree of clustering of variables is also evident: Sex and pitch are the largest contributors to Dimension 1, with F3 and F2 following in close succession; in Dimension 2, after the large contribution from syllable type, duration and stress are closely aligned with each other, suggesting a relationship between the two. In Figure 7B, Dimensions 3 and 4 are far less segregated in terms of contributing variables, with both elicitation style and syllable type accounting for relatively large proportions of both dimensions. It is again notable that sex, while strongly correlated with Dimension 1, is almost entirely relegated to that dimension and shows no meaningful contribution to Dimensions 2, 3, or 4.
Lastly, we relate the vowel phoneme categories to the FAMD model. Figure 8 plots the positions of the 1,202 vowel token observations in the dataset within Dimensions 1 and 2 (in panel A), and within Dimensions 3 and 4 (panel B). Each vowel phoneme is identified by the colour of the individual points and outlined by an irregular polygon hull which captures 100% of the observations associated with that category (per-group mean positions are also present as somewhat larger and more darkly-shaded points among the individual observations, but we do not focus on these). It is important to emphasize, as was mentioned at the beginning of this section, that the vowel categories were not directly included in and did not inform the FAMD analysis in any way-the association between the individual observations and vowel groupings is post hoc, being made subsequent to the analyzing the structure of the dataset, and does not in any way influence the prior output of the model.
Regarding the relative positions of the vowel groups, their arrangement in Figure 8A only loosely corresponds to a typical vowel space plot, even if it were rotated around the central meeting point of the axes. This reflects the lesser (although not absent) importance of F1 and F2 (which form typical vowel plot axes) most especially in Dimension 1, relative to other factors such as speaker sex, pitch, and F3 (see Figure 7). The vowels in Figure 8A can be observed to fall into three broad groups, with varying degrees of overlap within and between them. Falling mostly on the negative side of Dimension 2 (i.e., below the x-axis) we find the front vowels /e/ and /i/ which show a large amount of overlap with each other but very little with the other vowels. Clustered near the central 0-0 point and skewing slightly towards the negative side of Dimension 1 (to the left of the y-axis) and the positive side of Dimension 2 (above the x-axis) are the back vowels /o/ and /u/. Like the front vowels, these show large overlap with each other. Finally, the vowel /a/ is distinguished on two counts. It shows a fair degree of overlap with the back vowels, but virtually none with the front vowels. It also has a relatively large part of its distribution which is more isolated from the other vowels, mostly in terms of Dimension 2, and is the only vowel to do so. Summarizing, in terms of Dimension 1 the clearest distinction is between the back vowels /o, u/ versus the others; in terms of Dimension 2 the clearest distinction is between the front vowels /e, i/ versus the others, and secondarily between /a/ versus most substantially the front vowels, but also the back vowels.  provides another view on the distribution of vowel variation, showing the vowel tokens and hulls plotted against FAMD Dimensions 3 and 4. Unlike in Figure 8A, there is very clear separation between /o/ and /u/ here, while the other three vowels are massively overlapped. Taken as a whole, the degree of cross-vowel overlap observed in both plots suggests that /a/ is the most independent vowel, with the high-and-mid vowels less clearly distinguished in both the front and back regions of the vowel space, and that secondarily the back vowels /o/ and /u/ are more distinguished from each other than the front vowels /i/ and /e/ are from each other. Separation between front versus back vowels is also fairly clear at all levels, but also not especially revealing.

Linear mixed effects regression
Six separate MEMs were built to test each one of the acoustic correlates explored in the FAMD analysis (F1, F2, F3, pitch, duration, and intensity). Each model tests whether the correlate under analysis differs significantly across each vowel category and other independent predictors: Sex (female/male), Syllable Type (open/closed/open-final/ closed-final), Stress (stressed/unstressed syllable), and Style (wordlist/conversational speech). Given that we are primarily interested in /i/'s relationship to /e/ and /u/'s relationship to /o/, we rotate /i/ and /u/ through the model intercept. 6 We did not normalize the vowel data as we are only interested in intra-speaker comparisons. Given that each speaker receives their own intercept in a MEM (using speaker as a random effect), the normalization of unequaled variances is unnecessary (see Drager & Hay, 2012, for a comprehensive overview). MEMs were created with the lmer function of the lme4 package (Bates et al., 2015), and confidence intervals (see Appendix C) were calculated using the tab_model function from the sjPlot package (Lüdecke, 2020). P-values were estimated using the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017). Each model includes speaker and word as random effects. All models were kept maximal (i.e., all predictors were included, significant or not) to maintain similar environments across each model. Table 2 provides a summary of significant coefficient estimate 7 results of each model (see Appendix C for the full model results); non-significant results are marked as n.s.
Results from the models presented in Table 2 reveal that formants F1, F2, and F3 play a significant role in differentiating /i/ from /e/, and /u/ from /o/ (in addition to /a/ from /i/, /e/, /u/, and /o/). The F1 model shows that /i/ differs significantly from /e/ by on average 29 Hz, and /u/ from /o/ by on average 35 Hz. These results are remarkably similar to those identified in Stewart (2014), which showed a difference of 41 Hz between /i/ and /e/, and 39 Hz between /u/ and /o/; it is worth noting that his F1 and F2 analysis of Media Lengua vowel formants was based on a different dataset containing elicited speech data from 10 speakers. The only other significant effect identified in the F1 model was sex, with men producing the overall F1 vowel frequencies on average 67 Hz less than women. Interactions between sex and vowel were also tested but nonsignificant results were revealed for the target vowel pairs. However, men produce the /a/ vowel significantly lower in Hertz frequency, equating to a higher tongue body position compared to women. Men also produce /o/ significantly lower in Hz frequency compared to women in comparison with /i/ (-38 Hz); this is likely a result of the more compact vowel space observed in the speech of men.
The F2 model shows /i/ differs significantly from /e/ by, on average 150 Hz and /u/ differs significantly from /o/ by, on average, 154 Hz. These results equate to a more centralized space for /e/ and /o/ compared to /i/ and /u/, respectively. This also corresponds to the results of Stewart's (2014) F2 analysis of /i/ and /e/, which had an average difference of 125 Hz; the F2 of /u/ and /o/ was non-significant in his study. Additionally, stressed vowels were shown to be significantly fronted by 41 Hz, while vowels in a closed final syllable in ultimate position were significantly retracted by 119 Hz. The latter may be a coarticulation effect caused by anticipating the coda targets. Interactions between sex and vowel were also tested but no significant results were revealed for the target vowel pairs. However, a number of other interactions were significant, which all point to a more retracted vowel space for men compared to women.
The F3 model shows /i/ differs significantly from /e/ by, on average 120 Hz; however, differences between /u/ and /o/ were non-significant. F3 was not analyzed in Stewart (2014). The only additional fixed factor that reached significance was sex, with men showing significantly lower Hertz values by, on average, 400 Hz. Non-significant interactions between sex and vowel were calculated with respect to F3. Contrarily, pitch, duration, and intensity were all shown to be non-significant factors in differentiating /i/ from /e/ and /u/ from /o/. As expected, however, pitch is correlated with sex, with male speakers' pitch values being an average of 131.5 Hz lower than female speakers. Pitch also plays a significant role in stress, with stressed syllables being, on average, 9 Hz higher in pitch than unstressed syllables. Additionally, vowels in the ultimate syllable are lower in pitch than vowels in non-final syllable position. This is likely because the final syllable comes directly after the stressed syllable causing some degree of deaccentuation. Evidence from Stewart (2015a) supports these findings, which describes a bitonal pitch accent ending a high target (L+H*) on the penultimate syllable. Additionally, there is a tendency for the ultimate syllable to be lower in f0 Hz frequency compared to the baseline f0 in the contours illustrated in Stewart (2015a); especially in boundary tones that end on a low (L% BT). Moreover, the wordlist factor appears to have some influence on pitch, with an overall higher f0 of 19 Hz; an effect possibly caused by a greater tendency for careful speech produced during this data elicitation method.
The duration of a vowel was shown to be longer by, on average, 24 ms when found at the end of a word (final open syllable); a common cross-linguistic tendency corresponding to utterance-final lengthening. However, in the wordlist data, this effect was reduced by 16 ms to only 8 ms, likely due to the fact that speakers were not producing the words within an utterance in that corpus. Moreover, /a/ was shown to be longer in duration by, on average, 29 ms compared to /u/, supporting long-standing evidence that, in general, higher F1 frequencies correspond to longer vowel duration (see Toivonen et al., 2015, who recently revisited the correlation between height and duration in vowel production). The model results for duration also suggest that /i/ is marginally longer than /u/ by 17 ms. Non-significant interactions between sex and vowel were revealed with respect to pitch.
Lastly, the intensity model revealed non-significant results across the board with the exception of a negligible 1 dB difference between open and closed syllables. The lack of a strong correlation between stress and intensity may suggest that the term 'pitch accent' more aptly reflects the differences in syllable quality between penultimate position and non-penultimate position than 'stress,' which is typically used in the Quechuan literature.
Given that non-significant correlations were found across the target mid and high vowel pairs (/i/ & /e/ and /u/ & /o/) for pitch and duration, it can be inferred that qualities which can affect vowels in other languages, such as tone and/or length, are likely not responsible for differentiating the overlapping mid and high vowel clusters in Media Lengua, based on these models. From the preceding analysis, then, we are left with formants, including F3 in the case of /i/ versus /e/, as the primary acoustic correlates for differentiating mid from high vowels.

Summary and comparison of results
As this study employs two distinct analytic methods, factor analysis for mixed data (FAMD) and linear mixed effects regression modelling (MEM), the first pertinent consideration is to compare the results forthcoming from each method. While both methods investigate the combined effects of several quantitative and qualitative/categorical variables on vowel production, they do so in quite different ways, each with certain strengths and weaknesses. The FAMD model incorporates all the variables within a coherent analysis; however, it does not relate the variables to the vowel categories themselves or offer direct comparisons between vowel types. In contrast, MEM does directly relate the variables to different vowel categories and compare variation between vowels-however, it does not offer a single coherent analysis of all variables across the full dataset, because separate models are built for each acoustic correlate. We proceed here with a brief summary of the results from each analysis, and then compare these to each other, within the limitations inherent to each methodology.
The FAMD analysis revealed that the 22.3% of variation was attributable to the derived Dimension 1, whose main contributing variables included speaker sex and the acoustic variables of pitch, and formants F3, F2, and F1, in relative order according to their individual degrees of contribution. A further 12.7% of variation was attributable to the derived Dimension 2, composed primarily of the variables syllable type, acoustic duration, and stress. Another 10.8% of overall variation was attributable to Dimension 3, with large contributions from style, F2, F3, stress, and syllable type, while 9.5% of variation was attributable to Dimension 4, also having strong contributions from style and syllable, as well as intensity, duration, and F1. Interpreting these various groupings of major contributing variables for each dimension, Dimension 1 appears strongly related to speaker sex (interpreted broadly and subsuming some variation in related acoustic variables), 8 F3 (discussed below), and vowel positional quality-with front-back position superseding height. Dimension 2 is associated with syllable type and factors which may be linked to it phonologically (duration and stress). While a full examination of Media Lengua stress patterning is beyond the scope of this paper, we can note that stress and syllable type are indeed strongly correlated (χ 2 = 100.54(3), p < .001), with stressed syllables being almost completely excluded from closed-final (0 tokens) and open-final (3 tokens) syllables (note too that syllable type makes an important contribution in Dimension 4, and dominates Dimensions 5 and especially 6; see Figure 4). 9 Dimensions 3 and 4 together share large contributions from style which suggests a broad, but much subordinated role for that factor. Using these interpretations, the greatest correlates of variation in Media Lengua vowel production are found to be: speaker sex, F3, vowel quality, syllable type, and speech style, roughly in that order.
Although concurring in several areas, the MEM results are not always easily compared with the FAMD findings. In terms of acoustic qualities, the MEMs show that the vowels in each mid-high pair, /e/ versus /i/ and /o/ versus /u/, are significantly different from each other and are most strongly differentiated according to vowel formants, specifically F1 and F2, rather than other acoustic factors. It is also notable that F3 serves to differentiate /e/ from /i/, but not /o/ from /u/, which seems to be a novel finding in comparison with previous studies. While these results cannot be directly compared to the FAMD output, it is notable that within the MEMs for F1 and F2 the magnitude of difference was consistently greater between /o, u/ than between /e, i/. This is congruent with the FAMD results at the level of Dimensions 3 and 4, as shown in Figure 8B, where /o/ and /u/ are more clearly differentiated from each other as compared with /i, e/; and even at the higher Dimensions 1 and 2 ( Figure 8A) it is arguable that there is a similar albeit smaller difference between the two vowel pairs. This trend is actually the opposite in Stewart (2014) where the front series showed more acoustic distance compared to the back series. One possible explanation for this is that the /o/ measurements from this study show greater fronting and an overall larger vowel space compared to Stewart's (2014) measurements (compare Figure 2 with Figure 3), which might be attributed to the increased number of participants (over double) in this study, which allowed this effect to be captured.
Qualitative variable results are even less easily compared across methods, because within MEM analysis their role is always contextualized within a given model ( Table 2). For example, none of the quantitative variables exhibit a significant effect in relation to F1 within the model structured around that variable, whereas the majority do so in relation to F2. The FAMD results are not discrete in this way, such that the quantitative and qualitative variables are integrated within the larger analysis. Comparing across the various MEMs, we can see that syllable type, speaker sex, and stress each have significant effects within a minimum of three distinct MEMs-and these are also the three largest qualitative contributors within the FAMD. However, while it is difficult to say more than this from the MEM results in terms of the relative magnitude of effect from each of these variables within the system as a whole, the FAMD analysis does provide such a ranking: sex > syllable > stress.
Taken as a whole, then, the FAMD and MEM results both support and complement each other. Some divergence between the /e, i/ versus /o, u/ pairs emerges in both analyses, as does the relative importance of the categorical variables of speaker sex, syllable type, and stress. The MEM provides direct comparison across vowel types, and greater detail regarding the significance of differences in variation therein. FAMD provides a full-system analysis of variables of both types, quantitative and qualitative/categorical. The two methods, at least within the present study, function in a compatible way where their combined results are greater in both quantity and quality than they are when taken independently of each other.

The role of F3
One of the more intriguing results emerging from the combined analysis in this study concerns the role of F3. Under the FAMD, F3 emerges as the third-largest contributor to Dimension 1 (and the second-largest quantitative contributor) after speaker sex and pitch. As pitch is a known correlate of sex, and vowel formants are also known to vary systematically between female versus male speakers, we have therefore interpreted Dimension 1 as mostly reflecting speaker sex taken broadly, with variation in the acoustic factors contributing to Dimension 1 (in relative order: pitch, F3, F2, F1) all being correlated with sex to some degree. Under the MEM model for F3, sex was the most substantial determinant, differing between male versus female speakers to a larger degree than for any other categorical variable. Although the MEMs for both F2 and pitch also indicate a significant and substantial role for sex, it is difficult to compare the different MEM results to each other as they are scaled differently-although F2, F3, and pitch are all measured in Hz, the variance across this unit differs widely between the three variables. In contrast, the FAMD explicitly ranks these variables in accordance with each dimension. As noted, F3 is the third-ranked contributor to Dimension 1-however, it shows negligible levels of contribution to Dimensions 2 and 4, and only a slightly above average contribution to Dimension 3.
A comparison between the MEMs for F2 and F3 in relation to vowel categories is also informative. While both models produce similar results in terms of the occurrence of significant differences (with the exception that F3 is non-significantly different between /u/ versus /o/), the magnitude of difference is larger for F2 in every case. That is, F2 and F3 results pertaining to vowel types are nearly always compatible, and F2 is moreover a better indicator of significant difference between types. The general, and perhaps unexpected conclusion regarding F3 variation, then, is that it is primarily related to speaker sex, and more secondarily to vowel type.
The relationship between F3 and speaker sex is not heretofore unknown, although it is typically subsumed within a broader investigation of formant-frequency differences (Skuk & Schweinberger, 2014), and/or often goes unmentioned next to F1 and F2 (Simpson, 2009). However, some research has shown that F3 provides a superior function to either F1 or F2 in discrimination tasks of speaker sex (Hillenbrand & Clark, 2009). One of the most interesting findings related to our study is Whiteside (2001), who examined production differences between sexes across a wide age-range of speakers. Whiteside found that F3 exhibited sex-based differentiation even in pre-pubescent speakers, and that the relationship between F3 and pitch values showed sex-specific developmental patterns. While the present study is not focused on sex differences in vowel production, nor does it consider production differences over the lifespan, these other studies demonstrate that our finding regarding the strong connection between F3 production and speaker sex is not entirely novel, although it may not be as widely known as e.g., the relationship between sex and vocal pitch.

Contextualizing the results
What does this mean for Media Lengua? Results from these analyses suggest that F1 and F2 may not be the most salient cues used by speakers to differentiate mid and high vowels. Given the extensive literature on vowel formants as a primary cue for vowel production, the fact that other cues may be more salient may, at first glance, seem peculiar. However, after considering that mid and high vowels are highly overlapping in their respective regions of acoustic space (Figure 2A), it becomes clear that additional cues may be an important factor for extrapolating lexical meaning of a given word. Our results suggest that speakers may be relying more on the structure of the utterance (syllable type, stress), who is producing the utterance (sex), and other factors such as context, when parsing the speech stream. If distinguishing mid and high vowels are a low priority for Media Lengua speakers, who instead rely on the overall phonological shape and context of a word, inferring meaning based solely on a mid and high vowel contrast may only be important in limited contexts.
A brief analysis of the recently published descriptive Media Lengua dictionary (Stewart, Prado Ayala, & Gonza Inlago, 2020) shows that while speakers use mid vowels in approximately 71% of the lexicon, there is a high degree of spelling variation (24%) where mid vowels are interchangeable with high vowels. Additionally, only 0.67% of the lexicon were identified as minimal pairs differing by mid and high vowels; all of which were either from different parts of speech or different semantic categories, making their meaning easily identifiable based on context. A context-based rationale is even more justifiable when approximately four times the number of lexical entries in Media Lengua are homonyms. Nevertheless, while the contrastive functional load for mid and high vowels may be low, it is not zero; there are cases where speakers will reject replacing a high vowel for a mid vowel and vice versa e.g., vos '2 nd person pronoun' is rejected as *vus and enseñana 'teach' is rejected as *inseñana or *insiñana, though ensiñana is an accepted variation (and even though the rejected variants of both words were still understandable). Additionally, vowel-sequences (i.e., diphthongs), which are found in approximately 28% of the lexicon, are rarely accepted (2%) as monophthongs in spelling variations and never found with the target vowels in the same acoustic region switched e.g., /ei/ for /ie/. These observations are in line with our results suggesting that formants still play a role in vowel production and lexical interpretation with respect to the corner vowels, even if they are not found to be the most salient cue for high and mid vowel contrasts. This may motivate the maintenance of the small, albeit significant, separation of mid and high vowel categories observed in acoustic space (Figure 2A).
In summary, in most cases, if a monophthong is produced in its expected trisect of acoustic space (front, back, low), little more is needed in a vowel slot to extrapolate meaning, as long as the rest of the pre-and post-sequences are recognizable in phonological shape and the context is unambiguous.

Conclusion
Our study has compared the role of a range of acoustic and categorical variables in relation to Media Lengua vowel phonemes using two different statistical methods, factor analysis for mixed data (FAMD) and linear mixed effects regression modelling (MEM). The aggregated results indicate that the largest influence on variation is speaker sex, and related acoustic correlates including pitch (a well-known correlate) and F3 (less wellknown but documented in the literature), with syllable type and, to some extent, stress also being major contributing factors. We interpret these results to indicate that Media Lengua speakers rely on a variety of contextual cues beyond the first two vowel formants including syllable structure and prosodic elements, or social factors such as speaker sex when differentiating mid versus high vowels. This interpretation is supported by prior research on Media Lengua perception (Stewart, 2018b) and spelling variation (Stewart, Prado Ayala, & Gonza Inlago, 2020), which show a high degree of overlap between mid and high vowels. However, an important role for F1 and F2 is still maintained amidst these other cues, also confirmed via recent work on Media Lengua diphthong production (Onosson & Stewart, in press). The implications of the results from our study point the way towards future research explicitly focused on areas such as: the role of F3 and its relation to speaker sex, in Media Lengua or indeed other languages as it is certainly underexplored relative to F1 and F2; syllable-and stress-patterning in Media Lengua; and the impact of speech style, a lesser-contributing factor in our analysis but nonetheless one with some apparent effect on production.