<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.1" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">1868-6354</journal-id>
<journal-title-group>
<journal-title>Laboratory Phonology: Journal of the Association for Laboratory Phonology</journal-title>
</journal-title-group>
<issn pub-type="epub">1868-6354</issn>
<publisher>
<publisher-name>Ubiquity Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5334/labphon.196</article-id>
<article-categories>
<subj-group>
<subject>Journal article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>The &#916;F method of vocal tract length normalization for vowels</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-2569-8121</contrib-id>
<name>
<surname>Johnson</surname>
<given-names>Keith</given-names>
</name>
<email>keithjohnson@berkeley.edu</email>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>Department of Linguistics, University of California, Berkeley, CA, US</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2020-07-22">
<day>22</day>
<month>07</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>11</volume>
<issue>1</issue>
<elocation-id>10</elocation-id>
<history>
<date date-type="received" iso-8601-date="2019-03-12">
<day>12</day>
<month>03</month>
<year>2019</year>
</date>
<date date-type="accepted" iso-8601-date="2020-01-20">
<day>20</day>
<month>01</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2020 The Author(s)</copyright-statement>
<copyright-year>2020</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.journal-labphon.org/articles/10.5334/labphon.196/"/>
<abstract>
<p>Given the acoustic consequences of physiological differences between talkers, there is a practical need for effective and theoretically motivated procedures of vowel normalization to facilitate comparison of speech produced by people who differ by dialect or language. In addition, there is a question whether listeners might utilize a normalization procedure during speech perception. This paper reports the results of two studies that explore these questions&#8212;with particular focus on vocal tract length normalization. Drawing on research in speech engineering, where accurate estimates of vocal tract length are needed in some approaches to automatic speech recognition and speaker verification, a new model of vowel normalization is introduced. The model uses a direct measure of average formant spacing (the &#916;F) which can be used to measure vocal tract length. The acoustic consequences of vocal tract length differences are removed from vowel measurements by scaling vowel formant measurements by &#916;F. Study 1 found that this method is comparable to Nearey&#8217;s (<xref ref-type="bibr" rid="B25">1978</xref>) uniform normalization method, while providing an explicit vocal tract length interpretation, and a rationalized unit of measure. Study 2 found that uniform normalization measures (which let each formant serve as a noisy estimator of &#916;F) improve vowel classification even with only a couple of randomly selected vowel tokens. This suggests that vocal tract length normalization could be involved in speech perception.</p>
</abstract>
<kwd-group>
<kwd>Speech Perception</kwd>
<kwd>Talker Normalization</kwd>
<kwd>Vowel Normalization</kwd>
<kwd>Vocal Tract Length</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>1. Introduction</title>
<p>The acoustic properties of speech are shaped in large part by the movements of the upper vocal tract and the vibrations and turbulence set up by the flow of air through the larynx and vocal tract constrictions. In addition to the controlled movements of the upper vocal tract, speech acoustics are determined by the overall size of the vocal tract and larynx.</p>
<p>Vowels differ primarily in the two lowest resonances of the vocal tract&#8212;the first and second &#8216;formants&#8217; (F1 and F2). For instance, the vowel [i] has a low F1 and a high F2, while [a] has a high F1 and a low F2. The acoustic vowel space that is defined by a talker&#8217;s range of F1 and F2 values is shifted up or down in average F1 and F2 frequency as a function of vocal tract length. Longer vocal tracts have lower average resonant frequencies. This vocal tract length effect creates a practical challenge for automatic speech recognition, and for the phonetic comparison of speakers, dialects, and languages, as well as a &#8216;normalization&#8217; problem for listeners who must recognize vowels produced by a range of different talkers.</p>
<p>This paper will introduce the &#916;F method of vocal tract length normalization. In two studies, the paper will evaluate &#916;F vowel normalization as a practical technique for acoustic analysis of language data, and also gauge the feasibility of vocal tract length normalization as a perceptual mechanism.</p>
<sec>
<title>1.1. Vocal tract length</title>
<p>According to the acoustic theory of speech production (<xref ref-type="bibr" rid="B8">Fant, 1960</xref>) it is theoretically possible to remove vocal tract length differences from our descriptions of speech acoustics (<xref ref-type="bibr" rid="B28">Nordstr&#246;m &amp; Lindblom, 1975</xref>). Once we have measurements of the vocal tract resonant frequencies (F1-F4) we can use the talker-specific vowel formant distributions to estimate normalization factors. Several normalization methods have been proposed, with more or less reference to acoustic theory (<xref ref-type="bibr" rid="B10">Gerstman, 1968</xref>; <xref ref-type="bibr" rid="B24">Lobanov, 1971</xref>; <xref ref-type="bibr" rid="B28">Nordstr&#246;m &amp; Lindblom, 1975</xref>; <xref ref-type="bibr" rid="B25">Nearey, 1978</xref>; <xref ref-type="bibr" rid="B36">Watt &amp; Fabricius, 2002</xref>). The aim of acoustic vowel normalization is to make it possible to express the formant frequencies using a measurement scale that is independent of talker differences, for use in comparing dialects and languages with each other. Vowel normalization may also be a part of the cognitive process of speech perception.</p>
<p>Whether listeners make use of vocal tract length in speech perception is an open question (<xref ref-type="bibr" rid="B15">Johnson, 1997</xref>). It has been argued that perceptual compensation for talker differences must be more than simply a vocal tract length normalization, because acoustic differences between men and women (who tend to differ in vocal tract length) is language-specific; male/female differences depend in part on the language or dialect that they speak (<xref ref-type="bibr" rid="B16">Johnson, 2005</xref>). This language-specificity of gender differences suggests that there is more involved than just vocal tract length difference. Instead, there appears to be a performative aspect of gender that is overlaid on physical sex differences in vocal tract length. Secondly, it has been observed that perceptual talker normalization effects are conditioned to some degree on higher-level factors that are given by prior phonetic context (<xref ref-type="bibr" rid="B21">Ladefoged &amp; Broadbent, 1957</xref>; <xref ref-type="bibr" rid="B14">Johnson, 1990</xref>), visual context (<xref ref-type="bibr" rid="B32">Strand &amp; Johnson, 1996</xref>), or even experimenter suggestion (<xref ref-type="bibr" rid="B19">Johnson, Strand, &amp; D&#8217;Imperio, 1999</xref>).</p>
<p>Despite these observations about speech perception, there are a couple of reasons to suppose that vocal tract length normalization might be a component of the speech perception process. First, the perception of a conspecific individual&#8217;s body size is important for social organization in many species (e.g., <xref ref-type="bibr" rid="B11">Harrington &amp; Mech, 1979</xref>), and there is evidence that vocalizations are used to convey individual characteristics such as size (<xref ref-type="bibr" rid="B31">Reby &amp; McComb, 2003</xref>). This suggests that the perception of vocal tract length may be evolutionarily prior to linguistic communication, so language processing may be overlaid on a cognitive structure that already included vocal tract length perception. Second, there are regions of the brain involved in talker perception that do not overlap with regions involved in speech perception (e.g., <xref ref-type="bibr" rid="B34">Van Lanker, Kreiman, &amp; Cummings, 1989</xref>; <xref ref-type="bibr" rid="B18">Johnson &amp; Sjerps, to appear</xref>), suggesting that the perceptual system may include processes that compute talker information that can be mixed with phonetic information in a stream that produces &#8216;talker neutral&#8217; phonetic information.</p>
</sec>
<sec>
<title>1.2. Extrinsic normalization</title>
<p>One reason to believe that vocal tract length normalization is not a viable mechanism for speech perception is that, as usually formulated, it relies on information that is extrinsic to the vowel. That is, it is usually assumed that the estimation of vocal tract length is computed over a collection of vowel tokens&#8212;information beyond what is available intrinsically in the vowel to be classified. Common experience, and controlled experimentation confirm that isolated vowels are accurately recognized (<xref ref-type="bibr" rid="B26">Nearey, 1989</xref>; <xref ref-type="bibr" rid="B33">Strange, Jenkins, &amp; Johnson, 1983</xref>). This suggests that each vowel contains the information that is needed for its own recognition, and this intrinsic information is usually sufficient for vowel recognition.</p>
<p>Lammert and Narayanan&#8217;s (<xref ref-type="bibr" rid="B22">2015</xref>) finding is relevant for this discussion. Building on earlier work in speech engineering (<xref ref-type="bibr" rid="B29">Paige &amp; Zue, 1970</xref>; <xref ref-type="bibr" rid="B20">Kirlin, 1978</xref>; <xref ref-type="bibr" rid="B35">Wakita, 1977</xref>), they devised a regression method using the frequencies of formants F1-F4 to estimate vocal tract length over short stretches of speech. Vocal tract length normalization using four formants F1-F4 should be less reliant on the specific vowel tokens that are used to calculate normalization parameters than methods like <italic>z</italic>-score normalization (<xref ref-type="bibr" rid="B24">Lobanov, 1971</xref>) that normalize F1 (for example) on the basis of information about the distribution of F1 in a set of vowel measurements. Obviously, you cannot accurately estimate the distribution of a formant&#8217;s frequencies from a single token, so the Lobanov normalization method cannot be a model of how listeners identify (and normalize) isolated vowels. On the other hand, in vocal tract length normalization, information from F1-F4 determines a normalization scale factor, so each vowel contains a formant pattern within which to evaluate (and normalize) the formants.</p>
</sec>
<sec>
<title>1.3. Descriptive normalization</title>
<p>Regardless of whether vocal tract length normalization is done in speech perception, descriptive studies of language phonetic systems rely on vowel normalization algorithms to compare speech produced by different talkers or groups of talkers (<xref ref-type="bibr" rid="B5">Disner, 1980</xref>; <xref ref-type="bibr" rid="B1">Adank, Smits, &amp; van Hout, 2004</xref>). Methods used in these studies are sometimes based on general-purpose statistical normalization techniques such as range normalization (<xref ref-type="bibr" rid="B10">Gerstman, 1968</xref>), or <italic>z</italic>-score normalization (<xref ref-type="bibr" rid="B24">Lobanov, 1971</xref>), but more specialized methods using what could be called &#8216;mean normalization&#8217; (the ratio of x to the mean of x) are also widely used (<xref ref-type="bibr" rid="B25">Nearey, 1978</xref>; <xref ref-type="bibr" rid="B36">Watt &amp; Fabricius, 2002</xref>).</p>
<p>In vocal tract length normalization (<xref ref-type="bibr" rid="B28">Nordstr&#246;m &amp; Lindblom, 1975</xref>), a single normalization scale factor is derived from an estimate of the length of the speaker&#8217;s vocal tract. An acoustic factor related to vocal tract length is &#981;, the fundamental frequency of vocal tract resonances in an unconstricted vocal tract (<xref ref-type="bibr" rid="B29">Paige &amp; Zue, 1970</xref>; <xref ref-type="bibr" rid="B20">Kirlin, 1978</xref>; <xref ref-type="bibr" rid="B22">Lammert &amp; Narayanan, 2015</xref>). The factor &#981; is equal to F1 and the other formants are on odd harmonics of this value: F2 = 3&#981;, F3 = 5&#981;, F4 = 7&#981;. &#916;F is simply 2&#981;, and is an estimate of formant spacing in an unconstricted vocal tract (<xref ref-type="bibr" rid="B31">Reby &amp; McComb, 2003</xref>), where F1 = &#189;*&#916;F; F2 = 1&#189;*&#916;F; F3 = 2&#189;*&#916;F, etc. Because a single normalization scale factor is used to scale the formants (rather than a separate factor for each formant), this method is called a &#8216;uniform&#8217; normalization method.</p>
<p>Despite its basis in the acoustic theory of speech production and the interpretation of the normalization scale factor in terms of a physical characteristic of the talker, vocal tract length normalization was rejected almost as soon as Nordstr&#246;m and Lindblom (<xref ref-type="bibr" rid="B28">1975</xref>) proposed it, because it did not seem to work very well. However, there has been significant progress in deriving more accurate vocal tract length estimates from the acoustic vowel spectrum, so a reconsideration of vocal tract length normalization is due.</p>
</sec>
<sec>
<title>1.4. Organization</title>
<p>This paper will introduce the &#916;F method of vocal tract length normalization in Section 2. Section 3 will describe the methods used in two studies of vowel normalization techniques. The results of the studies are presented in Section 4, and the paper concludes with recommendations in Section 5. The aims of the paper are to evaluate &#916;F vowel normalization as a practical technique for acoustic analysis of language data, and also to gauge the feasibility of vocal tract length normalization as a perceptual mechanism.</p>
</sec>
</sec>
<sec sec-type="methods">
<title>2. Methods for vocal tract length normalization</title>
<p>The earliest use of vocal tract length as a normalization factor was by Nordstr&#246;m and Lindblom (<xref ref-type="bibr" rid="B28">1975</xref>). They calculated the average third formant (F3) frequency in open vowels (where the frequency of F3 is easily distinguished from F2) and used this to estimate the talker&#8217;s vocal tract length. Vocal tract length was then used to scale a talker&#8217;s vowel formant measurements onto a &#8216;standard&#8217; (i.e., male) vocal tract. We can avoid the male-centric bias of this approach by using a talker-independent measurement scale&#8212;an estimate of formant spacing in an unconstricted vocal tract, the &#916;F.</p>
<p>Speech engineers have devised measures of vocal tract length from vowel acoustics (<xref ref-type="bibr" rid="B29">Paige &amp; Zue, 1970</xref>; <xref ref-type="bibr" rid="B35">Wakita, 1977</xref>; <xref ref-type="bibr" rid="B20">Kirlin, 1978</xref>) for use in automatic speech recognition and speaker identification (<xref ref-type="bibr" rid="B6">Eide &amp; Gish, 1996</xref>; <xref ref-type="bibr" rid="B23">Lee &amp; Rose, 1998</xref>). For example, Kirlin (<xref ref-type="bibr" rid="B20">1978</xref>) used information from the four lowest formants and estimates of formant variances to weight the contributions of formants. He treated &#8220;each formant as a noisy estimator&#8221; of vocal tract length, so in his calculation, because F2 has large variance, it contributes less to the vocal tract length estimate. Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) published an important study in this line. They compared the vocal tract length predicted from acoustic measures to estimates of vocal tract length measured from magnetic resonant imaging (MRI) of the vocal tract, as well as computer simulated vocal tracts of known length. They found a family of regression formulas that predict vocal tract length from the first four formant frequencies of vowels, weighting formants as Kirlin did but finding the weights by regression to known vocal tract length.</p>
<p>Following the analysis outlined by Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) and building on the line-fitting approach of Reby and McComb (<xref ref-type="bibr" rid="B31">2003</xref>; see Figure <xref ref-type="fig" rid="F1">1</xref>), a talker&#8217;s &#916;F is directly estimated by scaling formant frequencies by their formant number (F1, F2, F3, F4, etc.) as in formula (1). Note that each formant (F1-F4) of each vowel provides an estimate of &#916;F. The sum in (1) can be taken over all of the vowels for a talker that are available in a dataset so missing values have a limited impact on the estimate. It is important to note that even though F1 and F2 vary substantially for different vowels, averaged over a representative dataset they approximate the frequencies of an unconstricted vocal tract&#8212;a uniform tube. Section 3 will compare the direct estimation of &#916;F using the average formant spacing method in (1) with the optimized empirical method presented by Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>). Kirlin&#8217;s (<xref ref-type="bibr" rid="B20">1978</xref>) a posteriori method was also implemented and results from this method of vocal tract length estimation will also be reported. Once we have calculated &#916;F, then the talker&#8217;s vocal tract length can be calculated from &#916;F by formula (2), and the normalized vowel formants are calculated using &#916;F as in (3).</p>
<fig id="F1">
<label>Figure 1</label>
<caption>
<p>An illustration of Reby and McComb&#8217;s (<xref ref-type="bibr" rid="B31">2003</xref>) direct estimation approach for finding &#916;F from formant measurements. The average spacing between vowel formants (&#916;F) is the slope of the line that relates formant number to formant frequency. The figure shows the formant frequencies of a vocal tract that is 17.5 cm long with no constrictions (a uniform tube). &#916;F is 1000 Hz, and F1 is 500 Hz, F2 is 1500 Hz, etc.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/article/id/6268/file/75689/"/>
</fig>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(1)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq001-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>&#x00A0;</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext mathvariant="italic">mn</mml:mtext></mml:mrow></mml:mfrac><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi><mml:mi>m</mml:mi></mml:msubsup><mml:mo>&#x00A0;</mml:mo><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:msubsup><mml:mo>&#x00A0;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>0.5</mml:mn></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>
<tex-math id="M1">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = \frac{1}{{mn}}\mathop \sum \nolimits_j^m \sum \nolimits_i^n \left[ {\frac{{{F_{ij}}}}{{i - 0.5}}} \right],
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75690/"/>
</alternatives>
</inline-formula> where <italic>i</italic> = formant number (1&#8230;4), and <italic>j</italic> is token number</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(2)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq002-mml"><mml:mrow><mml:mtext mathvariant="italic">VTL</mml:mtext><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mtext>/</mml:mtext><mml:mn>2</mml:mn><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext></mml:mrow></mml:math>
<tex-math id="M2">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
VTL = c{\rm{/}}2\Delta F
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75691/"/>
</alternatives>
</inline-formula>, <italic>c</italic> = 34000 cm/s, the speed of sound, warm moist air</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(3)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq003-mml"><mml:mrow><mml:msub><mml:msup><mml:mi>F</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mrow><mml:mtext mathvariant="italic">ij</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x00A0;</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mtext mathvariant="italic">ij</mml:mtext></mml:mrow></mml:msub><mml:mtext>/</mml:mtext><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext></mml:mrow></mml:math>
<tex-math id="M3">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
{F^\prime_{ij}} = {F_{ij}}{\rm{/}}\Delta F
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75692/"/>
</alternatives>
</inline-formula>, normalized formant frequencies</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>This average formant spacing method of estimating vocal tract length (VTL) is essentially perfectly correlated with Lammert and Narayanan&#8217;s (<xref ref-type="bibr" rid="B22">2015</xref>; henceforth L&amp;N) estimated VTL on the Hillenbrand, Getty, Clark, and Wheeler (<xref ref-type="bibr" rid="B12">1995</xref>) dataset (<italic>r</italic><sup>2</sup> = 0.998, Figure <xref ref-type="fig" rid="F2">2</xref> below), while the Nordstr&#246;m and Lindblom (<xref ref-type="bibr" rid="B28">1975</xref>) estimate which was based on F3 alone, is much less strongly correlated with L&amp;N (<italic>r</italic><sup>2</sup> = 0.829). Kirlin&#8217;s (<xref ref-type="bibr" rid="B20">1978</xref>) method was also closely correlated with the L&amp;N method (<italic>r</italic><sup>2</sup> = 0.963). The strong correlations between the average spacing (formula 1) and L&amp;N methods is an indication that with a full sample of vowels for a talker, the average spacing method is a valid measure of apparent vocal tract length. The key difference between L&amp;N&#8217;s data-driven approach and the average spacing approach is that the average spacing approach assumes that the formant values will, on average, be equally spaced (as in Figure <xref ref-type="fig" rid="F1">1</xref>). The high correlation between the methods can be seen as a validation of this assumption for a typical phonetic dataset. As seen in Figure <xref ref-type="fig" rid="F2">2</xref>, the length estimated by the average formant spacing method is consistently about 0.8 cm longer than the length estimated by the L&amp;N method.</p>
<fig id="F2">
<label>Figure 2</label>
<caption>
<p>Vocal tract length estimates for the talkers in Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>), as calculated using the average spacing formula (4) and the two formulas calculated by Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>): formula (8) with no zero intercept, and formula (11), which does have a zero intercept. The dashed line is the identity line, y = x.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/article/id/6268/file/75693/"/>
</fig>
<p>Formula (3) shows that &#916;F can be used as a unit of measure for vowel formant frequencies, putting vowel formants from all vocal tracts on the same measurement scale. The normalized F1&#8217; value is given as F1/&#916;F and is expected to be equal to 0.5 for a uniform tube of any length, the normalized F2&#8217; is F2/&#916;F and is expected to be equal to 1.5 for any uniform tube, and so on. This is <italic>&#916;F vocal tract length normalization</italic>, where &#916;F can be estimated in several different ways&#8212;from F3 alone (<xref ref-type="bibr" rid="B28">Nordstr&#246;m &amp; Lindblom, 1975</xref>), using the average formant spacing approach (formula 1), or using Lammert and Narayanan&#8217;s (<xref ref-type="bibr" rid="B22">2015</xref>) regression fits, or Kirlin&#8217;s (<xref ref-type="bibr" rid="B20">1978</xref>) weight by variance method. This use of &#916;F as a vowel normalization factor, which has not been proposed before, is the main practical contribution of this paper.</p>
</sec>
<sec sec-type="methods">
<title>3. Methods</title>
<sec>
<title>3.1. Data sets</title>
<p>The data analyzed in this paper are the published American English vowel production data from Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) and Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>), as distributed by Santiago Barreda in his &#8216;phonTools&#8217; package for the R statistical programming language.</p>
</sec>
<sec>
<title>3.2. Normalization formulas</title>
<p>The data were analyzed in the R statistical programming language, and the normalization algorithms were implemented as illustrated in Table <xref ref-type="table" rid="T1">1</xref>. As the variable names in the last four rows of the table make clear, the normalizing factor &#916;F can be calculated by any method that provides an estimated vocal tract length from the acoustic vowel formants. The Watt &amp; Fabricius method prescribes taking the mean of formants from particular judiciously selected vowel qualities to ensure that the center of the talker&#8217;s acoustic vowel space is adequately captured, to provide a cross-linguistically consistent scale. The implementation here used the mean of the entire sample as the estimate of the center of the acoustic vowel space. This works just as well for vowel classification within language, and avoids subjectivity or other mistakes in the selection of the exemplary vowel tokens.</p>
<table-wrap id="T1">
<label>Table 1</label>
<caption>
<p>Some details, in R, of how the normalization algorithms were implemented. In these code snippets &#8216;f1,&#8217; &#8216;f2,&#8217; &#8216;f3&#8217; are arrays of vowel formant measurements for a particular talker. The first three rows are non-uniform methods and the last four rows are uniform methods.</p>
</caption>
<table>
<tr>
<th align="left">Method</th>
<th align="left">R code</th>
</tr>
<tr>
<td colspan="2"><hr/></td>
</tr>
<tr>
<td align="left">Lobanov, <italic>z</italic>-score</td>
<td align="left">F1&#8217; = scale(f1);</td>
</tr>
<tr>
<td align="left">Watt &amp; Fabricius</td>
<td align="left">F1&#8217; = f1/mean(f1);</td>
</tr>
<tr>
<td align="left">Nearey 1, non-uniform</td>
<td align="left">F1&#8217; = exp(log(f1) &#8211; mean(log(f1)));</td>
</tr>
<tr>
<td align="left">Nearey 2, uniform</td>
<td align="left">mf = mean(c(log(f1),log(f2),log(f3)));<break/>F1&#8217; = exp(log(f1)-mf);</td>
</tr>
<tr>
<td align="left">Nordstr&#246;m &amp; Lindblom</td>
<td align="left">&#916;F = mean(f3[f1&gt;600]/2.5);<break/>F1&#8217; = f1/&#916;F;</td>
</tr>
<tr>
<td align="left">Kirlin</td>
<td align="left">x = mean(f1/146^2) + 3*mean(f2/485^2) + 5*mean(f3/322^2) + mean(f1)/40^2;<break/>&#916;F = 2(x*1051.2);<break/>F1&#8217; = f1/&#916;F;</td>
</tr>
<tr>
<td align="left">Lammert &amp; Narayanan</td>
<td align="left">&#916;F = 2*(262 + mean(0.14*f1) + mean((0.16*f2)/3) + mean((0.25*f3)/5));<break/>F1&#8217; = f1/&#916;F;</td>
</tr>
<tr>
<td align="left">average spacing &#916;F</td>
<td align="left">&#916;F = mean(c(f1/0.5, f2/1.5, f3/2.5));<break/>F1&#8217; = f1/&#916;F;</td>
</tr>
</table>
</table-wrap>
</sec>
<sec>
<title>3.3. Vocal tract length coefficients</title>
<p>The regression coefficients used in this paper for the Lammert &amp; Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) method are different from the ones they published because the datasets used here only include the first three formants, while L&amp;N used F1&#8211;4 to estimate vocal tract lengths. Dr. Lammert was kind enough to provide two sets of coefficients for regressions fitted from F1&#8211;3 for simulated data. Without an intercept (&#946;<sub>1</sub> = 0.28, &#946;<sub>2</sub> = 0.31, &#946;<sub>3</sub> = 0.47) the RMS error in estimated vocal tract length is 1.91 cm. A regression formula that includes an intercept term (&#946;<sub>0</sub> = 262, &#946;<sub>1</sub> = 0.14, &#946;<sub>2</sub> = 0.16, &#946;<sub>3</sub> = 0.25) leads to a smaller error of estimated vocal tract length (1.22 cm).<xref ref-type="fn" rid="n1">1</xref> The classification studies in the next section used the second formula, the one with an intercept term. The &#916;F estimates of the intercept formula tend to be regulated, drawn away from extremely short or long estimates, by the &#946;<sub>0</sub> constant, as further discussed below.<xref ref-type="fn" rid="n2">2</xref></p>
<p>The comparison of the Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) coefficients to the average spacing coefficients, formula (1), is complicated by a slight difference in how they are presented and calculated. Formula (1) is reproduced here as (4) and can be expanded as (5) in the case where we have three formants per vowel. Simplifying (5) into an expression similar to the form used by L&amp;N, we get (7).</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(4)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq004-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext mathvariant="italic">mn</mml:mtext></mml:mrow></mml:mfrac><mml:mtext>&#x2009;</mml:mtext><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi><mml:mi>m</mml:mi></mml:msubsup><mml:mo>&#x00A0;</mml:mo><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:msubsup><mml:mo>&#x00A0;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>0.5</mml:mn></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x00A0;</mml:mo></mml:mrow></mml:math>
<tex-math id="M4">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = \frac{1}{{mn}}\;\mathop \sum \nolimits_j^m \sum \nolimits_i^n \left[ {\frac{{{F_{ij}}}}{{i - 0.5}}} \right],
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75694/"/>
</alternatives>
</inline-formula> where <italic>i</italic> = formant number</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(5)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq005-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mtext>/</mml:mtext><mml:mn>3</mml:mn><mml:mtext>&#x2009;</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mtext>/</mml:mtext><mml:mn>0.5</mml:mn><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mtext>/</mml:mtext><mml:mn>1.5</mml:mn><mml:mo>+</mml:mo><mml:mo>&#x00A0;</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn><mml:mtext>/</mml:mtext><mml:mn>2.5</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math>
<tex-math id="M5">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 1{\rm{/}}3\;(F1{\rm{/}}0.5 + F2{\rm{/}}1.5 + F3{\rm{/}}2.5)
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75695/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(6)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq006-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mtext>/</mml:mtext><mml:mn>3</mml:mn><mml:mtext>&#x2009;</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>0.666</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>0.4</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math>
<tex-math id="M6">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 1{\rm{/}}3\;(2*F1 + 0.666*F2 + 0.4*F3)
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75696/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(7)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq007-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>0.6667</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>0.222</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>0.1333</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn></mml:mrow></mml:math>
<tex-math id="M7">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 0.6667*F1 + 0.222*F2 + 0.1333*F3
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75697/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>Lammert and Narayanan&#8217;s no-intercept expression for &#916;F with three vowel formant measurements is in (8). Simplifying, we get equation (10).</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(8)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq008-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>(</mml:mo><mml:mn>0.28</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0.31</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>/</mml:mtext><mml:mn>3</mml:mn><mml:mo>+</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0.47</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>/</mml:mtext><mml:mn>5</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math>
<tex-math id="M8">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 2(0.28*F1 + (0.31*F2){\rm{/}}3 + (0.47*F3){\rm{/}}5)
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75698/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(9)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq009-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>(</mml:mo><mml:mn>0.28</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>0.10333</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>0.094</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math>
<tex-math id="M9">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 2(0.28*F1 + 0.10333*F2 + 0.094*F3)
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75699/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(10)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq010-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>0.56</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>0.20666</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>0.188</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn></mml:mrow></mml:math>
<tex-math id="M10">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 0.56*F1 + 0.20666*F2 + 0.188*F3
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75700/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>Now we can compare the coefficients used in the Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) calculation for three formant vowels (10) with the average spacing formula (7). The L&amp;N formula has a larger coefficient for F3 and smaller coefficients for F1 and F2, than are found in the average spacing expression. This is likely to lead to better vocal tract length estimates when only one or two vowel tokens are available because the frequency of F3 is less variable across vowels than are F1 and F2. However, as seen in the left panel of Figure <xref ref-type="fig" rid="F2">2</xref>, with a full set of data, this method (formula 8) produces vocal tract length estimates that differ from the estimates given by the average-spacing formula (4) by only a constant (probably due to the difference between the acoustic effective length of the vocal tract versus the actual length of the vocal tract; <xref ref-type="bibr" rid="B17">Johnson, 2011, p. 42</xref>).</p>
<p>Lammert and Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) also fit a formula for estimating vocal tract length that includes an intercept term. This provides a slightly more accurate measure of vocal tract length with a small set of vowel tokens, by regularizing the estimated vocal tract length. This is illustrated in the right panel of Figure <xref ref-type="fig" rid="F2">2</xref>, which compares the Lammert and Narayanan VTL estimates (formula 2) using a regression formula with an intercept term (11), to the VTL estimates produced by average spacing (4) for the Hillenbrand et al. data. The main thing to notice is that the range of VTL estimates for formula (11) is smaller (min=13.6 cm, max=16.75cm) than the range given by the non-regularized formula (8).</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(11)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq011-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>(</mml:mo><mml:mn>262</mml:mn><mml:mo>+</mml:mo><mml:mn>0.14</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0.16</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>/</mml:mtext><mml:mn>3</mml:mn><mml:mo>+</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0.25</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>/</mml:mtext><mml:mn>5</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math>
<tex-math id="M11">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 2(262 + 0.14*F1 + (0.16*F2){\rm{/}}3 + (0.25*F3){\rm{/}}5)
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75701/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(12)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><inline-formula>
<alternatives>
<mml:math id="Eq012-mml"><mml:mrow><mml:mtext mathvariant="italic">&#x0394;F</mml:mtext><mml:mo>=</mml:mo><mml:mn>524</mml:mn><mml:mo>+</mml:mo><mml:mn>0.28</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>0.10666</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>0.1</mml:mn><mml:mo>*</mml:mo><mml:mi>F</mml:mi><mml:mn>3</mml:mn></mml:mrow></mml:math>
<tex-math id="M12">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\Delta F = 524 + 0.28*F1 + 0.10666*F2 + 0.1*F3
\]
\end{document}
</tex-math>
<graphic xlink:href="/article/id/6268/file/75702/"/>
</alternatives>
</inline-formula></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>The regression formula is simplified as (12). Notice that the coefficients for F1-3 are about half as large as in the no-intercept formula (10). If we enter F1=500, F2=1500, and F3=2500 into formula (12) we get 1074 Hz. So, about one half of the value of &#916;F is determined by the formant frequencies and the rest is determined by the intercept (&#946;<sub>0</sub> = 524). This will tend to shrink the range of &#916;F, and with it the range of estimated vocal tract lengths. The formula in (11) results in better vowel classification when vocal tract length is estimated from only a few vowel tokens and was used in the analyses reported here. Unsurprisingly, using Lammert and Narayanan&#8217;s no-intercept estimate of &#916;F (formula 8) resulted in classification accuracy that was almost identical to that reported for the average-spacing &#916;F method.</p>
</sec>
<sec>
<title>3.4. Study 1: Classification performance</title>
<p>For each normalization method, normalization factors (&#916;F, log(mean), <italic>SD</italic>, etc.) were calculated over all vowel tokens for a talker, and then the talker&#8217;s vowel formant frequencies were normalized using the code in Table <xref ref-type="table" rid="T1">1</xref>. After all of the vowel formant frequencies in the dataset were normalized, support vector machines (SVM, <xref ref-type="bibr" rid="B4">Cortes &amp; Vapnik, 1995</xref>) were used to classify the vowels. SVM is a supervised machine learning technique that, in this case, can be used to find optimal classification boundaries between the vowel categories given the data. The results of SVM classification can be used to infer an objective estimate of the level of vowel separation in the data. The classifiers built in this study used a radial basis function with gamma = 0.5. These are the default parameters for the R function svm(). Classification performance was evaluated by using the trained model to predict the category membership of each item in the training set and the percent correct vowel classification was computed from these predictions.</p>
<p>The normalized vowels were also used to build SVM classifiers to identify the talker group (male, female, child) of each token. Assuming that the goal in normalization is to remove talker differences, then a lower percent correct talker classification is an indication of better normalization performance.</p>
<p>The studies in this paper neglect two sources of vowel intrinsic information that have been shown to be useful both in automatic speech recognition and human speech perception&#8212;fundamental frequency of voicing (<xref ref-type="bibr" rid="B9">Fujisaki &amp; Kawashima, 1968</xref>), and vowel inherent spectral change (<xref ref-type="bibr" rid="B27">Nearey &amp; Assman, 1986</xref>). Thus, the classification results here are a minimal baseline.</p>
</sec>
<sec>
<title>3.5. Study 2: Effects of sample size</title>
<p>As in study 1, the vowel formant data to be classified in the second study were the full Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) or the Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) datasets. However, where in study 1 the normalization factors were calculated over all of the tokens for a talker, in this study the vowels were normalized with scale factors that were computed from randomly selected subsets of the available vowel data.</p>
<p>Different tests were conducted with normalization based on different numbers of vowel tokens per talker. Subsets of size 1, 2, 4, 6, or 9 tokens per talker were tested in the Hillenbrand data set, and subsets of size 1, 2, 5, 10, 15, and 20 tokens per talker were tested in the Peterson and Barney data set. In each test, a random subset of vowel tokens was selected and then normalization scale factors for each of the several normalization methods were computed from the randomly selected tokens. These normalization scale factors were then used to normalize the full set of vowel tokens (all of the tokens in the corpus), and as in study 1, SVM vowel classifiers were then built and the classification accuracy over the test data set was noted. This process was repeated 50 times in each test to give an estimate of the variability of the accuracy scores for a given normalization subset size.</p>
<p>When the subset was a single token, the formant &#8216;mean&#8217; that was used in the non-uniform normalization methods was set equal to the value of that formant in the selected token. The standard deviation in the Lobanov method was also set equal to the formant value when only a single token was the basis for normalization parameters.</p>
<p>Two additional tests were conducted. In one, a single token of the vowel [&#601;] was used to calculate normalization factors, and in the second, formants of the corner vowels [i, u, &#593;] were used to define the normalization statistics.</p>
</sec>
</sec>
<sec>
<title>4. Results</title>
<sec>
<title>4.1. Study 1: Classification accuracy</title>
<p>As shown in Table <xref ref-type="table" rid="T2">2</xref>, vowel classification with &#916;F (average-spacing or Kirlin&#8217;s VTL method) is close to state of the art, with vowel classification accuracy approaching 90% correct for the Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) data set, and about 80% correct for the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) data set. Also, reduction of talker information in vowels was comparable to the state of the art achieved by the non-uniform normalization methods proposed by Nearey (<xref ref-type="bibr" rid="B25">1978</xref>), Lobanov (<xref ref-type="bibr" rid="B24">1971</xref>), and Watt &amp; Fabricius (<xref ref-type="bibr" rid="B36">2002</xref>). The Lammert &amp; Narayanan (<xref ref-type="bibr" rid="B22">2015</xref>) method of VTL estimation (with intercept) is less accurate. The Lammert &amp; Narayanan no-intercept method (not shown in the table) is virtually the same as the average spacing method.</p>
<table-wrap id="T2">
<label>Table 2</label>
<caption>
<p>SVM percent correct identification of vowels and talker group (man, woman, child [MWC], or man, woman, boy, girl [MWBG]) in the vowel formant data reported by Peterson and Barney (<xref ref-type="bibr" rid="B30">1952; PB52</xref>) and Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995; H95</xref>) by different vowel normalization methods. Non-uniform vowel normalization refers to methods that use a different normalization factor for each formant. Uniform normalization refers to methods that use a single uniform scaling factor (such as vocal tract length) for all of the formants produced by a person.</p>
</caption>
<table>
<tr>
<th align="left" rowspan="3" valign="top">Method</th>
<th align="left" rowspan="3" valign="top">Type</th>
<th align="center" colspan="2" valign="top">Vowels</th>
<th align="center" colspan="2" valign="top">Talker Group</th>
</tr>
<tr>
<th colspan="4"><hr/></th>
</tr>
<tr>
<th align="center" valign="top">PB52</th>
<th align="center" valign="top">H95</th>
<th align="center" valign="top">PB52</th>
<th align="center" valign="top">H95</th>
</tr>
<tr>
<td colspan="6"><hr/></td>
</tr>
<tr>
<td align="left">Chance</td>
<td align="left">&#8212;</td>
<td align="right">12.5%</td>
<td align="right">10%</td>
<td align="right">40%</td>
<td align="right">30%</td>
</tr>
<tr>
<td align="left">No normalization (F1 &amp; F2)</td>
<td align="left">NONE</td>
<td align="right">77.3</td>
<td align="right">62.9</td>
<td align="right">66.7</td>
<td align="right">53.2</td>
</tr>
<tr>
<td align="left">&#916;F (Nordstr&#246;m &amp; Lindblom)</td>
<td align="left">Uniform</td>
<td align="right">82.5</td>
<td align="right">72.7</td>
<td align="right">49.8</td>
<td align="right">41.9</td>
</tr>
<tr>
<td align="left">&#916;F (Lammert &amp; Narayanan)</td>
<td align="left">Uniform</td>
<td align="right">86.3</td>
<td align="right">73.4</td>
<td align="right">59.6</td>
<td align="right">47</td>
</tr>
<tr>
<td align="left">Mean log Fs (Nearey 2)</td>
<td align="left">Uniform</td>
<td align="right">87.9</td>
<td align="right">77.8</td>
<td align="right">52.0</td>
<td align="right">43.2</td>
</tr>
<tr>
<td align="left">&#916;F (Kirlin)</td>
<td align="left">Uniform</td>
<td align="right">88</td>
<td align="right">78.4</td>
<td align="right">52.2</td>
<td align="right">41.7</td>
</tr>
<tr>
<td align="left">&#916;F (average-spacing)</td>
<td align="left">Uniform</td>
<td align="right">88.2</td>
<td align="right">78.1</td>
<td align="right">50.9</td>
<td align="right">42.9</td>
</tr>
<tr>
<td align="left">Mean log F (Nearey 1)</td>
<td align="left">Non-uniform</td>
<td align="right">90.9</td>
<td align="right">80.1</td>
<td align="right">51.6</td>
<td align="right">42.4</td>
</tr>
<tr>
<td align="left">Mean F, ratio (Watt &amp; Fabricius)</td>
<td align="left">Non-uniform</td>
<td align="right">90.8</td>
<td align="right">80.7</td>
<td align="right">50.8</td>
<td align="right">41.4</td>
</tr>
<tr>
<td align="left">Z-score normalization (Lobanov)</td>
<td align="left">Non-uniform</td>
<td align="right">92.6</td>
<td align="right">84.4</td>
<td align="right">49.3</td>
<td align="right">39.8</td>
</tr>
</table>
</table-wrap>
<p>Vowel classification with average-spacing &#916;F normalization is much better than with Nordstr&#246;m &amp; Lindblom&#8217;s (<xref ref-type="bibr" rid="B28">1975</xref>) method of VTL estimation, but not quite as good as the best non-uniform normalization methods. The &#916;F method is also quite comparable with the other models in removing much of the talker group information (e.g., man, woman, child). It is worth noting, however, that chance &#8216;talker group&#8217; identification (calculated with 1000 random permutations of the datasets) is 40% correct in the Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) corpus, and 31% correct in the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) corpus, so none of the methods fully &#8216;normalize out&#8217; this information. This is a neglected point in vowel normalization studies (but see the recent insightful discussion in <xref ref-type="bibr" rid="B2">Barreda &amp; Nearey, 2018</xref>, who reference a discussion in <xref ref-type="bibr" rid="B13">Hindle, 1978</xref>). We &#8216;normalize&#8217; vowel formant measurements, but gender and age differences are not fully removed. Knowing when a method &#8216;over-normalizes&#8217; and removes talker variability that actually &#8216;should&#8217; remain because it is sociolinguistically significant, is an important problem. This highlights the need for a principled approach, grounded in acoustic theory, and also means that regardless of the normalization method used, the residue of talker variation left behind by normalization should be statistically modeled, because we can be sure that some talker information remains.</p>
<p>Figure <xref ref-type="fig" rid="F3">3</xref> shows plots of normalized and unnormalized vowel formants for the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) data set. These plots show that vocal tract length normalization using the &#916;F method results in a normalized vowel space that is remarkably similar to spaces obtained with non-uniform methods that require more, and less interpretable, parameters. Note that the center of the vowel space is marked by the horizontal and vertical lines. For the Nearey, Lobanov, and Watt &amp; Fabricius methods these lines mark the mean F1 and F2; for the &#916;F methods they mark the resonances of the uniform tube.</p>
<fig id="F3">
<label>Figure 3</label>
<caption>
<p>Upper left: F1 and F2 vowel formant frequencies from Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>). Other panels: the same data normalized by several of the methods identified in the text.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/article/id/6268/file/75703/"/>
</fig>
<p>This analysis indicates that vocal tract length normalization using a single interpretable normalization parameter (&#916;F&#8212;an estimate of formant spacing in a vocal tract with no constrictions) is comparable to other vowel normalization methods. The talker-independent dimension that is used in this normalized representation is derived from both the formants being normalized (F1 and F2), as well as higher formants (F3 in this case, and also F4; <xref ref-type="bibr" rid="B22">Lammert &amp; Narayanan, 2015</xref>), which are less likely to vary as a function of the particular inventory of vowels found in a language. Higher formants, F3 and F4, are sometimes not measured. This study suggests that they should be, and that this additional information about the talker&#8217;s vocal tract could be extremely valuable in interpreting the F1/F2 vowel space.</p>
<p>Because the normalization factor &#916;F is directly interpretable in terms of a physical property of the talker, vocal tract length normalization is valid for cross-linguistic comparison of vowel spaces. In addition, it is remarkable that the state of the art in vowel formant normalization is almost entirely reducible to normalization in terms of the talker&#8217;s vocal tract length. This has not been observed before because phoneticians have not used a measure calibrated to vocal tract length in our vowel normalization schemes. Arguably, Nearey&#8217;s (<xref ref-type="bibr" rid="B25">1978</xref>) uniform normalization technique, with mean logF*, uses a measure that reflects vocal tract length, although because it is not calibrated to vocal tract length it is a problematic measure, giving incomparable measurement scales for values normalized over mean logF* (F1..F4) versus values normalized over mean logF* (F1..F3). The &#916;F measurement scale is the same whether &#916;F is estimated from three formants or four.</p>
</sec>
<sec>
<title>4.2. Study 2: Effects of sample size</title>
<p>Beyond the practicalities of having a workable vowel normalization scheme for comparing dialects and languages, the discovery that vocal tract length normalization is a powerful method for reducing some of the talker variability found in speech leads one to wonder whether vocal tract length normalization might play a role in speech perception. Study 2 tested this by limiting the amount of information available to the normalization algorithms in tests of vowel classification using the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) and Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) datasets. Limiting the information available for normalization creates a situation that Barreda and Nearey (<xref ref-type="bibr" rid="B2">2018</xref>) call &#8216;type B&#8217; over-normalization which is &#8220;due to noise in the estimated speaker parameters used for normalization.&#8221;</p>
<p>Figure <xref ref-type="fig" rid="F4">4</xref> shows the results of study 2 on the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) data set, with normalization based on different numbers of randomly selected vowels (1, 2, 4, 6, or 9 tokens), or basing vowel normalization on the corner vowels [i], [u], and [&#593;], on schwa [&#601;], or on the entire set of observations from each talker (12 tokens). For these later models (12 tokens, schwa, or corner vowels) there was no repeated random selection of a basis for extrinsic normalization, so only one SVM was fit for each normalization method.</p>
<fig id="F4">
<label>Figure 4</label>
<caption>
<p>Results of SVM vowel classification of the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) vowels when the vowel normalization statistics are calculated over different numbers of randomly selected vowel tokens, or sets designed to be uniquely informative about the vowel space (the corner vowels) or the vocal tract (schwa). The order of the bars within each panel is indicated in the legend, reading top to bottom for the bars from left to right&#8212;e.g., Lobanov normalization is the leftmost bar in each panel. Classification with no normalization results in 62.9% correct vowel identification (see Table <xref ref-type="table" rid="T2">2</xref>), which is indicated by the dashed horizontal line.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/article/id/6268/file/75704/"/>
</fig>
<p>Figure <xref ref-type="fig" rid="F5">5</xref> shows analogous results for the Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) dataset. With both the HIllenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) data and the Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) data, the results show that uniform scaling techniques (Nearey&#8217;s uniform scaling method, the &#916;F methods&#8212;Kirlin, Ave-spacing, and Lammert &amp; Narayanan) improve vowel classification accuracy over unnormalized classification even with a very small random sample of speech. While non-uniform methods, i.e., those that scale each formant using information in the corpus about that formant (<italic>z</italic>-score normalization [Lobanov], mean ratio [Watt &amp; Fabricius], or log mean difference [Nearey], non-uniform), are more dependent upon the particular vowel tokens that are used to calculate the vowel normalization factors, and need either a carefully chosen sample, or a large sample. Random selection of vowel tokens, as done here, has a catastrophic effect on the non-uniform methods if only a few vowel tokens are taken to represent the talker. In practice, where a large corpus of vowel measurements is available, this does not matter, and in fact the non-uniform methods may reduce talker differences better than the uniform methods in corpus analysis. But these methods are brittle and are generally inappropriate as models of the perceptual process.</p>
<fig id="F5">
<label>Figure 5</label>
<caption>
<p>Results of SVM vowel classification of the Peterson and Barney (<xref ref-type="bibr" rid="B30">1952</xref>) vowels when the vowel normalization statistics are calculated over different numbers of randomly selected vowel tokens, or sets designed to be maximally informative about the vowel space (the corner vowels) or the vocal tract (schwa). Classification with no normalization results in 77.3% correct vowel identification (see Table <xref ref-type="table" rid="T2">2</xref>), which is indicated by the dashed horizontal line.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/article/id/6268/file/75705/"/>
</fig>
<p>Testing the Watt &amp; Fabricius method with randomly selected vowel tokens, as done here, especially goes against the spirit of that method because its main feature is a judicious selection of vowel tokens to represent the full possible range of F1 and F2 for a talker. However, random selection of tokens is justified in this study because it is designed to evaluate the plausibility of extrinsic normalization as a component of speech perception. Listeners are not presented with a judicious selection of vowel tokens, but have to deal with whatever the talker says. It is worth noting that even without judicious selection of vowel tokens, Watt &amp; Fabricius&#8217; mean ratio representation has very good vowel classification performance when a large sample of tokens is taken; however the normalization scale may depend on the specific vowel inventory. Regardless, the Watt &amp; Fabricius method was not designed as a model of perceptual vowel normalization and is clearly not a feasible one.</p>
</sec>
</sec>
<sec>
<title>5. Conclusion</title>
<p>This study introduced a new method of vowel normalization, the &#916;F method. This is explicitly a vocal tract length normalization method, and represents vowels on a talker-independent measurement scale&#8212;the average formant spacing of the talker, their &#916;F. Using a metric that is closely related to the length of the talker&#8217;s vocal tract, we are able to produce a vowel space in which vocal tract length effects have been removed. The resulting vowel space is largely equivalent to Nearey&#8217;s (<xref ref-type="bibr" rid="B25">1978</xref>) log-mean uniform vowel normalization method, but is explicitly rationalized in terms of vocal tract length and has a consistent unit of measure whether 3 formant or 4 formant measurements are used. &#916;F normalization based on Lammert and Narayanan&#8217;s (<xref ref-type="bibr" rid="B22">2015</xref>) intercept method of vocal tract length estimation is more robust when only a few tokens are available for a talker, but provides less vowel category separation in models built over a larger corpus.</p>
<p>This study has also shown that non-uniform methods, that rely on within-formant scale factors (Nearey&#8217;s log-mean non-uniform method, Lobanov&#8217;s <italic>z</italic>-score normalization, and Watt &amp; Fabricius&#8217; mean ratio method) are highly sensitive to the vowel tokens that are chosen as the basis for calculating the normalization scale factors. This leads to the conclusion that these methods are not plausible as models of the cognitive processes involved in speech perception. Also, in practical phonetic description of languages this dependence on particular vowel tokens for calculating normalization scale factors complicates cross-linguistic or cross-dialect comparisons of vowels spaces. If languages have different vowel inventories, then the unit of measure for normalized vowels will not be comparable across languages. This leads one to try to determine formant ranges indirectly (<xref ref-type="bibr" rid="B7">Fabricius, Watt, &amp; Johnson, 2009</xref>) in order to have normalized values that can be cross-linguistically compared. Vocal tract length normalization does not have as much of a problem with this because it is less sensitive to the composition of the vowel inventory, and not very sensitive to the particular vowel tokens that represent a talker.</p>
<p>The practical conclusion is that &#916;F normalization likely has advantages over other methods for speech researchers. It produces good classification accuracy, is robust to sample size variation across talkers, is independent of the vowel inventory or phonetic vowel realizations in the language or dialect studied, puts all talkers, regardless of language or dialect, on the same measurement scale, and is rationalized in terms of the acoustic theory of speech production.</p>
<p>Finally, the results of this study also encourage us to think that listeners may be able to employ a type of vocal tract length normalization with very little extrinsic context. Non-uniform normalization schemes are untenable when faced with a single isolated vowel, but a normalization scheme in which vocal tract length is estimated from the entire spectrum of a vowel (see e.g., <xref ref-type="bibr" rid="B35">Wakita, 1977</xref>; <xref ref-type="bibr" rid="B3">Bladon, Henton &amp; Pickering, 1984</xref>; <xref ref-type="bibr" rid="B23">Lee &amp; Rose, 1998</xref>), does seem to be a plausible perceptual mechanism even with isolated vowels. Future research may bear this out.</p>
</sec>
</body>
<back>
<fn-group>
<fn id="n1"><p>The measurement errors reported for these estimates of vocal tract length are about 10% of the length of a typical vocal tract, which may seem a bit large. It is, however, a small enough error to be able to correctly classify speakers in the Hillenbrand et al. (<xref ref-type="bibr" rid="B12">1995</xref>) data set as &#8216;male,&#8217; &#8216;female,&#8217; or &#8216;child&#8217; with 90% accuracy based on the L&amp;N estimate of vocal tract length (formula 8). This was measured using the methods for support vector machine classification that are described below for study 1.</p></fn>
<fn id="n2"><p>Though the coefficients for four formant data using Lammert &amp; Narayanan&#8217;s (<xref ref-type="bibr" rid="B22">2015</xref>) method are published in their paper, for completeness and ease of reference they are repeated here: with intercept (formula 11): &#946;<sub>0</sub> = 52, &#946;<sub>1</sub> = 0.078, &#946;<sub>2</sub> = 0.099, &#946;<sub>3</sub> = 0.101, &#946;<sub>4</sub> = 0.609, and without intercept (formula 8): &#946;<sub>1</sub> = 0.089, &#946;<sub>2</sub> = 0.102, &#946;<sub>3</sub> = 0.121, &#946;<sub>4</sub> = 0.669.</p></fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>This paper is dedicated to Mary Beckman. Thanks to Adam Lammert for sharing VTL estimation coefficients for three-formant vowel data and for extensive comments on an earlier version of the manuscript. Thanks also to Santiago Barreda and two anonymous reviewers for comments on an earlier version of the manuscript. The work here was made possible in part by Barreda&#8217;s collection of datasets and analysis/visualization tools that he has made available in the <bold>phonTools</bold> R package.</p>
</ack>
<sec>
<title>Competing Interests</title>
<p>The authors have no competing interests to declare.</p>
</sec>
<ref-list>
<ref id="B1"><label>1</label><mixed-citation publication-type="journal"><string-name><surname>Adank</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Smits</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name><surname>van Hout</surname>, <given-names>R.</given-names></string-name> (<year>2004</year>). <article-title>A comparison of vowel normalization procedures for language variation research</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>116</volume>, <fpage>3099</fpage>&#8211;<lpage>3107</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.1795335</pub-id></mixed-citation></ref>
<ref id="B2"><label>2</label><mixed-citation publication-type="journal"><string-name><surname>Barreda</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name><surname>Nearey</surname>, <given-names>T. M.</given-names></string-name> (<year>2018</year>). <article-title>A regression approach to vowel normalization for missing and unbalanced data</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>144</volume>, <fpage>500</fpage>&#8211;<lpage>520</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.5047742</pub-id></mixed-citation></ref>
<ref id="B3"><label>3</label><mixed-citation publication-type="journal"><string-name><surname>Bladon</surname>, <given-names>R. A. W.</given-names></string-name>, <string-name><surname>Henton</surname>, <given-names>C. G.</given-names></string-name>, &amp; <string-name><surname>Pickering</surname>, <given-names>J. B.</given-names></string-name> (<year>1984</year>). <article-title>Towards an auditory theory of speaker normalization</article-title>. <source>Language &amp; Communication</source>, <volume>4</volume>, <fpage>59</fpage>&#8211;<lpage>69</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/0271-5309(84)90019-3</pub-id></mixed-citation></ref>
<ref id="B4"><label>4</label><mixed-citation publication-type="journal"><string-name><surname>Cortes</surname>, <given-names>C.</given-names></string-name>, &amp; <string-name><surname>Vapnik</surname>, <given-names>V. N.</given-names></string-name> (<year>1995</year>). <article-title>Support-vector networks</article-title>. <source>Machine Learning</source>, <volume>20</volume>(<issue>3</issue>), <fpage>273</fpage>&#8211;<lpage>297</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/BF00994018</pub-id></mixed-citation></ref>
<ref id="B5"><label>5</label><mixed-citation publication-type="journal"><string-name><surname>Disner</surname>, <given-names>S.</given-names></string-name> (<year>1980</year>). <article-title>Evaluation of vowel normalization procedures</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>67</volume>, <fpage>253</fpage>&#8211;<lpage>261</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.383734</pub-id></mixed-citation></ref>
<ref id="B6"><label>6</label><mixed-citation publication-type="confproc"><string-name><surname>Eide</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name> (<year>1996</year>). <article-title>A parametric approach to vocal tract length normalization</article-title>. <conf-name>1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings</conf-name>, <volume>1</volume>, <fpage>346</fpage>&#8211;<lpage>348</lpage>. <conf-loc>Atlanta, GA, USA</conf-loc>. DOI: <pub-id pub-id-type="doi">10.1109/ICASSP.1996.541103</pub-id></mixed-citation></ref>
<ref id="B7"><label>7</label><mixed-citation publication-type="journal"><string-name><surname>Fabricius</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Watt</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name><surname>Johnson</surname>, <given-names>D. E.</given-names></string-name> (<year>2009</year>). <article-title>A comparison of three speaker-intrinsic vowel formant frequency normalization algorithms for sociophonetics</article-title>. <source>Language Variation and Change</source>, <volume>21</volume>, <fpage>413</fpage>&#8211;<lpage>35</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/S0954394509990160</pub-id></mixed-citation></ref>
<ref id="B8"><label>8</label><mixed-citation publication-type="book"><string-name><surname>Fant</surname>, <given-names>G.</given-names></string-name> (<year>1960</year>). <source>Acoustic Theory of Speech Production</source>. <publisher-loc>The Hague</publisher-loc>: <publisher-name>Mouton</publisher-name>.</mixed-citation></ref>
<ref id="B9"><label>9</label><mixed-citation publication-type="journal"><string-name><surname>Fujisaki</surname>, <given-names>H.</given-names></string-name>, &amp; <string-name><surname>Kawashima</surname>, <given-names>T.</given-names></string-name> (<year>1968</year>). <article-title>The roles of pitch and higher formants in the perception of vowels</article-title>. <source>IEEE Transactions on Audio and Electroacoustics</source>, <volume>16</volume>, <fpage>73</fpage>&#8211;<lpage>77</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/TAU.1968.1161952</pub-id></mixed-citation></ref>
<ref id="B10"><label>10</label><mixed-citation publication-type="journal"><string-name><surname>Gerstman</surname>, <given-names>L.</given-names></string-name> (<year>1968</year>). <article-title>Classification of self-normalized vowels</article-title>. <source>IEEE Transactions on Audio and Electroacoustics</source>, <volume>16</volume>, <fpage>78</fpage>&#8211;<lpage>80</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/TAU.1968.1161953</pub-id></mixed-citation></ref>
<ref id="B11"><label>11</label><mixed-citation publication-type="journal"><string-name><surname>Harrington</surname>, <given-names>F. H.</given-names></string-name>, &amp; <string-name><surname>Mech</surname>, <given-names>L. D.</given-names></string-name> (<year>1979</year>). <article-title>Wolf howling and its role in territory maintenance</article-title>. <source>Behaviour</source>, <volume>68</volume>, <fpage>207</fpage>&#8211;<lpage>249</lpage>. DOI: <pub-id pub-id-type="doi">10.1163/156853979X00322</pub-id></mixed-citation></ref>
<ref id="B12"><label>12</label><mixed-citation publication-type="journal"><string-name><surname>Hillenbrand</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Getty</surname>, <given-names>L. A.</given-names></string-name>, <string-name><surname>Clark</surname>, <given-names>M. J.</given-names></string-name>, &amp; <string-name><surname>Wheeler</surname>, <given-names>K.</given-names></string-name> (<year>1995</year>). <article-title>Acoustic characteristics of American English vowels</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>97</volume>(<issue>5</issue>), <fpage>3099</fpage>&#8211;<lpage>3111</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.411872</pub-id></mixed-citation></ref>
<ref id="B13"><label>13</label><mixed-citation publication-type="book"><string-name><surname>Hindle</surname>, <given-names>D.</given-names></string-name> (<year>1978</year>). <chapter-title>Approaches to vowel normalization in the study of natural speech</chapter-title>. In <string-name><given-names>D.</given-names> <surname>Sankoff</surname></string-name> (Ed.), <source>Linguistic variation: Models and methods</source> (pp. <fpage>161</fpage>&#8211;<lpage>171</lpage>) <publisher-loc>New York</publisher-loc>: <publisher-name>Academic Press</publisher-name>.</mixed-citation></ref>
<ref id="B14"><label>14</label><mixed-citation publication-type="journal"><string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name> (<year>1990</year>). <article-title>The role of perceived speaker identity in F0 normalization of vowels</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>88</volume>, <fpage>642</fpage>&#8211;<lpage>654</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.399767</pub-id></mixed-citation></ref>
<ref id="B15"><label>15</label><mixed-citation publication-type="book"><string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name> (<year>1997</year>). <chapter-title>Speech perception without speaker normalization: An exemplar model</chapter-title>. In <string-name><given-names>K.</given-names> <surname>Johnson</surname></string-name> &amp; <string-name><given-names>J. W.</given-names> <surname>Mullennix</surname></string-name> (Eds.), <source>Talker variability in speech processing</source>. (pp. <fpage>145</fpage>&#8212;<lpage>166</lpage>). <publisher-loc>San Diego, CA</publisher-loc>: <publisher-name>Academic Press</publisher-name>.</mixed-citation></ref>
<ref id="B16"><label>16</label><mixed-citation publication-type="book"><string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name> (<year>2005</year>). <chapter-title>Speaker Normalization in speech perception</chapter-title>. In <string-name><given-names>D. B.</given-names> <surname>Pisoni</surname></string-name> &amp; <string-name><given-names>R.</given-names> <surname>Remez</surname></string-name>, (Eds.), <source>The handbook of speech perception</source> (pp. <fpage>363</fpage>&#8212;<lpage>389</lpage>). <publisher-loc>Oxford</publisher-loc>: <publisher-name>Blackwell Publishers</publisher-name>. DOI: <pub-id pub-id-type="doi">10.1002/9780470757024.ch15</pub-id></mixed-citation></ref>
<ref id="B17"><label>17</label><mixed-citation publication-type="book"><string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name> (<year>2011</year>). <chapter-title>Acoustic and Auditory Phonetics</chapter-title>, <edition>3rd</edition> Edition. <publisher-loc>Boston</publisher-loc>: <publisher-name>Wiley-Blackwell</publisher-name>.</mixed-citation></ref>
<ref id="B18"><label>18</label><mixed-citation publication-type="journal"><string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name>, &amp; <string-name><surname>Sjerps</surname>, <given-names>M.</given-names></string-name> (to appear). <article-title>Speaker Normalization in Speech Perception</article-title>.</mixed-citation></ref>
<ref id="B19"><label>19</label><mixed-citation publication-type="journal"><string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Strand</surname>, <given-names>E. A.</given-names></string-name>, &amp; <string-name><surname>D&#8217;Imperio</surname>, <given-names>M.</given-names></string-name> (<year>1999</year>). <article-title>Auditory-visual integration of talker gender in vowel perception</article-title>. <source>Journal of Phonetics</source>, <volume>27</volume>, <fpage>359</fpage>&#8211;<lpage>384</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/jpho.1999.0100</pub-id></mixed-citation></ref>
<ref id="B20"><label>20</label><mixed-citation publication-type="journal"><string-name><surname>Kirlin</surname>, <given-names>R. L.</given-names></string-name> (<year>1978</year>). <article-title><italic>A posteriori</italic> estimation of vocal tract length</article-title>. <source>IEEE Transactions on Acoustics, Speech, and Signal Processing</source>, <volume>26</volume>, <fpage>571</fpage>&#8211;<lpage>574</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/TASSP.1978.1163151</pub-id></mixed-citation></ref>
<ref id="B21"><label>21</label><mixed-citation publication-type="journal"><string-name><surname>Ladefoged</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name><surname>Broadbent</surname>, <given-names>D. E.</given-names></string-name> (<year>1957</year>). <article-title>Information conveyed by vowels</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>39</volume>, <fpage>98</fpage>&#8211;<lpage>104</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.1908694</pub-id></mixed-citation></ref>
<ref id="B22"><label>22</label><mixed-citation publication-type="journal"><string-name><surname>Lammert</surname>, <given-names>A. C.</given-names></string-name>, &amp; <string-name><surname>Narayanan</surname>, <given-names>S. S.</given-names></string-name> (<year>2015</year>). <article-title>On Short-Time Estimation of Vocal Tract Length from Formant Frequencies</article-title>. <source>PLoS ONE</source>, <volume>10</volume>(<issue>7</issue>): <elocation-id>e0132193</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1371/journal.pone.0132193</pub-id></mixed-citation></ref>
<ref id="B23"><label>23</label><mixed-citation publication-type="journal"><string-name><surname>Lee</surname>, <given-names>L.</given-names></string-name>, &amp; <string-name><surname>Rose</surname>, <given-names>R. C.</given-names></string-name> (<year>1998</year>). <article-title>A frequency warping approach to speaker normalization</article-title>. <source>IEEE Transactions on Speech &amp; Audio Processing</source>, <volume>6</volume>(<issue>1</issue>), <fpage>49</fpage>&#8211;<lpage>60</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/89.650310</pub-id></mixed-citation></ref>
<ref id="B24"><label>24</label><mixed-citation publication-type="journal"><string-name><surname>Lobanov</surname>, <given-names>B. M.</given-names></string-name> (<year>1971</year>). <article-title>Classification of Russian vowels spoken by different speakers</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>49</volume>, <fpage>606</fpage>&#8211;<lpage>608</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.1912396</pub-id></mixed-citation></ref>
<ref id="B25"><label>25</label><mixed-citation publication-type="book"><string-name><surname>Nearey</surname>, <given-names>T. M.</given-names></string-name> (<year>1978</year>). <source>Phonetic Feature Systems for Vowels</source>. <publisher-loc>Bloomington, Indiana</publisher-loc>: <publisher-name>Indiana University Linguistics Club</publisher-name>.</mixed-citation></ref>
<ref id="B26"><label>26</label><mixed-citation publication-type="journal"><string-name><surname>Nearey</surname>, <given-names>T. M.</given-names></string-name> (<year>1989</year>). <article-title>Static, dynamic, and relational properties in vowel perception</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>85</volume>, <fpage>2088</fpage>&#8211;<lpage>2113</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.397861</pub-id></mixed-citation></ref>
<ref id="B27"><label>27</label><mixed-citation publication-type="journal"><string-name><surname>Nearey</surname>, <given-names>T. M.</given-names></string-name>, &amp; <string-name><surname>Assmann</surname>, <given-names>P. F.</given-names></string-name> (<year>1986</year>). <article-title>Modeling the role of inherent spectral change in vowel identification</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>80</volume>, <fpage>1297</fpage>&#8211;<lpage>1308</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.394433</pub-id></mixed-citation></ref>
<ref id="B28"><label>28</label><mixed-citation publication-type="confproc"><string-name><surname>Nordstr&#246;m</surname>, <given-names>P. E.</given-names></string-name>, &amp; <string-name><surname>Lindblom</surname>, <given-names>B.</given-names></string-name> (<year>1975</year>). <article-title>A normalization procedure for vowel formant data</article-title>. <conf-name>Proceedings of the 8th international congress of phonetic sciences</conf-name>. <conf-loc>Leeds, England</conf-loc>.</mixed-citation></ref>
<ref id="B29"><label>29</label><mixed-citation publication-type="journal"><string-name><surname>Paige</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name><surname>Zue</surname>, <given-names>W. Z.</given-names></string-name> (<year>1970</year>). <article-title>Calculation of vocal tract length</article-title>. <source>IEEE Transactions on Audio and Electroacoustics</source>, <volume>18</volume>, <fpage>268</fpage>&#8211;<lpage>270</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/TAU.1970.1162113</pub-id></mixed-citation></ref>
<ref id="B30"><label>30</label><mixed-citation publication-type="journal"><string-name><surname>Peterson</surname>, <given-names>G. E.</given-names></string-name>, &amp; <string-name><surname>Barney</surname>, <given-names>H. L.</given-names></string-name> (<year>1952</year>). <article-title>Control methods used in the study of vowels</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>24</volume>, <fpage>175</fpage>&#8211;<lpage>184</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.1906875</pub-id></mixed-citation></ref>
<ref id="B31"><label>31</label><mixed-citation publication-type="journal"><string-name><surname>Reby</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name><surname>McComb</surname>, <given-names>K.</given-names></string-name> (<year>2003</year>). <article-title>Anatomical constraints generate honesty: Acoustic cues to age and weight in the roars of red deer stags</article-title>. <source>Animal Behavior</source>, <volume>65</volume>, <fpage>519</fpage>&#8211;<lpage>530</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/anbe.2003.2078</pub-id></mixed-citation></ref>
<ref id="B32"><label>32</label><mixed-citation publication-type="book"><string-name><surname>Strand</surname>, <given-names>E. A.</given-names></string-name>, &amp; <string-name><surname>Johnson</surname>, <given-names>K.</given-names></string-name> (<year>1996</year>). <chapter-title>Gradient and visual speaker normalization in the perception of fricatives</chapter-title>. In <string-name><given-names>D.</given-names> <surname>Gibbon</surname></string-name> (Ed.), <source>Natural Language Processing and Speech Technology. Results of the 3rd KOVENS Conference, Bielefeld, October, 1996</source> (pp. <fpage>14</fpage>&#8211;<lpage>26</lpage>). <publisher-loc>Berlin</publisher-loc>: <publisher-name>Mouton de Gruyter</publisher-name>. DOI: <pub-id pub-id-type="doi">10.1515/9783110821895-003</pub-id></mixed-citation></ref>
<ref id="B33"><label>33</label><mixed-citation publication-type="journal"><string-name><surname>Strange</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Jenkins</surname>, <given-names>J. J.</given-names></string-name>, &amp; <string-name><surname>Johnson</surname>, <given-names>T. L.</given-names></string-name> (<year>1983</year>). <article-title>Dynamic specification of coarticulated vowels</article-title>. <source>Journal of the Acoustical Society of America</source>, <volume>74</volume>(<issue>3</issue>), <fpage>695</fpage>&#8211;<lpage>705</lpage>. DOI: <pub-id pub-id-type="doi">10.1121/1.389855</pub-id></mixed-citation></ref>
<ref id="B34"><label>34</label><mixed-citation publication-type="journal"><string-name><surname>Van Lanker</surname>, <given-names>D. R.</given-names></string-name>, <string-name><surname>Kreiman</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name><surname>Cummings</surname>, <given-names>J.</given-names></string-name> (<year>1989</year>). <article-title>Voice perception deficits: Neuroanatomical correlates of phonagnosia</article-title>. <source>Journal of Clinical and Experimental Neuropsychology</source>, <volume>11</volume>(<issue>5</issue>), <fpage>665</fpage>&#8211;<lpage>674</lpage>. DOI: <pub-id pub-id-type="doi">10.1080/01688638908400923</pub-id></mixed-citation></ref>
<ref id="B35"><label>35</label><mixed-citation publication-type="journal"><string-name><surname>Wakita</surname>, <given-names>H.</given-names></string-name> (<year>1977</year>). <article-title>Normalization of vowels by vocal-tract length and its application to vowel identification</article-title>. <source>IEEE Transactions on Acoustics, Speech, and Signal Processing</source>, <volume>25</volume>, <fpage>183</fpage>&#8211;<lpage>192</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/TASSP.1977.1162929</pub-id></mixed-citation></ref>
<ref id="B36"><label>36</label><mixed-citation publication-type="webpage"><string-name><surname>Watt</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name><surname>Fabricius</surname>, <given-names>A.</given-names></string-name> (<year>2002</year>). <article-title>Evaluation of a technique for improving the mapping of multiple speakers&#8217; vowel spaces in the F1-F2 plane</article-title>. <source>Leeds Working Papers in Linguistics and Phonetics</source>, <volume>9</volume>, <fpage>159</fpage>&#8211;<lpage>173</lpage>. Retrieved from <uri>http://www.leeds.ac.uk/linguistics/WPL/WP2002/Watt_Fab.pdf</uri></mixed-citation></ref>
</ref-list>
</back>
</article>