1. Speech perception and linguistic experience
The perceptual similarity of any two speech sounds depends, to a great extent, on their raw acoustic similarity. However, speech perception is also mediated by the native language(s) of the hearer. The functional organization of speech sounds within a language’s phonological system strongly determines whether a given pair of segments will be well-discriminated by speakers of that language (Trubetzkoy, 1939; see Best, 1995; Best, McRoberts, & Goodell, 2001; Boomershine, Hall, Hume, & Johnson, 2008; Sebastián-Gallés, 2005 for references and recent discussion). For example, [d ð] are present in both English and in Spanish. In English these sounds are contrastive, as attested by minimal pairs like /be͡ɪd/ ‘bade’ vs. /be͡ɪð/ ‘bathe.’ In Spanish [ð] is instead a conditioned allophone of the phoneme /d/, e.g., [deðo] ‘finger’ vs. [ese ðeðo] ‘that finger’ (e.g., Harris, 1969). This difference in the function of [d ð] has consequences for speech perception: English speakers, who rely on the [d ð] contrast to distinguish word meanings, are better at discriminating these sounds than Spanish speakers, for whom [ð] is simply a predictable variant of /d/ (Boomershine et al., 2008; see too Harnsberger, 2000, 2001a, 2001b). These and related findings demonstrate that prior linguistic experience plays a significant role in conditioning the perception of speech sounds.
Even fine-grained details of linguistic experience, based on statistical properties of a hearer’s native language, may substantially impact speech perception (see Cutler, 2012 for an overview). Phoneme discrimination, for instance, appears to be sensitive to the specific acoustic parameters associated with each phoneme category in the hearer’s language (e.g., Kuhl & Iverson, 1995 and Section 5.2). More recent research has suggested that the statistical structure of the lexicon may also influence native language speech perception (e.g., Hall & Hume, in preparation; see Hall, Hume, Jaeger, & Wedel, in preparation; Hall, Letawsky, Turner, Allen, & McMullin, 2014; Kataoka & Johnson, 2007; Vitevitch & Luce, 2016 and references there; Yao, 2011, Ch.2 for related discussion). Among other factors, segment-level measures such as phoneme frequency and functional load (discussed in Section 5.3) seem to contribute to the relative discriminability of different phoneme pairs. The precise mechanism behind such effects is not well-understood at present, a point we return to in Section 7.
It thus seems clear that statistical properties of a hearer’s native language may influence speech perception. However, we believe that the full generality of these findings has not yet been established, particularly with respect to lexical effects on speech perception. A large proportion of speech perception studies—perhaps most such studies—involve experiments with listeners who are native speakers of majority languages like English, French, Dutch, Japanese, and so on (e.g., Cutler, 2012, p. 4). This sampling bias might be unremarkable, if not for the fact that the languages most commonly used in speech perception research also share a number of structural and sociolinguistic properties. To give one example, the European languages most often used in speech perception research typically belong to the Germanic or Romance branches of the Indo-European family. The morphological structure of these languages is characteristically analytic or fusional rather than agglutinating. Since perfect minimal pairs should, intuitively, be less common in languages which tend toward longer and/or more complex words, it remains unclear whether statistical measures which refer to minimal pairs (e.g., functional load) should have the same importance in languages with relatively agglutinative morphology (see also Hall et al., in preparation; Wedel, Jackson, & Kaplan, 2013; Wedel, Kaplan, & Jackson, 2013).1 Similar questions arise with respect to neighborhood density and word-frequency effects in agglutinating languages, as longer words are likely to have fewer lexical neighbors and (possibly) low overall corpus frequencies (e.g., Vitevitch & Luce, 2016; Yao, 2011; Zipf, 1935, Ch.2).
At the sociolinguistic level, a preponderance of work in speech perception has been conducted with listeners who are highly educated and literate in their native language. Apart from general concerns about whether results obtained with such populations are really generalizable (e.g., Henrich, Heine, & Norenzayan, 2010), the bias in speech perception studies toward literate speakers of Indo-European languages is potentially relevant for understanding how phoneme-level lexical statistics interact with speech perception. It has sometimes been suggested that lexical items lack a phoneme-level encoding altogether, being stored instead with strictly gestural and/or syllabic encoding (e.g., Browman & Goldstein, 1986, 1989, 1992, etc.; Ladefoged & Disner, 2012; Lodge, 2009; Port & Leary, 2005; Silverman, 2006, 2012; Tilsen, 2016; cf. Dunbar & Idsardi, 2010; Hyman, 2015). To the extent that such models of lexical storage can account for phoneme-level statistical effects in speech perception, they would presumably attribute such effects to phonemic awareness, itself an artifact of literacy in an alphabetic writing system. Against this backdrop, further studies of speech perception among populations with non-alphabetic writing systems, or simply low literacy rates, are clearly needed. It is not our place here to adjudicate between these views, only to highlight the fact that answering such questions will require a more diverse sample of speakers and languages than currently exists in the speech perception literature.
In this article we explore how statistical measures derived from the lexicon (such as pairwise functional load) affect stop consonant discrimination in Kaqchikel, a Guatemalan Mayan language (Section 2). Kaqchikel has a number of properties, both grammatical and sociolinguistic, which differentiate it from most of the majority languages typically encountered in the speech perception literature. As discussed in Section 9, our study replicates some past results regarding the influence of segment-level distributional statistics on speech perception, and in doing so, supports the generality of such effects across different linguistic populations.
We investigate these issues using an AX discrimination study of the stop consonants of Kaqchikel (Section 3). Our emphasis in this paper is the influence of linguistic experience on speech perception in a lesser-studied language. For reasons of space we do not discuss specific patterns of pairwise consonant confusion in detail here.
Kaqchikel is a K’ichean-branch Mayan language spoken by over half a million people in southern Guatemala (Figure 1; Fischer & Brown, 1996, fn. 3; Maxwell & Hill, 2010; Richards, 2003). Like all Mayan languages, Kaqchikel has a phonemic contrast between plain voiceless plosives (/p t k q t͡s t͡ʃ/) and ‘glottalized’ plosives at corresponding places of articulation (implosive /ɓ/, ejective /tʔ kʔ qʔ t͡sʔ t͡ʃʔ/ and /ʔ/) (Table 1; Bennett, 2016; Campbell, 1977; Chacach Cutzal, 1990; Cojtí Macario & Lopez, 1990; García Matzar, Toj Cotzajay, & Coc Tuiz, 1999; Majzul, Matzar, & Serech, 2000; R. M. Brown, Maxwell, & Little, 2010, etc.).
|Stop||p ɓ||t tʔ||k kʔ||q qʔ||ʔ|
|Affricate||t͡s t͡sʔ||t͡ʃ t͡ʃʔ|
|Fricative||s||ʃ||x ~ χ|
Ejectives and implosives are reasonably uncommon cross-linguistically: Maddieson’s (2009) typological survey of 566 languages finds that 151 (27%) have either ejectives or implosives in their consonant inventories. Bennett, Tang, and Ajsivinac (in preparation) find that the ejectives of Kaqchikel closely resemble the ‘slack’ ejectives described by Lindau (1984) and Kingston (1984, 2005b): They are characteristically produced with short VOTs and weak release bursts, and cause creaky voice on adjacent voiced segments. Ejectives are sometimes realized with glottal closure following the oral release burst: This difference in release quality, along with creakiness in adjacent segments, seems to be a reliable cue to the plain~ejective distinction in Kaqchikel. Implosive /ɓ/ usually lacks a release burst entirely, being ingressive, and also conditions creaky voice on neighboring sounds (Bennett, 2016; Majzul et al., 2000). Common realizations of /ɓ/ include [ɓ̰ ɓ̥ ɓ] less common realizations include [pʔ w̰ ʔ]. The phonetic realization of /qʔ/ is typically either [qʔ] or [ʛ̥]. These findings are all in line with past descriptions of Kaqchikel and other Eastern Mayan languages (e.g., Barrett, 1999; Bennett, 2010; DuBois, 1981; England, 1983; Kingston, 1984; Larsen, 1988; Pinkerton, 1986; Russell, 1997). Since not much previous research has been done on the perception of glottalized consonants in any language, and none at all on Mayan languages, we do not have any prior expectations as to how discriminable plain~glottalized contrasts might be in Kaqchikel (on the perception of glottalized consonants outside of Mayan, see Fre Woldu, 1985; Gallagher, 2010a, 2010b, 2011, 2012, 2014; Rose & King, 2007; R. Wright, Hargus, & Davis, 2002).
The morphological system of Kaqchikel is moderately agglutinating, especially with verbs (see R. M. Brown et al., 2010; Chacach Cutzal, 1990; Coon, 2016; García Matzar et al., 1999; Kaufman, 1990). Across lexical categories, the prefixal field is mostly reserved for inflectional affixes marking aspect and person/number agreement, while the suffixal field is composed of derivational affixes (1, 2) (the adjectival root ch’u’j /t͡ʃʔuʔχ/ ‘crazy’ is in bold).
|they went somewhere to drive me crazy|
|our being driven crazy|
All modern Mayan writing systems are alphabetic in nature. Literacy in Kaqchikel is currently quite low, in part because written materials are not widely available (on the history of literature and literacy in Kaqchikel, see Maxwell & Hill, 2010; on literacy in Mayan languages more generally, see Brody, 2004; England, 1996, 2003, and references there). While standard orthographies exist for most Mayan languages (Bennett, Coon, & Henderson, 2016; Kaufman, 2003), there is a substantial amount of variability in writing conventions across speakers (R. M. Brown et al., 2010; England, 1996, 2003). This variation reflects, at least in part, the extensive dialect variation found for those Mayan languages which (like Kaqchikel) are spoken over a wide geographical area (e.g., R. M. Brown et al., 2010; Majzul et al., 2000; Maxwell & Hill, 2010; Richards, 2003).
3. Perception study: AX task
To investigate the role that lexical and acoustic experience play in speech perception in Kaqchikel, we carried out a simple AX discrimination task investigating perceptual similarity among stop consonants.
In this study, Kaqchikel speakers listened to pairs of [CV] or [VC] syllables over headphones. We will sometimes refer to the [CV] condition as the ‘Onset’ condition, and the [VC] condition as the ‘Coda’ condition. The vowels in a given syllable pair were always identical, but the consonants could be either identical or different. Upon listening to each pair of syllables, the participants were asked to respond Same or Different on a button box. Our underlying assumption is that incorrect Same responses to syllables containing different consonants indicates perceptual similarity between [CA]~[CB] pairs. Further details of the methodology are outlined below.
Forty-five experimental participants were recruited in Patzicía, Guatemala (Figure 1) by one of the authors (Ajsivinac), who is himself a native speaker of the Patzicía variety of Kaqchikel. These participants all have self-reported native-level fluency in Kaqchikel, a fact further confirmed by co-author Ajsivinac during conversations before and after the experimental sessions. As is typically the case in Guatemala, most of these participants were also fully bilingual in Spanish. Kaqchikel is nonetheless the primary medium of communication in Patzicía, and the language most likely spoken by our participants at home and in many public contexts. All communication before, during, and after experimental sessions was conducted in Kaqchikel (the first author, Bennett, is a second-language speaker of Kaqchikel with conversational abilities).
Participants completed a consent form and were given 200 Guatemalan quetzals (≈$27.25) for their participation. All participants were born in the department of Chimaltenango (Figure 1), where they also resided at the time of the study. Forty-one participants were born in the town of Patzicía itself, and 43 were living there at the time of the study. Ages ranged from 18 to 79 years old (Mean: 29, SD: 12.3). Thirteen male and 32 female speakers participated in the study (M:F ratio: 0.41). The skew toward female participants is typical of fieldwork in Guatemala, as women typically have greater flexibilty during the workday than men. One participant was excluded from analysis for failure to complete the study.
All experimental sessions were carried out in Patzicía, Guatemala (Figure 1), in a quiet room made available to the authors for the purposes of the study. Each session took about 35 minutes to complete.
3.1.2. Stimulus design
Recording and pre-processing The stimuli used in this study were recorded by a male native speaker of Patzicía Kaqchikel (co-author Ajsivinac). The stimuli were recorded in [pVC] and [CVp] frames. These frames were chosen for several reasons. First, the dominant shape of root morphemes in Kaqchikel (as in other Mayan languages) is /CVC/: There are few content words of the shape /CV/ or /VC/ (e.g., Bennett, 2016 and references there). Recording the materials as [pVC]/[CVp] helps minimize any phonetic artifacts which might result from recording materials that are not native-like in form. Furthermore, /VC/ roots are subject to consonant epenthesis in Kaqchikel, being realized as [ʔVC] in isolation, which makes it effectively impossible to record simple [VC] syllables. A plain consonant (/p/) was chosen as the frame consonant, rather than an ejective or implosive, to avoid any coarticulatory glottalization on the vowel (see Bennett, 2016; Bennett et al., in preparation). The frames [pVC] and [CVp] were recorded for all combinations of the vowels /a i u/ matched with each of the 22 phonemic consonants of Kaqchikel (Table 1).
The stimuli were presented for recording in random order, using an HTML platform (El Hattab, 2016). For each stimulus, the speaker was asked to produce 3 repetitions with roughly even intonation. Only the best repetition for each stimulus was selected for further processing and presentation. Each recording was first manually annotated on the segmental level using the acoustic analysis program Praat (Boersma & Weenink, 2016). A new set of stimuli was then extracted at these segmental boundaries, with the frame consonant /p/ excluded. The exclusion of /p/ was determined on the basis of the waveform, spectrogram, and listener audition (by co-author Bennett). Following Cutler, Weber, Smits, and Cooper (2004), the stimuli were then amplitude-normalized with respect to the rms amplitude of the vowel (set at 60 dB).
Embedding in noise In order to increase the likelihood of response errors in our study, we masked the stimuli in speech-shaped noise at a signal-to-noise ratio (SNR) of 0 dB. After amplitude normalization, each stimulus was padded with 250 ms of preceding silence, and 250 ms of following silence. The padded stimuli were then embedded in speech-shaped noise at 0 dB SNR (on our choice of SNR, see Meyer, Dentel, & Meunier, 2013, Tang, 2015, Ch. 3.7).2
Stimulus pairs As noted above, participants in this study listened to [CV] or [VC] syllables presented in pairs: [CAV]~[CBV] or [VCA]~[VCB]. The perception study was designed to focus on perceptual confusion between the stops /p t k q ɓ tʔ kʔ qʔ ʔ/. Our Target Pairs were pairs of [CV] or [VC] syllables in which both consonants belonged to this set of stops. All other consonants of Kaqchikel were included as fillers in this study, so that participants also heard many filler pairs in which at least one consonant was not a stop. In each [CV]/[VC] syllable the vowel was always one of /a i u/, and vowel quality was always matched between syllables presented in a pair.
There were 270 distinct target pairs in our study, ignoring the order of presentation of the items in each pair. This included 54 Same target pairs (9 stops × 3 vowels × 2 syllable templates) and 216 Different target pairs (9C2 (=36) consonant pairs × 3 vowels × 2 syllable templates). There were 1248 additional filler pairs, which reflect all possible combinations of non-stop consonants (n = 13) with other consonants (n = 22), across 3 vowel and 2 syllable contexts. The ratio of same:different trials in the study was set at 3:4 (including both filler and target pairs).
In order to keep the experiment to a reasonable length, we divided our target pairs into 30 different lists. For each list, we randomly sampled 72 Different target pairs, and 54 Same target pairs. Sampling of Same trials was done with replacement, so that each list could contain multiple instances of a given Same pair (up to a maximum of 3 repetitions).
Within each list we also included 74 filler items composed of consonant pairings that included at least one non-stop consonant. These 74 filler items were sampled with the same 3:4 ratio used to balance same:different trials for the target pairs (32 Same fillers, 42 Different fillers). This resulted in 200 trials per list. The 45 participants were assigned a list in order: Since there were only 30 stimulus lists, the first 15 lists were assigned to two participants each, and the remaining 15 lists assigned to just one participant each. The order of presentation for the pairs in each list was randomized across participants.
3.1.3. Stimulus presentation
Presentation of the stimuli and logging of participant responses was carried out with a script written in PsychoPy (Version 1.82.01; Peirce, 2007) and excecuted on a laptop computer. As noted above, this script assigned each participant to one of 30 stimulus lists, and automatically randomized presentation of stimuli within each list.
Prior to the beginning of the experiment, participants were told that they would be listening to a series of syllable pairs, and that they would have to respond as to whether they thought the syllables in each pair were the same or different. They were also told that the stimuli would be embedded in noise of some kind, making them difficult to hear, and that they should not expect the syllables to correspond to actual words of Kaqchikel in most cases (see Section 5.3). This information was provided because pilot testing suggested that the presence of noise and the nonce-word status of the stimuli might be confusing to some participants.
On each trial, participants were first presented with a cross in the center of the screen, lasting 500 ms. The screen then changed to a display showing a green box on the left side of the screen (corresponding to Same responses) and a red box on the right (corresponding to Different responses). Simultaneous with this change in the display, the first member of the stimulus pair for that trial began to play over headphones (Shure SRH 440 over-ear headphones, connected to the computer via an external FiiO E10 USB preamp set at a fixed level across sessions). The order of presentation of the two syllables in a stimulus pair was randomized on each trial.
Upon hearing each stimulus pair, participants responded as to whether they thought the two syllables were identical or different, using a PST Serial Response Box attached to the laptop. Same responses were entered with the leftmost key, and Different responses with the rightmost key; the position of the response keys was not counter-balanced across participants.
Participants were instructed to respond as quickly and as accurately as possible. Participants could take as long as they liked to respond, but trials taking longer than 10 seconds were followed with a reminder to respond as quickly as possible (a yellow warning sign symbol). Even without significant time pressure, participants responded in under one second on most trials (mean RT = 854 ms, median RT = 664 ms). Ten practice trials were completed prior to the actual experiment; these practice trials always involved syllable pairs which were not included in the test list for that speaker’s session.
After each participant was comfortable with the practice items, they began the actual experiment. The inter-stimulus interval (ISI) between the two stimuli in each pair was set to 300 ms. The inter-trial interval was set at 1500 ms (1000 ms of blank screen followed by the 500 ms cross fixation at the beginning of each trial). Participants were permitted to take a break after every 40 trials. The stimuli were presented at a fixed volume across trials, set at a comfortable level for each listener.
To verify that participants had completed the task as requested, we computed d′ scores (Macmillan & Creelman, 2005) for perceptual confusions between each pair of stop consonants, collapsing comparisons across all participants. d′ is related to accuracy, but controls for response bias, in particular the tendency to favor one of the two responses regardless of what the stimulus is. The mean d′ score for comparisons in the Onset condition was 1.62, and the mean d′ score for comparisons in the Coda condition was 1.82.
We believe that these are reasonably good d′ values, given that our participants had limited or no prior experience as experimental participants and were not necessarily accustomed to using a computer.3 We conclude that the participants in this study completed the task as requested.4
A 9-by-9 plot summarizing the d′ scores for all target stop pairs, collapsed across vowel context and syllable position, is provided as an appendix (Appendix D).
3.1.4. ISI length and processing mode
The ISI used in this study can be estimated in at least two ways. The shortest estimate would be 300 ms, the length of the silent interval between the two stimuli in a given pair. If we also include the noise padding present at the beginning and end of each stimulus, then the ISI would instead be 800 ms in length (250 ms of noise padding before/after each syllable + 300 ms silence between items). We note these values because the duration of the ISI in AX discrimination and related tasks is known to affect the way in which listeners process auditory stimuli (Babel & Johnson, 2010; Fox, 1984; Kingston, 2005a; Kingston, Levy, Rysling, & Staub, 2016; McGuire, 2010; Pisoni, 1973, 1975; Pisoni & Tash, 1974; Werker & Logan, 1985; Werker & Tees, 1984b). Shorter ISIs, particularly those without intervening noise, tend to favor a more acoustically-oriented mode of speech processing which does not necessarily engage the phonemic and lexical levels of speech encoding (i.e., short ISIs encourage a ‘prelinguistic’ mode of listening; see Section 8). Longer ISIs, typically at 500 ms or above, seem to condition responses which are more strongly affected by the phonemic and lexical structure of the listener’s native language (i.e., a ‘linguistic’ mode of speech processing). We mention these considerations because an ISI of 800 ms may have facilitated a linguistically-oriented mode of listening, a fact which is relevant given our goal of linking perceptual confusions to statistical facts about words and segments in Kaqchikel.
4. Two corpora for Kaqchikel: Assessing the effect of statistical and acoustic factors on speech perception
The overarching goal of this study was to assess the extent to which prior linguistic experience with Kaqchikel might affect consonant discrimination in a perceptual task. To that end, we examined acoustic, segmental, and word-level factors which could play a role in conditioning consonant confusions. Doing so necessitated the development of two corpora for Kaqchikel, which are described in the following sections.
4.1. Spoken Kaqchikel: The Sololá corpus
4.1.1. Corpus collection
The Sololá corpus is a collection of audio recordings of spontaneous spoken Kaqchikel. This corpus consists of recordings made by two of the authors in Sololá, Guatemala (Figure 1) in 2013 (Bennett & Ajsivinac, in preparation). Sixteen speakers of the Sololá variety of Kaqchikel contributed to this corpus and shared short, spontaneous narratives of their own choosing for the recording.
Fifteen (out of 16) of the speakers were born in the department of Sololá. The remaining speaker was born in the department of Sacatepéquez, to the east of Sololá. As of 2013, the speakers were all living in the department of Sololá, with six living in the city of Sololá, and ten in other towns. Six of these speakers were male, and 10 female; their ages ranged from 19–84 years old (mean = 33 years, median = 28 years, SD = 15.4). The speakers all had self-reported native-level fluency in Kaqchikel, a fact further confirmed by co-author Ajsivinac during conversations before and after the recording sessions. Most speakers reported using Kaqchikel as the primary language of communication at home.
All speakers were recorded using a headset microphone (Audio-Technica ATM73a) and solid-state portable recorder (Fostex FR-2LE), at a 48 kHz sampling rate with 24 bit quantization. The recordings were subsequently downsampled to 16 kHz for forced alignment and acoustic analysis (see below).
4.1.2. Corpus processing
In total, the corpus amounts to about 4 hours of recorded speech (≈40,000 word tokens). The entire corpus was transcribed orthographically by one of the authors, a native speaker of Kaqchikel (Ajsivinac). We took a subset of this corpus, consisting of approximately 3.5 minutes of audio per speaker (about 50 minutes in total, consisting of 5218 word tokens and 2754 stop consonant tokens), and annotated it phonetically using forced alignment tools. First, the transcriptions in this subset of the corpus were double-checked by another author (Bennett, a trained phonologist and phonetician, as well as an L2 speaker of Kaqchikel with conversational-level abilities). The orthographic transcriptions for this portion of the corpus were then converted into a surface phonetic transcription with a suite of Python scripts (http://www.python.org/) implementing grapheme-to-phoneme conversion and several major allophonic rules (see DiCanio et al., 2013 for discussion).5
These phonetic transcriptions, and their associated audio, were then submitted to segment-level forced alignment using the Prosodylab-Aligner (http://prosodylab.org/tools/aligner/; Gorman, Howell, & Wagner, 2011). Forced alignment is a computational technique for semi-automatically time-aligning audio files with a corresponding transcription. The Prosodylab-Aligner takes as its input an audio file with an associated sentence-level transcription, and produces a time-aligned Praat TextGrid with annotations at the word and segment levels. An alignment model was first trained on the 50 minute sub-corpus (3 training rounds of 1000 epochs each), then applied to that same corpus to generate the segmental annotations. A total of 2754 stops were annotated at the segmental level using this technique. Alignments were visually-inspected by one of the authors (Bennett, a trained phonologist and phonetician), but not hand-corrected for the purposes of this analysis (see DiCanio et al., 2013 on the distribution of error types in forced alignment).6
4.1.3. Corpus criticism
There are several advantages to using a spoken corpus of this type for acoustic analysis and psycholinguistic research. First, the Sololá corpus is a corpus of spontaneous speech, and is therefore more naturalistic, and more representative of everyday Kaqchikel speech, than a corpus of read or elicited materials. Second, the materials in such a corpus—which include stories and folktales that are traditionally told in the Sololá region—may be of greater interest to the Kaqchikel language community than recordings of isolated wordlists or prompted sentences.
There are also potential drawbacks to using a corpus of this type. While the Sololá corpus has the advantage of being naturalistic and thus more ecologically valid than certain other types of audio corpora, the content of the recordings is not controlled in any way (see Xu, 2010 for discussion). As a consequence, data sparsity issues emerge with respect to certain phonetic and phonological structures. For example, ejective /tʔ/ is quite rare in our data (<1% of stops). (This is to be expected, as ejective /tʔ/ is known to have low type and token frequencies in Mayan languages; Bennett, 2016; England, 2001.) The paucity of /tʔ/ tokens in the corpus clearly precludes any strong conclusions about the properties of this sound. Additionally, there are relatively few glottalized stops in non-prevocalic (≈coda) position in our corpus (<5% of all stops). This owes in part to the fact that most stops in our corpus, regardless of laryngeal state, occur in pre-vocalic position (85%).
Although the Sololá corpus is a corpus of spontaneous speech, it is also a corpus of monologues rather than dialogues. As such, the speech genre of the corpus may be less than fully naturalistic, and may further show the effects of stereotyped or ritualistic speech patterns associated with storytelling in Kaqchikel. Nonetheless, we believe that the size and composition of this corpus is appropriate for drawing at least some initial conclusions about the phonetic structure of everyday Kaqchikel speech.
4.2. Written corpus
4.2.1. Corpus collection
One goal of this paper is to explore how the statistical structure of Kaqchikel—both in the lexicon (i.e., the vocabulary), and in actual spoken or written usage—might influence speech perception. To answer this question we needed a reasonably large corpus of written Kaqchikel over which segmental and word-level statistics could be calculated. To the best of our knowledge there are no structured corpora of written Kaqchikel currently available (apart from dictionaries like Macario, Cutzal, & Semeyá, 1998; Majzul, 2007), and certainly none that are in a digitized, searchable form. It was therefore necessary to construct a novel, digitized written corpus of Kaqchikel in order to assess the statistical patterning of words and segments in the language.
Our corpus is constructed from religious texts, spoken transcripts, government documents, medical handbooks, and other educational books written in Kaqchikel—essentially all the materials we could find that were already digitized or in an easily digitizable format. The corpus contains approximately 0.7 million word tokens (around 30,000 word types).
The corpus underwent further processing and cleaning before being used to calculate word- and segment-level corpus measures for Kaqchikel. Details on our processing and cleaning methods are given in Appendix A.
4.2.2. Corpus criticism
Corpus size and composition Modern corpora of majority languages like English are quite large, on the order of hundreds of millions of word tokens in size (e.g., the Subtlex-UK corpus, 201 million words, van Heuven, Mandera, Keuleers, & Brysbaert, 2014). Spoken corpora tend to be smaller, but still typically contain several million words (e.g., the Corpus of Spontaneous Japanese, 7 million words, Maekawa, 2003). Developing corpora of this size is simply not feasible for under-resourced languages like Kaqchikel, which may lack large quantities of written text (particularly digitized text), as well as the economic infrastructure needed to support the collection and annotation of large corpora.
For this reason, in compiling our written corpus we drew on any and all written Kaqchikel texts that we could find. We purposefully excluded dictionaries and collections of neologisms from the corpus because these sources are likely to contain words which are not familiar to most Kaqchikel speakers.
In several respects, our written corpus is far from ideal. First, the corpus is relatively small, containing only ≈0.7 million word tokens. It has been argued that a corpus of 16 million word tokens or more is needed for calculating stable estimates of the statistical properties of low frequency words (Brysbaert & New, 2009).
Second, our corpus contains a mix of both spoken transcripts and written sources. Ideally we would make use of a corpus consisting exclusively of spoken transcripts, given that most Kaqchikel speakers are not literate in the language, or otherwise have limited experience reading in Kaqchikel. Even for majority languages with higher literacy rates, it has been argued that spoken corpora are more representative of speakers’ actual linguistic experience than written corpora (Brysbaert & New, 2009; Keuleers, Lacey, Rastle, & Brysbaert, 2012).
Additionally, it is important to recognize that the Kaqchikel orthography is only semi-standardized, and orthographic practices vary across dialects and speakers of the language (R. M. Brown et al., 2010, pp. 3–4). For example, our corpus contains both nb’än and nub’än as forms of ‘(s)he does it,’ reflecting the fact that some dialects omit the 3SG.ERG marker -u- in particular morphological contexts (Majzul et al., 2000, pp. 69–70).
Genre The representativeness of a corpus refers to how closely a corpus reflects actual language use in a particular population (e.g., Atkins, Clear, & Ostler, 1992; Biber, 1993). One measure of representativeness is the extent to which the texts and genres included in a corpus correspond to the kinds of texts (or linguistic interactions) that speakers in the target population typically engage with.
The written Kaqchikel corpus described here is not balanced by genre, nor is it particularly speech-like with respect to the thematic content of the materials that it includes. To get a rough sense of how far the corpus deviates from naturalistic speech, we used the transitional probabilities between words in the corpus to create a trigram Markov-chain language model. Using this Markov-chain language model, we stochastically generated (‘babbled’) some random samples of Kaqchikel. One such sample is shown below.
A sample of Markov-chain Kaqchikel
ri taq Mechanpomal moloj: achoq pa ruwi’ yesamäj. K’ïy mul nqak’axaj nkib’ij chi ri xaqixaq nuqasaj ri k’atän jub’a’. K’o b’ey chuqa’ nq’axon nchulun o taq nsinan. K’ïy b’ey man ntane’ ta ri retal nuya’ chi ke ronojel ri qamolojri’ïl. Richin nawetamaj más República Democrática del Congo Ruanda Jun peraj chi re ri raqän ya’ Jordán. Ri Jehová rik’in ri más ütz chuqa’ man ütz ta yojch’on rik’in jun winäq…
Loose English translation of the Markov-chain Kaqchikel sample
the Mechanpomal group: on top of what do they work. Many times we listen to what they say about wormwood which lowers the heat a little. There are times too it hurts when he urinates or has sexual relations. Many times it doesn’t stop, the sign it gives to all of our organizations. In order for you to know more Democratic Republic of the Congo Rwanda A shawl for the river of Jordan. Jehova is with the best and it isn’t good that we talk to a man…
It is clear from the sample that the written corpus is not particularly speech-like, although it does contain a good range of lexical items covering the topics of religion, geography, and agriculture. Despite the fact that the genres represented by our corpus diverge somewhat from everyday speech, Tang, Bennett, and Ajsivinac (in preparation) show that word frequencies estimated from this written corpus can be used to predict the duration of words in our corpus of spontaneous spoken Kaqchikel (Section 4.1; see C. E. Wright, 1979 for the classic finding that word frequency and word duration are correlated in English). We take this result as indirect evidence that our written corpus roughly approximates the lexical structure of Kaqchikel as it is actually spoken.
For present purposes, the question is whether measures like functional load or phoneme frequency can be reliably estimated from this corpus. Work in progress (Tang, Bennett, & Ajsivinac, 2015) suggests that estimates of these measures are stable even over small sub-samples of this corpus (e.g., 20,000 words; see too Dockum & Campbell-Taylor, 2017; Gasser & Bowern, 2014; Macklin-Cordes & Round, 2015). As such, we believe that our corpus is indeed of sufficient size for the estimation of these measures.
5. Predictions and model design
In this section we consider how acoustic factors, word-level statistics, and segmental statistics might interact with speech perception in Kaqchikel. We also describe the basic modeling procedure we used to test whether these predictions were borne out in our results (Section 6).
5.1. Statistical model
We analyzed participant accuracy on each trial of the AX discrimination task with a mixed effects logistic regression in R (R Development Core Team, 2013), using the glmer function in the lme4 library (Bates, Maechler, & Bolker 2011). Recall that each trial could either contain two identical stimuli (the Same condition), or two different stimuli (the Different condition). We interpreted incorrect responses on Different trials as evidence that a given pair of stimuli was perceptually similar (having been mistaken as identical). Same trials are not similarly informative regarding perceptual confusion; we therefore analyzed only the accuracy of participant response on Different trials.7
5.2. Acoustic similarity
One of our main expectations is that greater acoustic similarity between a pair of syllables should predict greater perceptual similarity between those syllables (e.g., Dubno & Levitt, 1981). Two acoustic similarity measures were considered. The first measure is Stimulus Similarity—the raw acoustic similarity of the stimuli themselves. We expected stimulus similarity to have a substantial effect on stimulus discrimination in our study.
The second measure is Category Similarity—the similarity of two phoneme categories based on prior phonetic experience. Following a large body of work in Exemplar Theory, we assume that phonemic categories are associated with episodic memory traces (or exemplars), which are phonetically-rich representations of that category as previously encountered on specific occasions in actual speech (Gahl & Yu, 2006; Goldinger, 1996, 1998; Pierrehumbert, 2001, 2002; Wedel, 2004 and references there). On this view, the category similarity of two phonemes can be conceived of as the extent to which their exemplar clouds show overlap in phonetic space (see also Yu, 2011).
Figures 2 and 3 illustrate the importance of distinguishing stimulus similarity and category similarity. These figures show two hypothetical exemplar clouds for the phonemes /k/ and /p/ over some acoustic dimension(s) (say, VOT and burst intensity). The two clouds represent a collection of prior phonetic experiences that the listener has associated with each phoneme. In the context of our study, the two dots represent two stimuli presented to the listener for discrimination (say [ka] and [pa]).
In Figure 2, the exemplar clouds are non-overlapping (i.e., VOT values for /k/ and /p/ are typically quite distinct). As a consequence, listeners would likely conclude that the two stimuli (the dots) belong to different categories. In Figure 3, the clouds are substantially more overlapped, with the two stimuli falling in the overlapping region. This overlap between the two categories increases the level of uncertainty for the listener, making it more difficult to determine whether the two stimuli belong to different phonemic categories. To the extent that overlap along particular phonetic dimensions makes listeners less likely to rely on those dimensions for category discrimination (e.g., Holt & Lotto, 2006), category overlap may influence discrimination even for stimuli which are acoustically unambiguous (i.e., in non-overlapping regions of Figure 3), by reducing overall sensitivity to certain potential cues to consonant identity.
5.2.1. Stimulus similarity
It is intuitively clear that higher levels of acoustic similarity between stimuli should lead to higher rates of confusion between those stimuli in a discrimination task (e.g., Dubno & Levitt, 1981; Hall & Hume, in preparation; Redford & Diehl, 1999 and many others). To evaluate the importance of other factors in this study—particularly those related to prior phonetic experience (category similarity)—stimulus similarity must therefore be included as a control predictor.
To capture the acoustic similarity between the stimulus pairs in each trial, an acoustic distance metric was applied to each stimulus pair (after embedding the stimuli in noise). Such a metric should allow us to capture the raw acoustic information that could be used by the listeners to perform the AX task, even without accessing higher-level perceptual, phonemic, or lexical processing. Our acoustic distance metric was calculated using Phonological CorpusTools (Hall, Allen, Fry, Mackie, & McAuliffe, 2015). First, the waveform of each stimulus was transformed into mel-frequency cepstrum coefficents (MFCCs) (Mielke, 2012), a common re-representation of the acoustic signal used widely in speech recognition research. The number of MFCCs was set to 12, as this allows the model to capture speaker-independent information about acoustic similarity (see http://corpustools.readthedocs.io/en/latest/acoustic_similarity.html). Dynamic Time Warping (DTW), another common speech processing technique, was used to compute an explicit distance metric on the basis of the MFCC-transformed stimuli (Mielke, 2012; Sakoe & Chiba, 1971).
Our statistical model included a fixed effect for Stimulus Similarity, representing the acoustic distance between two stimuli according to this DTW metric. This predictor was z-score normalized using the scale() function in R.
5.2.2. Category similarity
To calculate the acoustic similarity of two stops at the phonemic level (category similarity), we computed DTW over pairs of stops as they occur in our acoustic corpus (Section 4.1). We further limited our comparisons to stop consonants occuring in similar environments, since the confusability of any given pair of stops may depend on the phonotactic and prosodic context (e.g., Chang, Plauché, & Ohala, 2001; Cutler et al., 2004).
- Using our acoustic corpus, we identified all instances of /p t k q ɓ tʔ kʔ qʔ/. These were divided into two groups: (a) pre-vocalic ([CV]); and (b) post-vocalic, but non-prevocalic ([VC(C/#)]).
- Using the segmentation provided by forced alignment (Section 4.1), the waveforms corresponding to each target stop consonant and the vowel adjacent to it were extracted individually. This gave two sets of waveforms corresponding to /p t k q ɓ tʔ kʔ qʔ/ in [CV] and [VC] transitions.
- These waveforms were further divided into subsets on the basis of the vowel, matching [CV] and [VC] waveforms according to the quality and stress profile of the vowel. For example, ['ke] could be compared with ['te], but not with [te], ['ti], ['kɛ], or ['ek].
- Within each matched [CV] or [VC] subset, we computed an acoustic distance measure (DTW) between all pairs of waveforms within that set which contained different stop consonants. For example, if ['ke] occurred twice in the corpus, and ['te] occurred three times, we would compute six pairwise acoustic distance measures: ['ke]1~['te]1, ['ke]1~['te]2, ['ke]1~['te]3; and ['ke]2~['te]1, ['ke]2~['te]2, ['ke]2~['te]3.
- The outcome of this procedure is a set of acoustic distances between tokens of the stop categories /p t k q ɓ tʔ kʔ qʔ/, grouped according to their syllabic context (onset/coda) and the properties of the preceding/following vowel. As an aggregate measure of category similarity, we took the mean and standard deviation of these values for each pair of stops.
These measures of category similarity were then used as predictors in our analysis of perceptual similarity (Section 6).8
Under the assumption that each stop token in the corpus counts as an exemplar, and exemplars are clustered together in clouds according to their category membership, the mean category distance between two stops can be interpreted as the distance between the two centroids of the exemplar clouds associated with each phoneme. The standard deviation of the distances between tokens is a measure of how consistently different the two categories are, across contexts and repetitions. These measures are logically and practically independent of each other. For instance, /ɓ/~/p/ and /t͡ʃʔ/~/qʔ/ have similar mean acoustic distances (51.04 and 51.74 respectively), but the standard deviations of the acoustic distances associated with each category are rather different (8.80 and 5.39 respectively). Since the separation between category means and the variance around those means might both matter for the overall separation of two phonemic categories in the acoustic space, we treated both measures of category similarity as predictors in our analysis of the perception study described above. Ultimately, only the category means proved to be a reliable predictor of perceptual similarity in our study (Section 6).
Our statistical model includes two fixed effects for Category Similarity between the two phonemic categories being compared on a given Different trial: Mean Category Similarity and Standard Deviation of Category Similarity. Both predictors were z-score normalized using the scale() function in R.
5.3. Word- and segment-level statistical factors
The analysis of statistical effects on speech perception in Kaqchikel took into account a number of distinct segment- and word-level predictors. Only a few of these predictors made a significant contribution to predicting patterns of perceptual confusion between stops in our study (Section 6). In the following section we define only those predictors which made a reliable contribution to predicting stop discrimination in our study, and leave the definition of the other, non-significant factors which we considered to Appendix C.
5.3.1. Segment-level factors
Three segment-level predictors were considered: segmental frequency, functional load, and distributional overlap. Of these, only functional load and distributional overlap emerged as significant predictors of perceptual confusions in our study.
Functional load Intuitively, Functional Load characterizes the importance of a given phonemic contrast for distinguishing words in a language. One way of defining the Functional Load of two phonemes in a language is to count the number of minimal pairs that are distinguished solely by the contrast between those phonemes (Hockett, 1967; Kučera, 1963; Martinet, 1952; Surendran & Niyogi, 2003, 2006). It has been argued that Functional Load and related measures condition the probability of diachronic phoneme mergers (Wedel, Jackson, & Kaplan, 2013; Wedel, Kaplan, & Jackson, 2013), as well as the production of phonemic contrasts (Baese-Berk & Goldrick, 2009; Goldrick, Vaughn, & Murphy, 2013; Nelson & Wedel, 2017).
Perhaps most relevant to this study, Functional Load may also interact with the perception of phonemic contrasts. Graff (2012), drawing on data from 60 languages and 25 language families, argues that languages tend to use perceptually robust phoneme contrasts to distinguish minimal pairs. Consequently, there should be a positive correlation between functional load and the perceptual distinctiveness of a given phonemic contrast. Hall and Hume (in preparation) and Stevenson and Zamuner (2017) show that in French, vowel pairs which have a higher functional load are also more perceptually distinct, even when other factors (such as raw acoustic similarity) are taken into account (see also Renwick, 2014 for similar claims about Romanian). Relatedly, L. Davidson, Shaw, and Adams (2007) found that listeners were more attentive to subtle phonetic details in an AX discrimination task (such as the presence vs. absence of schwa in clusters, [CəC]~[CC]) when the items constituted minimal pairs.
For this study, we focused on a metric of functional load which captures the change in entropy of the lexicon following merger of a phoneme contrast. This metric is sometimes called lexical Δ-entropy (for comparison with other metrics, see footnote 12). To compute this measure, we employed an information-theoretic method (Shannon, 1948). We first calculated the entropy of the Kaqchikel lexicon—a measure of uncertainty—using Equation 1 (Surendran & Niyogi, 2003, 2006). For our purposes, the entropy H(L) of a language measures how diverse the vocabulary is (basically the size of the lexicon), weighted by token frequency. Entropy depends on pw, the probability of a given word w in our written corpus. Functional load (Equation 2) is measured by estimating how much the lexicon ‘shrinks’ when two phonemes are merged into one (i.e., the number of distinct words made homophonous, weighted by frequency). The entropy of a lexicon in which phonemes x, y have been merged, H(Lxy), is compared to the original, non-merged lexicon H(L) to yield the functional load of the x, y contrast (Equation 2). Phoneme pairs with a higher functional load should lead to a larger proportional decrease in entropy when they are merged.
For the purpose of computing functional load, ‘words’ are defined as whole word forms, including affixes (i.e., as strings of segments separated by white space in a text; see Appendix A).
Wedel, Jackson, and Kaplan (2013) found that patterns of diachronic phoneme merger were better predicted by a measure of functional load which only compares words belonging to the same lexical category (e.g., two nouns distinguished by an /A B/ phoneme contrast would contribute to the functional load of /A B/, but not a noun-verb pair). As Kaqchikel is moderately agglutinating (Section 2), many words bear affixes which unambiguously indicate their part of speech (e.g., both affixes in r-utz-il ‘its goodness’ 3SG.ERG-good-NOM signal that this word is a noun). Our whole-word measure of functional load is thus probably biased toward comparing words within the same lexical category, as in Wedel, Jackson, and Kaplan (2013). Unlike Wedel, Jackson, and Kaplan (2013), we did not consider measures of functional load computed over lemmas (basically, uninflected stems) because a lemmatized corpus of Kaqchikel is not currently available.
A fixed effect of Functional Load was included in our statistical model, reflecting the measure of Δ-entropy described above.
Distributional overlap Recent work by Hall et al. (2014) and Hall and Hume (in preparation) argues that the predictability of two phonemes across contexts contributes to their perceptual confusability. The theoretical context of this claim is one in which contrastiveness is assumed to be gradient rather than categorical: Two phonemes which occur in many of the same environments are taken to be more contrastive (i.e., less predictable) than phonemes which mostly occur in distinct environments (Hall, 2012, 2013). By hypothesis, phoneme pairs which are more contrastive (less predictable from context) are expected to be more readily discriminated.9
We took Jeffreys’ distance (also called Jeffrey divergence; henceforth JD) as our measure of the distributional overlap (=contextual predictability) of two phonemes. JD determines the contextual probability of two phonemes based on the local segmental contexts X__Y that they occur in. JD can thus be interpreted as a measure of phonotactic similarity.10
JD was implemented using the TiMBL manual (Daelemans, Zavrel, van der Sloot, & van den Bosch, 2009, p. 26). In our study, each segment type is a class, and our features are the presence and absence of segments; more specifically, we used a trigram sliding window to generate features, with the target segment being in the first, second, or third position of the trigram window. Trigram windows are commonly used to capture phonotactics in computational linguistics (e.g., Jurafsky, Bell, Gregory, & Raymond, 2001) as well as phonology more broadly (e.g., Hayes & Wilson, 2008). In the specific case of Kaqchikel, trigram windows are necessary to capture certain co-occurrence restrictions which hold between the two consonants in a /CVC/ root (see Bennett, 2016; Bennett et al., in preparation).
A major difference between our metric and other metrics (such as the entropy-based measures in Peperkamp et al., 2006 and Hall, 2012) is that our contexts are defined over segments as opposed to phonological features. This decision owes in part to our own uncertainty about which phonological features are most appropriate for classifying segments in Kaqchikel, particularly in the case of laryngeal contrasts (see Bennett et al., in preparation for discussion). The details of the metric are shown below in Equation (3).
- S1 and S2 are two phones.
- Ci is a phonotactic environment, defined with a sliding trigram window: __XY, X__Y, XY___
A fixed effect of Distributional Overlap was included in our statistical model, reflecting this measure of JD.11
5.3.2. Word-level factors
In addition to the variables mentioned thus far, which were all part of the experimental design, a number of nuisance variables were included in the analysis of consonant confusions: wordhood, word frequency, neighborhood density, average neighborhood frequency, and bigram frequency. Our study was not designed to test the effect of these factors, but we included them in the analysis as control predictors, just in case they had an effect on our results. Of these factors, only word frequency made an appreciable contribution to predicting consonant discrimination in our study (and even then, only marginally so). We describe wordhood and word frequency here; the remaining predictors are defined in Appendix C.
Wordhood Ganong (1980) established the now classic result that listeners are more likely to identify a phonetically ambiguous segment as belonging to some phoneme Px if the categorization of that segment as Px results in an actual word of the listener’s native language, and categorizing the segment as a competing phoneme Py does not (see Fox, 1984; Kingston, 2005a; Kingston et al., 2016; Pitt & Samuel, 1993 and references there).
While our stimuli did contain real words, they were only included in order to achieve balanced coverage over the consonant and vowel combinations which were the focus of this study. However, since wordhood is known to play a role in speech perception, it was included as a possible predictor of perceptual confusions in our AX discrimination task.
To assess the wordhood of our stimuli, we consulted two sources: a native speaker of Kaqchikel (co-author Ajsivinac) and the headwords in two dictionaries (Macario et al., 1998; Majzul, 2007). We considered a [CV] or [VC] stimulus to be a ‘word’ of Kaqchikel if it was identical to either a function word (e.g., the particle k’a /kʔa/ ‘until, then, well’) or a content word (e.g., aq’ /aqʔ/ → [ʔaqʔ] ‘tongue’; word-initial epenthetic glottal stops were ignored for the purposes of computing wordhood). Affixes and other bound morphemes were not considered to be words in this sense (e.g., at- /at-/ 2SG.ABS or -i’ /-iʔ/ REFLEXIVE). We tailored these judgments to the Patzicía dialect: For example, uq [ʔuq] counted as a word because üq ‘skirt’ is pronounced as [ʔuq] (rather than historical [ʔʊq]) in the Patzicía dialect. Only 15 of our experimental items (including fillers) were actual words of Kaqchikel; the remainder (108) were non-words according to these criteria.
A fixed effect for Wordhood was included in our statistical model as the absolute difference of the wordhood values of two given stimuli: If both stimuli were words or both were non-words, the value was coded as 0, otherwise as 1. This predictor was z-score normalized using the scale() function in R.
Word frequency The effect of wordhood on phonemic categorization also obtains when the categorization of an ambiguous segment as either phoneme Px or phoneme Py would result in a real word, but the two resulting words differ in token frequency (de Marneffe, Tomlinson Jr., Tice, & Sumner, 2011). This suggests that categorization judgments can be influenced not only by the categorical word~non-word distinction, but also by gradient differences in word frequency.
More generally, word frequency has shown to contribute to both visual and auditory word recognition, with high frequency words being recognized more accurately and more quickly than low frequency words (Broadbent, 1967; C. R. Brown and Rubenstein, 1961; Felty, Buchwald, Gruenenfelder, and Pisoni, 2013; Howes, 1957; Tang, 2015, Ch. 4). Furthermore, when words are incorrectly identified in perception, the perceived word tends to have roughly the same lexical frequency as the intended word (Tang, 2015, Ch. 4; Tang & Nevins, 2014; Tang & Nevins, in preparation; Vitevitch, 2002). It follows that in our study, even if the stimuli (one or both) were incorrectly perceived on a given trial, the difference in word frequency between the two items could still bias the participants’ responses.
Word frequency was obtained using our written corpus. We only obtained word frequency information if a stimulus was determined to be a word according the criteria described above. We used the difference in frequency between the two stimuli on a given trial as a predictor of consonant confusions; non-words were coded as having zero frequency.
A fixed effect of Word Frequency was included in our statistical model as the absolute difference of the log-transformed (base-10) word token frequencies of the stimuli in each trial, with Laplace (‘add one’) smoothing for frequencies of zero (prior to log-transformation; Brysbaert & Diependaele, 2013). This predictor was z-score normalized using the scale() function in R.
6. Analysis and results
6.1. Statistical modeling
The statistical analysis began with the construction of an initial (or ‘superset’) model which included a large number of predictors. This initial model was then simplified through a model criticism procedure described in Appendix B. The factors included in the initial model are described below.
6.1.1. Fixed effects
As mentioned above, our initial model included fixed effects for three acoustic predictors (Stimulus Similarity, Mean Category Similarity, and Standard Deviation of Category Similarity), as well as fixed effects for Functional Load, Distributional Overlap, Wordhood, and Word Frequency. Of these factors, only Stimulus Similarity, Mean Category Similarity, Functional Load, Distributional Overlap, and Word Frequency emerged as significant predictors of consonant discrimination in the final statistical model.
Along with these predictive factors, our initial model included fixed effects for Segmental Frequency, Neighborhood Density, Average Neighborhood Frequency, and Bigram Frequency (see Appendices B and C). These predictors were coded by log-transforming the values of the relevant measure, and taking the absolute difference of those log-transformed values for the two stimuli on each trial (Laplace smoothing was also used for Word Frequency). All five predictors were z-score transformed; none of them emerged as predictive in our final statistical model.
Response time There is a well-known trade-off between speed and accuracy in many behavioral tasks (e.g., Heitz, 2014). To account for the possibility of such a tradeoff in our study, Response Time was treated as a fixed effect predictor in the analysis of accuracy (D. Davidson & Martin, 2013).
The response time on each trial was measured from the offset of the second stimulus (including the 250 ms of noise padding following the end of the syllable itself) to the time at which the response was logged. Given that each participant might have a different baseline response speed, these response times were transformed into by-participant z-scores.
6.1.2. Random effects
Item-level random effects Unordered Stimulus Pair was treated as a random intercept, and was defined as the unordered pairing of any two Different stimuli. As each participant heard one of 30 different lists of stimulus pairs (Section 3.1.2), List was also included as a random intercept. The order of the two stimuli in a given trial (Stimulus Order) was included as another random intercept, since the order of stimulus presentation has been reported to affect same-different discrimination judgments in some tasks (Best et al., 2001; Bundgaard-Nielsen, Baker, Kroos, Harvey, & Best, 2015; Cowan & Morse, 1986; Dar, Keren-Portnoy, & Vihman, 2018; Repp & Crowder, 1990).
Finally, the position of the target stop in each stimulus (Onset vs. Coda) was treated as a random intercept. Phonotactic context is an important factor that influences consonant discrimination (see R. Wright, 2004 for an overview). A large body of research has found that place, manner, and laryngeal features are better discriminated for prevocalic [CV] consonants (particularly stops) than for non-prevocalic [VC] consonants (e.g., Benkí, 2003; Bladon, 1986; Fujimura, Macchi, & Streeter, 1978; Jun, 2004; Redford & Diehl, 1999; Steriade, 2001, 2009; Tang, 2015; Wang & Bilger, 1973, and others; cf. Cutler et al., 2004; Meyer et al., 2013 for skeptical views). Our initial analysis of d′ found no distinction in perceptibility between onset [CV] and coda [VC] contrasts (Section 2), but it still seemed prudent to include consonant position as a potential predictor of consonant confusions in this study.
Participant-level random effects Participant was treated as a random intercept to control for inter-speaker differences in overall accuracy. In addition, by-participant random slopes for all of the word- and segment-level factors mentioned above were also included in the initial model. These by-participant random slopes were motivated by the fact that vocabulary size—which may vary across individuals—has been shown to associate with the effect of lexical factors like neighborhood size and average neighborhood frequency (Yap, Sibley, Balota, Ratcliff, & Rueckl, 2015).
We began with an initial, full model incorporating all of the fixed and random effects described above. This model was then simplified by a standard step-down model-selection procedure making use of the anova() function and likelihood ratio test provided by R. This procedure, described in greater detail in Appendix B, resulted in the final, best model in (3), where (1|F) indicates a simple random effect of factor F.
|Accuracy ~ Stimulus Similarity + Category Similarity (Mean) + Functional Load + Distributional Overlap + Word Frequency + (1 | Unordered Stimulus Pair) + (1 | Participant)|
6.2. Statistical results
6.2.1. Unimportant factors
A number of fixed and random effects were dropped during the model selection procedure. The fixed effects which fell out of the model were Category Similarity (SD), Segmental Frequency, all but one of the word-level factors (Wordhood, Neighborhood Density, Average Neighborhood Frequency, and Bigram Frequency), and Response Time. We suspect that Category Similarity (SD) emerged as insignificant because Category Similarity (mean) provides a better estimate of the distance between two phonemic categories: Category Similarity (mean) reflects the distance between category centroids—a property which clearly impacts the overall similarity between two categories—while Category Similarity (SD) reflects the variability in pairwise token comparisons across those categories—a property which could lead to either more or less overlap between categories depending on the shape of the variation. Like Bundgaard-Nielsen and Baker (2014) and Bundgaard-Nielsen et al. (2015), we did not find an effect of Segmental Frequency. Most word-level predictors were dropped from our final model, which we interpret as evidence that word-level factors had a limited effect on discrimination accuracy in our study. This perhaps reflects the fact that listeners could carry out the task (AX discrimination) without accessing lexical items of Kaqchikel (see Sections 5.3.2, 8 for more discussion). The insignificance of Response Time further suggests that there was no meaningful speed-accuracy trade-off in this study.
The dropped random intercepts were Stimulus Order, Onset vs. Coda, and List. All of the by-participant random slopes for word-level factors also fell out of the final model. Unlike e.g., Bundgaard-Nielsen et al. (2015), we found no evidence that the order of presentation of the two stimuli within a pair affected participant responses. The insignificance of List suggests the stimuli were randomized successfully across participants, such that the distribution of stimuli within and across lists did not serve as an accidental confound. The failure to retain by-participant random slopes for word-level factors in the final model may owe to several factors: Either vocabulary size was fairly homogenous across participants, or differences in vocabulary size do not have a material effect on the strength of the statistically significant predictors (segment-level Functional Load and Distributional Overlap, and Word Frequency).
6.2.2. Explanatory factors
The significant fixed factors in the best model are reported in Table 2.
|Category Similarity (mean)||–0.3876||0.1238||3.131||<.005||**|
|Word Frequency (abs. diff.)||0.1848||0.1068||1.731||.084||.|
Both acoustic similarity measures are highly significant, particularly Stimulus Similarity. These acoustic measures have negative coefficients, meaning that the greater the acoustic similarity between two stimuli and their associated phonemic categories, the harder it is to discriminate those stimuli. Second, Functional Load has a positive coefficient, meaning that the higher the pairwise functional load of two stops, the easier it is to discriminate syllables differentiated by those stops. Third, Distributional Overlap has a negative coefficient, meaning that the more phonotactic environments shared by two stops, the harder it is to discriminate them (this was an unexpected finding, which we discuss in detail below). Fourth, the only remaining word-level factor, Word Frequency, has a positive coefficient, meaning that the bigger the difference in token frequency between two syllables which are also words of Kaqchikel, the easier it is to discriminate them. However, unlike the other predictors in this final model, the effect of Word Frequency is only marginally significant (p = .084), consistent with our overall finding that word-level factors do not have much of an effect on discrimination accuracy in our study. We believe that this effect of word frequency, though marginal, reflects the general importance of this factor in psycholinguistic processing: Word frequency is consistently the strongest word-level factor in lexical retrieval tasks (such as lexical decision tasks) in a wide range of languages (Ferrand et al., 2010; Keuleers, Diependaele, & Brysbaert, 2010; Keuleers et al., 2012; Sze, Rickard Liow, & Yap, 2014).
While all remaining fixed effects are statistically important according to our model selection procedure, differences in the size of the coefficients suggest that these predictors differ in their relative strength. Stimulus Similarity was the most important predictor (|β| = 1.0720), followed by Distributional Overlap (|β| = 0.6320), Functional Load (|β| = 0.4653), Category Similarity (mean) (|β| = 0.3876), and Word Frequency (|β| = 0.1848).12
7. Interim discussion
The statistical analysis in Section 6 established that both Stimulus Similarity and Category Similarity had an effect on discriminability in our perception study. The effect of Stimulus Similarity is unsurprising—stimuli that were acoustically more similar were, expectedly, harder to discriminate. The effect of Category Similarity requires additional interpretation.
Recall that Category Similarity was computed on the basis of acoustic similarity between stop categories as they occur in our corpus of spontaneous spoken Kaqchikel (Sections 4.1, 5.2). We believe that this corpus provides a good approximation of the acoustic properties of Kaqchikel stops as they occur in actual, fluent speech. As such, we take the significant effect of Category Similarity as an indication that phonemic categories which are acoustically well-separated in regular Kaqchikel speech are easier to discriminate in perceptual tasks.
At the theoretical level, this finding suggests that consonant discrimination is mediated by some representation of prior phonetic experience. In particular, these results are consistent with the view that speakers possess mental representations for phonemic categories which include rich phonetic detail, including (at least) some information about the acoustic properties which are typically associated with actual productions of each phoneme category in everyday speech. This claim accords with exemplar models of lexical representation, which assume that linguistic units (words, phonemes, etc.) are represented as clouds of episodic memories, which store phonetic representations of specific instances on which those units were encountered in speech (e.g., Gahl & Yu, 2006; Goldinger, 1996, 1998; K. Johnson, 2005; Pierrehumbert, 2001 and references there). Our results are also consistent with the alternative view that phonemic categories are represented in a more abstract, parametric fashion, as vectors of values along specific dimensions (e.g., VOT, closure duration, etc.) which are specified separately for each phonemic category (see Ernestus, 2014; Pierrehumbert, 2016; Smits, Sereno, & Jongman, 2006 for discussion).
We also found that two predictors related to the lexical structure of Kaqchikel—Functional Load and Distributional Overlap—made a significant contribution to predicting consonant confusions in our study. Notably, both of these factors are segment-level factors: The word-level factors considered here had essentially no effect on stop consonant confusions. Following Hall (2012) and others, we take Functional Load and Distributional Overlap to be expressions of a gradient notion of segment-level phonemic contrast. In this sense, the relative predictability of two phonemes across contexts, and the precise number of words distinguished by those phonemes, provide a scalar characterization of how contrastive those phonemes are (i.e., how much lexical ‘work’ is done by the contrast between those phonemes). Our results suggest that contrasts which have a higher functional importance in Kaqchikel are also easier to discriminate, as indicated by the positive correlation between accuracy and functional load. This finding is consistent with the view that language-specific phonemic contrasts ‘warp’ the perceptual space in both categorical and gradient ways (e.g., Boomershine et al., 2008; Hall & Hume, in preparation; Hall et al., 2014; Harnsberger, 2000, 2001a, 2001b; Kataoka & Johnson, 2007 and references there).
This interpretation of the results is nonetheless complicated by the finding that distributional overlap is negatively correlated with accuracy in our study. Such a correlation indicates that phonemes which occur in more shared environments—that is, phonemes which are less predictable from context, and therefore more contrastive—are harder to discriminate. This effect is contrary to our finding for functional load, which suggests that segments which distinguish many word forms (and which are therefore not predictable from context) are easier to discriminate; it is also contrary to previous findings by Hall et al. (2014) and Hall and Hume (in preparation) on the effect of distributional overlap on segment discrimination.
We are uncertain as to the source of this discrepancy. We first considered whether the negative correlation could be driven by the behavior of the alveolar ejective /tʔ/ alone. This segment has very low type and token frequencies in our corpora and in Kaqchikel more generally (Section 4.1.3), meaning that it should have a low degree of distributional overlap with other phones. Ejective /tʔ/ is nonetheless highly perceptible—most d′ values for comparisons involving /tʔ/ are above 2, compared to a grand average of about 1.7 for all pairwise comparisons—and so this segment alone might be driving the negative correlation between accuracy and distributional overlap. However, this is not the case: When we re-run our analyses with comparisons involving /tʔ/ excluded, the effect size of Distributional Overlap weakens, but the negative sign does not change (β = –0.247, p < .05).
Alternatively, this divergent result may reflect the methods used to calculate distributional overlap in our study. As emphasized by Hall (2012), measures of distributional overlap and contextual predictability are highly sensitive to the definition of ‘context’ used. For example, Kaqchikel has a process which devoices syllable-final /l/ to [l̥] (e.g., /loq’ob’äl/loqʔoɓəl/ → [loqʔoɓəl̥] ‘blessing’). These two sounds are distributed completely predictably, but only if ‘context’ can refer to right-hand environments and syllable structure [__X] (i.e., [l̥] is always followed by a syllable boundary, and [l] never is). If, instead, ‘context’ refers only to the left-hand segment [X__], these sounds would appear to be at least partially contrastive and unpredictable (e.g., both can be preceded by [a], wach’alal [wat͡ʃʔalal̥] ‘my family’).
Previous work on this topic has computed distributional overlap using highly-specific, pre-defined contexts which either (a) reflect the structure of experimental stimuli used in the study (e.g., [a__a] in Hall et al., 2014), or (b) reflect prior observations about the phonotactic contexts responsible for conditioning the distribution of sounds in the language under investigation (e.g., [__z]σ in French, Hall & Hume, in preparation). Here, we used an inductive method (Jeffrey’s divergence) defined over all possible trigram windows to compute contextual predictability (Section 5.3.1). This methodological difference alone may have contributed substantially to the difference between our results and the results of Hall and Hume (in preparation); Hall et al. (2014). With this in mind, we explored several other methods of computing contextual predictability, using bigram windows ([__X], [X__]) instead of trigram windows; using type rather than token frequencies (as in Hall & Hume, in preparation; Hall et al., 2014); and including or excluding comparisons involving /tʔ/. No combination of these methods yielded the expected positive correlation between distributional overlap and discriminability: In each case, the correlation was either non-significant or remained negative in sign.
These practical considerations aside, there is at least one other way to interpret the negative correlation between Distributional Overlap (degree of contrastiveness) and discriminability in our study, which again relies on exemplar dynamics. Two sounds which tend to occur in the same contexts may have more opportunities to be confused, particularly if word misperception is sensitive to statistical properties of the lexicon, including phonotactic well-formedness (see Tang, 2015). If ‘confusing’ sound A for sound B means erroneously storing a token of A as a token of B in the exemplar space, then frequent confusions between A and B should have the effect of making the exemplar clouds for A and B more similar over time (e.g., Wedel, 2004; see also Ohala, 1993). In this way, increased distributional overlap between two sounds could indirectly lead to greater confusability between those sounds by increasing the amount of overlap between their associated exemplar clouds.13 Choosing between these possible interpretations of the effect of Distributional Overlap remains an open question for future research.
To reiterate, the finding that high contextual predictability (low degree of contrast) leads to greater discriminability conflicts with both our theoretical expectations (Section 5.3.1) and past results on this question (Hall & Hume, in preparation; Hall et al., 2014). We are unsure how to interpret this result, though we note that the existence of a positive correlation between contrastiveness and discriminability has not yet been conclusively established: Such a result is reported by Hall and Hume (in preparation); Hall et al. (2014), but Hall (2009) finds no meaningful correlation at all between contrastiveness and discriminability (though Hall also discusses some potential issues which may have led to this null result).
In any case it seems clear, particularly for functional load, that gradient contrast has an effect on speech perception. However, the precise mechanism(s) behind patterns of contrast-driven perceptual warping remain somewhat obscure (see again Kataoka & Johnson, 2007). In the following section we test two hypotheses which attempt to provide an explicit link between consonant discrimination and segment-level distributional measures (Functional Load and Distributional Overlap) in Kaqchikel.
The first hypothesis is that functional load and distributional overlap are computed online in speech perception tasks such as ours, and that these computations can affect real-time speech processing. We do not think that this hypothesis is likely to be correct: Speech perception is rapid and automatic, while the computation of functional load and distributional overlap should require substantial processing time, even if computed over some subset of the lexicon (see e.g., Kingston et al., 2016; McClelland & Elman, 1986; McClelland, Mirman, & Holt, 2006; McClelland, Rumelhart, & Hinton, 1986; Norris, McQueen, & Cutler, 2000 and references there for discussion). We nonetheless believe that this hypothesis is worthy of some consideration.
Our second hypothesis is that functional load and distributional overlap condition speech perception by shaping low-level perceptual tuning during development. By ‘perceptual tuning,’ we refer to the fact that listeners selectively attend to those phonetic dimensions which are informative and reliable for the discrimination of phonemic categories in their native language (L. Davidson et al., 2007; Holt & Lotto, 2006; McGuire, 2007).
In the following section we attempt to disentangle these two hypotheses by investigating the timecourse of segment-level distributional factors (Functional Load and Distributional Overlap) in our study.
8. The timecourse of experience-based effects
Speech perception can be decomposed into at least three distinct tasks: auditory/acoustic processing; phone-level processing; and lexical retrieval (e.g., Babel & Johnson, 2010; Fox, 1984; Pisoni, 1975; Pisoni & Tash, 1974; Pitt & Samuel, 1993; Werker & Logan, 1985 and references there). The first of these tasks, auditory/acoustic processing, involves mechanisms which are basically physiological in nature. As such, this aspect of speech processing is not expected to be substantially affected by the listener’s native language. Phone-level processing (sometimes called ‘phonetic’ or ‘phonemic’ processing, e.g., Pisoni, 1973; Werker & Logan, 1985; Werker & Tees, 1984b) involves the categorization of speech sounds into appropriate phonemes and/or allophones. This type of processing differs from auditory/acoustic processing in that it is necessarily conditioned by the listener’s native language, and is therefore expected to show sensitivity to past linguistic experience. Such sensitivity is also expected for any aspect of speech processing that involves lexical access, as languages (and speakers) obviously differ in their vocabularies.
Researchers disagree as to the relative independence of each of these aspects of speech processing (see Kingston, 2005a; Kingston et al., 2016; McClelland & Elman, 1986; McClelland et al., 2006, 1986; Norris et al., 2000 for discussion and further references). There is nonetheless a broad consensus that native-language influences on speech perception emerge relatively late in the timecourse of speech processing. This is particularly true of lexical effects, which tend to influence speech perception sometime after the initiation of phone-level processing (Fox, 1984; though cf. Kingston et al., 2016 and work cited there).
Assuming that these stages of speech processing have the rough temporal sequencing suggested by prior work (acoustic/auditory ⇒ phone-level ⇒ lexical), we can at least tentatively diagnose the mechanism behind the segment-level statistical effects in our study (Functional Load and Distributional Overlap) by investigating when in the course of speech processing those effects arise. If Functional Load (FL) and Distributional Overlap (DO) are computed online during speech perception, through some process of lexical access or lexical sampling, then the effect of these predictors should emerge relatively late. We would then expect stronger effects of FL/DO at slower response times. If, on the other hand, FL/DO affect speech processing by shaping low-level perceptual tuning (e.g., cue weighting) during acquisition, then we might expect to see the influence of these predictors even at relatively fast response times.
Among the significant predictors in our final model (3), Stimulus Similarity corresponds most closely to the kind of information that would be processed during the acoustic/auditory stage of speech perception. We thus expect that Stimulus Similarity should have a robust effect on response accuracy at even the fastest response times. Category Similarity, a measure which refers to language-specific phonetic distributions associated with individual phoneme categories, should emerge no earlier than than the purely acoustic measure of Stimulus Similarity. If functional load and distributional overlap are computed online, they should begin to affect responses at a later stage than either Stimulus Similarity or Category Similarity.
Apart from the relative onset of these effects, we might also find differences in how the influence of each factor changes over time. Even if all of the factors in the model begin to affect response accuracy at about the same point, some factors might still grow in strength over time, while others weaken instead. In particular, factors involving lexical access might be more evident at longer response latencies, under the assumption that the strength of lexical activation gradually increases over time, such that lexical factors influence phone-level activation more strongly at later stages of processing (e.g., Kingston et al., 2016).
8.2. Statistical modeling
To carry out a timecourse analysis of our results we fit a new regression model based on our previous best model (3). Five interaction terms were added to test whether the significant fixed effects in (3) interact with response time in predicting participant accuracy. Response Time was also added as a fixed effect, consistent with the standard practice of including simple effects for any predictor included in an interaction term. Nested model comparison shows that model fit is significantly improved when these five interaction terms are included (Table 3; χ2(5) = 23.6, p < .001).
|Category Similarity (mean)||–0.3975||0.1257||3.163||.002||**|
|Word Frequency (abs. diff.)||0.1705||0.1084||1.573||.116||n.s.|
|Stimulus Similarity:Response Time||0.1877||0.0742||2.528||.011||*|
|Category Similarity (mean):Response Time||0.1591||0.0794||2.005||.045||*|
|Functional Load:Response Time||–0.2737||0.1063||2.575||.010||*|
|Distributional Overlap:Response Time||0.3389||0.1076||3.149||.002||**|
|Word Frequency (abs. diff.):Response Time||0.0132||0.0686||0.193||.8465||n.s.|
We also performed a separate timecourse analysis treating Response Time as a discrete variable rather than a continuous one. Dichotomizing continuous variables is often discouraged (Baayen, 2008, p. 259), but we performed this additional analysis because it more closely resembles the treatment of timecourse effects on speech processing in some previous work (Babel & Johnson, 2010; Kingston, 2005a). First, each participant’s responses were divided into three equally-sized bins (i.e., by-participant terciles): fast responses (Early), medium-speed responses (Middle), and slow responses (Late). The mean response times for each bin (across participants) were about 400 ms, 650 ms, and 1200 ms; the first bin falls roughly in the range of response times associated with auditory processing, while the latter two fall in the range associated with phone-level and/or lexical processing (e.g., Babel & Johnson, 2010; Fox, 1984; Werker & Logan, 1985; Werker & Tees, 1984b). We Helmert-coded this discrete, three-level timecourse predictor, and re-fit the model (3) using the same structure used for our continuous response time predictor. This analysis yielded the same qualitative results as the analysis which treated timecourse as a continuous predictor. We report only the continuous model below.14
The significant interactions reported in Table 3 suggest that the strength of our acoustic and distributional predictors did vary as a function of response time. To dig deeper into the interaction between these factors and response time, we fit a separate regression model for each of the response time tercile bins described above (Table 4), using the same model structure (3) which we used in the analysis of the full data set.
(μ ≈ 400 ms)
(μ ≈ 650 ms)
(μ ≈ 1200 ms)
|Word Frequency (abs. diff.)||0.2671||n.s.||0.2314||n.s.||0.0607||n.s.|
8.3. Statistical results
Table 3 presents the regression statistics of the model with interaction terms between the five fixed effects and Response Time. We first note that Word Frequency and its interaction term with Response Time do not reach statistical significance. The other four interaction terms do reach statistical significance. Crucially, the coefficients of these four significant interaction terms indicate that the four fixed effects decrease in strength as response times increase.
Table 4 presents the regression statistics for each of the three regression models, grouped by response time bin. We first note that Word Frequency does not reach statistical significance in any response time bin. The other four predictors—Stimulus Similarity, Category Similarity, Functional Load, and Distributional Overlap—have consistent effects across all three response time bins. Each of these four predictors influence response accuracy even in the earliest bin. Additionally, all of these predictors (including the insignificant predictor Word Frequency) decrease in strength as response times increase. This decrease in strength is evident in both the magnitude of the effects (the values of the β coefficients) and the level of statistical significance reached. Together these findings suggest that all four significant factors kick in early, but decrease in strength over time.
8.4. Interpretation of timecourse analysis
This timecourse analysis shows that three experience-based factors (Category Similarity, Functional Load, and Distributional Overlap) began to affect discrimination at about the same early timepoint as the acoustic/auditory factor Stimulus Similarity.
For present purposes, the most important result is that Functional Load and Distributional Overlap influenced response accuracy even at very fast response times (those in the Early tercile bin). This result is consistent with the view that these two ‘lexical’ measures impinge on speech perception somewhat indirectly, most likely by influencing which acoustic dimensions speakers attend to more closely to during phone-level processing. If Functional Load and Distributional Overlap condition speech perception through perceptual tuning, these factors are expected to show the same timecourse as Category Similarity, as all three measures reflect perceptual processes which in some way refer to the phonetic dimensions that distinguish phoneme categories in the listener’s native language.
We cannot completely rule out the possibility that Functional Load and Distributional Overlap are computed online during speech perception. For one, the inter-stimulus interval in this study (up to 800 ms) may have been sufficiently long that listeners were able to carry out some form of lexical access even for fairly quick responses. Nevertheless, we believe that this interpretation of our results is at odds with several observations. First, the AX discrimination task used in this study neither required nor encouraged lexical access, particularly because most stimuli were not actual words of Kaqchikel. Second, we think it is inherently unlikely that listeners carry out the large-scale lexical access that would be needed to accurately compute measures like functional load online. Speech processing is simply too rapid to involve lexical access at this scale during real-time listening. This argument is bolstered by the observation that the effect of these ‘lexical’ measures emerged early and weakened over time; were some form of bulk lexical access involved, we should expect to see the strength of these measures increase over time instead, as more of the lexicon is accessed and analyzed.
9.1. Theoretical contributions
The core theoretical contributions of this article are twofold. First, we have demonstrated experimentally that prior linguistic experience affects speech perception, not simply because different languages have different phonemic inventories (e.g., Werker & Logan, 1985; Werker & Tees, 1984a), but also because languages differ in the fine phonetic details associated with phonemic categories, as well as in their lexical structure and patterns of usage. These results replicate and extend past research showing that highly specific statistical patterns in a listener’s native language can have extensive effects on perceptual processing, even in experimental tasks that do not obviously require lexical access.
Importantly, our investigation has established these results in the context of a language—Kaqchikel Maya—which is sociolinguistically and structurally very different from the majority languages which are most often studied in speech perception research (Section 1). We hope that this work will encourage researchers to continue expanding the speech perception literature, and the phonetics literature more generally, to include a wider range of lesser-studied languages. Only in this way can we establish cross-linguistically valid theories of speech perception and production.
9.2. Methodological contributions
In this study, we replicated several findings of experience-based effects in speech perception which have previously been demonstrated for majority languages using much richer resources (e.g., substantially larger written corpora, Section 4.2.2). We take this result to be an indirect validation of the use of small corpora in speech perception studies. Despite their shortcomings, small, noisy corpora can make valuable contributions to speech perception research, provided they are carefully processed beforehand. Our results also supply a positive answer to the general question of whether reliable speech perception research can be conducted in the field, outside of highly-controlled laboratory settings (Whalen & McDonough, 2015; see also DiCanio, 2014 for an excellent recent example of this kind of research).
To further support this claim, we now compare our findings to a similar study conducted with a majority language (French) in a laboratory setting. Hall and Hume (in preparation) investigated segment-level statistical effects on the discriminability of French vowels. They considered many of the same predictors we investigated in our study, including Stimulus Similarity, Functional Load, Distributional Overlap, and Segmental Token Frequency; this parallelism allows us to compare the two studies rather directly.
With the exception of Category Similarity and Word Frequency (which were not examined by Hall and Hume), the significant predictors in our study (Stimulus Similarity, Functional Load, and Distributional Overlap) were also statistically significant predictors of vowel discrimination in Hall and Hume (in preparation) (though the direction of the effect for Distributional Overlap was different in the two studies). Segment Token Frequency did not reach significance in either our study or in Hall and Hume (in preparation). In terms of the relative importance of these predictors, both studies found that Stimulus Similarity played the strongest role, followed by predictors related to phonological contrastiveness (Functional Load and Distributional Overlap), with Distributional Overlap being a better predictor of response accuracy than Functional Load (though only when /tʔ/ is included in the analysis). We find the parallelism between these results to be rather encouraging, especially given the following differences between the two studies:
- Task: Hall and Hume used a multiple forced-choice identification task, while our study used an AX discrimination task.
- Target segments: Hall and Hume examined vowels, while our study examined consonants.
- Stimulus presentation: Hall and Hume presented their stimuli without any masking noise, while we presented our stimuli in speech-shaped noise at a 0 dB SNR.
- Experimental setting: Hall and Hume tested their participants in a controlled laboratory setting in a sound-attenuated booth, while we tested our participants in a quiet room which was not sound-attenuated.
- Language: Hall and Hume examined French, a Romance language, while we examined Kaqchikel, a Mayan language.
- Culture: the participants in Hall and Hume’s study were likely to have some experience with psychological experiments, and extensive experience with computers. Our participants did not in general have such experience.
- Quality of the corpus estimates: the segment-level predictors in Hall and Hume (in preparation) were estimated using a written corpus that is very large (65.1 million words), well-balanced, and highly speech-like (compiled from books, and subtitles of films and TV shows). Our segment-level predictors were estimated using a written corpus that is small (0.7 million words), unbalanced, and not particularly speech-like (mostly governmental and religious documents).
Despite these differences, which are varied and numerous, the two studies arrive at strikingly similar conclusions about the kinds of predictors which affect speech perception, as well as their relative importance. This comparison thus reaffirms our claim that conducting speech perception research in the field can result in findings that are comparable to those done in a laboratory.
However, it should also be noted that our results are substantially ‘noisier’ than the results of Hall and Hume (in preparation). In particular, the fixed effects components in our statistical model (Section 6) capture 23.2% of the variance in our data; in contrast, the fixed effects components in the statistical model reported in Hall and Hume (in preparation) capture as much as 82% of the variance in their study.15 It is unclear to us which differences between the studies (including, but not limited to those outlined above) could have led to such a dramatic difference in model fit. Answering this question would require several follow-up studies which reduce the methodological differences with Hall and Hume (in preparation), a project we leave for future research.
The additional files for this article can be found as follows:Appendix A
Processing of written corpus. DOI: https://doi.org/10.5334/labphon.100.s1Appendix B
Model construction and selection. DOI: https://doi.org/10.5334/labphon.100.s2Appendix C
Non-significant predictors in the AX discrimination study. DOI: https://doi.org/10.5334/labphon.100.s3Appendix D
d′ scores for all target stop contrasts. DOI: https://doi.org/10.5334/labphon.100.s4