1 Introduction

One of the major challenges in early language acquisition is to learn the phonology that defines the words of the native language. Phonological acquisition starts in the first year of life, and includes the learning of stress patterns (Jusczyk et al., 1993a; Jusczyk et al., 1999b; Mehler et al., 1988; Morgan & Saffran, 1995; Nazzi et al., 1998), co-articulation (e.g., Fowler et al., 1990; Johnson & Jusczyk, 2001), allophony (e.g., Jusczyk et al., 1999a), and phonotactics (e.g., Friederici & Wessels, 1993; Jusczyk et al., 1993b; Jusczyk et al., 1994).

Experimental studies addressing the learning of phonotactics show that both adults and infants induce constraints from exposure to items designed to meet those constraints. Onishi et al. (2002) examined the learning of constraints which confined consonants to either word-initial or word-final position (e.g., /bæp/, not */pæb/), and constraints which linked specific consonant-vowel sequences (e.g., /bæp/ or /pɪb/, but not */pæb/ or */bɪp/). Adult participants were faster at repeating novel test words which obeyed the experiment-induced constraints than test words which violated those constraints. Similar findings have been reported in studies on phonotactic learning through speech production (e.g., Dell et al., 2000; Goldrick & Larson, 2008), and in studies with infants (Chambers et al., 2003).

Various studies have aimed to determine which phonological properties drive phonotactic learning. For example, it has been shown that learners induce constraints on consonants across different vowel contexts, suggesting that consonants and vowels are processed independently (Bonatti et al., 2005; Chambers et al., 2010). In addition, several studies indicate that patterns of natural classes are easier to learn than patterns of arbitrary classes of segments (Cristià & Seidl, 2008; Finley & Badecker, 2009; Saffran & Thiessen, 2003; Seidl & Buckley, 2005).

Recent studies have started to investigate the source of input that is used in phonotactic learning. In particular, the question has been whether learning is primarily driven by exposure to word types or by exposure to word tokens. Experiments with adults and children indicate that phonotactic learning is primarily driven by phonotactic probabilities in word types (Richtsmeier, 2011), although token variability due to speaker variation also has a positive impact on learning (Richtsmeier et al., 2011). These findings are in line with computational work showing that phonotactic constraints can be modeled as abstractions over co-occurrence patterns in the lexicon (e.g., Albright, 2009; Hayes & Wilson, 2008), and that models trained on word types typically perform better than models trained on word tokens (Albright, 2009; Bailey & Hahn, 2001; Hay et al., 2004; Pierrehumbert, 2003).

There is, however, another source of input that may affect phonotactic learning. Computational work has shown that phonotactic constraints are learnable from transcriptions of continuous speech, and that such constraints have a positive effect on word segmentation (Adriaans & Kager, 2010; Blanchard et al., 2010; Brent & Cartwright, 1996; Cairns et al., 1997; Daland & Pierrehumbert, 2011). Phonotactic constraints might thus arise before the lexicon is in place, in particular from co-occurrence probabilities in continuous speech. This view is supported by experimental evidence showing that nine-month-old infants use phonotactic probabilities to segment words from speech (Mattys et al., 1999; Mattys & Jusczyk, 2001).

Additional support for the learning of phonotactics from continuous speech comes from studies that focus on the role of segment transitional probabilities (TPs) in word segmentation (Bonatti et al., 2005; Newport & Aslin, 2004; Saffran et al., 1996; Toro et al., 2008). Bonatti et al. (2005) found that learners exploit consonant probabilities (but not vowel probabilities) for segmentation. Importantly, the study showed that learners generalize the phonotactic structure of familiarization words to novel test items (i.e., consonant frames matched with a novel vowel structure). These findings were taken as evidence that learners pick up consonantal word roots from the speech stream, rather than complete word forms (which consist of consonants and vowels). This could explain generalization to novel vowels, since no vowel information is extracted from the speech stream.

However, there is no direct evidence that human learners can induce novel phonotactic constraints directly from the speech stream, without the mediation of a lexicon of word forms. The artificial language in Bonatti et al. (2005) consisted of nine different words which occurred repeatedly in the speech stream. A possible explanation of their findings is therefore that learners induced phonotactics from a statistically learned lexicon. That is, it may have been the case that phonotactic learning was preceded by word segmentation, in which participants learned the artificial lexicon (containing nine trisyllabic word forms) from the speech stream, and subsequently judged test items based on their similarity to the stored word forms. This explanation is in line with the view that statistical learning provides a basis for an initial lexicon from which further linguistic generalizations can be derived (e.g., Swingley, 2005; Thiessen & Saffran, 2003).

The idea of constructing generalizations from unsegmented speech also appears to be in conflict with studies examining the learning of syllable class dependencies (e.g., Endress & Bonatti, 2007; Gómez, 2002; Gómez & Lakusta, 2004; Gómez & Maye, 2005; Marcus et al., 1999; Pena et al., 2002). In such studies, learners are presented with a similar challenge of detecting dependencies across varying intervening material. For example, Gómez (2002) found that adults and infants can learn dependencies between artificial words (e.g., pel and jic) across different intervening words (e.g., pel-wadim-jic, pel-kicey-jic). Work by Pena et al. (2002) and Endress and Bonatti (2007) has addressed the important question of whether such dependencies might be learnable from continuous speech. They found that the learning of generalizations can only take place if the speech signal contains subliminal acoustic cues (i.e., short silences) to word boundaries. These findings cast doubt on whether generalizations can be learned directly from speech, without the mediation of word forms, and without acoustic cues that signal word boundaries to the learner.

The current study provides a direct test for the possibility that learners induce phonotactics from continuous speech. In two experiments with artificial languages, we investigate whether phonotactic learning can take place in the absence of recurring word forms in the speech stream.

1.1 The current study

Throughout this study, we will use the term ‘word’ to refer to CVCVCV sequences, and ‘frame’ to refer to C_C_C_ structures (similar to lexical roots in Semitic languages, e.g., Bonatti et al., 2005; McCarthy, 1988). Any change in vowel thus represents a new word, but not a new frame.

The artificial languages used in the current experiments are relatively complex, compared to earlier studies. First, vowel slots are filled at random, so unlike in previous studies, there are no recurring words in the speech stream. This rules out the possibility that participants learn a small artificial lexicon from which generalizations are derived. Second, the consonant frames in our study are probabilistic. That is, while most earlier studies on consonant co-occurrence probabilities use a transitional probability (TP) of 1.0 within words (e.g., Bonatti et al., 2005; Newport & Aslin, 2004), our languages exhibit relatively subtle differences between ‘within-word’ and ‘between-word’ consonant probabilities (TPwithin = 0.5, TPbetween = 0.33). This is in line with a large body of work that suggests that phonotactic constraints used in speech processing are gradient rather than absolute (e.g., Bailey & Hahn, 2001; Vitevitch & Luce, 1998, 1999, 2005). Our languages thus exhibit some of the probabilistic variation that is found in natural language. The only other study we are aware of that uses relatively subtle differences in consonant TP is Toro et al. (2008), who use a consonant TP of 0.7 within words, and 0.16 between words.

Our study employs different training languages, and one single set of test items. This was done to ensure that participants’ responses are not solely driven by native language preferences or properties of test items (Reber & Perruchet, 2003; Redington & Chater, 1996). If participants base their decisions on knowledge from their native language, then they would display the same preferences in the test phase, regardless of their training condition. Conversely, if there is a significant difference between training conditions, then this can be ascribed to the structure of different familiarization languages.

2 Experiment 1

The experiment focuses on consonant triplets that occur across intervening vowels in a continuous speech stream. Two artificial languages are defined by different statistical structures: An ‘ABC’ language and a ‘BCA’ language. Test items in the experiment are novel combinations of consonant frames and vowel fillers. The experiment thus tests whether knowledge that is acquired by participants generalizes to novel words that have the same consonantal structure as the sequences in the continuous familiarization language. If participants learn the phonotactic structure of the language they are exposed to, then they should show a preference for novel words that conform to the pattern of their training language.

2.1 Method

2.1.1 Participants

Forty native speakers of Dutch (33 female, 7 male) were recruited from the Utrecht Institute of Linguistics OTS subject pool (mean age: 21.4, range: 18–39). Participants received five euros for participation. Participants were assigned randomly to either the ABC or BCA condition.

2.1.2 Materials

Artificial languages were created using six consonant frames. Consonants were taken from natural classes of obstruents. Class A consisted of voiceless /p/, /t/, /s/. Class B consisted of their voiced counterparts: /b/, /d/, /z/. Class C had three dorsals with mixed voicing: /k/, /g/, and /x/. Consonants were concatenated to create one language consisting of ABC frames, and one language consisting of BCA frames. The materials are given in Table 1.

Table 1

Artificial languages. (A = voiceless obstruents, B = voiced obstruents, C = dorsal obstruents)

ABC language BCA language

Consonant frames (C1_C2_C3_) Vowel fillers (random) Consonant frames (C2_C3_C1_) Vowel fillers (random)

p_d_g_ [_a, _e, _o, _i, _u, _y] d_g_p_ [_a, _e, _o, _i, _u, _y]
p_z_k_ z_k_p_
t_b_x_ b_x_t_
t_z_g_ z_g_t_
s_b_k_ b_k_s_
s_d_x_ d_x_s_

A continuous stream of consonants was generated for each language by concatenating 600 randomly selected frames (about 100 occurrences per frame), resulting in a stream of 600 × 3 = 1800 consonants. ‘Within-word’ probabilities were 0.5 (consonants had 2 possible successors within frames), while ‘between-word’ probabilities were 0.33 (consonants had 3 possible successors between frames). The two languages differed with respect to the locations of high and low probabilities in the speech stream:

    1. (1)
    1. …ABC.ABC.ABC.ABC.ABC.ABC… (ABC language)
    1. (2)
    1. …A.BCA.BCA.BCA.BCA.BCA.BC… (BCA language)

where ‘.’ indicates a boundary as predicted by a low transitional probability between consonants. A set of six vowels (/a/, /e/, /o/, /i/, /u/, /y/) was used to fill the vowel slots between consonants. A vowel was chosen at random for each position. As a result, vowel TPs were 0.17 within words as well as between words. Due to the random insertion of vowels, each syllable bigram on average occurred only 2.7 times. Syllable TPs were thus unreliable and took on a wide range of values. Importantly, within-word syllable probabilities were not systematically higher than between-word probabilities.

Audio streams were generated using MBROLA (Dutoit et al., 1996), using the Dutch ‘nl2’ voice. Streams were synthesized with flat intonation and had a total duration of 7 minutes (with an average syllable duration of 232 ms).

Thirty-six novel test items were created for each language by combining consonant frames with novel vowel structures. There were no vowel repetitions within items (i.e., each word had 3 different vowels). For each test trial (e.g., /tibaxo/ – /dugopa/) a counterpart was created in which vowel frames had been switched (/tuboxa/ – /digapo/). Vowel frames were thus distributed evenly between ABC and BCA items, so that any learning effect must be due to consonant structure. Test items were chosen with a minimal difference in cohort density (i.e., the number of Dutch words starting with the initial CV syllable) between two items in a trial. All items were checked to make sure that the exact combination of consonants and vowels had not occurred in either of the two familiarization languages. The complete set of trials is given in the Appendix.

2.1.3 Procedure

Participants were tested in a sound-attenuated booth. Participants received written instructions which were explained to them by the experimenter. Audio was presented over a pair of headphones. Participants responded by selecting one out of two response options (indicated visually with ‘1’ and ‘2’) with a mouse on the screen. The instruction given to participants was that they would hear a novel (‘Martian’) language, and that their task was to discover the words of this language. Before starting the actual experiment, participants were given a short pre-test in which they had to indicate whether a particular syllable had occurred in first or in second position in a trial. This was done to familiarize participants with the setup of the experiment.

The 7-minute familiarization stream was presented twice, with a 2-minute silence between presentations of the stream. As a consequence, total familiarization time was 14 minutes. The stream started with a 5-second fade-in, and ended with a 5-second fade-out. There were no indications of word endings or word beginnings in the speech stream. After familiarization, participants were given a two-alternative forced-choice (2AFC) task in which they had to indicate for each trial which out of two words sounded more like the Martian language they had just heard. The order of test trials was randomized, and the order of presentation of ABC and BCA items within trials was balanced within and across participants.

2.2 Results and discussion

Figure 1 (left half) shows the mean preferences of individual participants in the ABC and BCA conditions, coded as the percent preference for BCA items (which would be expected to be lower than 50% in the ABC condition, and higher than 50% in the BCA condition). Participants’ responses were analyzed with mixed-effects logistic regression with subjects and items as random factors.1 We first tested responses in each training condition against chance (Table 2). Participants in the ABC condition showed no significant preference for either ABC or BCA items (p = 0.114). In contrast, participants in the BCA condition significantly preferred BCA words over ABC words (p < 0.001). We then tested whether there was a significant effect of training condition on participants’ responses (Table 3). Participants exposed to the BCA language chose BCA items significantly more often than participants exposed to the ABC language (p = 0.0168). The odds of choosing a BCA item for participants in the BCA condition was e0.3628 = 1.44 times higher than for participants in the ABC condition. The significant difference between groups indicates that participants’ preferences were affected by the phonotactic structure of the continuous language to which they had been exposed.

Figure 1 

Results for the ABC and BCA (Experiment 1) and control (Experiment 2) training languages. Circles indicate mean preferences for BCA words for individual participants. Triangles indicate the mean score for each condition.

Table 2

Experiment 1: Summary of mixed logit model for individual conditions against chance.

Predictor Coefficient SE Pr (>∣z∣)

Training language = ABC 0.2136 0.1353 0.114
Training language = BCA 0.5764 0.1368 <0.001***

Table 3

Experiment 1: Summary of mixed logit model testing for difference between conditions (learning effect).

Predictor Coefficient SE Pr (>∣z∣)

Intercept 0.2136 0.1353 0.1144
Training language 0.3628 0.1517 0.0168*

While we can conclude from the significant difference between conditions that phonotactic learning took place in the experiment, Experiment 1 leaves open the question of which condition resulted in the learning of novel constraints. That is, even though performance in the ABC condition was not significant from chance, this does not necessarily imply that no learning took place in the ABC condition. Several studies have shown that artificial language learning from continuous speech is affected by phonological patterns in the native language (Boll-Avetisyan & Kager, 2014; Finn & Hudson Kam, 2008; Mersad & Nazzi, 2011; Onnis et al., 2005). If participants would somehow be biased towards BCA patterns, then ABC participants could have un-learned this bias (resulting in chance-level performance), and BCA participants could have developed a preference for BCA patterns while ignoring statistical cues in the artificial language. To test this possibility, we designed a control experiment aimed at testing whether Dutch participants develop a preference for BCA words in the absence of statistical cues to novel phonotactic structure.

3 Experiment 2

One way to assess the effects of native language phonotactics on artificial language learning is to create a continuous artificial language that is neutral in terms of transitional probability (Boll-Avetisyan & Kager, 2014). By removing the probabilistic cue from the artificial language, any preference for test items must be due to factors other than phonotactic learning. If exposure to a language with probabilities that are constant throughout the familiarization stream leads to a preference for BCA items, then this can be taken as evidence for a native language bias.

3.1 Method

3.1.1 Participants

Twenty native speakers of Dutch (18 female, 2 male) were recruited from the Utrecht Institute of Linguistics OTS subject pool (mean age: 24.3, range: 19–35). Participants received five euros for participation.

3.1.2 Materials

A new familiarization language was constructed using the same consonant classes as in Experiment 1. Rather than imposing phonotactic constraints (i.e., specific ABC and BCA frames) on the speech stream, a sequence of consonants was generated by selecting one of the three consonants for each class at random. That is, every consonant from Class A could be followed by any of the consonants from Class B, etc. As a result, there was no difference between ‘within-word’ and ‘between-word’ probabilities. Consonant probabilities were 0.33 throughout the speech stream, providing no cue to the structure of the language:

    1. (3)
    1. …ABCABCABCABCABCABC… (control language)

As in Experiment 1, the resulting stream contained 1800 consonants, and six different vowels were used to fill the vowel slots at random. The method for generating the speech stream was identical to Experiment 1. The same test trials were used as in Experiment 1.

3.1.3 Procedure

The procedure was the same as in Experiment 1.

3.2 Results and discussion

The mean preferences of participants in the control condition are shown in Figure 1 (right half). Participants’ responses were analyzed using mixed-effects logistic regression with subjects and items as random factors. Participants in the control condition showed no significant preference for either ABC or BCA items (p = 0.3283). We then compared performance in the control condition to performance in the ABC and BCA conditions of Experiment 1. Table 4 shows that performance in the ABC condition was not significantly different from the control condition (p = 0.5400). In contrast, there was a significant difference between the BCA condition and the control condition (p = 0.0023). The odds of choosing a BCA item for participants in the BCA condition was e0.4434 = 1.56 times higher than for participants in the control condition.

Table 4

Experiment 2: Summary of mixed logit model for the control condition against chance and against the conditions from Experiment 1 (ABC and BCA).

Predictor Coefficient SE Pr (>∣z∣)

Intercept (Training language = control) 0.1191 0.1219 0.3283
Training language = ABC 0.0883 0.1441 0.5400
Training language = BCA 0.4434 0.1454 0.0023**

In the absence of statistical cues in the training language, participants did not develop a preference for either ABC or BCA. This result allows us to further interpret the results of Experiment 1. The significant difference between the BCA language and the control language indicates that participants in the BCA condition induced novel phonotactic constraints from continuous speech. However, participants in the ABC condition failed to induce the structure of their training language.

3.3 Similarity to Dutch phonotactics

Why did ABC participants fail to learn the phonotactics of their training language? Earlier work by Finn and Hudson Kam (2008) found that adult learners were unable to use novel statistical information in an artificial language task when this information conflicted with their native language phonotactics. This raises the possibility that native language phonotactics is not actively guiding participants’ preferences in the absence of statistical cues in the artificial language, but it is interfering with the learning of novel statistical patterns that go against a more common native language pattern. We assessed this possibility by examining ABC versus BCA patterns in Dutch words in CELEX (Baayen et al., 1995).

Although only a few words in CELEX followed the exact ABC or BCA patterns used in our experiment, we did find more words that followed the BCA pattern (6 types, 128 tokens) than the ABC pattern (2 types, 51 tokens).2 Since our languages were defined in terms of natural classes of obstruents, we also looked at how common voiceless-voiced (AB), voiced-dorsal (BC), and dorsal-voiceless (CA) patterns are in Dutch monomorphemic words. The likelihood of these natural class patterns was calculated using the observed/expected (O/E) ratio, a measure of co-occurrence probability commonly used to quantify gradient phonotactic patterns (e.g., Adriaans & Kager, 2010; Frisch et al., 2004; Kager & Shatzman, 2007). An O/E ratio smaller than 1 indicates that a particular pattern is underrepresented (relatively uncommon) in the language, whereas patterns with a value larger than 1 are overrepresented (relatively common).

Table 5 shows that in word-initial position, voiceless-voiced patterns are somewhat underrepresented in Dutch (O/E = 0.74). In contrast, voiced-dorsal occurrences appear to be neutral (i.e., the O/E is close to 1). In word-final position, only a few relevant sequences were found, resulting in low observed and expected frequencies. In this case voiced-dorsal sequences are somewhat more common than dorsal-voiceless sequences. To get an overall estimate of the similarity of ABC and BCA natural class patterns to Dutch phonotactics, we summed the initial and final observed and expected values for each pattern. The net result is that BCA is a relatively neutral pattern in the Dutch lexicon (O/E = 1.04), whereas ABC is somewhat underrepresented (O/E = 0.76). The fact that underrepresentation occurs in voiceless-voiced sequences might be due to the involvement of two different values of a single feature, ‘voice.’ The other patterns involve two different features, ‘voice’ and ‘place,’ which are not a priori expected to interact.

Table 5

Observed/expected ratios for natural class patterns in Dutch monomorphemic words (CELEX).

Position Pattern Observed Expected Observed/Expected

Initial voiceless-voiced (AB) 180 243.9 0.74
Initial voiced-dorsal (BC) 81 78.6 1.03
Final voiced-dorsal (BC) 8 4.5 1.77
Final dorsal-voiceless (CA) 10 9.1 1.10
Initial + final voiceless-voiced-dorsal (ABC) 188 248.5 0.76
Initial + final voiced-dorsal-voiceless (BCA) 91 87.7 1.04

In sum, we found that ABC words are relatively unlikely in terms of Dutch phonotactics. This gradient phonotactic restriction may have made it more difficult for participants in the ABC condition to learn the structure of their training language. In contrast, BCA words appear to be neutral with respect to Dutch phonotactics. The neutrality of BCA patterns could have enabled participants to induce novel statistical patterns from the BCA artificial language, while at the same time keeping neutral preferences in the absence of novel statistical cues in the random control condition. This interpretation of the results is in line with earlier findings that participants have difficulty with artificial language learning when the language being learned conflicts with native language phonotactics (Finn & Hudson Kam, 2008).

4 General discussion

Human learners have been shown to be capable of learning novel phonotactic constraints from exposure to isolated words. The current study tested another possibility, which is that learners might be able to induce phonotactics from continuous speech input. Using artificial languages that had no recurring word forms, we were able to investigate whether participants induce phonotactics directly from speech. We found that novel phonotactic constraints can emerge as a result from exposure to consonant co-occurrence patterns in continuous speech. However, learning was not symmetrical between two conditions with opposite statistical patterns. The asymmetry of our results is in line with studies showing that artificial language learning is particularly difficult when the novel statistical cues conflict with native language patterns (Finn & Hudson Kam, 2008). Performance in the learning condition was nevertheless impressive given the complexity of our languages, which had random vowels, and relatively subtle statistical manipulations. Despite difficult learning conditions, participants were able to learn novel phonotactics, and generalized these structures to novel test words.

Our study is related to work on the learning of structural generalizations (e.g., Endress & Bonatti, 2007; Gómez, 2002; Gómez & Maye, 2005; Pena et al., 2002). In such studies, participants learn long-distance dependencies across intervening syllables (e.g., pel X jic, Gómez, 2002). In our study, learners face a similar task of abstracting consonant patterns across intervening vowels (e.g., pXdYgZ). The aim of our study was to see whether such generalizations can be learned from continuous speech. Earlier work by Bonatti et al. (2005) found that learners can induce C_C_C_ frames and apply them to novel vowel structures. However, their design allowed for the possibility that generalizations were derived from a small, statistically learned lexicon of CVCVCV word forms, rather than directly from the speech stream. To rule out the possibility that word learning preceded phonotactic learning in our experiment, we avoided having recurring CVCVCV word forms in the speech stream. This design makes it likely that participants derived generalizations directly from continuous speech, without the mediation of an artificial lexicon.

Our findings appear to be in contrast with studies showing that generalization can only take place if the speech signal contains acoustic cues to word boundaries. Work by Pena et al. (2002) and Endress and Bonatti (2007) has shown that the learning of long-distance syllable dependencies relies on the presence of subliminal acoustic word boundaries in the speech stream. In contrast, our results suggest that the induction of phonotactic generalizations might not depend on the presence of such cues. Perhaps the difference in findings could be explained by the different nature of the generalizations involved. That is, it is conceivable that the generalization of consonant patterns can operate directly on the speech stream, whereas the generalization of long-distance syllable patterns might require a more established segmentation. Further research will have to determine the extent to which different types of generalizations operate on different types of input.

In earlier computational work we have argued that young infants might be able to induce phonotactics from continuous speech, and use the resulting phonotactic constraints for the detection of word boundaries (Adriaans & Kager, 2010). The ability to induce phonotactics from continuous speech could serve as a way of bootstrapping into lexical acquisition. The experiments presented here provide partial support for this view by showing that under certain conditions learners are indeed able to induce novel phonotactics from continuous speech. However, more work is needed to fully understand how artificial language learning interacts with the participants’ native language.

Our study has broader implications for models of phonotactic learning, which typically operate on word types in the lexicon (e.g., Hayes & Wilson, 2008). Given our finding that learners’ preferences for novel words can be affected by co-occurrence patterns in connected speech, and given the vast amounts of connected speech that listeners hear on a daily basis, it seems that models of phonotactic learning should allow for contextual effects to occur. Phonotactic learning might be affected by multiple sources of input, including word types, word tokens, and continuous speech.

Additional File

The additional file for this article can be found as follows:

Appendix

Experimental items. DOI: https://doi.org/10.5334/labphon.20.s1