The current study examines how listeners make gradient and variable ethnolinguistic judgments in an experimental context where the speaker’s identity is well-known. It features an open-guise experiment (
Recent research in perceptual sociolinguistics has investigated a host of phonetic and phonological variables—primarily segmental—to assess the extent to which social meanings are constructed in perception, similar to the way they are constructed in ongoing production. Despite production research in sociolinguistics demonstrating how speakers use intonational variation to index various ethnic identities and social stances (
In addition, research in perceptual sociolinguistics has rarely confronted the issue of whether social meanings are
We pursued these questions about intonational variation and social meaning via a task in which listeners rated samples of President Barack Obama’s speech on the degree of ‘sounding black.’
A body of linguistic research on ethnic identification dating back nearly 70 years has found that U.S. listeners are generally rather accurate (70–100%) at distinguishing black speakers from white speakers (cf.
This study focuses on one particular type of intonational variable as a starting point for understanding how listeners may react to ethnically-linked suprasegmental features, using methods based in the auto-segmental/metrical (AM) intonational framework (
MAE-ToBI contains two types of pitch movements: pitch accents, which occur on some stressed syllables, and edge tones, which occur at phrase boundaries. The current study focuses only on the movement of pitch accents, though it is important to note that we also tested for the perceptual effects of edge tones. This study focuses on the difference between two types of pitch accents in MAE: a simple high tone, labeled as H*, and a fall-rise, labeled as L+H*. Though other types of pitch accents exist, H* and L+H* are by far the most common pitch accents in most varieties of U.S. English, including AAL (
Earlier studies have shown that pitch accents are perceptually salient for listeners and that naïve listeners can be trained to identify them quickly (
Recent work by Holliday (
This study’s focus on intonational and suprasegmental variation presents an opportunity to address questions about phonetic detail and social meaning. One of the most significant recent advances in sociolinguistic theory has been the advent of sociophonetics (e.g.,
Although the binary treatment of phonetic variables reveals structure in sociolinguistic variation, a sociophonetically informed approach recognizes that the distribution of these variables’ continuous acoustic correlates is not always compatible with discrete categorization. For example, Jacewicz and Fox (
At the same time as research on production in sociolinguistics has increasingly turned to phonetic detail, the role of such detail remains under-theorized and under-investigated in the study of social meaning. To that end, Podesva (
These predictions about categorial and phonetic salience have been supported by a handful of findings on the distribution and social meaning of intonational variation in production. For example, Podesva (
As far as we are aware, only a handful of perceptual studies have investigated how social meanings are affected by phonetic detail. Plichta and Preston (
The present study seeks to expand our understanding of the relationship between phonetic detail and social meaning by investigating this relationship through the lens of intonational variation. Building on Podesva (
This study was designed to address three central research questions:
How do pitch accents affect listener judgments of ethnic identity? In particular, does the L+H* pitch accent carry a social meaning of blackness in perception, as it does in production?
To what extent are the ethnicity-based social meanings of these pitch accents mediated by incremental phonetic differences?
What other aspects of voice quality affect listener judgments of ethnicity?
These questions were investigated via a perceptual task in which listeners rated 120 samples of President Barack Obama’s speech with respect to how much they thought he ‘sounded black’ in each particular sample.
This task used the ‘open-guise technique’ (OGT) (
In the present study, we assumed that listeners (all from the United States) were highly likely to recognize our stimulus speaker, President Barack Obama, necessitating an OGT rather than MGT approach. We openly informed our listeners, “This study is designed to test how people respond to different speech excerpts from the same speaker.” In so doing, we rejected the type of instrumental task framing often used in MGTs, such as evaluating prospective radio newsreaders (
The 120 stimuli were based on excerpts of President Barack Obama’s spontaneous speech from two different 2016 television interviews with Gayle King, a black broadcast journalist who co-anchors the
Sixty excerpts were selected: 20 critical excerpts and 40 filler excerpts. Ten critical excerpts were ‘H* phrases,’ which contained between 1–3 H* pitch accents and 0 L+H* accents; and ten were ‘L+H* phrases,’ which contained between 1–3 L+H* pitch accents and 0–2 H* accents. This imbalanced definition of H* versus L+H* phrases was necessary since L+H* pitch accents are relatively rarer, even in AAL (
In choosing excerpts, we intentionally sacrificed a degree of experimental control for the sake of presenting listeners with natural, spontaneously produced stimuli rather than unnatural, lab-like speech. The benefit of using spontaneous stimuli is that it more closely models real-world perception conditions, as listeners perceive spontaneous and read speech (including oratory) differently (
The critical stimuli were created by manipulating critical excerpts to four manipulation steps, with the original excerpt as Step 1. Steps 2, 3, and 4 were created by making pitch accents’ F0 minima and maxima successively more extreme. With each manipulation step, H* and L+H* maxima were increased by a semitone, and L+H* minima were decreased by a half-semitone. For example, the H* pitch accent in the top panel of Figure
Original (Step 1) and manipulated (Steps 2–4) versions of pitch accents in stimuli: H* pitch accent in
Filler stimuli were created by modifying the final syllable of filler excerpts to include percepts of creaky voice: low F0 and damped pulses (
The task was administered via an online survey hosted by Qualtrics. In each of 120 randomly ordered trials, listeners heard a single stimulus auto-play twice and responded to the question “How black or white does Obama sound here?” on a continuous unit-less slider bar with “very black” and “very white” on opposite poles. As the recognizability of President Obama’s voice would have likely rendered ineffective the type of instrumental task framing often used in MGTs (e.g., rating prospective radio newsreaders, as in
The survey was distributed via social network sampling in May 2017, with a raffle incentive for one randomly selected listener to win an
As mentioned above, both authors listened to all stimuli and confirmed that they sounded natural. As a further check on stimulus naturalness, we coded listeners’ responses to the final two questionnaire items: “How did the clips sound to you?” and “Do you have any other comments on the clips or on the survey?” Based on listeners’ responses to these questions, the second author developed eight true-or-false codes that described sentiments listeners expressed in their responses and coded responses accordingly (with a single response capable of being coded “true” in multiple categories). For example, 21% of listeners reported something amiss with the quality of the clips (although numerous listeners commented positively about the clips’ quality). More information about these codes, including examples, can be found in Appendix B. As we discuss below, however, none of these codes significantly improved our model of intonation results, so we did not find evidence that they impacted listeners’ perceptions of the speaker’s blackness.
Slider-bar positions were converted to real numbers between 0 (“very white”) and 100 (“very black”) and standardized by listener to control for variable usage of the continuous slider bar. All results are reported in unit-less standard deviations (i.e., z-scores); the average listener’s standard deviation was 16.6, so a difference of 1 standard deviation can be interpreted as a difference of roughly one-sixth of the length of the slider bar for the average listener.
Our task was specifically designed to address the first two research questions, about the role of pitch accents and phonetic incrementality in affecting listener judgments of ethnicity; we first present the analysis of intonation features. We then describe a post hoc analysis of voice quality characteristics that addressed the third research question, about the role of other voice quality features in affecting listener judgments of ethnicity.
We compared linear mixed-effects models of standardized ratings to find the predictor structure that best modeled the data in critical trials, via the lmerTest package for R (
Table
Summary of best model of listener ratings of blackness. Degrees of freedom estimated via Satterthwaite approximations (
Estimate | |||||
---|---|---|---|---|---|
(Intercept) | –0.0409 | 0.1065 | 23.1 | –0.384 | 0.7045 |
PhrTypeL+H* | 0.0172 | 0.1507 | 23.1 | 0.114 | 0.9103 |
Step2 | 0.0041 | 0.0458 | 6056 | 0.089 | 0.929 |
Step3 | –0.0214 | 0.0459 | 6056 | –0.466 | 0.6415 |
Step4 | 0.0132 | 0.0459 | 6056 | 0.287 | 0.7737 |
PhrTypeL+H*:Step2 | 0.0297 | 0.065 | 6056 | 0.457 | 0.6475 |
PhrTypeL+H*:Step3 | 0.1314 | 0.065 | 6056 | 2.02 | 0.0434* |
PhrTypeL+H*:Step4 | 0.1047 | 0.065 | 6056 | 1.609 | 0.1076 |
As is evident from this model, listener ratings of blackness tended to increase with the more extreme step manipulations, though this is only statistically significant for L+H* phrases. Also notable is that the model revealed no significant listener effects for gender, race, region, education, or political affiliation, indicating that listeners were remarkably similar in their ratings regardless of a number of potentially influential demographic factors. While previous studies have generally found that in-group community members may perform better in ethnic identification tasks (cf.
These results must be interpreted with caution, however, in light of their small effect size. The sole significant term in Table
The model indicated no main effect of phrase type on listener ratings of blackness, indicating that pitch accent alone did not trigger different blackness ratings. Figure
Fitted model predictions for listener ratings of blackness by phrase type and manipulation step. Error bars represent 95% confidence intervals.
Though the main effect of phrase type failed to reach significance, the model indicated a significant interaction between phrase type and manipulation step, with more extreme L+H* phrases rated as sounding blacker than less extreme L+H* phrases, and no perceived blackness difference for H* phrases regardless of step. Figure
Fitted model predictions for listener ratings of blackness by manipulation step and phrase type. Error bars represent 95% confidence intervals.
This model also implies that listener judgments of blackness are affected by more than just pitch accent type and phonetic shape. As mentioned above, these results must be interpreted with caution, especially in light of the fact that the model’s fixed-effects predictor structure accounted for less than 1% of the variance in ratings, while random effects—the effect of individual excerpts—accounted for 11.3% of the variance. In other words, listeners were much more attuned to features varying by excerpt, such as segmental, semantic, pragmatic, or voice quality characteristics, than the type and phonetic shape of pitch accents. However, this small effect size may represent an inherent challenge to studies of prosody, since the highly nested nature of such variables causes them to be difficult to isolate from one another. Despite this challenge, the finding of a significant difference here may be a step in the direction of discovering how these variables may operate both independently and together. The small effect size of this intonation effect motivated the post hoc analysis of voice quality features.
Our perceptual experiment was specifically designed to test predictions about how listener judgments of ethnicity are influenced by the type and phonetic shape of pitch accents; however, sociophoneticians have long suspected that voice quality characteristics may also influence listener judgments of ethnicity (e.g.,
We ran a Praat script on critical stimuli to extract several measures that, according to previous studies, may pattern differently in AAL versus MAE: phrase speech rate, pitch ratio (
As with the intonation analysis, we modeled standardized ratings via linear mixed-effects models. Because the intonation analysis revealed differences in patterning of responses to H* versus L+H* stimuli, we fit separate models to H* versus L+H* critical trials. We included manipulation step in these models to determine whether the intonation analysis’s findings about the role of manipulation step—significantly affecting listener ratings of blackness in L+H* stimuli but not H* stimuli—remained after considering voice quality features. These models also included random intercepts for excerpts and random by-excerpt slopes for the manipulation step factor. Voice quality measures were normalized (z-scored) to account for widely differing measurement scales.
To account for likely collinearity of voice quality measures (e.g., jitter and pitch ratio are all different measures of changes in fundamental frequency), we adopted a model-comparison strategy that iteratively added interaction terms to the models based on correlations between measures. We first ran baseline models that included all voice quality measures as main effect predictors with no interactions. (Again, these models also included random intercepts for excerpts and random by-excerpt slopes for the manipulation step factor.) We then checked these baseline models for correlations between voice quality measures; any correlations with an absolute value correlation coefficient greater than 0.4 in either model were added as interaction terms into both models. After running these models, we again added interaction terms (including three-way interactions) based on correlations between voice quality terms. The resulting models included the following interactions: phrase speech rate × peak delay × HNR, shimmer × jitter × HNR, pitch ratio × intensity average. For both the H* and L+H* models, each successive model represented a significant improvement in model fit at an α = .05 significance threshold.
Summaries of fixed effects for the voice quality models are in Appendix C. The voice quality model for H* critical trials revealed that few voice quality measures significantly affected listener perceptions of blackness: phrase speech rate and the interaction of peak delay and HNR. Phrase speech rate (seconds per syllable) had a positive effect on listener perceptions of blackness, with slower phrases rated blacker. While the model returned positive estimates for the effects of peak delay and HNR, neither of these main effects reached significance. Rather, the effect of peak delay on listener perceptions of blackness was constrained by the phrase’s HNR. As Figure
H* model predictions for perceived blackness ratings by peak delay (seconds) and HNR (dB). The five facets display peak delay slopes at the minimum, first quartile, median, third quartile, and maximum values for HNR among H* stimuli.
As with the H* model, few predictors reached significance in the L+H* model—including just one voice quality measure, jitter. Among L+H* stimuli, phrases with less jitter were rated blacker, suggesting that listeners are sensitive to the interaction of F0 movement and local periodic perturbations. Notably, the measures affecting listener perceptions of blackness did not overlap for H* versus L+H* phrases; the jitter term in the H* model, and the phrase speech rate & peak delay × HNR terms in the L+H* model, did not even approach significance. This finding provides additional evidence that listeners may respond to different intonation and voice quality cues in phrases containing L+H* pitch accents than those not containing L+H* pitch accents. As L+H* accents are far less common than H* accents, it is possible that L+H* accents cue listeners to adjust their expectations as to markers of ethnic identification.
In addition, manipulation step was significant in the L+H* voice quality model (manipulation steps 3 and 4 were rated blacker than steps 1 and 2) but not the H* voice quality model. This finding corroborates the generalization that the percept of blackness is subject to phonetic incrementality only with respect to the more socially marked L+H* pitch accent. However, this finding is tempered by the fact that the fixed effects in the L+H* model accounted for less than 3% of the variance, as compared with 10.2% for the fixed effects in the H* model (Table
R2 values (percentage of variance accounted for) for voice quality models, calculated via R package piecewiseSEM (
Fixed-effects R2 | Random-effects R2 | Total R2 | |
---|---|---|---|
H* model | 10.2% | 4.3% | 14.5% |
L+H* model | 2.7% | 24.7% | 27.3% |
In short, the voice quality analysis found that listeners relied on multiple acoustic cues—beyond those pertaining to pitch accents’ type or phonetic shape—in making judgments of perceived blackness; crucially, in the presence of an L+H* pitch accent listeners not only relied on different voice quality cues than in the absence of one, but they apparently relied to a much greater degree on cues other than those relating to intonation or voice quality. This finding suggests a fundamental difference in how listeners judge phrases in the presence of an L+H* pitch accent, although this is an open question for future study. More broadly, this finding further supports the claim that understanding the interrelated nature of prosodic variables is a necessary part of their description.
To summarize, this study has demonstrated that listeners are sensitive to the details of phonetic realizations of the H* and L+H* pitch accents in declaratives, and that a larger difference between the F0 maximum and minimum within L+H* pitch accents appears to cause listeners to rate a speaker (in this case, President Barack Obama) as sounding blacker. However, the difference between H* and L+H* pitch accent phrases alone is not sufficient to trigger this judgment; it is the actual realization of the pitch accents themselves that listeners seem to attune to. In addition to pitch accent type and phonetic shape, listeners also attend to voice quality cues in judging blackness, though the relevant cues are different for H* versus L+H* phrases: speech rate, peak delay, and harmonics to noise ratio for H* phrases, jitter for L+H* phrases. There is also some evidence that the number of L+H* and H* pitch accents in a phrase also affect listener judgments of blackness. We also obtained an unexpected finding with respect to speech rate; among H* stimuli, slower phrases were perceived blacker than faster phrases, which could possibly indicate that speakers have different expectations related to ethnolinguistic variation and speech rate (
This study’s results show that in a perception task, listeners appear to be sensitive not only to the phonological category of pitch accents, but also their phonetic realization, as listeners appear to be sensitive to increasingly extreme manipulations of F0 within a single pitch accent type. In the traditional AM model of intonational phonology, pitch accent and edge tones have largely been binned into discrete categories, with meaning presumed to be attached to those categories and their combinations (
Relatedly, as much of the work on prosody has focused on the meaning of intonational contours in an imagined Standard American English as opposed to in specific varieties, it is clear that much more work is needed on both variation in speaker production and listener perception of contour meaning. Though the current study did not reveal differences in perception of ‘sounding black’ conditioned by listener demographics, future work should explore how such perceptions could potentially be affected by listeners with different backgrounds and sociolinguistic experiences.
This point about the role of demographics is especially relevant because (as mentioned above) the listener sample was overwhelmingly liberal and approving of Obama’s presidency, more so than the US population at large. While this is not an issue for the present study—our aim was not to achieve political representativeness but rather to ascertain how intonational variation affected perceptions of blackness within a population of US listeners—it does contextualize the results. Theoretical frameworks that take as primary the role of experience in forming linguistic representations (e.g., Exemplar Theory,
In their 2004 study and summary of the body of research on ethnic identification of white and black speakers and the U.S., Thomas and Reaser reveal gaps in our knowledge about what triggers judgments of speakers as ‘black’ or ‘white.’ Most ethnic identification studies have focused on segmental features, at least in part due to the fact that so little is known about how non-standard varieties of American English employ intonational variation, though it is the case that such studies on prosodic variables have been carried out outside the U.S. (
This study also builds on the findings of Purnell et al. (
It is worth reiterating here the small effect size that we found in our intonation model, in which fixed effects accounted for just 1% of the variance in listener ratings of blackness. Some readers may interpret this small effect size and the proximity of the sole significant intonational model term’s
These findings support the notion that listeners attend to phonetic detail in constructing social meanings of sociophonetic variation, given that listener ratings of blackness for L+H* increased stepwise as L+H* pitch accents became more phonetically extreme. In other words, there is some evidence that listeners map continuous social meanings to continuous variation, supporting our incrementality hypothesis; contra Podesva’s (
These findings expand our understanding of methods for probing language attitudes, countering the received wisdom in MGT research that these tasks only work if listeners believe they are judging different speakers (
Moreover, whereas stimulus speakers in typical MGTs are anonymous to listeners, representing blank attitudinal canvases save for small bits of contextual information provided via stimulus text and/or explicit labels, listeners in this study likely had salient prior impressions of President Obama and his racialized speech. The finding that the guise manipulation affected listener perceptions of Obama’s blackness is even
Although the OGT worked in the present study, we caution readers against the assumption that the OGT will necessarily apply to any context, feature, or trait. First, while both Soukup’s study and the present study intentionally violated the assumption that listeners should believe they are judging different speakers, in both studies listeners were not told which
Second, we argue that there remain contexts in which it is important to conceal the fact that the same speaker is behind both or all guises. While the majority of speaker evaluation tasks involve cognitive and/or affective responses, we predict that tasks involving behavioral responses (e.g., making a hiring decision) are likelier to hinge on listeners believing they are hearing different speakers. For example, if the landlords in Purnell et al. (
Third, we argue that the use of an OGT rather than MGT approach must be justified by a plausible style-shifting context. For example, this task relied on listeners’ awareness of President Obama’s style-shifting to sound more black in some contexts and less black in others (
Caveats about the OGT notwithstanding, it is clear that traditional approaches to linguistic perception do not give listeners enough credit for being aware of style-shifting; indeed, explicit public awareness of style-shifting (e.g.,
The current study examined listener ratings of phonetically manipulated speech to test whether listeners were sensitive to such manipulations in the process of making judgments about speaker ethnicity. Regression models indicated that listeners systematically judged a familiar speaker as ‘sounding blacker’ when exposed to more extreme F0 manipulations of both the peak and valley of L+H* pitch accents. This effect was mediated by incrementality, with more extreme L+H* pitch accents mapping to greater perceptions of blackness—albeit with an effect size that suggests caution in generalizing these results. Results of post-hoc testing also reveal that a number of voice quality features appear to also be involved in these judgments. In particular, speech rate, peak delay, HNR, and jitter also appear to influence listener judgments, though the salience of voice quality features may be mediated by the presence versus absence of L+H* pitch accents.
These results have important implications for future work examining both intonational variation from a formal perspective as well as sociophonetic studies on ethnic identification. The finding that listeners seem to attune differently to H* versus L+H* pitch accents in ethnicity judgments and that these perceptions are influenced by phonetic factors provides further motivation for studies that examine intonation from both a phonological and a phonetic perspective. Additionally, the finding that listener perceptions of ethnicity may be manipulated by alterations in F0 provides important context for studies that aim to isolate the phonetic features that may trigger listener judgments of ethnicity. This is especially important given the large body of work on linguistic profiling and discrimination and may provide additional resources for linguists who aim to describe and address racial inequality. Finally, these results indicate that listeners’ sociolinguistic perceptions are sensitive to the magnitude of the input, a finding that indicates promising directions for research in language attitudes and sociolinguistic cognition.
The additional files for this article can be found as follows:
Questionnaire. DOI:
Questionnaire qualitative codes. DOI:
Voice quality model summaries. DOI:
Listeners were intentionally not provided guidance on how to interpret this question, because earlier ethnic identification studies allowed for speakers to answer with their own conceptualizations of race and ethnicity (cf.
Portions of this data appeared in print in the University of Pennsylvania Working Papers, Selected Papers from NWAV46, as “How black does Obama sound now?: Testing listener judgments of intonation in incrementally manipulated speech.”
Though some scholars have posited that the intonational phonological inventory of AAL may differ from that of MAE, it is still considered a reliable method for analyzing intonation in AAL, at least until researchers further investigate development of an AAL ToBI system (
Whereas Podesva’s phonetic salience hypothesis applies only to phonetic outliers, our incrementality hypothesis applies across the ‘axis of phonetic variation’; the latter can thus be considered a stronger form of the former.
An anonymous reviewer expresses doubt that this significant effect “would reliably reappear as an important factor” in a replication of the present study; we agree that this is an empirical question.
Phrase speech rate, peak delay, and vowel duration are prosodic features, not voice quality features, but for the sake of brevity we refer to the entire set as voice quality features.
Thanks to an editor for pointing this out.
The authors wish to express their thanks to Paul Reed for comments on the study design. We would also like to thank the audiences at New Ways of Analyzing Variation (NWAV46) and Sociolinguistics Symposium 22, as well as anonymous reviewers for their helpful feedback. Thanks also to our listeners.
The authors have no competing interests to declare.