Linguistic convergence is the phenomenon in which interlocutors’ speech characteristics become more similar to each other’s. One of the methods frequently used to measure convergence is the difference-in-difference (DID) approach, comparing change in absolute distance between a subject and an interlocutor or model talker. We show that this approach is not a reliable measure of convergence when the starting values of the subject and the interlocutor or model talker are close, which can result in the measurement of apparent divergence, while extreme starting points can result in overestimation of convergence. These biases are of particular concern in studies that look for individual differences in convergence. We propose an alternative approach, linear combination, which does not have the same biases, and demonstrate the advantages of this method using data from convergence studies of four linguistic characteristics and simulated data.

Convergence is the phenomenon in which speakers become more similar to their interlocutors, which has been observed in many characteristics, both linguistic and non-linguistic. Many studies find variation in performance across participants, which is used as evidence for individual differences or population differences (e.g.,

Using data from four convergence studies, we demonstrate that DID is not a suitable measure of convergence in the following ways: (1) DID underestimates convergence and can even produce apparent divergence when the subject’s baseline performance is close to the reference value of the interlocutor or model talker, and (2) DID interprets regression to the mean as convergence. For these reasons, DID measures of convergence for individual talkers are unreliable. Our proposed alternative, linear combination, while fully capable of measuring convergence (

Convergence to an interlocutor within a conversation or to a model talker during shadowing or other exposure is often measured by the change in difference between the two speakers. In shadowing tasks, subjects are exposed to recordings of a model talker, which they repeat after, and the comparison is made of their speech before and after exposure. Conversational interactions present additional complications in defining reference points for the speakers, because both of the interlocutors are potentially changing.

In shadowing tasks, recordings are often naturally produced (e.g.,

In interactional studies, it is difficult to reliably expose subjects to consistent extreme productions; while confederate interlocutors can be trained to produce certain behavioral patterns such as face rubbing and foot shaking (

Even when stimuli are not manipulated, the choice of model talker or the particular assigned interlocutor can influence how distant subjects are from these other speakers. Some studies use multiple model talkers (e.g.,

In shadowing tasks, subjects’ productions are measured before exposure to the model talker (_{b}_{R}_{b}_{R}_{R}_{R}

(1) | _{R}_{0} + _{b}_{R} R |

However, many studies do not model both the effects of the subject and the model talker, instead looking at the change in similarity of the subject and the model talker. Some studies measure similarity subjectively with AXB designs (e.g.,

(2) | DID := |_{R}_{b} |

DID is sometimes used to quantify convergence in conversational tasks, looking at the change from the two speakers’ starting distance and ending distance (e.g.,

Some studies, rather than measuring change in distance between two interlocutors, look at correlations between partners’ productions or compare speakers’ productions in different conditions. Comparing participants’ productions under two manipulation conditions is a particularly common method in syntactic priming paradigms, e.g., comparing participants’ use of dative indirect objects and double accusative constructions when they had been exposed to descriptions using one or the other (

We argue that the DID approach has biases that make it unreliable for investigations of individual differences, except when the reference value is outside the range of normal productions. DID can still capture convergence broadly when aggregated across participants, but it introduces a degree of noise that could obscure convergence, particularly when compared across groups of participants.

The first issue is that DID does not distinguish between different trajectories producing the final distance; distance due to lack of convergence is treated the same way as distance due to speakers over-converging. For example, if _{b}_{R}_{R}

The second issue is that eliminating the term for speakers’ consistency (

Some work includes a comparison of distance between subjects and their actual interlocutors or model talkers versus distance between ‘pseudo-pairs’ of subjects and speakers or model talkers they did not interact with (e.g.,

The final issue is that difference-in-difference is susceptible to effects of

Differences in starting distances between subjects and interlocutors or model talkers and distance between subjects and the population mean may account for some of the variation in convergence across measures. In the same task, there can be differences in overall convergence in different measures (e.g.,

In this paper, we use data from four interactional studies of convergence (from

We use data from the four convergence studies used by Cohen Priva and Sanker (

Each conversation is annotated for clarity. To ensure reliable acoustic measurements (F0 median and variance), calls with high levels of background noise, echoing, or other issues were omitted. This left 464 speakers used for acoustic measures. For the other measures (

We follow Cohen Priva et al. (_{R}_{b}

We use the four measures presented by Cohen Priva and Sanker (

We compare two approaches to modelling convergence: (1) difference in difference (DID) and (2) mixed-effects linear regression using the subject’s and interlocutor’s baselines as predictors of each subject’s productions. All the measures were standardized.

While DID is generally not examined with regression models, we use these models to more closely parallel the alternative linear combination model that we propose, which can only be done with regression. The limitations of DID are based on comparing distance without reference to the raw values for each speaker, rather than being based on how those differences are modelled, so the regression structure should not change the behavior of DID results.

When multiple subjects and interlocutors are present, as in Cohen Priva et al. (

(3)

The intercept would be significantly positive if overall convergence is detected. The random intercepts for subject, interlocutor, and conversation would model individual differences in convergent behaviors by individual, by interlocutor, or by conversation.

(4)

The intercept in this case is expected to be zero if the predictors are standardized, as they are in the studies presented here. The subjects’ baseline models

The by-subject intercept is expected to explain little variance, as the subject’s baseline is explicitly provided as a fixed predictor, and is included only for completeness. The by-interlocutor intercept is somewhat more necessary, as it can capture variance not due to convergence, such as modifying one’s speech when speaking with figures of authority or older people. The by-conversation intercept could capture interactional effects that are not convergence per se, which influence both speakers’ performance in the same direction, rather than toward one another.

Additional fixed and random effects could be added to address particular research questions. For instance, adding a by-subject random slope for the interlocutor’s baseline could be used to model individual variation in convergence, yielding a formula of the form (5). The random slope would be non-zero if there is variance among speakers with respect to the reliance on the interlocutors’ baseline, thereby capturing individual differences in convergence. The formula explicitly requests that the variance-covariance matrix between the random intercept and the random slope is not evaluated because the random intercept is expected to be zero (so that the model would converge). This is done by replacing the single pipe | with a double pipe || in the expression specifying the random effects structure per subject.

(5)

Similarly, if two convergence conditions are compared, a fixed effect per condition could be added, and the particular effect of the condition on convergence would be the coefficient for the interaction term between the condition and the interlocutor’s baseline, as in (6).

(6)

Crucially, measuring convergence would always involve a manipulation of the coefficient of the interlocutor’s baseline in the mixed effects model. The use of this method is currently rather limited, but has been established as effective for measuring convergence. Cohen Priva et al. (

Current accounts of convergence do not predict that subjects who start out with productions close to their interlocutors’ will diverge, and it is not a pattern that seems motivated by social or phonological factors. However, we show here that one of the major shortcomings of the DID approach is that it is prone to overestimate divergence when a subject’s baseline performance is close to the reference value.

To test the relationship between starting distance and convergence, we fit DID models that include the absolute difference between the speaker’s and interlocutor’s baseline as a predictor. We contrast these models with linear combination models, which do not create an artificial relationship between starting distance and convergence; without this artifact, no relationship is expected.

For the DID models, we extend the general model described in (3) by adding an additional predictor, the absolute difference between the subject’s and interlocutor’s baseline, yielding the formula in (7). The coefficient of this predictor will be positive if our claim holds, signifying that a smaller initial difference is more likely to result in lower DID values. We fit this model to each of the four measures.

(7) |

We also trained the linear combination equivalent model. We extend the formula in (4) by adding the interaction term between the interlocutor’s baseline (which measures convergence) and the absolute difference between the subject and the interlocutor. This yields the formula in (8), in which the interaction term on the second line is the variable of interest. In both of these models, each data point represents one conversation side.

(8) |

In interest of deviating the least from the DID formula, we specified the interaction term in (8) using : rather than *. The implications are that we do not estimate how the absolute difference in baselines affects the speakers’ performance, only how the absolute difference affects the speakers’ convergence, as in the DID models. An effect on the speakers’ performance, rather than their convergence, would be expected if speakers e.g., speak faster when their interlocutors’ performance is far from their own, regardless of whether their speech rate is fast or slow. There is no parallel measure in the DID models, and it is not expected to relate to convergence. In this model and in subsequent models in which we used the same approach, we verified that the results for convergence in minimally different models in which the term is not excluded (in which * is used to introduce the interaction) do not differ in their statistical significance from the convergence results in the reported models.

All models and data are available in the supplementary materials, and are summarized below.

In all of the four measures, the absolute distance between the subject’s baseline performance and the interlocutor’s baseline performance was positively correlated with higher DID values, as shown in (9). That is, the DID models were more likely to find convergence for subjects whose baselines were further from their interlocutors’ baselines. This is consistent with our predictions, and is not expected to hold in other methods of measuring convergence.

(9)

Study 1: Regression results for the absolute distance between the subject’s baseline performance and the interlocutor’s baseline performance across four measures, for DID models.

F0 median
0.16
0.02
3600
8.2
<0.0001
F0 variance
0.54
0.02
3368
29.9
<0.0001
Speech rate
0.45
0.02
3604
29.1
<0.0001
uh:um ratio
0.43
0.02
4071
25.8
<0.0001

The estimate for the intercepts was negative in all measures, as shown in (10). This indicates that for subjects whose baselines differed very little from their interlocutors, the DID value was negative, not just a smaller positive value, and would thus appear to be divergence.

(10)

Study 1: Regression results for the DID models’ intercepts, across four measures.

F0 median
–0.17
0.03
1430
–6.3
<0.0001
F0 variance
–0.60
0.03
1117
–22.7
<0.0001
Speech rate
–0.51
0.02
991
–21.8
<0.0001
uh:um ratio
–0.48
0.03
1301
–19.3
<0.0001

In contrast, the linear combination models found no significant effect for the interaction between absolute distance and the interlocutors’ baseline in any of the measures, as shown in (11). These results suggest that there is no special status for the initial distance between the interlocutor and the subject in influencing how much they actually converge, and the significant interactions found in the DID models were indeed only artifacts of how convergence was defined. Figure

A comparison between the coefficients of DID and linear combination models for the four measures in Study 1. Each point is the estimate for the measure in that model, the thick lines are one standard error in each direction, and the thin lines are two standard errors in each direction. The two types of models are distinguished by color and shape. A dashed line marks zero; the linear combination model coefficients are all close to zero, having no effect, while the DID model coefficients are much larger.

(11)

Study 1: Regression results for the interaction between the interlocutor’s baseline performance and the distance between the subject’s baseline and the interlocutor’s baseline across four measures, for linear combination models.

F0 median
–0.0061
0.007
2882
–0.9
0.39
F0 variance
–0.0108
0.013
1215
–0.8
0.42
Speech rate
0.0018
0.008
780
0.2
0.82
uh:um ratio
–0.0138
0.011
1988
–1.2
0.21

Another shortcoming of the DID approach follows from the likely interpretation of regression to the mean as convergence, which the linear combination approach does not do. As shown in this study, this effect would make it likely that speakers whose initial measurements are extreme would appear to converge, even though in many cases they are simply reverting to less extreme values.

The prediction that DID would be susceptible to regression to the mean relies on two components. First, regression to the mean predicts that extreme values are less likely to be repeated, meaning that if the measured baseline value for a subject is extreme, the subject’s actual performance is likely to be closer to the population mean. Second, given that interlocutors are more likely to be close to the mean in a unimodal distribution, initial extreme values for a subject are likely to be further away from the interlocutor’s baseline than the subject’s actual performance is. In a DID model, a shift to more typical productions would be interpreted as convergence. In contrast, a linear combination model measures convergence as the amount of variance in speakers’ behavior that is predicted by the interlocutors’ behavior; in these models, variance does not have to be attributed to the interlocutors’ baseline, which better captures the range of factors that cause speakers to vary.

The overestimation of convergence is more likely to be an issue if subjects’ measured baselines are noisy, and thus more likely to differ from their actual baselines. Measured baselines that are not representative are a particular risk when they are established based on a small set of items. If measurement of each subject’s baseline is based on a large amount of data, this bias can be reduced, but it will not be fully eliminated as long as speakers produce variation not driven by their interlocutors.

It is not clear whether a real relationship between convergence and distance between speakers should be expected. It is possible that subjects with more extreme baselines are more likely to converge than subjects with more central baselines, though this might depend on the measure or be directional; for example, fast speakers might slow down, but slow speakers might be more limited in how much they can speed up. Some studies have noted that there is more convergence between individuals who start with more distinct productions based on speaking different dialects, suggesting that this is because they have more room for convergence (e.g.,

To test this prediction, we fit DID models that include the absolute distance from the mean of the distribution as a predictor. We contrast these models with the parallel linear combination models, in which the distance from the mean is not expected to be a predictor.

For the DID models, we extend the models from Study 1 by adding an additional predictor, the absolute distance between the subject’s baseline and the mean of the distribution. Since all the variables were already standardized, the mean was zero, so the absolute distance between the subject’s baseline and the mean was equivalent to the absolute value of the subject’s baseline. This resulted in the formula in (12). The coefficient of this predictor will be positive if our claim holds, signifying that more extreme initial values are more likely to result in higher DID values. We fit this model for each of the four measures.

(12) |

We also trained the linear combination equivalent model for each measure. We extend the linear combination models in Study 1 by adding the interaction term between the interlocutor’s baseline (which measures convergence) and the absolute difference between the subject’s baseline and the mean. Since all the variables were already standardized, the mean was zero, so this difference was equivalent to the absolute value of the subject’s baseline. This resulted in the formula in (13), in which the interaction terms on the second and third lines are the variables of interest.

(13) |

In both sets of models, each data point represents one conversation side.

All models and data are available in the supplementary materials, and summarized below.

In three of the four measures (excluding F0 median), the absolute distance between the subject’s baseline performance and the mean of the distribution was positively associated with higher DID values, as shown in (14). This means that these models were more likely to find convergence for subjects whose initial values were more extreme, consistent with our predictions.

(14)

Study 2: Regression results for the absolute distance between the subject’s baseline performance and the mean of the distribution for DID models, across four measures.

F0 median
0.034
0.04
3683
0.8
0.4
F0 variance
0.187
0.03
495
6.8
<0.0001
Speech rate
0.147
0.02
4645
5.9
<0.0001
uh:um ratio
0.119
0.03
4663
4.5
<0.0001

The distance between the subjects’ and interlocutors’ baselines (the focus of Study 1) was still positively correlated with higher DID values, as shown in (15), which suggests that these are two distinct effects. As before, the estimate for the intercepts was negative in all measures, signifying that DID was predicted to be negative for small differences in baselines (16).

(15)

Study 2: Regression results for the absolute distance between the subject’s baseline and the interlocutor’s baseline performance for DID models, across four measures, based on the revised formula in (12).

F0 median
0.16
0.02
3562
7.8
<0.0001
F0 variance
0.48
0.02
3068
23.9
<0.0001
Speech rate
0.40
0.02
2844
22.3
<0.0001
uh:um ratio
0.39
0.02
3584
21.0
<0.0001

(16)

Study 2: Regression results for the DID models’ intercepts, across four measures, based on the revised formula in (12).

F0 median
–0.20
0.04
2992
–4.7
<0.0001
F0 variance
–0.68
0.03
452
–23.7
<0.0001
Speech rate
–0.56
0.02
1557
–22.6
<0.0001
uh:um ratio
–0.53
0.03
1996
–19.5
<0.0001

There was no robust effect of distance from the median as a predictor of convergence for F0 median, in contrast to the DID models for the other three measures. This lack of effect is likely a result of F0 median having two distinct modes, as shown in Figure

(17)

Study 2: Regression results for the absolute distance between the subject’s baseline performance and the nearest mode of the distribution for DID models, across four measures.

F0 median
0.166
0.06
3674
2.6
0.00938
F0 variance
0.151
0.04
3660
3.9
0.00012
Speech rate
0.205
0.04
4720
5.1
<0.0001
uh:um ratio
0.094
0.03
4667
3.2
0.00120

A density plot of F0 values, averaged for each subject. The values are split by sex on the left panel, and collapsed on the right panel.

In contrast, the linear combination models found no significant effect for the interaction between absolute distance from the mean and the interlocutors’ baseline; that is, there was no interaction between convergence and subjects’ baseline distance from the mean, as shown in (18). This suggests that there is no actual effect of the distance between subjects’ baseline and the mean on convergence. As in Study 1, the initial distance between the subject and the interlocutor was not significant (19). Figure

A comparison between the coefficients of DID and linear combination models for the four measures in Study 2 (before the post-hoc adjustment). Each point is the estimate for the measure in that model, the thick lines are one standard error in each direction, and the thin lines are two standard errors in each direction. The two types of models are distinguished by color and shape. A dashed line marks zero; the linear combination model coefficients are all close to zero, having no effect, while the DID model coefficients are much larger.

(18)

Study 2: Regression results for the interaction between the absolute distance between the speaker’s baseline and the mean, and the interlocutor’s baseline performance, across four measures, for linear combination models.

F0 median
–0.0085
0.01
266
–0.8
0.41
F0 variance
0.0078
0.02
371
0.4
0.72
Speech rate
0.0122
0.01
286
0.8
0.40
uh:um ratio
0.0113
0.02
375
0.6
0.52

(19)

Study 2: Regression results for the interaction between the interlocutor’s baseline, and the absolute distance between the speaker’s baseline and the interlocutor’s baseline, across four measures, for linear combination models, based on the revised formula in (13).

F0 median
–0.0045
0.007
3312
–0.6
0.54
F0 variance
–0.0123
0.014
1333
–0.9
0.38
Speech rate
–0.0014
0.009
1118
–0.2
0.88
uh:um ratio
–0.0157
0.011
2093
–1.4
0.17

These results suggest that the DID approach is prone to two distinct types of artifacts: Extreme baseline values for a subject can result in the overestimation of convergence, and small initial difference between a subject and interlocutor can result in the overestimation of divergence. Study 3 shows how these effects can lead to the overestimation of individual differences in convergence among subjects.

The goal of this study is to show that it is likely that the effect of absolute initial differences between interlocutors (as shown in Study 1) and regression to the mean (as shown in Study 2) could inflate the appearance of individual differences in convergence, regardless of whether or not such differences exist in the underlying data. Using the same dataset we are examining here, Cohen Priva and Sanker (

Recent years have seen rising interest in identifying predictors of individual differences in convergence (e.g.,

The results of Study 1 and Study 2 suggest that the DID approach might inflate or artificically produce individual differences in convergence. Study 1 found that in every measure, subjects were more likely to appear as divergent if their baseline values were close to their interlocutors’ baseline values. Study 2 found that extreme subject baseline values relative to the mean of the distribution are likely to appear as convergent, and that these effects exist even when controlling for the findings of Study 1. These two effects can influence our estimation of individual differences in convergence as measured by DID. If the appearance of individual differences in convergence is largely an artifact of how convergence is measured, rather than reflecting actual behavior differences, this might explain why studies looking for individual tendencies in convergence across different characteristics have found no such tendencies (e.g.,

Subjects whose baseline values are close to the distribution’s mode are more likely than others to be close to their interlocutors’ baselines, if the interlocutors come from the same distribution. Therefore, they are more likely than others to have negative (divergent) DID values, as Study 1 shows. Subjects whose baseline performance is distant from the mode are more likely to appear convergent, as Study 2 shows. We therefore predict that distance from a mode of the distributions would affect the detection of individual differences. A correlation between distance from the mode and convergence is not expected when convergence is measured in other means, namely using linear combination models, and thus individual variation in distance from the mode should not produce apparent individual differences in convergence in such models.

In contrast to Study 1 and and Study 2, the focus of the investigation here is the individual. For DID-based models, we estimate individual differences in convergence as the mean per-conversation DID values for each subject. For linear combination models, individual differences were measured as the per-subject random slope for the interlocutor’s baseline, as in (5). Distance from the mode was calculated as the absolute distance of the subject’s mean performance from the closest mode.

Since the model has no repeated values per subject, we used a simple linear regression to test for a possible correlation between distance from a mode of the distribution and individual differences in convergence. We repeated this analysis for each of the four measures, for both the DID models and the linear combination models.

The results for the two models are summarized in (20). For DID models, there was a positive correlation between mean DID and mean performance for all measures; that is, the subject’s distance from the nearest mode was related to measured convergence. For the linear combination models, there was a much smaller relationship between per-individual slopes and mean performance, which did not consistently reach significance. Figure

The relationships between proximity to a mode and DID values. The solid line shows the relationship between the variable using local polynomial regression (loess). The dashed line is the linear relationship. The X-axis is the absolute distance between the subject and the nearest mode. The Y-axis is the DID value. Each panel shows the relationship for a different measure. The results show a clear relationship between proximity to a mode and high DID, though it is much weaker for F0 median than for other characteristics.

The relationships between proximity to a mode and individual differences in convergence, as measured using a random slope for interlocutors’ baseline in a linear combination mixed effects model. The solid line shows the relationship between the variable using local polynomial regression (loess). The dashed line is the linear relationship. The X-axis is the absolute distance between the subject and the nearest mode. The Y-axis is the standardized per-subject convergence slope value. Each panel shows the relationship for a different measure. The results do not show a clear relationship between proximity to the median and convergence.

(20)

Comparison of the correlations between absolute distance from the nearest mode and the individual differences in the DID and mixed effects slope methods. In every measure, the relationship between the individual differences measure and absolute distance from the nearby mode was higher for DID than for the linear combination method.

These results demonstrate major differences between the two methods of measuring convergence. The DID models produced a consistently significant correlation between the subject’s distance from the nearest mode and the convergence measured for that subject, across all measures. The random slopes measured in the linear combination models did not produce significant correlations between individual convergence and distance from a mode in most measures. The key difference between this method and DID is that it does not assign value to the initial distance between the subject and the interlocutor, and leaves room for the estimation of noise. This allowance for noise and decreased reliance on starting distance make the linear combination method better suited to measuring convergence for particular individuals, without inflating convergence for subjects whose baselines were far from their interlocutors’ and understimating convergence for subjects with values close to a mode. The one measure in which the linear combination method found a statistically significant correlation between a subject’s convergence and baseline distance from the closest mode was speech rate, but as Figure

The results in Studies 1–3 demonstrate clear differences in the extent to which DID and linear combination models find individual differences in convergence. We argue that these apparent effects found by DID models are purely mathematical artifacts of how convergence is measured, rather than reflecting actual relationships between convergence and distance from the population mean and from the interlocutor. However, these measurements use existing studies of convergence (

The parameters were defined to mirror a typical design of a convergence study on a single phonetic measure, in which each participant participated in a single interaction, with testing

Data was generated with 50 interacting pairs of speakers. The true baseline of each participant was sampled from a normal distribution of the population, with a mean of 0 and a standard deviation of 1. The mean of zero parallels the normalized data used in the preceding studies.

The

We calculated DID values for each participant, and performed a linear regression model similar to the mixed effects models for DID in Studies 1–3. For this model, we tested whether the starting distance between the speaker and the mode of the distribution (assumed to be zero) or the starting distance between the speaker and interlocutor would be correlated with DID values. The formula is provided in (21), in which

(21) |

We also performed the equivalent linear combination model, following the modeling approach for the linear combination model in Studies 1–3. In this model, the convergence parameters are the interaction terms between the interlocutors’ baseline, and both the absolute distance of the speaker from the mode as well the absolute distance between the speaker and their interlocutor (22), in which

(22) |

We repeated this sampling procedure ten thousand times, and provide summary results for the models below.

For the ten thousand samples, the median correlation between each subject’s

In DID models, the absolute distance of the subjects’ distance from the mean resulted in statistically significant (

In the linear combination models, in contrast, the absolute distance of the subject from the mean resulted in statistically significant positive coefficients 2.4% of the time, and the coefficient for the absolute distance between the speakers and their interlocutors was statistically significant and positive 2.5% of the time. Both of the results for the linear combination models are within what would be expected by chance. Figure

Density plots of

The results clearly indicate that even in the complete absence of underlying relationships between convergence and the participant’s distance from the mode or the starting distance between the participant and the interlocutor, such effects often spuriously emerge in DID models. When studies then look for individual differences in convergence as measured this way, these effects are likely to be interpreted as evidence for differences in individual tendencies in convergence, even though the true variation across individuals is just in their measured baselines. Because distance from the mode correlates with convergence, speakers whose mean performance is exceptional will seem to converge more than other speakers do, while speakers whose mean performance is close to their interlocutors will seem less convergent or even divergent.

We argue that the spurious effects in the DID model are due to mishandling the noise that exists between measurements of an individual. This predicts that spurious effects would be more likely in more noisy measurements than in less noisy ones. We therefore replicated the results presented above with varying amounts of noise, between 0.1 and 2 standard deviations, as sampled from a uniform distribution (that parameter was fixed at 0.5, as discussed above). Indeed, less noise translates to lower

The relationship between speaker self-consistency, as measured by noise SD, and the

As with the rest of the analysis used in this paper, the code for the sampling procedures is available in the supplementary materials.

Our results demonstrate that DID is not a suitable measure of convergence because it interprets regression to the mean as convergence and underestimates convergence or even finds divergence when the subject’s baseline performance is close to the reference value of the interlocutor or model talker. These biases pose a particular problem in estimation of individual differences in convergence. Our proposed alternative, linear combination, is not subject to any of these issues and thus provides more reliable estimates of the convergence exhibited by each individal participant.

Measuring convergence as change in distance can create biases due to the starting distance between the speakers. The natural variability of speakers can make subjects appear to diverge from interlocutors whose baselines are close to their own, while variation at greater starting distances is more likely to appear convergent. In Study 1, we demonstrate that the subject’s distance from the interlocutor is a predictor of convergence measured within DID models; greater baseline distances from the interlocutor produce higher measurments of convergence, and small baseline distances can create the appearance of divergence. In contrast, there is no relationship within linear combination models. The effects are parallel across all four linguistic characteristics tested (F0 median, F0 variability, speech rate, and uh:um ratio). We argue that the apparent relationship within the DID models is a purely mathematical artifact of how convergence is measured, not based on any actual behavioral pattern. In Study 4, we demonstrate that this apparent relationship within DID models also arises in simulated datasets which have been defined to lack such a relationship.

Consistent with a lack of actual relationship between starting distance and convergence, previous work does not clearly predict behavioral differences based on starting distance. Some studies have found that there is more convergence between individuals whose starting distance is greater, when the distance is across dialects (e.g.,

In contrast, Kim et al. (

While some work has addressed potential effects of the prototypicality of the model talker or interlocutor in eliciting convergence, the prototypicality of each subject has been largely overlooked. Looking at convergence patterns produced by L2 speakers, Lewandowski (

Convergence depends on establishing clear baselines for the subjects. When the baselines could be extreme due to noise, returning to true baselines is likely to appear convergent, as long as the subject and the interlocutor come from the same population. In Study 2, we demonstrate that greater distance of the subject from the median or the nearest mode also produces higher measured convergence in DID models. In bimodal distributions, as for F0, the distance from the median is not a significant predictor, but distance from the nearest mode is consistently a predictor across the four measures. In contrast, there was no relationship within the linear combination models. The effect of distance from the population mode seems to be a different effect than distance from the interlocutor, because both are significant within the same models.

If starting distance from the interlocutor or the population mode produces biases in the measurement of convergence for each individual, this can produce apparent individual differences in convergence, even though the actual differences are simply in the baselines. In Study 3, we demonstrate that DID models consistently find significant individual differences, across all four measures. In contrast, there was no relationship in three of the linear combination models; the significance of the apparent weak relationship within speech rate disappeared when correcting for multiple comparisons. This result throws into question work that looks for individual tendencies in convergence. Within phonetic convergence, many studies of individual differences use DID, and they often do find individual differences (e.g.,

Shadowing studies have also found differences in convergence across model talkers (e.g.,

Some of the measurement biases produced by DID can be reduced by measuring distance or change in distance between subjects and their actual interlocutors or model talkers as compared to distance between subjects and speakers or model talkers they did not interact with (e.g.,

Biases due to starting distance are also likely to create different patterns based on the characteristic in which convergence is measured; different measures have different distributions by speaker, so the biases could create more issues for some measures than others. As demonstrated in Study 4, greater variability increases convergence found by DID, even in data defined to lack convergence. If apparent individual differences in convergence are due to how convergence has been measured, rather than reflecting actual individual tendencies in convergent behavior, this could explain the lack of evidence for individual tendencies in convergence across measures (e.g.,

Effects of starting distance from the interlocutor and distance to a mode on measurement of convergence could also create the appearance of different convergent behaviors across subgroups of the population, particularly if the variability of the particular measure is different within the two groups. Many studies have examined gender as a predictor of convergence; there are sometimes significant differences in convergence based on the gender of the subject or the model talker or interlocutor, or an interaction between the genders of the two speakers, either in overall convergence or in interactions with other factors (e.g.,

It is also possible that effects of the subject’s starting distance relative to the interlocutor or to the population mean could create apparent effects across words. Studies have often found more convergence in lower frequency words than in higher frequency words (e.g.,

In addition to DID measurements of convergence, many studies use AXB tasks to obtain holistic judgments of similarity. Given that AXB is based on testing change in perceived distance, it is possible that it would be subject to some of the same issues as DID. However, as long as listeners in the AXB task are making decisions based on a range of features, as is demonstrated by Pardo et al. (

Another possible issue in the measurement of convergence is normalization by speakers, which might reduce the variation in starting distances between subjects and interlocutors or model talkers. However, it is unclear whether normalization is a better or worse representation of how listeners perceive phonetic input that varies across speakers than raw acoustic measurements. Some studies of convergence measure speakers’ production as raw values, while others normalize by speaker. F0 is more often treated as a raw value (e.g.,

We demonstrate several issues with the commonly used difference-in-difference (DID) method for measuring convergence as change in absolute distance between a subject and an interlocutor or model talker. Close baselines for the subject and interlocutor produce underestimation of convergence or apparent divergence, and greater distance from the mode(s) of the population produces overprediction of convergence.

These biases in measurement of convergence can produce the appearance of individual differences in convergence, which can have consequences in motivating directions of future work in individual variation and consistency. Because individual variation in production varies by measure, these biases due to starting values can also make the measurement of convergence more sensitive to the particular characteristic examined.

The alternative method that we propose for measuring convergence, using linear combination models, does not exhibit the same biases, and thus provides a more reliable measure of convergence, particularly for comparing that convergence across individuals and across speech characteristics.

The additional file for this article can be found as follows:

A zip file containing an Rmarkdown file that presents all the code used to create the models in this paper (

Some work uses a similar model, but includes a coefficient only for the reference value of the model talker or interlocutor, and model _{b}

In lme4, the expression

See also the supplementary materials for this paper, under Study 3.

The authors would like to thank Bodo Winter, an anonymous reviewer, as well as Elinor Amit, Scott AnderBois, Roman Feiman, and Jim Morgan for their helpful feedback on this paper.

The authors have no competing interests to declare.

Uriel Cohen Priva and Chelsea Sanker contributed equally to this manuscript.