Data

Let’s first load the main data frame, which contains information about the headwords, the speaker, the vowel environment, and the acoustic features of nasality obtained from the Nasality Automeasure Praat script created by Will Styler:

data <- readRDS("Arabana_data.Rda")

head(data)
##                               filename item speaker headword    gloss V1_env
## 1 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 2 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 3 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 4 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 5 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 6 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
##   timepoint point_vwlpct point_time freq_f1 freq_f2 freq_f3 amp_f1 amp_f2
## 1         1            0   0.239728     861    1584    2619  39.98  38.71
## 2         2           10   0.251218     861    1584    2619  39.98  38.71
## 3         3           20   0.262708     560    1563    1982  35.70  40.18
## 4         4           30   0.274198     861    1561    2113  36.99  39.08
## 5         5           40   0.285688     581    1472    2057  36.89  38.04
## 6         6           50   0.297178     861    1617    1620  38.41  43.59
##   amp_f3 width_f1 width_f2 width_f3 amp_p0   a1p0 a1p0_compensated p0prominence
## 1   9.96      223       58      465  40.45  -0.47          1.57176         2.72
## 2   9.96      223       58      465  40.45  -0.47          1.57176         2.72
## 3  21.13      468      141     1403  43.47  -7.77         -5.72529         4.08
## 4  27.06      462      208     1791  44.66  -7.67         -5.62947         4.54
## 5  24.00      367      254      769  46.90 -10.01         -7.95671         6.32
## 6  43.59      640       84     2726  47.77  -9.36         -7.32675         7.88
##   amp_p1  a1p1 a1p1_compensated      a3p0 h1h2
## 1  39.98  0.00          4.93580 -30.48983 2.71
## 2  39.98  0.00          4.93580 -30.48983 2.71
## 3  34.73  0.97          3.43353 -22.34065 4.09
## 4  37.55 -0.56          2.15151 -17.60132 4.55
## 5  39.53 -2.64         -0.00981 -22.89949 6.32
## 6  38.41  0.00          2.53667  -4.18175 7.89

For each item, there are 11 separate measurements (i.e. 11 equidistant time points of measure throughout the vowel interval). To obtain a count of the total number of utterances produced by each speaker, we will need to divide the total number of observations for each speaker by 11. Let’s first look at the headwords produced by Laurie, along with their respective English glosses and total counts:

library(dplyr)

data %>% 
  filter(speaker=="laurie") %>% 
  group_by(headword, gloss) %>% 
  summarize(n=n()/11) %>%
  print(n=Inf)
## # A tibble: 63 × 3
## # Groups:   headword [63]
##    headword     gloss              count
##    <chr>        <chr>              <dbl>
##  1 akuru        over there             2
##  2 ampurdu      central, middle        1
##  3 ananthara    1pl.Nom                2
##  4 anpa         2sg.Abs                3
##  5 antha        1sg.Abs               18
##  6 anthunha     my, mine              22
##  7 anti         soon                   1
##  8 antu         2sg.Erg                2
##  9 apira        white gum              7
## 10 apityi       father                 1
## 11 apurlu       apple                  2
## 12 arabana      language name          3
## 13 athu         1sg.Erg               17
## 14 awarda       that                   9
## 15 kadlhu       liver                  1
## 16 kadlhumpa    lungs                  1
## 17 kadliti      a personal name        1
## 18 kadnha       stone                  6
## 19 kalka        evening                1
## 20 kalta        lizard sp              2
## 21 kangi        too much               2
## 22 kanta-lhuku  pour-Purp              1
## 23 kantha       broom                  1
## 24 kantimalyu   gum, resin             1
## 25 kantunya     wallaby sp             1
## 26 kantyara     grass                  9
## 27 karla        creek                  2
## 28 madla        dog                    9
## 29 madlanti     bad                    1
## 30 madli        cold                   4
## 31 maka         fire                   3
## 32 malaru       however                1
## 33 maltya       not                    1
## 34 manga-rnda   be.ashamed-Present     3
## 35 mangu        forehead               1
## 36 mani         to get                 4
## 37 mankarra     girl                   3
## 38 manta-li     tie-Habitual           1
## 39 manthara     wattle sp              2
## 40 manungka-rda remember-Pres          2
## 41 marka        to crawl               1
## 42 marna        mouth                  3
## 43 pangki       ribs                   1
## 44 panki-rda    happy-Pres             1
## 45 panpa-rda    taste-Present          1
## 46 panti-li     fight-Habitual         2
## 47 pantya       knee                   2
## 48 parnda       big                   15
## 49 thadla       frightened             1
## 50 thanthi      grandchild             4
## 51 wabma        carpet snake           5
## 52 wadna        yamstick               3
## 53 wadni        corroboree             1
## 54 wakarla      crow                   1
## 55 wangka       word                   1
## 56 wanka        to rise                1
## 57 wanpa        to distrust            1
## 58 wanta        to fly                 1
## 59 wanti        corkwood               2
## 60 wathili      proper, real           2
## 61 yalka        plant sp               5
## 62 yangkathara  thristy                2
## 63 yanhi        to speak               1

And now we’ll do the same for Sydney, who has much more data available:

data %>% 
  filter(speaker=="syd") %>% 
  group_by(headword, gloss) %>% 
  summarize(count=n()/11) %>%
  print(n=Inf)
## `summarise()` has grouped output by 'headword'. You can override using the
## `.groups` argument.
## # A tibble: 67 × 3
## # Groups:   headword [67]
##    headword          gloss                     count
##    <chr>             <chr>                     <dbl>
##  1 alantha           1dl.Nom.SM                   66
##  2 alhakiya          1dl.Nom.DM                   65
##  3 amanyi            grandmother                  31
##  4 anari             this way                     66
##  5 angka             alive                         6
##  6 anhaku            Dont know                    77
##  7 anthunha          mine                         40
##  8 arla-thi-rnda     appear-Pres                   5
##  9 kadlara           cloud                         7
## 10 kadnanundi        scorpion                      7
## 11 kadnha            stone                        65
## 12 kadnha-pardi      grasshopper sp                7
## 13 kadnhaardi        small stones                  6
## 14 kala-rrantja      lizard sp                     5
## 15 kalka             choose                        6
## 16 kalta             lizard sp                     5
## 17 kangi             persistent                   38
## 18 kantyara          grass                        10
## 19 karla             creek                         6
## 20 karlatharra       bush turkey                   7
## 21 kata              louse                        65
## 22 katharungka       white cockatoo               74
## 23 madla             bad, dog                     68
## 24 madla-yari        Sturts desert pea             6
## 25 madlanti          bad                           6
## 26 maka              fire                          7
## 27 malka-malka       stripey                       6
## 28 malya             gypsum                        7
## 29 mama-rnda         grab-Pres                    74
## 30 manarni           perhaps, maybe, meanwhile    33
## 31 mankarra          girl                          6
## 32 manuputu          ignorant                     43
## 33 mapaa-rnda        sit.together-Present          6
## 34 marna             mouth                         7
## 35 marna-kathu-kathu whispery                      6
## 36 mathapurda        old man                       7
## 37 ngadlha           cheek                        60
## 38 paka-rnda         dig-Pres                      8
## 39 palkali           snatch                       12
## 40 palkalya          leaves                        5
## 41 palyama-rnda      carry.on.shoulder-Pres        7
## 42 pangkara          reed                          4
## 43 pinya             vengeance party               2
## 44 thadlarra-ma-rnda fear-Caus-Pres                7
## 45 thalka            bilby                         6
## 46 wabma             carpet snake                  8
## 47 wabmarri          plain                        72
## 48 wadlampu          hungry                        7
## 49 wadna             yamstick                     72
## 50 wadna-rnda        take.off-Present              6
## 51 wadnakani         carpet snake                  6
## 52 wakarla           crow                          4
## 53 wakarra           back.of.neck                  5
## 54 walpathi-rda      hunt-Present                  7
## 55 wamparla          possum                        7
## 56 wangka parda-rnda answer-Pres                   5
## 57 wangka-matha-nta  speak-together-Reflexive      5
## 58 wangka-tyimpa-rda converse-Pres                 6
## 59 wanpa-rda         carry-Pres                    6
## 60 wanta-rda         fly-Present                   7
## 61 wapa-rnda         seek-Pres                    83
## 62 yakarra           teeth                         6
## 63 yalka             plant sp                      4
## 64 yampa             stranger                      5
## 65 yangka            hollow                        6
## 66 yangkatharra      thirsty                      12
## 67 yantakarra        west                          6

Additional acoustic measures

In addition to the 18 acoustic features of nasality from the Nasality Automeasure script, we add the first four spectral moments, a measure of nasal murmur based on spectral amplitude ratio, and 14 MFCCs. We are not able to make the original audio files publicly available due to privacy concerns, but we have provided here the code used to create the measurements, and we provide the derived data frame in the supplementary materials data repository:

# names of MFCCs 1-14
mfccs <- paste0("mfcc",1:14)

# prepare data frame
alldata <- c()

# loop through all files
for (file in unique(data$filename)) {
  
  # get the file data
  fdat <- data[data$filename==file,]
  
  # read the corresponding audio file
  audio <- tuneR::readWave(paste0('audio/',file,'.wav'))
  
  # total duration (in seconds)
  totaldur <- length(audio@left)/audio@samp.rate
  
  # extract audio and de-mean to remove DC offset
  snd <- audio@left - mean(audio@left)
  
  # audio pre-emphasis
  for (n in 2:length(snd)) {
    snd[n] <- snd[n] - 0.97*snd[n-1]
  }
  
  # replace the wave object audio samples
  audio@left <- snd
  
  # calculate MFCCs
  melcs <- tuneR::melfcc(audio, 
                         sr = audio@samp.rate, 
                         numcep = length(mfccs), 
                         wintime = 0.01, 
                         hoptime = 0.005)
  
  # get the actual time step (may be slightly different than "hoptime")
  timestep <- totaldur/nrow(melcs)
  
  # get the MFCCs samples nearest to the time points
  mfsamps <- round(fdat$point_time/timestep)
  
  # add the MFCCs to the file data
  fdat[,mfccs] <- melcs[mfsamps,]
  
  # create spectrogram
  spec <- signal::specgram(x = audio@left,
                           n = 1024,
                           Fs = audio@samp.rate,
                           window = 256,
                           overlap = 128
  )
  
  # get spectra
  P <- abs(spec$S)
  
  # convert to dB
  P <- 20*log10(P)
  
  # get the spectral time step
  timestep <- diff(spec$t[1:2])
  
  # get the spectral samples nearest to the time points
  specsamps <- round(fdat$point_time/timestep)
  
  # get first four spectral moments
  moments <- c()
  for (samp in 1:length(specsamps)) {
    moments <- rbind(moments, emuR::moments(P[,samp]))
  }
  colnames(moments) <- c("COG", "variance", "skew", "kurtosis")
  
  # add the moments to the file data
  fdat[,colnames(moments)] <- moments
  
  # nasal murmur (low/high ratio, 0-320 Hz : 320-5360 Hz) bands
  thresh1 <- which.min(abs(spec$f-320))
  thresh2 <- which.min(abs(spec$f-5360))
  
  # get the spectral amplitude means within the two frequency bands
  low   <- colMeans(P[1:thresh1,specsamps])
  high  <- colMeans(P[thresh1:thresh2,specsamps])
  
  # calculate the murmur ratio and add to the file data
  fdat$murmur <- low/high
  
  # add the file data to the combined data frame
  alldata <- rbind.data.frame(alldata,fdat)
}

saveRDS(alldata,"alldata.Rda")

And now we load the derived data frame, in order to continue with a replicable pipeline:

alldata <- readRDS("alldata.Rda")

Let’s take a quick look at the data to make sure everything looks correct:

summary(alldata)
##    filename              item        speaker            headword        
##  Length:17743       Min.   :   1   Length:17743       Length:17743      
##  Class :character   1st Qu.: 404   Class :character   Class :character  
##  Mode  :character   Median : 807   Mode  :character   Mode  :character  
##                     Mean   : 807                                        
##                     3rd Qu.:1210                                        
##                     Max.   :1613                                        
##                                                                         
##     gloss              V1_env            timepoint   point_vwlpct
##  Length:17743       Length:17743       Min.   : 1   Min.   :  0  
##  Class :character   Class :character   1st Qu.: 3   1st Qu.: 20  
##  Mode  :character   Mode  :character   Median : 6   Median : 50  
##                                        Mean   : 6   Mean   : 50  
##                                        3rd Qu.: 9   3rd Qu.: 80  
##                                        Max.   :11   Max.   :100  
##                                                                  
##    point_time        freq_f1          freq_f2        freq_f3    
##  Min.   : 0.040   Min.   :   0.0   Min.   : 432   Min.   :1222  
##  1st Qu.: 1.581   1st Qu.: 422.0   1st Qu.:1275   1st Qu.:2275  
##  Median : 3.606   Median : 562.0   Median :1440   Median :2407  
##  Mean   : 3.759   Mean   : 553.9   Mean   :1452   Mean   :2449  
##  3rd Qu.: 5.562   3rd Qu.: 656.0   3rd Qu.:1576   3rd Qu.:2553  
##  Max.   :16.537   Max.   :2320.0   Max.   :3830   Max.   :5231  
##                                                                 
##      amp_f1           amp_f2           amp_f3          width_f1     
##  Min.   :-48.66   Min.   :-48.17   Min.   :-51.10   Min.   :   3.0  
##  1st Qu.: 31.27   1st Qu.: 21.92   1st Qu.:  9.42   1st Qu.: 137.0  
##  Median : 36.90   Median : 28.67   Median : 16.40   Median : 249.0  
##  Mean   : 36.22   Mean   : 27.84   Mean   : 16.24   Mean   : 342.3  
##  3rd Qu.: 42.60   3rd Qu.: 35.01   3rd Qu.: 23.37   3rd Qu.: 438.0  
##  Max.   : 62.00   Max.   : 56.13   Max.   : 48.44   Max.   :6051.0  
##                                                                     
##     width_f2       width_f3          amp_p0            a1p0        
##  Min.   :   8   Min.   :   9.0   Min.   :-42.47   Min.   :-44.940  
##  1st Qu.: 118   1st Qu.: 208.0   1st Qu.: 36.57   1st Qu.: -7.700  
##  Median : 196   Median : 360.5   Median : 39.96   Median : -2.280  
##  Mean   : 303   Mean   : 496.3   Mean   : 39.68   Mean   : -3.467  
##  3rd Qu.: 343   3rd Qu.: 617.0   3rd Qu.: 43.47   3rd Qu.:  0.500  
##  Max.   :7932   Max.   :8519.0   Max.   : 59.97   Max.   : 46.730  
##                 NA's   :31                                         
##  a1p0_compensated    p0prominence         amp_p1            a1p1       
##  Min.   :-42.9355   Min.   :-19.570   Min.   :-52.93   Min.   :-38.09  
##  1st Qu.: -5.6284   1st Qu.:  3.605   1st Qu.: 15.65   1st Qu.:  7.54  
##  Median : -0.1619   Median :  7.630   Median : 22.09   Median : 13.66  
##  Mean   : -1.3417   Mean   :  8.106   Mean   : 22.38   Mean   : 13.84  
##  3rd Qu.:  2.8873   3rd Qu.: 11.980   3rd Qu.: 29.34   3rd Qu.: 19.70  
##  Max.   : 48.8675   Max.   : 43.290   Max.   : 55.58   Max.   : 53.83  
##                                                                        
##  a1p1_compensated      a3p0             h1h2            mfcc1        
##  Min.   :-35.55   Min.   :-61.97   Min.   :-44.15   Min.   :-110.10  
##  1st Qu.: 10.70   1st Qu.:-29.32   1st Qu.:  1.80   1st Qu.:  50.87  
##  Median : 16.41   Median :-23.41   Median :  6.71   Median :  56.58  
##  Mean   : 16.62   Mean   :-23.45   Mean   :  6.51   Mean   :  55.49  
##  3rd Qu.: 22.06   3rd Qu.:-17.38   3rd Qu.: 11.71   3rd Qu.:  62.20  
##  Max.   : 64.72   Max.   : 29.44   Max.   : 38.76   Max.   :  88.93  
##                                                                      
##      mfcc2             mfcc3             mfcc4              mfcc5         
##  Min.   :-13.390   Min.   :-20.677   Min.   :-24.3642   Min.   :-21.0224  
##  1st Qu.:  1.158   1st Qu.:  3.128   1st Qu.: -6.1410   1st Qu.: -7.8112  
##  Median :  3.598   Median :  6.624   Median : -2.5414   Median : -4.4083  
##  Mean   :  4.981   Mean   :  6.239   Mean   : -2.5221   Mean   : -3.8751  
##  3rd Qu.:  6.785   3rd Qu.: 10.099   3rd Qu.:  0.8234   3rd Qu.: -0.5944  
##  Max.   : 30.822   Max.   : 26.312   Max.   : 21.1995   Max.   : 27.9880  
##                                                                           
##      mfcc6              mfcc7              mfcc8             mfcc9        
##  Min.   :-27.9145   Min.   :-24.7785   Min.   :-38.379   Min.   :-33.792  
##  1st Qu.: -0.0594   1st Qu.: -3.8065   1st Qu.: -6.955   1st Qu.: -7.237  
##  Median :  6.2691   Median :  0.5021   Median : -2.263   Median : -1.572  
##  Mean   :  5.1518   Mean   :  0.5367   Mean   : -2.708   Mean   : -2.108  
##  3rd Qu.: 11.3075   3rd Qu.:  4.6862   3rd Qu.:  2.131   3rd Qu.:  3.274  
##  Max.   : 31.4070   Max.   : 31.3032   Max.   : 23.930   Max.   : 23.213  
##                                                                           
##      mfcc10            mfcc11            mfcc12             mfcc13       
##  Min.   :-19.478   Min.   :-26.742   Min.   :-29.3881   Min.   :-16.658  
##  1st Qu.: -1.336   1st Qu.: -5.843   1st Qu.: -9.1303   1st Qu.:  5.252  
##  Median :  2.700   Median : -1.673   Median : -4.0143   Median :  9.637  
##  Mean   :  2.851   Mean   : -1.838   Mean   : -4.1285   Mean   :  9.521  
##  3rd Qu.:  6.915   3rd Qu.:  2.326   3rd Qu.:  0.6071   3rd Qu.: 13.939  
##  Max.   : 29.241   Max.   : 20.220   Max.   : 26.8391   Max.   : 36.758  
##                                                                          
##      mfcc14             COG           variance          skew         
##  Min.   :-31.819   Min.   :220.0   Min.   :10517   Min.   :-1.06139  
##  1st Qu.: -3.233   1st Qu.:246.7   1st Qu.:24389   1st Qu.:-0.09699  
##  Median :  1.941   Median :255.0   Median :25565   Median :-0.02016  
##  Mean   :  1.668   Mean   :258.6   Mean   :25209   Mean   :-0.04533  
##  3rd Qu.:  6.845   3rd Qu.:262.7   3rd Qu.:26639   3rd Qu.: 0.06739  
##  Max.   : 26.120   Max.   :381.5   Max.   :32094   Max.   : 0.37271  
##                                                                      
##     kurtosis          murmur      
##  Min.   :-1.587   Min.   :-1.271  
##  1st Qu.:-1.355   1st Qu.: 1.375  
##  Median :-1.312   Median : 1.480  
##  Mean   :-1.264   Mean   : 1.485  
##  3rd Qu.:-1.254   3rd Qu.: 1.600  
##  Max.   : 1.435   Max.   : 2.786  
## 

We can see that there are 31 NA values for the width_f3 measurement; these come from time points where Praat was not able to reliably estimate the F3 bandwidth in the process of running the Nasality Automeasure script. There aren’t that many of these occurrences, and they all come from Sydney’s data:

table(alldata$speaker[is.na(alldata$width_f3)])
## 
## syd 
##  31

…which accounts for only 0.2% of this speaker’s total data set:

table(alldata$speaker[is.na(alldata$width_f3)]) / nrow(alldata[alldata$speaker=="syd",])
## 
##         syd 
## 0.002010115

…so we’ll simply set these values to be equal to the overall mean for Sydney’s F3 bandwidth measure:

alldata$width_f3[is.na(alldata$width_f3)] <- mean(alldata$width_f3[alldata$speaker=="syd"], na.rm=T)

Delta features

In order to capture abrupt temporal changes in the acoustic measurements, we will also calculate the delta values for all of these features (i.e. sample-wise differences). We’ll first make a list of the feature names, and then use this to calculate the delta features:

features <- c('freq_f1','freq_f2','freq_f3',
              'amp_f1','amp_f2','amp_f3',
              'width_f1','width_f2','width_f3',
              'amp_p0','a1p0','a1p0_compensated','p0prominence',
              'amp_p1','a1p1','a1p1_compensated',
              'a3p0','h1h2',paste0('mfcc',1:14),
              'COG','variance','skew','kurtosis','murmur')

for (item in unique(alldata$item)) {
  it.dat <- alldata[alldata$item==item,]
  
  for (feature in features) {
    alldata[[paste0(feature,".d")]][alldata$item==item] <- c(0, diff(it.dat[[feature]]) )
  }
}

We’ll add these to the list of feature names, and we can see that we now have 74 features to work with:

features <- c(features, paste0(features,".d") )

length(features)
## [1] 74

Data re-sampling

Re-sampling of the data was crucial for creating balanced data sets across the six vowel environments, while also retaining as much of the data as we are able to retain. There are a number of processes from this point forward that will involve a random component, so we’ll use a random seed for replicability:

myseed <- 123
set.seed(myseed)

Re-sampling function

The re-sampling function is taken from the multivariate resampling procedure, which allows for over-sampling of data by creating Euclidean-distance-based weighted averages of feature vectors of randomly selected nearest neighbors in a multidimensional space. We replicate the function here, as accessed on 21 June, 2022, for the sake of transparency:

multivar_resamp <- function (inputdata, groupby, resampnum, features) {
  
  outputdata <- c()
  for (group in unique(inputdata[[groupby]])) {
    subdata <- inputdata[inputdata[[groupby]]==group,]
    
    if (nrow(subdata) > resampnum) { # more samples than the desired number?
      # undersample to the desired number
      newdata <- subdata[sample(nrow(subdata), resampnum, replace=F), ]
      
      # print a success message
      print(paste0("'",groupby,"' level named '", group, "' has been under-sampled from ", nrow(subdata), " to ", resampnum, " observations"), quote=F)
      
    } else {
      if (nrow(subdata) == resampnum) { # same number of samples as the desired number?
        # keep the original data as-is
        newdata <- subdata
        
        # print a success message
        print(paste0("'",groupby,"' level named '", group, "' has been kept at ", nrow(subdata), " observations"), quote=F)
        
      } else { # fewer samples than the desired number?
        # let's oversample!
        oversamp              <- resampnum - nrow(subdata) # number of new observations to use in oversampling
        sub.scaled            <- subdata # original data to scale
        means                 <- apply(as.matrix(sub.scaled[,features]), 2, mean) # get the original feature means
        stdevs                <- apply(as.matrix(sub.scaled[,features]), 2, sd) # get the original feature standard deviations
        sub.scaled[,features] <- scale(subdata[,features]) # scale the original features
        oversamp.data         <- c() # oversampled data to build
        
        for (samp in 1:oversamp) {
          # randomly choose an observation from the scaled feature matrix
          this.samp   <- sub.scaled[sample(nrow(sub.scaled), 1), ]
          
          # select all of the OTHER observations from the scaled feature matrix
          other.samps <- sub.scaled[which(row.names(sub.scaled)!=row.names(this.samp)), ]
          
          # calculate Euclidean distances between the selected observation and all other observations
          dists <- apply(as.matrix(other.samps[,features]), 1, function(x) sqrt(sum((x - as.matrix(this.samp[,features]))^2)))
          
          # sort by distance
          neighbors <- sort(dists)
          
          # while loop which ensures that no duplicate observations are created in the oversampling process
          n.check <- 0
          while (n.check == 0) {
            # select one of the neighbors from within a Gaussian PDF
            # possible duplicates are ignored in two steps
            n.dist    <- sample(neighbors[neighbors>0], 
                                prob = dnorm(1:length(neighbors[neighbors>0]),0,round(0.3413*length(neighbors[neighbors>0]))),
                                size = 1)
            neighbor  <- which(dists == n.dist)[1]
            
            # create a new observation by calculating a weighted average of the feature vectors 
            # associated with the selected observation and its selected neighbor
            s.dist <- (n.dist-min(dists))/diff(range(dists))
            
            new.features <- (
              s.dist * (sub.scaled[which(row.names(sub.scaled)==row.names(other.samps[neighbor,])), features]) +
                (1-s.dist) * (sub.scaled[which(row.names(sub.scaled)==row.names(this.samp)), features])
            )
            
            # convert weighted features back to their respective original scales
            # using the means and standard deviations of the original features
            new.features <- (new.features * stdevs) + means
            
            # replace old features with new features
            this.samp[,features] <- new.features
            
            # check if it is a duplicate oversampled observation
            dup.check <- duplicated(rbind(oversamp.data,this.samp))
            
            # if it is NOT a duplicate, exit the loop and add it to the data frame
            # if it IS a duplicate, run this loop again with a different neighbor
            if (length(dup.check[dup.check==TRUE])==0) {
              n.check <- 1
            }
          }
          # add new observation to data frame
          oversamp.data <- rbind(oversamp.data,this.samp)
        }
        # add oversampled data to original data frame
        newdata <- rbind(subdata,oversamp.data)
        
        # print a success message
        print(paste0("'",groupby,"' level named '", group, "' has been over-sampled from ", nrow(subdata), " to ", resampnum, " observations"), quote=F)
      } 
    }
    # add resampled data to output dataset
    outputdata <- rbind(outputdata,newdata)
  }
  return(outputdata)
}

Performing re-sampling

Before performing the data re-sampling, we first remove the absolute boundaries of the vowel interval in order to avoid the most extreme effects of co-articulation on the acoustic signal:

alldata <- alldata[!(alldata$timepoint %in% c(1,11)),]

We will begin with Laurie’s data:

speakers <- c("laurie","syd")

speaker <- speakers[1]

subdata <- alldata[alldata$speaker==speaker,]
table(subdata$V1_env)
## 
## #_C #_N C_C C_N N_C N_N 
## 369 441 288 450 180 171

We can see that there is an unbalanced distribution of data samples across the six phonetic environments, from 171 samples in N_N to 450 samples in C_N. Since Laurie’s data is quite limited compared to Sydney’s, and since it is not possible to obtain more data from this speaker, we will maximize the data available to us by over-sampling the five minority categories so that they have an equal number of samples as the C_N environment, i.e. 450:

subdata1 <- multivar_resamp(subdata, "V1_env", max(table(subdata$V1_env)), features)
## [1] 'V1_env' level named '#_N' has been over-sampled from 441 to 450 observations
## [1] 'V1_env' level named '#_C' has been over-sampled from 369 to 450 observations
## [1] 'V1_env' level named 'C_C' has been over-sampled from 288 to 450 observations
## [1] 'V1_env' level named 'C_N' has been kept at 450 observations
## [1] 'V1_env' level named 'N_C' has been over-sampled from 180 to 450 observations
## [1] 'V1_env' level named 'N_N' has been over-sampled from 171 to 450 observations

Now let’s look at the effects of the over-sampling, by comparing the separate probability densities of the original and over-sampled data for the first 20 acoustic features. We’ll plot the probabilities densities of the original data in blue and the over-sampled data in red:

par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))

for (feature in features[1:20]) {
  plot(density(subdata1[[feature]]), col='red',
       main=NA,xlab=feature,
       ylim = range(c(
         density(subdata[[feature]])$y,
         density(subdata1[[feature]])$y)))
  
  lines(density(subdata[[feature]]), col='blue')
}