Using the standardize package

Christopher D. Eager

2021-03-04

The standardize package

The standardize package provides tools for controlling continuous variable scaling and factor contrasts. The goal of these standardizations is to keep the regression parameters on similar scales, and to ensure that the intercept (which is the predicted value of an observation when all other coefficients are multiplied by 0) represents the corrected mean (i.e. the predicted value for an observation which is average in every way, holding covariates at their mean values and averaging over group differences in factors). When the predictors are all on a similar scale, there are computational benefits for both frequentist and Bayesian approaches in mixed effects regressions, reasonable Bayesian priors are easier to specify, and regression output is easier to interpret.

Throughout this vignette, we will use the ptk dataset to demonstrate the use of the standardize package. A summary of the data can be seen below. We will first discuss scaling continuous variables with the scale function from base R, and with the scale_by function from standardize. Then we’ll discuss contrasts for unordered and ordered factors with named_sum_contr and scaled_contr_poly (respectively). Finally, we will use the standardize function to quickly use all of these tools at once.

library(standardize)
#> 
#>  *********************************************************** 
#>           Loading standardize package version 0.2.2          
#>      Call standardize.news() to see new features/changes     
#>  ***********************************************************

summary(ptk)
#>       cdur            vdur             place            stress    prevowel
#>  Min.   : 61.0   Min.   :  0.00   Bilabial:228   Post-Tonic:250   a:217   
#>  1st Qu.:117.0   1st Qu.: 49.00   Dental  :275   Tonic     :252   e:169   
#>  Median :133.0   Median : 62.00   Velar   :248   Unstressed:249   i:146   
#>  Mean   :137.5   Mean   : 60.69                                   o:151   
#>  3rd Qu.:155.0   3rd Qu.: 77.00                                   u: 68   
#>  Max.   :281.0   Max.   :190.00                                           
#>                                                                           
#>  posvowel    wordpos       wordfreq        speechrate         sex     
#>  a:212    Initial:322   Min.   :     1   Min.   : 2.000   Female:384  
#>  e:144    Medial :429   1st Qu.:  1104   1st Qu.: 5.000   Male  :367  
#>  i: 86                  Median :  4899   Median : 6.000               
#>  o:255                  Mean   : 18840   Mean   : 5.812               
#>  u: 54                  3rd Qu.: 26610   3rd Qu.: 7.000               
#>                         Max.   :158168   Max.   :10.000               
#>                                                                       
#>     speaker   
#>  s02    : 45  
#>  s03    : 45  
#>  s04    : 45  
#>  s07    : 45  
#>  s11    : 45  
#>  s16    : 44  
#>  (Other):482

Continuous variables

Continuous variables include covariates (i.e. fixed effects which take on continuous values) and the dependent variable in linear regression. The scale function in base R, with its default arguments, places continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation (also sometimes called z-scoring or simply scaling). The result is that the values in the transformed variable have the same relationship to one another as in the untransformed variable, but the transformed variable has mean 0 and standard deviation 1. As an example, consider the simple linear regression on total consonant duration cdur with speechrate as the predictor when we don’t scale either variable:

mean(ptk$cdur)
#> [1] 137.5473
sd(ptk$cdur)
#> [1] 30.074

mean(ptk$speechrate)
#> [1] 5.811869
sd(ptk$speechrate)
#> [1] 1.294624

summary(lm(cdur ~ speechrate, ptk))
#> 
#> Call:
#> lm(formula = cdur ~ speechrate, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -75.064 -19.688  -4.486  16.514 123.142 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 186.7623     4.7061   39.69   <2e-16 ***
#> speechrate   -8.4680     0.7904  -10.71   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 28.02 on 749 degrees of freedom
#> Multiple R-squared:  0.1329, Adjusted R-squared:  0.1317 
#> F-statistic: 114.8 on 1 and 749 DF,  p-value: < 2.2e-16

In the output for the regression on the raw variables, the intercept is 186.8 ms. This is the predicted value when all of the other coefficients are multiplied by 0. In this case, this does not describe anything interpretable, since from the data summary we can see that the minimum speech rate is 2 nuclei per second (and a value of 0 nuclei per second is not possible). The coefficient for speech rate (-8.5) means that for each increase of 1 nucleus per second in speechrate, there is an expected decrease in total consonant duration of 8.5 ms. But is an increase of 1 nucleus per second a large increase? And is a decrease in 8.5 ms a large decrease? This information is not in the output. When we scale cdur and speechrate so that they both have mean 0 and standard deviation 1, the output becomes easier to interpret:

ptk$cdur_scaled <- scale(ptk$cdur)[, 1]
ptk$sr_scaled <- scale(ptk$speechrate)[, 1]

mean(ptk$cdur_scaled)
#> [1] -1.539605e-16
sd(ptk$cdur_scaled)
#> [1] 1

mean(ptk$sr_scaled)
#> [1] -1.573508e-16
sd(ptk$sr_scaled)
#> [1] 1

summary(lm(cdur_scaled ~ sr_scaled, ptk))
#> 
#> Call:
#> lm(formula = cdur_scaled ~ sr_scaled, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.4960 -0.6547 -0.1492  0.5491  4.0946 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -2.164e-16  3.400e-02    0.00        1    
#> sr_scaled   -3.645e-01  3.402e-02  -10.71   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9318 on 749 degrees of freedom
#> Multiple R-squared:  0.1329, Adjusted R-squared:  0.1317 
#> F-statistic: 114.8 on 1 and 749 DF,  p-value: < 2.2e-16

Looking at the output of the regression on the scaled variables, we see that the R-squared value and F-statistic are unchanged, but now the intercept term is 0 and the estimate for the effect of scaled speech rate is -0.36. In this case, a value of 0 for scaled speech rate indicates the average value, and so the intercept represents the predicted consonant duration on unit scale at an average speech rate. The effect of scaled speech rate now represents the expected change in standard deviations of consonant duration for a 1-standard-deviation increase in speech rate; that is, for a 1-SD increase in speech rate, we expect a 0.36-SD decrease in consonant duration, which we could call an effect of moderate magnitude. When we add more covariates into the model, so long as they are all scaled prior to regression, we can directly compare the effect sizes of the different coefficients with no further math required, since they all represent the expected change in dependent variable standard deviations for a 1-SD increase in the covariate.

In addition to the scale function in base R, the standardize package has the function scale_by, which allows a continuous variable to be placed on unit scale within each level of a factor (or the interaction of several factors). For example, say that we are interested in whether or not a speaker’s relative speech rate affects their total consonant durations rather than speech rate in general. In other words, some speakers may simply speak more quickly or more slowly in general, and some may exhibit more speech rate variation than others, and we are interested in modeling speech rate relative to the speaker’s tendencies. In this case, we can use the scale_by function:

ptk$sr_scaled_by_speaker <- scale_by(speechrate ~ speaker, ptk)

mean(ptk$sr_scaled_by_speaker)
#> [1] -3.607564e-17
sd(ptk$sr_scaled_by_speaker)
#> [1] 0.9886017

with(ptk, tapply(speechrate, speaker, mean))
#>      s01      s02      s03      s04      s05      s06      s07      s08 
#> 5.540706 5.650011 5.773467 6.049278 6.005344 5.619478 6.516664 5.802228 
#>      s09      s10      s11      s12      s13      s14      s15      s16 
#> 5.855702 5.381143 5.702743 5.483161 6.290449 6.353387 5.358924 5.769699 
#>      s17      s18 
#> 5.542354 5.777147
with(ptk, tapply(speechrate, speaker, sd))
#>      s01      s02      s03      s04      s05      s06      s07      s08 
#> 1.084740 1.230526 1.136553 1.261346 1.294189 1.339326 1.242599 1.350230 
#>      s09      s10      s11      s12      s13      s14      s15      s16 
#> 1.472762 1.184024 1.146887 1.160242 1.437768 1.145203 1.432073 1.254818 
#>      s17      s18 
#> 1.357880 1.250270

with(ptk, tapply(sr_scaled, speaker, mean))
#>          s01          s02          s03          s04          s05          s06 
#> -0.209453180 -0.125023252 -0.029663172  0.183380177  0.149444392 -0.148607756 
#>          s07          s08          s09          s10          s11          s12 
#>  0.544401163 -0.007446973  0.033857642 -0.332704012 -0.084291855 -0.253902532 
#>          s13          s14          s15          s16          s17          s18 
#>  0.369666620  0.418281977 -0.349865920 -0.032573278 -0.208180769 -0.026820301
with(ptk, tapply(sr_scaled, speaker, sd))
#>       s01       s02       s03       s04       s05       s06       s07       s08 
#> 0.8378801 0.9504890 0.8779020 0.9742948 0.9996641 1.0345287 0.9598147 1.0429513 
#>       s09       s10       s11       s12       s13       s14       s15       s16 
#> 1.1375979 0.9145695 0.8858839 0.8961996 1.1105678 0.8845834 1.1061685 0.9692528 
#>       s17       s18 
#> 1.0488603 0.9657396

with(ptk, tapply(sr_scaled_by_speaker, speaker, mean))
#>           s01           s02           s03           s04           s05 
#> -2.960268e-16  3.165268e-18 -3.119346e-16  1.004790e-16  2.183358e-16 
#>           s06           s07           s08           s09           s10 
#> -7.174906e-17 -1.953684e-16 -2.200950e-16  1.468327e-16 -1.266348e-16 
#>           s11           s12           s13           s14           s15 
#> -2.923792e-16  3.850162e-16  1.980309e-16  6.325185e-17  1.427623e-16 
#>           s16           s17           s18 
#> -2.220446e-16  1.582818e-16 -2.505491e-16
with(ptk, tapply(sr_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

The overall mean of sr_scaled_by_speaker is still 0, and while the standard deviation is not exactly 1, it is very close at 0.99. When we look at the raw speechrate variable by speaker, we can see that speakers differ somewhat in their mean speech rate and in the amount that the deviate from their own mean speech rate. When we look at the sr_scaled variable that we created earlier with the base R scale function, these same differences persist, but now expressed on unit scale in terms of the overall dataset. For sr_scaled_by_speaker, however, each speaker’s mean value is 0 and each speaker’s standard deviation is 1, and the variable is interpreted as how slowly or quickly the speaker was talking given their own speech rate tendencies.

If we wanted the variables resulting from scale and scale_by to have a different standard deviation than 1, this can be easily accomplished in the following way:

ptk$sr_scaled_0.5 <- scale(ptk$speechrate) * 0.5
ptk$sr_scaled_by_speaker_0.5 <- scale_by(speechrate ~ speaker, ptk, scale = 0.5)

mean(ptk$sr_scaled_0.5)
#> [1] -7.867542e-17
sd(ptk$sr_scaled_0.5)
#> [1] 0.5

with(ptk, tapply(sr_scaled_by_speaker_0.5, speaker, mean))
#>           s01           s02           s03           s04           s05 
#> -1.480134e-16  1.582634e-18 -1.559673e-16  5.023952e-17  1.091679e-16 
#>           s06           s07           s08           s09           s10 
#> -3.587453e-17 -9.768421e-17 -1.100475e-16  7.341635e-17 -6.331741e-17 
#>           s11           s12           s13           s14           s15 
#> -1.461896e-16  1.925081e-16  9.901544e-17  3.162593e-17  7.138116e-17 
#>           s16           s17           s18 
#> -1.110223e-16  7.914090e-17 -1.252746e-16
with(ptk, tapply(sr_scaled_by_speaker_0.5, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

Factors

Factors are variables which take on a defined set of categorical values called levels rather than continuous values. In regression, a factor with K levels is modeled through the use of K - 1 dummy variables. Each level of the factor is assigned a value for each dummy variable based on a contrast matrix. So, for example, a factor with four levels has a contrast matrix with four rows (one for each level) and three columns (one for each dummy variable), with the values in the cells of the matrix determining the numerical expression for the factor levels in the dummy variables. There are two general types of factors, ordered and unordered, whose contrasts are treated differently.

Unordered factors

Unordered factors take on two or more categorical values which are not intrinsically ordered (or have a somewhat ordered interpretation but there are only two categories, as is sometimes the case with factors coded as false vs. true, 0 vs. 1, or no vs. yes). For unordered factors, the default in R is to use treatment contrasts, where the first level is coded as 0 for all of the dummy variables, and the remaining levels each have a dummy variable for which they are coded +1, and are then coded as 0 for the other dummy variables, as can be seen in the following example using prevowel (the preceding vowel’s phoneme identity).

options(contrasts = c("contr.treatment", "contr.poly"))

contrasts(ptk$prevowel)
#>   e i o u
#> a 0 0 0 0
#> e 1 0 0 0
#> i 0 1 0 0
#> o 0 0 1 0
#> u 0 0 0 1

summary(lm(cdur_scaled ~ prevowel, ptk))
#> 
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.3844 -0.6886 -0.1139  0.5714  4.7076 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)  0.06240    0.06765   0.922   0.3566  
#> prevowele   -0.11995    0.10224  -1.173   0.2411  
#> prevoweli    0.09704    0.10667   0.910   0.3632  
#> prevowelo   -0.22329    0.10561  -2.114   0.0348 *
#> prevowelu   -0.10358    0.13850  -0.748   0.4547  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared:  0.01219,    Adjusted R-squared:  0.006891 
#> F-statistic: 2.301 on 4 and 746 DF,  p-value: 0.05722

With treatment contrasts, the intercept loses the interpretation of the corrected mean, since when all of the dummy variables in the contrast matrix above are multiplied by 0, the resulting value corresponds to a preceding /a/. The coefficients prevowele … prevowelu then represent the difference between each of the other preceding vowel categories and /a/. To avoid this (and to ensure that the coefficients for the dummy variables stay, on average, closer to zero, but without altering the ultimate interpretation of the results), sum contrasts are used. With sum contrasts in R, the first K - 1 levels each get a dummy variable for which they are coded +1, and then are valued 0 for the other dummy variables. The last level is assigned a value of -1 for all of the dummy variables. Sum contrasts also have additional computational benefits in comparison to treatment contrasts for similar reasons as covariate scaling. Recoding the contrasts for prevowel with sum contrasts, we get:

options(contrasts = c("contr.sum", "contr.poly"))

contrasts(ptk$prevowel)
#>   [,1] [,2] [,3] [,4]
#> a    1    0    0    0
#> e    0    1    0    0
#> i    0    0    1    0
#> o    0    0    0    1
#> u   -1   -1   -1   -1

summary(lm(cdur_scaled ~ prevowel, ptk))
#> 
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.3844 -0.6886 -0.1139  0.5714  4.7076 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) -0.007554   0.039210  -0.193   0.8473  
#> prevowel1    0.069957   0.065448   1.069   0.2855  
#> prevowel2   -0.049994   0.071157  -0.703   0.4825  
#> prevowel3    0.167001   0.074958   2.228   0.0262 *
#> prevowel4   -0.153338   0.074051  -2.071   0.0387 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared:  0.01219,    Adjusted R-squared:  0.006891 
#> F-statistic: 2.301 on 4 and 746 DF,  p-value: 0.05722

With sum contrasts, the intercept maintains the interpretation of the corrected mean, since when all of the dummy variable coefficients are multiplied by 0, it averages over their effects (note that no row in the contrast matrix above has all 0’s, and thus multiplying all of the coefficients by 0 cannot describe any one level; rather, the mean of the values in each column is 0, and so multiplying all of the dummy variable coefficients by 0 averages over their effects). However, one downside to the default implementation is that the coefficients are numbered rather than named. The named_contr_sum function from the standardize package orders levels alphabetically, applies sum contrasts to them, and names the contrasts:

contrasts(ptk$prevowel) <- named_contr_sum(ptk$prevowel)

contrasts(ptk$prevowel)
#>    a  e  i  o
#> a  1  0  0  0
#> e  0  1  0  0
#> i  0  0  1  0
#> o  0  0  0  1
#> u -1 -1 -1 -1

summary(lm(cdur_scaled ~ prevowel, ptk))
#> 
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.3844 -0.6886 -0.1139  0.5714  4.7076 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) -0.007554   0.039210  -0.193   0.8473  
#> prevowela    0.069957   0.065448   1.069   0.2855  
#> prevowele   -0.049994   0.071157  -0.703   0.4825  
#> prevoweli    0.167001   0.074958   2.228   0.0262 *
#> prevowelo   -0.153338   0.074051  -2.071   0.0387 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared:  0.01219,    Adjusted R-squared:  0.006891 
#> F-statistic: 2.301 on 4 and 746 DF,  p-value: 0.05722

Note that the named output is identical in all other ways to the number output in this case. The intercept is the corrected mean, prevowela … prevowelo represent the difference between the named category and the corrected mean, and the /u/ category is obtained by subtracting all four the prevowel coefficients from the intercept.

The named_contr_sum function also accepts a scale argument by which the entire contrast matrix is multiplied. This allows the contrast matrix deviation magnitude to be defined. For example:

contrasts(ptk$prevowel) <- named_contr_sum(ptk$prevowel, scale = 0.5)

contrasts(ptk$prevowel)
#>      a    e    i    o
#> a  0.5  0.0  0.0  0.0
#> e  0.0  0.5  0.0  0.0
#> i  0.0  0.0  0.5  0.0
#> o  0.0  0.0  0.0  0.5
#> u -0.5 -0.5 -0.5 -0.5

One last note on named_contr_sum is that if there are only two levels and they are equal to (ignoring case) “F” and “T”, “FALSE” and “TRUE”, “N”, and “Y”, “NO”, and “YES”, or “0” and “1”, then their order is reversed. This makes it so the positive level gets the dummy coefficient rather than the negative level, yielding a more intuitive interpretation for the resulting coefficients. For example, if we were interested in differentiating only between high vowels (/i/ and /u/) and non-high vowels (/e/, /o/, and /a/), we could do the following (note also that return_contr can be set to FALSE to create a factor with the contrasts, which is useful when the original variable is not a factor):

ptk$prehigh <- ptk$prevowel %in% c("i", "u")

class(ptk$prehigh)
#> [1] "logical"

unique(ptk$prehigh)
#> [1] FALSE  TRUE

ptk$prehigh <- named_contr_sum(ptk$prehigh, return_contr = FALSE)

class(ptk$prehigh)
#> [1] "factor"

levels(ptk$prehigh)
#> [1] "TRUE"  "FALSE"

contrasts(ptk$prehigh)
#>       TRUE
#> TRUE     1
#> FALSE   -1

Ordered factors

Ordered factors are variables which take on more than two categorical values, with the categories representing a hierarchy for which we may not expect the trend to be strictly linear. For example, say we are interested in the effect of preceding vowel height, considering three levels (/a/ = Low, /e o/ = Mid, /i u/ = High). We might hypothesize that the general trend is for relatively higher vowels to have shorter durations than relatively lower vowels, but at the same time not expect that the difference between Low and Mid is the same as the difference between Mid and High. In this case we could create an ordered factor preheight with levels Low < Mid < High. In R, the default contrasts for ordered factors are orthogonal polynomial contrasts. That is, if there are K factor levels, then the contrast matrix is an orthogonal polynomial of degree K - 1, where the first contrast column is the linear component, the second the quadratic, the third the cubic, the fourth the fourth power, etc. until the K - 1’th power is reached. Our hypothesis that the trend is in general negative then would be supported by a negative linear trend. This is exemplified in the following code:

ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
  "Mid", "High"))

head(ptk$preheight)
#> [1] Low  High High High Mid  High
#> Levels: Low < Mid < High

contrasts(ptk$preheight)
#>                 .L         .Q
#> [1,] -7.071068e-01  0.4082483
#> [2,] -9.073800e-17 -0.8164966
#> [3,]  7.071068e-01  0.4082483

However, the scale of the contrast matrix columns depends on the number of levels. The mean of each contrast column is always 0, and the columns of the contrast matrix are completely uncorrelated, but the standard deviations of the contrast matrix columns decreases as the number of levels increases:

contr3 <- contr.poly(3)
contr5 <- contr.poly(5)

apply(contr3, 2, mean)
#>            .L            .Q 
#> -6.723860e-17  3.700743e-17
apply(contr5, 2, mean)
#>            .L            .Q            .C            ^4 
#> -6.591949e-18  2.219362e-17  4.148700e-17  2.783689e-18

apply(contr3, 2, sd)
#>        .L        .Q 
#> 0.7071068 0.7071068
apply(contr5, 2, sd)
#>  .L  .Q  .C  ^4 
#> 0.5 0.5 0.5 0.5

For this reason, the standardize package has a function scaled_contr_poly where the standard deviation of the contrast matrix columns for ordered factors can be specified through its scale argument (with default 1):

sc_1_contr3 <- scaled_contr_poly(3)
sc_0.5_contr3 <- scaled_contr_poly(3, scale = 0.5)

sc_1_contr3
#>                .L         .Q
#> [1,] -1.00000e+00  0.5773503
#> [2,] -3.32076e-17 -1.1547005
#> [3,]  1.00000e+00  0.5773503

apply(sc_1_contr3, 2, sd)
#> .L .Q 
#>  1  1

sc_0.5_contr3
#>                .L         .Q
#> [1,] -5.00000e-01  0.2886751
#> [2,] -1.66038e-17 -0.5773503
#> [3,]  5.00000e-01  0.2886751

apply(sc_0.5_contr3, 2, sd)
#>  .L  .Q 
#> 0.5 0.5

The resulting contrasts are still orthogonal polynomial contrasts; they are simply placed on a uniform scale regardless of the number of factor levels. This affects the magnitude of the resulting coefficients, but, as with the standardization of unordered factors and continuous variables discussed above, it does not alter the significance of the variable:

contrasts(ptk$preheight)
#>                 .L         .Q
#> [1,] -7.071068e-01  0.4082483
#> [2,] -9.073800e-17 -0.8164966
#> [3,]  7.071068e-01  0.4082483

summary(lm(cdur_scaled ~ preheight, ptk))
#> 
#> Call:
#> lm(formula = cdur_scaled ~ preheight, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.4390 -0.7099 -0.1139  0.5536  4.7076 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)  0.01726    0.03702   0.466    0.641  
#> preheight.L  0.02354    0.06792   0.347    0.729  
#> preheight.Q  0.15135    0.06007   2.519    0.012 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.997 on 748 degrees of freedom
#> Multiple R-squared:  0.008562,   Adjusted R-squared:  0.005911 
#> F-statistic:  3.23 on 2 and 748 DF,  p-value: 0.04011

contrasts(ptk$preheight) <- scaled_contr_poly(ptk$preheight)

contrasts(ptk$preheight)
#>                .L         .Q
#> Low  -1.00000e+00  0.5773503
#> Mid  -3.32076e-17 -1.1547005
#> High  1.00000e+00  0.5773503

summary(lm(cdur_scaled ~ preheight, ptk))
#> 
#> Call:
#> lm(formula = cdur_scaled ~ preheight, data = ptk)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.4390 -0.7099 -0.1139  0.5536  4.7076 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)  0.01726    0.03702   0.466    0.641  
#> preheight.L  0.01665    0.04803   0.347    0.729  
#> preheight.Q  0.10702    0.04248   2.519    0.012 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.997 on 748 degrees of freedom
#> Multiple R-squared:  0.008562,   Adjusted R-squared:  0.005911 
#> F-statistic:  3.23 on 2 and 748 DF,  p-value: 0.04011

The standardize function

The standardize function implements scale, scale_by, named_contr_sum, and scaled_contr_poly automatically, allowing regressions to be easily fit in a standardized space. It takes the following arguments:

The function first calls model.frame. If family is gaussian (i.e. a linear model), then the response (i.e. dependent variable) is placed on unit scale (mean 0 and standard deviation 1) regardless of what the scale argument to standardize is; if the response contains a call to scale_by, then it is placed on unit scale within each level of the conditioning factor. For offsets (again, if family* is gaussian), the values are divided by the standard deviation of the response variable prior to scaling (within-factor-level if scale_by** is used on the response). Then, for all values of family, random effects groups (if any) are coerced to unordered factors, and any other predictors which are characters or which contain only two unique non-NA values (regardless of their class) are converted to unordered factors and assigned contrasts with named_contr_sum, passing along the scale argument. Ordered factors are assigned contrasts with scaled_contr_poly, passing along the scale argument. Continuous predictors which contain a call to scale_by are re-calculated passing along the scale argument. Finally, continuous predictors which do not contain a call to scale_by are scaled using the scale function, ensuring that the resulting variable has standard deviation equal to the scale argument to standardize.

The column names of the model frame are then renamed so that they are valid variable names, and the formula passed to standardize is updated with these variable names. A list with class standardized is then returned with the following elements:

To illustrate the use of the standardize function, we will fit a linear mixed effects regression with lmer in the lme4 package with cdur as the response, place, stress, preheight, the natural log of wordfreq, and speecrhate scaled by speaker as fixed effects, and random intercepts for speaker. We begin by creating preheight and then calling standardize:

ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
  "Mid", "High"))

sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
  scale_by(speechrate ~ speaker) + (1 | speaker), ptk)

Now lets examine the sobj object created by standardize:

is.standardized(sobj)
#> [1] TRUE

sobj
#> 
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) + 
#>     scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker + 
#>     (1 | speaker)
#> 
#> Variables:
#>  Variable                       Standardized Name            Class           
#>  cdur                           cdur                         response.numeric
#>  place                          place                        factor          
#>  stress                         stress                       factor          
#>  preheight                      preheight                    ordered         
#>  log(wordfreq)                  log_wordfreq                 numeric         
#>  scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker scaledby        
#>  speaker                        speaker                      group           
#> 
#> Response has mean 0 and standard deviation 1.
#> 
#> Continuous variables have mean 0 and standard deviation 1
#>   (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 1
#> Ordered factors have orthogonal polynomial contrasts whose
#>   columns have standard deviation 1
#> Grouping factors are coded as unordered factors with default contrasts

names(sobj)
#>  [1] "call"      "scale"     "formula"   "family"    "data"      "offset"   
#>  [7] "pred"      "variables" "contrasts" "groups"

head(sobj$data)
#>          cdur  place     stress preheight log_wordfreq
#> 1  0.58032620  Velar Unstressed       Low  -1.64753617
#> 2  0.21456174  Velar Post-Tonic      High  -1.64753617
#> 3 -0.11795140 Dental Unstressed      High  -0.05570344
#> 4  0.01505386 Dental Post-Tonic      High  -1.90076049
#> 5  0.91283934  Velar      Tonic       Mid  -0.19203879
#> 6  1.51136300 Dental Post-Tonic      High  -0.22152148
#>   speechrate_scaled_by_speaker speaker
#> 1                   -0.4984662     s01
#> 2                    1.3452937     s01
#> 3                   -0.4984662     s01
#> 4                    0.6454974     s01
#> 5                    2.2671736     s01
#> 6                   -2.3422261     s01

mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1

mean(sobj$data$log_wordfreq)
#> [1] 1.550764e-16
sd(sobj$data$log_wordfreq)
#> [1] 1
all.equal(scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE

with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#>           s01           s02           s03           s04           s05 
#> -2.960268e-16  3.165268e-18 -3.119346e-16  1.004790e-16  2.183358e-16 
#>           s06           s07           s08           s09           s10 
#> -7.174906e-17 -1.953684e-16 -2.200950e-16  1.468327e-16 -1.266348e-16 
#>           s11           s12           s13           s14           s15 
#> -2.923792e-16  3.850162e-16  1.980309e-16  6.325185e-17  1.427623e-16 
#>           s16           s17           s18 
#> -2.220446e-16  1.582818e-16 -2.505491e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

sobj$contrasts
#> $place
#>          Bilabial Dental
#> Bilabial        1      0
#> Dental          0      1
#> Velar          -1     -1
#> 
#> $stress
#>            Post-Tonic Tonic
#> Post-Tonic          1     0
#> Tonic               0     1
#> Unstressed         -1    -1
#> 
#> $preheight
#>                .L         .Q
#> Low  -1.00000e+00  0.5773503
#> Mid  -3.32076e-17 -1.1547005
#> High  1.00000e+00  0.5773503

sobj$groups
#> $speaker
#>  [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11" "s12"
#> [13] "s13" "s14" "s15" "s16" "s17" "s18"

We can see that all of the regression variables have been placed on a similar scale, using the default scale argument of 1. The majority of the predictors did not change name, but those which included function calls (log(wordfreq) and scale_by(speechrate ~ speaker)) have been altered so that they are valid variable names. If we were to call standardize with scale = 0.5, then cdur would still have mean 0 and standard deviation 1, but the predictors would all have scale 0.5:

sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
  scale_by(speechrate ~ speaker) + (1 | speaker), ptk, scale = 0.5)

sobj
#> 
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) + 
#>     scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk, 
#>     scale = 0.5)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker + 
#>     (1 | speaker)
#> 
#> Variables:
#>  Variable                       Standardized Name            Class           
#>  cdur                           cdur                         response.numeric
#>  place                          place                        factor          
#>  stress                         stress                       factor          
#>  preheight                      preheight                    ordered         
#>  log(wordfreq)                  log_wordfreq                 numeric         
#>  scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker scaledby        
#>  speaker                        speaker                      group           
#> 
#> Response has mean 0 and standard deviation 1.
#> 
#> Continuous variables have mean 0 and standard deviation 0.5
#>   (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 0.5
#> Ordered factors have orthogonal polynomial contrasts whose
#>   columns have standard deviation 0.5
#> Grouping factors are coded as unordered factors with default contrasts

names(sobj)
#>  [1] "call"      "scale"     "formula"   "family"    "data"      "offset"   
#>  [7] "pred"      "variables" "contrasts" "groups"

head(sobj$data)
#>          cdur  place     stress preheight log_wordfreq
#> 1  0.58032620  Velar Unstressed       Low  -0.82376808
#> 2  0.21456174  Velar Post-Tonic      High  -0.82376808
#> 3 -0.11795140 Dental Unstressed      High  -0.02785172
#> 4  0.01505386 Dental Post-Tonic      High  -0.95038025
#> 5  0.91283934  Velar      Tonic       Mid  -0.09601939
#> 6  1.51136300 Dental Post-Tonic      High  -0.11076074
#>   speechrate_scaled_by_speaker speaker
#> 1                   -0.2492331     s01
#> 2                    0.6726468     s01
#> 3                   -0.2492331     s01
#> 4                    0.3227487     s01
#> 5                    1.1335868     s01
#> 6                   -1.1711131     s01

mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1

mean(sobj$data$log_wordfreq)
#> [1] 7.753821e-17
sd(sobj$data$log_wordfreq)
#> [1] 0.5
all.equal(0.5 * scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE

with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#>           s01           s02           s03           s04           s05 
#> -1.480134e-16  1.582634e-18 -1.559673e-16  5.023952e-17  1.091679e-16 
#>           s06           s07           s08           s09           s10 
#> -3.587453e-17 -9.768421e-17 -1.100475e-16  7.341635e-17 -6.331741e-17 
#>           s11           s12           s13           s14           s15 
#> -1.461896e-16  1.925081e-16  9.901544e-17  3.162593e-17  7.138116e-17 
#>           s16           s17           s18 
#> -1.110223e-16  7.914090e-17 -1.252746e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

sobj$contrasts
#> $place
#>          Bilabial Dental
#> Bilabial      0.5    0.0
#> Dental        0.0    0.5
#> Velar        -0.5   -0.5
#> 
#> $stress
#>            Post-Tonic Tonic
#> Post-Tonic        0.5   0.0
#> Tonic             0.0   0.5
#> Unstressed       -0.5  -0.5
#> 
#> $preheight
#>                .L         .Q
#> Low  -5.00000e-01  0.2886751
#> Mid  -1.66038e-17 -0.5773503
#> High  5.00000e-01  0.2886751

sobj$groups
#> $speaker
#>  [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11" "s12"
#> [13] "s13" "s14" "s15" "s16" "s17" "s18"

Using the standardized object

The standardized object can be used to fit regression models, and the resulting regression model can then be used with functions from other packages such as lme4, afex, and emmeans with a few caveats.

Fitting a regression

We fit the mixed effects regression using the standardized object with scale = 0.5 by simply passing its formula and data elements to the lmer function:

library(lme4)
#> Loading required package: Matrix
#> Warning: package 'Matrix' was built under R version 3.6.3

mod <- lmer(sobj$formula, sobj$data)

summary(mod)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: 
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +  
#>     (1 | speaker)
#>    Data: sobj$data
#> 
#> REML criterion at convergence: 2006.9
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.5479 -0.6919 -0.1014  0.5883  4.2981 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  speaker  (Intercept) 0.09375  0.3062  
#>  Residual             0.79081  0.8893  
#> Number of obs: 751, groups:  speaker, 18
#> 
#> Fixed effects:
#>                              Estimate Std. Error t value
#> (Intercept)                   0.01576    0.07946   0.198
#> placeBilabial                 0.01651    0.09553   0.173
#> placeDental                  -0.02341    0.09183  -0.255
#> stressPost-Tonic             -0.12491    0.09681  -1.290
#> stressTonic                   0.42435    0.09595   4.423
#> preheight.L                   0.03529    0.09331   0.378
#> preheight.Q                   0.11910    0.07848   1.518
#> log_wordfreq                 -0.07557    0.06763  -1.117
#> speechrate_scaled_by_speaker -0.61373    0.06643  -9.239
#> 
#> Correlation of Fixed Effects:
#>             (Intr) plcBlb plcDnt strP-T strssT prhg.L prhg.Q lg_wrd
#> placeBilabl  0.024                                                 
#> placeDental -0.039 -0.500                                          
#> strssPst-Tn  0.004  0.053 -0.051                                   
#> stressTonic -0.006  0.003 -0.003 -0.524                            
#> preheight.L  0.009  0.075 -0.106 -0.272  0.276                     
#> preheight.Q  0.083 -0.041 -0.127  0.029 -0.059  0.023              
#> log_wordfrq  0.005 -0.092  0.062 -0.131 -0.029  0.106  0.093       
#> spchrt_sc__  0.007  0.015 -0.043  0.058  0.046  0.061  0.073  0.013

Predicting new data

When predicting new data with the regression model, the new data first needs to be placed into the same standardized space as sobj$data. This can be done by calling predict with the standardied object as the first argument. The predict method for the standardized class also takes logical arguments response (whether or not the new data contains the response variable; default FALSE), fixed (whether or not the new data contains the fixed effects variables; default TRUE), and random (whether or not the new data contains the random effects variables; default TRUE). We will illustrate the use of the standardized predict method using the original data ptk as if it were new data:

newdata <- predict(sobj, ptk)
newdata_fe <- predict(sobj, ptk, random = FALSE)
newdata_re <- predict(sobj, ptk, fixed = FALSE)

head(newdata)
#>    place     stress preheight log_wordfreq speechrate_scaled_by_speaker speaker
#> 1  Velar Unstressed       Low  -0.82376808                   -0.2492331     s01
#> 2  Velar Post-Tonic      High  -0.82376808                    0.6726468     s01
#> 3 Dental Unstressed      High  -0.02785172                   -0.2492331     s01
#> 4 Dental Post-Tonic      High  -0.95038025                    0.3227487     s01
#> 5  Velar      Tonic       Mid  -0.09601939                    1.1335868     s01
#> 6 Dental Post-Tonic      High  -0.11076074                   -1.1711131     s01

head(newdata_fe)
#>    place     stress preheight log_wordfreq speechrate_scaled_by_speaker
#> 1  Velar Unstressed       Low  -0.82376808                   -0.2492331
#> 2  Velar Post-Tonic      High  -0.82376808                    0.6726468
#> 3 Dental Unstressed      High  -0.02785172                   -0.2492331
#> 4 Dental Post-Tonic      High  -0.95038025                    0.3227487
#> 5  Velar      Tonic       Mid  -0.09601939                    1.1335868
#> 6 Dental Post-Tonic      High  -0.11076074                   -1.1711131

head(newdata_re)
#>   speaker
#> 1     s01
#> 2     s01
#> 3     s01
#> 4     s01
#> 5     s01
#> 6     s01

This standardized new data can then be used as the newdata argument to predict with the regression model as the first argument. The predict method may generate warnings about contrasts being dropped, but these warnings can be ignored (i.e. the predictions are still correct; warnings shouldn’t occur with the latest version of lme4, but may occur for older versions, and will occur for the predict method for fixed-effects-only models):

# predictions using both the fixed and random effects
preds <- predict(mod, newdata = newdata)
all.equal(preds, fitted(mod))
#> [1] TRUE

# predictions using only the fixed effects
preds_fe <- predict(mod, newdata = newdata_fe, re.form = NA)

head(preds)
#>           1           2           3           4           5           6 
#>  0.36567215 -0.07756456  0.32566494  0.13159718 -0.26161147  0.98498387

head(preds_fe)
#>           1           2           3           4           5           6 
#>  0.10143783 -0.34179887  0.06143062 -0.13263714 -0.52584579  0.72074955

Obtaining p-values with mixed

When obtaining p-values for the predictors with the mixed function in the afex package, the check_contrasts argument should be set to FALSE to ensure that the correct contrasts are used (this shouldn’t matter when a regression model is passed to mixed, but if the formula and data elements of the standardized object are passed directly to mixed, it is necessary; it is best to just always specify check_contrasts = FALSE):

library(afex)
#> Registered S3 methods overwritten by 'car':
#>   method                          from
#>   influence.merMod                lme4
#>   cooks.distance.influence.merMod lme4
#>   dfbeta.influence.merMod         lme4
#>   dfbetas.influence.merMod        lme4
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> ************
#> Welcome to afex. For support visit: http://afex.singmann.science/
#> - Functions for ANOVAs: aov_car(), aov_ez(), and aov_4()
#> - Methods for calculating p-values with mixed(): 'KR', 'S', 'LRT', and 'PB'
#> - 'afex_aov' and 'mixed' objects can be passed to emmeans() for follow-up tests
#> - NEWS: library('emmeans') now needs to be called explicitly!
#> - Get and set global package options with: afex_options()
#> - Set orthogonal sum-to-zero contrasts globally: set_sum_contrasts()
#> - For example analyses see: browseVignettes("afex")
#> ************
#> 
#> Attaching package: 'afex'
#> The following object is masked from 'package:lme4':
#> 
#>     lmer

pvals <- mixed(mod, data = sobj$data, check_contrasts = FALSE)
#> Formula (the first argument) converted to formula.
#> Fitting one lmer() model. [DONE]
#> Calculating p-values. [DONE]

pvals
#> Mixed Model Anova Table (Type 3 tests, KR-method)
#> 
#> Model: cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker + 
#> Model:     (1 | speaker)
#> Data: $
#> Data: sobj
#> Data: data
#>                         Effect        df         F p.value
#> 1                        place 2, 726.63      0.03    .967
#> 2                       stress 2, 726.38 10.51 ***   <.001
#> 3                    preheight 2, 732.84      1.21    .299
#> 4                 log_wordfreq 1, 731.67      1.25    .264
#> 5 speechrate_scaled_by_speaker 1, 725.10 85.35 ***   <.001
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

Obtaining estimated marginal means with emmeans

The emmeans function in the emmeans package can be called as it would normally be called, but note that if the regression model contains polynomial covariates (i.e. if the formula passed to standardize includes calls to the poly function), then the results may be misleading (as is often the case). We will illustrate the use of emmeans with the stress factor (since the p-value for the overall factor was <.0001):

library(emmeans)

stress_comparisons <- emmeans(mod, pairwise ~ stress)

stress_comparisons
#> $emmeans
#>  stress      emmean     SE   df lower.CL upper.CL
#>  Post-Tonic -0.0467 0.0932 32.5  -0.2365   0.1431
#>  Tonic       0.2279 0.0926 31.6   0.0392   0.4166
#>  Unstressed -0.1340 0.0924 31.4  -0.3223   0.0544
#> 
#> Results are averaged over the levels of: place, preheight 
#> Degrees-of-freedom method: kenward-roger 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast                  estimate     SE  df t.ratio p.value
#>  (Post-Tonic) - Tonic       -0.2746 0.0842 727 -3.263  0.0033 
#>  (Post-Tonic) - Unstressed   0.0873 0.0825 726  1.058  0.5406 
#>  Tonic - Unstressed          0.3619 0.0817 726  4.428  <.0001 
#> 
#> Results are averaged over the levels of: place, preheight 
#> Degrees-of-freedom method: kenward-roger 
#> P value adjustment: tukey method for comparing a family of 3 estimates

Conclusion

The standardize package provides several functions which aid in placing regression variables on similar scales, namely scale_by, named_contr_sum, and scaled_contr_poly. The standardize function offers a convenient way to make use of these functions (along with the scale function in base R) automatically, resulting in a standardized object which contains a standardized formula and data frame that can be passed to regression fitting functions. The most common use of the standardize package is to call the standardize function, passing the same formula, data, and family arguments as would normally be passed to a regression fitting function, leaving the scale argument at its default, and then passing the formula and data elements of the standardized object on to the regression fitting function with all other options for the regression fitting function specified as they would normally be specified.

References

If you use the standardize package in a publication, please cite:

Eager, Christopher D. (2017). standardize: Tools for Standardizing Variables for Regression in R. R package version 0.2.1. https://CRAN.R-project.org/package=standardize

If you analyze the ptk dataset in a publication, please cite:

Eager, Christopher D. (2017). Contrast preservation and constraints on individual phonetic variation. Doctoral thesis. University of Illinois at Urbana-Champaign.