Maho Takahashi

Linguistics PhD Student at UC San Diego

research

CV

code

mtakahas[at]ucsd[dot]edu

How to apply Helmert coding

In working with data with a categorical variable (e.g., a factor with three levels: A, B, C), a regression model treats one level of the variable as the reference level (usually what comes up first alphabetically, so A), and compares that level with all the other levels; A vs B and A vs C.

But what if we want to compare non-reference levels, like B vs C? Contrast coding makes this possible, including Helmert coding which I will demonstrate how to do here.

Below is the sample data. extraction_domain column has three levels (baseline, existential, transitive), and let’s say we want to compare mean acceptability of two non-baseline levels (existential, transitive).

##   subject_id extraction_domain acceptability
## 1          1          baseline    0.52157849
## 2          2          baseline   -0.04169323
## 3          3          baseline   -0.39919983
## 4          4          baseline    0.70724171
## 5          5          baseline    1.08893772
## 6          6          baseline    1.01628361
levels(fake_data$extraction_domain)
## [1] "baseline"    "existential" "transitive"

By applying Helmert coding, we get to compare each level of a categorical variable to the mean of subsequent levels of the variable. So for our data, comparison goes as follows: A vs B+C, and B vs C (which is what we want to do).

While R has contr.helmert(n) function (with n the number of levels in your variable), it applies the reverse Helmert coding, which lets us compare each level of a categorical variable to the mean of previous levels of the variable. Since that’s not what we want to do, I will apply Helmert coding by creating a matrix with K-1 columns (for a variable with K levels), and applying the matrix to the dataset with contrasts function.

my_helmert_ext_domain = matrix(c(2, -1, -1, 0, 1, -1), ncol = 2)
contrasts(fake_data$extraction_domain) = my_helmert_ext_domain

And here’s what a linear mixed-effects model looks like after applying Helmert coding.

my_model = lmer(acceptability~extraction_domain + (1 | subject_id), data = fake_data)
## boundary (singular) fit: see help('isSingular')
summary(my_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: acceptability ~ extraction_domain + (1 | subject_id)
##    Data: fake_data
## 
## REML criterion at convergence: 394.5
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.74534 -0.72944 -0.00989  0.68516  2.44594 
## 
## Random effects:
##  Groups     Name        Variance  Std.Dev. 
##  subject_id (Intercept) 4.660e-18 2.159e-09
##  Residual               2.407e-01 4.906e-01
## Number of obs: 270, groups:  subject_id, 18
## 
## Fixed effects:
##                     Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)          0.02357    0.02986 267.00000   0.789    0.431    
## extraction_domain1   0.26931    0.02111 267.00000  12.756  < 2e-16 ***
## extraction_domain2   0.26699    0.03657 267.00000   7.301 3.28e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) extr_1
## extrctn_dm1 0.000        
## extrctn_dm2 0.000  0.000 
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Under the “Fixed effects” section, you can see the results of two comparisons: “extraction_domain1” which is comparing the baseline with the mean of existential and transitive (significant), and “extraction_domain2” which is comparing existential and transitive (also significant).