Linguistics PhD Student at UC San Diego
In working with data with a categorical variable (e.g., a factor with three levels: A, B, C), a regression model treats one level of the variable as the reference level (usually what comes up first alphabetically, so A), and compares that level with all the other levels; A vs B and A vs C.
But what if we want to compare non-reference levels, like B vs C? Contrast coding makes this possible, including Helmert coding which I will demonstrate how to do here.
Below is the sample data. extraction_domain
column has three levels (baseline, existential, transitive), and let’s say we want to compare mean acceptability of two non-baseline levels (existential, transitive).
## subject_id extraction_domain acceptability
## 1 1 baseline 0.52157849
## 2 2 baseline -0.04169323
## 3 3 baseline -0.39919983
## 4 4 baseline 0.70724171
## 5 5 baseline 1.08893772
## 6 6 baseline 1.01628361
levels(fake_data$extraction_domain)
## [1] "baseline" "existential" "transitive"
By applying Helmert coding, we get to compare each level of a categorical variable to the mean of subsequent levels of the variable. So for our data, comparison goes as follows: A vs B+C, and B vs C (which is what we want to do).
While R has contr.helmert(n)
function (with n the number of levels in your variable), it applies the reverse Helmert coding, which lets us compare each level of a categorical variable to the mean of previous levels of the variable. Since that’s not what we want to do, I will apply Helmert coding by creating a matrix with K-1 columns (for a variable with K levels), and applying the matrix to the dataset with contrasts
function.
my_helmert_ext_domain = matrix(c(2, -1, -1, 0, 1, -1), ncol = 2)
contrasts(fake_data$extraction_domain) = my_helmert_ext_domain
And here’s what a linear mixed-effects model looks like after applying Helmert coding.
my_model = lmer(acceptability~extraction_domain + (1 | subject_id), data = fake_data)
## boundary (singular) fit: see help('isSingular')
summary(my_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: acceptability ~ extraction_domain + (1 | subject_id)
## Data: fake_data
##
## REML criterion at convergence: 394.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.74534 -0.72944 -0.00989 0.68516 2.44594
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject_id (Intercept) 4.660e-18 2.159e-09
## Residual 2.407e-01 4.906e-01
## Number of obs: 270, groups: subject_id, 18
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.02357 0.02986 267.00000 0.789 0.431
## extraction_domain1 0.26931 0.02111 267.00000 12.756 < 2e-16 ***
## extraction_domain2 0.26699 0.03657 267.00000 7.301 3.28e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) extr_1
## extrctn_dm1 0.000
## extrctn_dm2 0.000 0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
Under the “Fixed effects” section, you can see the results of two comparisons: “extraction_domain1” which is comparing the baseline with the mean of existential and transitive (significant), and “extraction_domain2” which is comparing existential and transitive (also significant).