13.10 Perfect collinearity

Suppose you have a categorical variable like GEOID and numeric variables like population, median income, etc., for each census tract. If you include both the categorical variable and any of the numeric variables in a regression model, you will get an error because of perfect collinearity. Why? Let’s investigate a toy example.

Code

set.seed(1)
df = data.frame(x1 = c('A', 'B', 'A', 'A', 'B', 'C'), 
                x2 = c(1, 2, 1, 1, 2, 3), 
                x3 = c(5, 4, 5, 5, 4, 6), 
                y = rnorm(6, 0, 1))
head(df)

  x1 x2 x3          y
1  A  1  5 -0.6264538
2  B  2  4  0.1836433
3  A  1  5 -0.8356286
4  A  1  5  1.5952808
5  B  2  4  0.3295078
6  C  3  6 -0.8204684

Let’s building a model with just x1.

Code

lm1 = lm(y ~ x1, data = df)
summary(lm1)


Call:
lm(formula = y ~ x1, data = df)

Residuals:
         1          2          3          4          5          6 
-6.709e-01 -7.293e-02 -8.800e-01  1.551e+00  7.293e-02 -8.327e-17 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.0444     0.6360   0.070    0.949
x1B           0.2122     1.0056   0.211    0.846
x1C          -0.8649     1.2720  -0.680    0.545

Residual standard error: 1.102 on 3 degrees of freedom
Multiple R-squared:  0.1812,    Adjusted R-squared:  -0.3646 
F-statistic: 0.332 on 2 and 3 DF,  p-value: 0.7409

All is fine. Now let’s add x2 to the model.

Code

lm2 = lm(y ~ x1 + x2, data = df)
summary(lm2)


Call:
lm(formula = y ~ x1 + x2, data = df)

Residuals:
         1          2          3          4          5          6 
-6.709e-01 -7.293e-02 -8.800e-01  1.551e+00  7.293e-02 -8.327e-17 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.0444     0.6360   0.070    0.949
x1B           0.2122     1.0056   0.211    0.846
x1C          -0.8649     1.2720  -0.680    0.545
x2                NA         NA      NA       NA

Residual standard error: 1.102 on 3 degrees of freedom
Multiple R-squared:  0.1812,    Adjusted R-squared:  -0.3646 
F-statistic: 0.332 on 2 and 3 DF,  p-value: 0.7409

There are NAs in the x2 line. How about x3?

Code

lm3 = lm(y ~ x1 + x3, data = df)
summary(lm3)


Call:
lm(formula = y ~ x1 + x3, data = df)

Residuals:
         1          2          3          4          5          6 
-6.709e-01 -7.293e-02 -8.800e-01  1.551e+00  7.293e-02 -8.327e-17 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.0444     0.6360   0.070    0.949
x1B           0.2122     1.0056   0.211    0.846
x1C          -0.8649     1.2720  -0.680    0.545
x3                NA         NA      NA       NA

Residual standard error: 1.102 on 3 degrees of freedom
Multiple R-squared:  0.1812,    Adjusted R-squared:  -0.3646 
F-statistic: 0.332 on 2 and 3 DF,  p-value: 0.7409

Same. Why?

Code

mm = model.matrix(lm2) |>
  as.data.frame()
mm

  (Intercept) x1B x1C x2
1           1   0   0  1
2           1   1   0  2
3           1   0   0  1
4           1   0   0  1
5           1   1   0  2
6           1   0   1  3

If there is perfect collinearity, then x2 can be written as a linear combination of other variables.

Code

check = data.frame(x2 = mm$x2, 
                   linear.combo = 1 + (2-1)*mm$x1B + (3-1)*mm$x1C)
check

  x2 linear.combo
1  1            1
2  2            2
3  1            1
4  1            1
5  2            2
6  3            3

We can manually inspect these columns and see they are equal. For a large data set, we can use identical.

Code

identical(check$x2, 
          check$linear.combo)

[1] TRUE