13.10 Perfect collinearity
Suppose you have a categorical variable like GEOID and numeric variables like population, median income, etc., for each census tract. If you include both the categorical variable and any of the numeric variables in a regression model, you will get an error because of perfect collinearity. Why? Let’s investigate a toy example.
x1 x2 x3 y
1 A 1 5 -0.6264538
2 B 2 4 0.1836433
3 A 1 5 -0.8356286
4 A 1 5 1.5952808
5 B 2 4 0.3295078
6 C 3 6 -0.8204684
Let’s building a model with just x1
lm(formula = y ~ x1, data = df)
1 2 3 4 5 6
-6.709e-01 -7.293e-02 -8.800e-01 1.551e+00 7.293e-02 -8.327e-17
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0444 0.6360 0.070 0.949
x1B 0.2122 1.0056 0.211 0.846
x1C -0.8649 1.2720 -0.680 0.545
Residual standard error: 1.102 on 3 degrees of freedom
Multiple R-squared: 0.1812, Adjusted R-squared: -0.3646
F-statistic: 0.332 on 2 and 3 DF, p-value: 0.7409
All is fine. Now let’s add x2
to the model.
lm(formula = y ~ x1 + x2, data = df)
1 2 3 4 5 6
-6.709e-01 -7.293e-02 -8.800e-01 1.551e+00 7.293e-02 -8.327e-17
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0444 0.6360 0.070 0.949
x1B 0.2122 1.0056 0.211 0.846
x1C -0.8649 1.2720 -0.680 0.545
Residual standard error: 1.102 on 3 degrees of freedom
Multiple R-squared: 0.1812, Adjusted R-squared: -0.3646
F-statistic: 0.332 on 2 and 3 DF, p-value: 0.7409
There are NAs in the x2
line. How about x3
lm(formula = y ~ x1 + x3, data = df)
1 2 3 4 5 6
-6.709e-01 -7.293e-02 -8.800e-01 1.551e+00 7.293e-02 -8.327e-17
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0444 0.6360 0.070 0.949
x1B 0.2122 1.0056 0.211 0.846
x1C -0.8649 1.2720 -0.680 0.545
Residual standard error: 1.102 on 3 degrees of freedom
Multiple R-squared: 0.1812, Adjusted R-squared: -0.3646
F-statistic: 0.332 on 2 and 3 DF, p-value: 0.7409
Same. Why?
(Intercept) x1B x1C x2
1 1 0 0 1
2 1 1 0 2
3 1 0 0 1
4 1 0 0 1
5 1 1 0 2
6 1 0 1 3
If there is perfect collinearity, then x2
can be written as a linear combination of other variables.
x2 linear.combo
1 1 1
2 2 2
3 1 1
4 1 1
5 2 2
6 3 3
We can manually inspect these columns and see they are equal. For a large data set, we can use identical
[1] TRUE