Collinearity between predictors: what happens under the hood

fog37 · Apr 9, 2023

Hello,

In the presence of NO multicollinearity, with a linear regression model like ##Y=3 X_1+2 X_2##, the predictors ##X_1, X_2## are not pairwise correlated.

When ##X_1## changes by 1 unit, the dependent variable ##Y## change by a factor of ##3##, i.e. ##\Delta Y =3##, while the other variables are kept fixed/constant, i.e. they are not simultaneously changing with ##X_1## and participating in the ##\Delta Y## being equal to 3. By analogy, it is like the predictors are working "decoupled" gears.
However, when multicollinearity is present (##X_1## and ##X_2## are correlated), it is not true that as ##X_1## changes by 1 unit, the change ##\Delta Y=3## is not solely due to that unit change in ##X_1## alone while the other variables are fixed/constant. The number 3 is due to the explicit change of ##X_1## but also to the implicit change of ##X_2## (also by one unit?) caused by ##X_1##: changing the variable ##X_1## also changes automatically the variable ##X_2## which is not kept constant while ##X_1## changes.

I think my understanding is correct but I don't fully understand how all this happens mechanically within the data. Does the idea of "while keeping the other variables fixed" really mean that the calculation of the coefficients ##\beta## involves the pairwise correlation between ##r_{12}## compromising the purity of the coefficient? I just don't see how, operationally, changing ##X_1## by one unit (i.e. setting ##X_1=1##) automatically, under the hood, activates a change of ##X_2## in the equation which silently contributes to a partial change of ##\Delta Y##.

It is like ##\Delta Y## = (##\Delta Y## due to ##X_1##) + (##\Delta Y## due to ##X_2##)

Thank you for any clarification.

FactChecker · Apr 9, 2023

In the case of correlated independent variables, ##X_1## and ##X_2##, the coefficients of the linear regression are not necessarily unique. As an extreme example, consider the case where ##X_1= X_2##. The use of a second variable is completely redundant and linear regressions with both variables are possible with a whole set of coefficient combinations.

A step-by-step process alleviates the problem and gives statistical meaning to the coefficients. Suppose ##X_1= X_2## and the linear regression model of ##Y = a_1 X_1 + \epsilon## gives the minimal sum-squared-errors. Then there will be no correlation between the sample ##x_{2,i}## and the sample errors, ##\epsilon_i## because the factor ##a_1 X_1## has taken care of the entire correlation that could be obtained by adding ##X_2## to the linear model.

In a less extreme example, where ##X_1## and ##X_2## are correlated but not equal, there might be some residual error from the ##Y = a_1 X_1 + \epsilon## that can be reduced by adding an ##X_2## term to the linear regression. If the reduction is statistically significant, ##X_2## can be added. Then the term ##a_2 X_2## can be thought of as accounting for/explaining/predicting the residual errors left over by the ##Y = a_1 X_1+ \epsilon## model.

This process is automated in the stepwise linear regression algorithm. The results should be examined for validity and not just applied blindly. The bidirectional elimination algorithm is the most sophisticated. Suppose that variable ##X_1## gives the best single-variable model but ##X_2## and ##X_3## are added in later steps because their reduction of the residual errors were statistically significant. It can happen that the model with only ##X_2## and ##X_3## explains so much of the ##Y## values that the ##X_1## term is no longer statistically significant. The bidirectional elimination algorithm would go back and remove ##X_1## from the final regression result..

Office_Shredder · Apr 9, 2023

Measuring collinearity is the same thing as asking what the beta is using ##X_1## to predict ##X_2## and getting a non zero answer. In your example suppose ##X_2=0.5X_1+\text{independent noise}##. Then how would you expect ##Y## to change if ##X_1## changes by 1 unit?

This is basically the same thing as the chain rule with multiple inputs, if that's something you're familiar with.

fog37 · Apr 10, 2023

Office_Shredder said:

Measuring collinearity is the same thing as asking what the beta is using ##X_1## to predict ##X_2## and getting a non zero answer. In your example suppose ##X_2=0.5X_1+\text{independent noise}##. Then how would you expect ##Y## to change if ##X_1## changes by 1 unit?

This is basically the same thing as the chain rule with multiple inputs, if that's something you're familiar with.

The chain rule example clears things well. For example, in the case of perfect collinearity, if ##X_2=-2X_1##

$$Y=3X_1 + 2X_2=3X_1-4 X_1$$

and ##\frac {\Delta Y}{{\Delta_X}_1} = -1## instead of ##\frac {\Delta Y}{{\Delta_X}_1} = 3##

Office_Shredder · Apr 10, 2023

Yep. There is one very important difference with the chain rule. When taking derivatives, things invert in the natural way. If ##\partial X_2 /\partial X_1=3##, then ##\partial X_1 /\partial X_2=1/3##. Betas don't work that way. If ##X_2=0.5 X_1+noise##, then the only thing you can say is ##X_1=\beta X_2+noise## where ##|\beta|\leq 2## (equality only if they are perfectly correlated)

WWGD · Apr 12, 2023

Don't you consider interaction effects in your model, as in ##\beta_1X_1 + \beta_2 X_2+ \beta_3 X_1*X_2##?

Which you ultimately test for, to test for the assumption ##\beta_3=0##?

FactChecker · Apr 12, 2023

WWGD said:

Don't you consider interaction effects in your model, as in ##\beta_1X_1 + \beta_2 X_2+ \beta_3 X_1*X_2##?

Which you ultimately test for, to test for the assumption ##\beta_3=0##?

You certainly can, but that can be done with the variable ##X_3 = X_1*X_2## in the usual way.

fog37 · Apr 12, 2023

Just a related question about fitting a multiple linear regression model to our multivariate data: what steps can we take to figure out if a multiple linear regression is the adequate model at all for our data ?
For simple linear regression, we can easily inspect the scatterplot between ##Y## and the single predictor ##X## to see if the cloud of data follows a linear trend...But in the case of multiple regressors ##X_1, X_2, X_3, X_4##, would we first plot individual scatterplots between ##Y## and ##X_1##, ##Y## and ##X_2##, ##Y## and ##X_3##, ##Y## and ##X_4## ?
And if the scatterplots are all showing a linear trend, then we try to fit the model with a multiple linear regression equation (aka a plane)? What if the data is linear for some scatterplots and not for others scatterplots?

Thank you!

FactChecker · Apr 12, 2023

The best thing is if you have knowledge of the subject matter and are comfortable with the form of your model. The regression algorithm in any good statistical package will indicate the statistical significance of each term in the model. You should not include terms in the model that do not both make sense in the subject matter and pass the test of statistical significance.

FactChecker · Apr 12, 2023

When you evaluate a regression model, you should keep one thing in mind. Suppose that two independent variables, ##X_1## and ##X_2## have positive correlations with ##Y##. It can easily happen that the best linear regression model ##Y = a_1 X_1 +a_2 X_2 +\epsilon## has ##a_1## a little high and that is corrected with a negative ##a_2##. That may be correct even though the sign of ##a_2## appears wrong. A close examination of the regression process will allow you to determine what happened.

WWGD · Apr 12, 2023

You also use the distribution of the coefficients to test the null hypothesis ##H_0: \beta_i=0, H_A: \beta_i \neq 0 ## And test the adjusted ##r^2##, see whether it inreases or decreases as you add variables. There are also methods like forward, stepwise regression: forward selection, backwards elimination.
https://en.wikipedia.org/wiki/Stepwise_regression

statdad · Apr 14, 2023

FactChecker said:

In the case of correlated independent variables, ##X_1## and ##X_2##, the coefficients of the linear regression are not necessarily unique. As an extreme example, consider the case where ##X_1= X_2##. The use of a second variable is completely redundant and linear regressions with both variables are possible with a whole set of coefficient combinations.

A step-by-step process alleviates the problem and gives statistical meaning to the coefficients. Suppose ##X_1= X_2## and the linear regression model of ##Y = a_1 X_1 + \epsilon## gives the minimal sum-squared-errors. Then there will be no correlation between the sample ##x_{2,i}## and the sample errors, ##\epsilon_i## because the factor ##a_1 X_1## has taken care of the entire correlation that could be obtained by adding ##X_2## to the linear model.

In a less extreme example, where ##X_1## and ##X_2## are correlated but not equal, there might be some residual error from the ##Y = a_1 X_1 + \epsilon## that can be reduced by adding an ##X_2## term to the linear regression. If the reduction is statistically significant, ##X_2## can be added. Then the term ##a_2 X_2## can be thought of as accounting for/explaining/predicting the residual errors left over by the ##Y = a_1 X_1+ \epsilon## model.

This process is automated in the stepwise linear regression algorithm. The results should be examined for validity and not just applied blindly. The bidirectional elimination algorithm is the most sophisticated. Suppose that variable ##X_1## gives the best single-variable model but ##X_2## and ##X_3## are added in later steps because their reduction of the residual errors were statistically significant. It can happen that the model with only ##X_2## and ##X_3## explains so much of the ##Y## values that the ##X_1## term is no longer statistically significant. The bidirectional elimination algorithm would go back and remove ##X_1## from the final regression result..

It's important to note that stepwise regression methods are, in general, not good choices and even though I teach courses that discuss them I strongly urge students not to use them in practice. A few reasons:
1. The R^2 for models that come form them tend to be higher than they should be
2. The F statistics often reported don't really have F distributions
3. The standard errors of the parameter estimates are too small and so the confidence intervals around the parameter estimates are not accurate
5. Because of the multiple tests in the process the p-values are often too low are difficult to correct
6. The slope estimates are biased (this is probably not the strongest argument against them, since the notion of a slope estimate being unbiased simply means they are unbiased for the model you specify, and you have no idea whether it is the correct model)
7. These methods increase problems caused when there is collinearity in the predictors
For a good discussion of these issues see Frank Harrell's Regression Modeling Strategies (2001).

FactChecker · Apr 14, 2023

statdad said:

It's important to note that stepwise regression methods are, in general, not good choices and even though I teach courses that discuss them I strongly urge students not to use them in practice. A few reasons:
1. The R^2 for models that come form them tend to be higher than they should be
2. The F statistics often reported don't really have F distributions
3. The standard errors of the parameter estimates are too small and so the confidence intervals around the parameter estimates are not accurate
5. Because of the multiple tests in the process the p-values are often too low are difficult to correct
6. The slope estimates are biased (this is probably not the strongest argument against them, since the notion of a slope estimate being unbiased simply means they are unbiased for the model you specify, and you have no idea whether it is the correct model)
7. These methods increase problems caused when there is collinearity in the predictors
For a good discussion of these issues see Frank Harrell's Regression Modeling Strategies (2001).

IMO, if the assumptions are met, the mathematics is correct and well established.

statdad · Apr 14, 2023

FactChecker said:

IMO, if the assumptions are met, the mathematics is correct and well established.

I'm not sure what you mean here: the points I made (again, look at Harrell for deeper discussion) are also mathematical points: they apply even if the assumptions are met.
Stepwise methods, by their nature, negate the benefits of the usual assumptions about LS regression.

Collinearity between predictors: what happens under the hood

FAQ: Collinearity between predictors: what happens under the hood

What is collinearity between predictors?

Why is collinearity problematic in regression analysis?

How can you detect collinearity between predictors?

What strategies can be used to address collinearity?

What happens to the regression coefficients when collinearity is present?

Similar threads

Hot Threads

Recent Insights