Coefficient sign flip in linear regression with correlated predictors

In summary, correlation between variables can lead to different regression coefficients for predictors that are not included in the model, but are highly correlated with the predictors in our model, can have an impact!
  • #1
fog37
1,569
108
TL;DR Summary
Understand how coefficient change sign in multiple Linear Regression with correlated predictors are included
Hello Forum,
I have read about an interesting example of multiple linear regression (https://online.stat.psu.edu/stat501/lesson/12/12.3). There are two highly correlated predictors, ##X_1## as territory population and ##X_2## as per capita income with Sales as the ##Y## variable. My understanding is that if a model includes correlated predictors, the regression coefficient for one of the predictor can flip in sign compared to when the model only has that one predictor. Further, the lecture notes state that "...even predictors that are not included in the model, but are highly correlated with the predictors in our model, can have an impact!..."

One would expect that as the territory population ##X_1## increases, so would the territory sales ##Y## (positive regression coefficient). However, the regression analysis provides a negative estimated coefficient for territory population: the population of the territory increases and the territory sales decrease (because the larger the territory, the larger to the competitor's market penetration keeping sales down even if the model does have data for the market competition). How does that happen? How can the regression results be affected by a variable (competitor market penetration) that is not included into the model?

How can including a certain predictor in the linear regression model cause the regression coefficient of another predictor to flip in sign and change in magnitude when the predictors are correlated (not possible when the predictors are perfectly uncorrelated)?

I am confused on how correlated predictors affect each other's regression coefficient making the individual effect of each predictor ambiguous...

Thank you!
 
Physics news on Phys.org
  • #2
The basic idea is that a prediction of Y based on positively correlated variables A and B, ##Y \sim A, B##, might overdo the contribution of A (coefficient of A too large) and compensate for that by changing the sign of the coefficient of B.
 
  • Like
Likes fog37
  • #3
Thank you FactChecker. That shines some light on it. I get that correlated predictors jointly contribute to explaining a portion of the variance of ##Y## and we are not sure how much each contributes. But how does the statistical software makes these internal decisions, i.e. giving a large coefficient to A and compensating for that by giving a negative coefficient to B? Do you have any trivial example that can clarify that?
FactChecker said:
The basic idea is that a prediction of Y based on positively correlated variables A and B, ##Y \sim A, B##, might overdo the contribution of A (coefficient of A too large) and compensate for that by changing the sign of the coefficient of B.
 
  • #4
fog37 said:
Thank you FactChecker. That shines some light on it. I get that correlated predictors jointly contribute to explaining a portion of the variance of ##Y## and we are not sure how much each contributes. But how does the statistical software makes these internal decisions, i.e. giving a large coefficient to A and compensating for that by giving a negative coefficient to B?
Do not count on the statistical algorithm to make a logical cause/effect decision. It only knows corresponding tendencies, not any logic behind those tendencies, like what causes what else. That must be decided by the user from subject matter knowledge.
fog37 said:
Do you have any trivial example that can clarify that?
Consider this example.
Suppose Y = water down a rain drain flowing due to rain at locations A and B (and maybe other locations).
Suppose A flows into B, which flows into Y. The value of B includes the A water. A, B, and Y are all positively correlated. A regression of Y based on the single variable A would include the amount that Y gets from B as indicated by A alone. Now add B to the regression model. The regression coefficient of B might take care of some or all of the water from A. In fact, it might overdo it. Now the regression coefficient of A can be much smaller because B might take care of all the A water. In fact, the prediction based on the B coefficient might overdo it and the coefficient of A might need to change its sign to correct for that.
 
Last edited:
  • #5
fog37 said:
But how does the statistical software makes these internal decisions...?
I wouldn't call it "making decisions", in regression analysis we simply find values for the coefficients that minimize the residuals.

fog37 said:
Do you have any trivial example that can clarify that?
Consider how much tax people pay and how much they donate to charity. People who donate large amounts to charity usually earn a lot of money, and this usually means they pay a lot of tax. If we have a model with the amount of money a person donates to charity as the only independent variable and the amount of tax they pay as the dependent variable, we would expect a positive correlation. Now if we introduce as an independent variable the amount a person earns we would again expect a positive correlation between earnings and tax, but in most states, donations to chartity reduce the amount of tax you pay, and so we might expect to see the correlation between donations and tax to become negative.
 
  • Like
Likes fog37 and FactChecker
  • #6
pbuk said:
I wouldn't call it "making decisions", in regression analysis we simply find values for the coefficients that minimize the residuals.Consider how much tax people pay and how much they donate to charity. People who donate large amounts to charity usually earn a lot of money, and this usually means they pay a lot of tax. If we have a model with the amount of money a person donates to charity as the only independent variable and the amount of tax they pay as the dependent variable, we would expect a positive correlation. Now if we introduce as an independent variable the amount a person earns we would again expect a positive correlation between earnings and tax, but in most states, donations to chartity reduce the amount of tax you pay, and so we might expect to see the correlation between donations and tax to become negative.
I like this example. So there are two independent variables, income and donations, which individually correlate positively with taxes. But for a fixed income, the correlation between donations and taxes is reversed. Therefore, if income is included in the regression the remaining variance of taxes has a negative correlation with donations and the regression coefficient of donations will (likely) be negative.
That seems very intuitive and a good way to look for more examples.
 
  • #7
pbuk said:
I wouldn't call it "making decisions", in regression analysis we simply find values for the coefficients that minimize the residuals.Consider how much tax people pay and how much they donate to charity. People who donate large amounts to charity usually earn a lot of money, and this usually means they pay a lot of tax. If we have a model with the amount of money a person donates to charity as the only independent variable and the amount of tax they pay as the dependent variable, we would expect a positive correlation. Now if we introduce as an independent variable the amount a person earns we would again expect a positive correlation between earnings and tax, but in most states, donations to chartity reduce the amount of tax you pay, and so we might expect to see the correlation between donations and tax to become negative.
Thank you pbuk.

So, just to make sure I get it, the raw data is the raw data and the the regression coefficients are based on that. So ##Y##=paidTax, ##X_1##=Donation, ##X_2##=income.

Model01: if we considered the model ##Y= b1 X_1+b0##, we would expected ##b1## to be positive.
Model02: if we considered the model ##Y= b2 X_2+b0##, we would expected ##b2## to be positive.
Model03: if we considered the model ##Y= b1 X_1+ b2 X_2 +b0##, if could be possible for b2 to be negative depending on our data...

I am trying to make some fake data that reflects the particular situation you describe, i.e.
but in most states, donations to charity reduce the amount of tax you pay, and so we might expect to see the correlation between donations and tax to become negative.
How would the income column need to look like to match that scenario? taxPaid and Donations are positive correlated. What values could I use for Income? I am not sure....
1684426561953.png


I know that the partial regression coefficients are the regression coefficients obtained from the graph of the residualized variable, i.e. we first need to regress ##Y## on ##X_1## and take the residual, then regress ##X_2## on ##X_1## and take the residual, and finally the find the slope of the best-fit line of between those two residualized variables...

The model model ##Y= b1 X_1+ b2 X_2 +b0## is slanted plane. We interpret the coefficients as the unit-change in one of the predictor while the other predictor is at some arbitrary value and not changing from it. It is like moving on top of the slanted hyperplane in a direction that keeps one of the predictor's value the same while the other changes. The fact that we can pick any value for the predictor we keep constant is possible because the surface is a plane: the slope in the ##X_1## direction, i.e. ##\Delta Y / \Delta X_1##, is the same if measured at any arbitrary value of ##X_2##

Thank you! I am getting closer...
 
  • #8
I feel like I have opened pandora's box on the the topic of interpreting regression coefficients.

For example:
  • the coefficient for the same predictor ##X## change depending which are predictors are included in the regression model (unless the predictors are uncorrelated or very very little correlated).
  • The idea that a coefficient represents the change ##\Delta Y## for a unit-change of a predictor while keeping the other predictors fixed at some arbitrary value is not completely correct. T though we could interpret a regression coefficient as the partial derivative of ##Y## w.r.t. ##X##. We would essentially "walk" on the hyperplane in a direction that change ##X## but keep the other predictors constant. That seems feasible. However, logically, if two predictors are correlated and one of them changes, then the other HAS to change too: we cannot keep one fixed while changing the other even if the partial derivative idea seems to allow that...
 
  • #9
A better example is stock prices - if you simply regress the returns of some long-only, unelected diversified stock portfolio against the returns of, say, the returns of each industry sector in the S&P you might think you could recover the portfolio’s average sector weighting, with a set of betas that sum to 1.0. In reality you will get an overfit with positive and negative weights and while the betas will tend to sum to 1.0, the individual weights won’t make sense.For that reason the market return is typically stripped out of the factor returns and used as a separate variable, then you get a regression that makes sense
 
  • #10
fog37 said:
How would the income column need to look like to match that scenario? taxPaid and Donations are positive correlated. What values could I use for Income? I am not sure....
A simple model would be tax paid = 25% x (Income - Donations), just generate some pseudo-random figures for income and donations in a sensible range and calculate the tax paid (adding noise if you want).
 
  • Like
Likes fog37
  • #11
I was thinking about the classic example of correlation but no correlation: number of murders and ice-cream sale are two variables have a positive correlation. $$ murders = \beta_1 icecreamSales +\beta_0$$ The two variables may be correlated but there is not cause-effect relationship between them.
The variable temperature ##T##, which is not included into the simple regression model, plays the role of a confounding variable. It positively correlated with both ice-cream sales and number of murders. By including the variable temperature in the model we are controlling for it: $$ murders = \beta_1 icecreamSales +\beta_2 T+ \beta_0$$ By doing that, we may be controlling for temperature ##T## but we are also injecting collinearity into the model since ##T## and ##icecreamSale## are correlated. This will cause the coefficients ##\beta_1## and ##\beta_2## not to reflect the actual correlation between ##murders## and ##icecreamSale##....Isn't that a problem? We wanted to include the variable ##T## to show that ##murders## and ##icecreamSale## are really not correlated...
 
  • #12
fog37 said:
Isn't that a problem?
Yes it's one of the key problems that a statistician must avoid by ensuring independent variables are not correlated.
fog37 said:
We wanted to include the variable ##T## to show that ##murders## and ##icecreamSale## are really not correlated...
But you can't because they ARE correlated!
 
  • Like
Likes fog37
  • #13
pbuk said:
Yes it's one of the key problems that a statistician must avoid by ensuring independent variables are not correlated.

But you can't because they ARE correlated!
I guess I understand what I confounding variable is: a variable that is not included into the model but, when considered and controlled, explains (but maybe does not remove) the correlation between the two other variables (murders and ice-cream sales).

There is indeed a true positive correlation between murders and ice-cream sales but certainly no cause-effect. Including the confounding variable ##T## into the model does not remove the positive correlation between murders and ice-cream. The pattern of correlation between them exists probably even if the confounder ##T## is controlling for (either by taking murders and ice-cream sale data for the same temperature ##T## or including ##T## as predictor in the model). Confounders confuse in the sense that they can lead us to believe there is cause-effect when there is only correlation. Confounders, when included, don't necessarily remove/change the ongoing correlation, correct?

But contemplating ##T##, even if the positive correlation does not disappear, surely and logically reveals that there is certainly NO cause-effect going on.
 
  • #14
fog37 said:
Confounders confuse in the sense that they can lead us to believe there is cause-effect when there is only correlation.
They should not, because we should separate the concepts. Statistics can be a useful tool to help confirm, or refute, hypotheses, nothing more.

fog37 said:
Confounders, when included, don't necessarily remove/change the ongoing correlation, correct?
I would put it another way: one of the assumptions we make in regression analysis is that the independent variables are no more than weakly correlated. If this assumption is wrong then we cannot fix the problem by adding more independent variables, we have to remove the correlation that is causing the problem. If, as is often the case in practice, this is not possible we must be very careful when interpreting results.
 

FAQ: Coefficient sign flip in linear regression with correlated predictors

What is a coefficient sign flip in linear regression?

A coefficient sign flip in linear regression occurs when the sign of a regression coefficient changes unexpectedly, often due to the presence of multicollinearity among the predictors. This can lead to misleading interpretations of the relationship between predictors and the response variable.

How does multicollinearity cause coefficient sign flips?

Multicollinearity, the situation where predictors are highly correlated, can cause instability in the estimation of regression coefficients. When predictors are correlated, small changes in the data can lead to large changes in the estimated coefficients, including changes in their signs. This happens because the model has difficulty distinguishing the individual effect of each correlated predictor on the response variable.

What are the consequences of a coefficient sign flip in a regression model?

Coefficient sign flips can lead to incorrect conclusions about the direction and strength of relationships between predictors and the response variable. This can affect decision-making processes, as it may suggest that a predictor has a positive effect when it actually has a negative effect, or vice versa. Additionally, it can reduce the overall trustworthiness of the model.

How can I detect multicollinearity in my regression model?

Multicollinearity can be detected using several methods. The Variance Inflation Factor (VIF) is a common diagnostic tool; a VIF value greater than 10 is often considered indicative of high multicollinearity. Additionally, examining the correlation matrix of predictors can help identify pairs of highly correlated variables. Eigenvalues and condition indices from the correlation matrix can also provide insights into multicollinearity.

What strategies can be used to address coefficient sign flips caused by multicollinearity?

Several strategies can be employed to address coefficient sign flips caused by multicollinearity. These include removing or combining highly correlated predictors, using regularization techniques like Ridge Regression or Lasso, and applying Principal Component Analysis (PCA) to transform the predictors into a set of uncorrelated components. Additionally, domain knowledge can be used to guide the selection and interpretation of predictors in the model.

Back
Top