Linear Model with independent categorical variable

fog37 · Apr 5, 2023

Hello,

I have been pondering on the following: we have data for blood pressure BP (response variable) and data about age and gender (categorical variable with two levels). We can build two linear regression models: $$BP=b_0+b_1 age+b_2 gender$$ $$BP=b_0+b_1 age$$

The first model does not take gender into account and plots one single best-fit line disregarding that gender may have an effect.
The 2nd model includes ##gender## and two scenarios are possible: assuming no interaction term, the categorical variable ##gender## may shift the best fit regression line up or down depending its value being ##1## or ##0## and the sign of its corresponding coefficient. If the shift is very small, then ##gender## does not have an effect. But if best-fit line vertical shift is meaningful, then ##gender## has an effect. That means that the ##BP## values for males and females form different clusters that would require two different best-fit lines (same slope different intercept).
The 2nd model, including ##gender## takes care of that difference. Would the 2nd model be exactly equivalent to creating two separate linear regression models and best-fit lines, one for the male group and one for the female group, once we recognize that male and female form different clusters of points w.r.t. blood pressure BP?

Thank you!

Dale · Apr 5, 2023

No, those would not be exactly equivalent. The group-wise fitting would allow different intercepts and different slopes for the two groups. The no-interaction model (your model 2) allows different intercepts but not different slopes for the two groups. Also, the standard error for the slope in the no-interaction model will be smaller (if there is indeed no significant interaction) because it is estimated with twice the data of either of the group-wise fits.

fog37 · Apr 5, 2023

Dale said:

No, those would not be exactly equivalent. The group-wise fitting would allow different intercepts and different slopes for the two groups. The no-interaction model (your model 2) allows different intercepts but not different slopes for the two groups. Also, the standard error for the slope in the no-interaction model will be smaller (if there is indeed no significant interaction) because it is estimated with twice the data of either of the group-wise fits.

Thanks. I see. So the no-interaction model 2 would be a better model than creating two separate models, one for each group. Thanks for confirming.

Dale · Apr 5, 2023

fog37 said:

Thanks. I see. So the no-interaction model 2 would be a better model than creating two separate models, one for each group. Thanks for confirming.

Yes. And if you think that there may be an interaction then I would use an interaction model instead of group-wise fits. It is a lot easier to control for multiple comparisons that way.

fog37 · Apr 5, 2023

Dale said:

Yes. And if you think that there may be an interaction then I would use an interaction model instead of group-wise fits. It is a lot easier to control for multiple comparisons that way.

We can suspect the interaction term between ##age## and ##gender## but the proof would be to see that model 2 generates best-fit lines with different slopes for different values of the ##gender## variable. Once we see that, we should include the interaction term ##(age)\times(gender)##

FactChecker · Apr 5, 2023

fog37 said:

We can build two linear regression models: $$BP=b_0+b_1 age+b_2 gender$$ $$BP=b_0+b_1 age$$

The first model does not take gender into account and plots one single best-fit line disregarding that gender may have an effect.
The 2nd model includes ##gender## and two scenarios are possible: assuming no interaction term, the categorical variable ##gender## may shift the best fit regression line up or down depending its value being ##1## or ##0## and the sign of its corresponding coefficient. If the shift is very small, then ##gender## does not have an effect.

You should have your model equations and your description in the same order so there is no confusion about which model is "first" and which is "second". It looks like your model equations are in reverse order. Otherwise, I disagree with practically everything you said about those two models.
A third option is to separate the genders into two distinct data sets and do separate regressions on each one. It is not clear to me if that is what you had in mind for the model that does not include a "gender" factor. I recommend this approach if you have enough data for each gender to get adequate parameter estimates for each.

Linear Model with independent categorical variable

FAQ: Linear Model with independent categorical variable

What is a linear model with independent categorical variables?

How do you include categorical variables in a linear model?

What assumptions must be met when using a linear model with categorical variables?

How can I interpret the coefficients of a linear model with categorical variables?

What are the limitations of using linear models with categorical variables?

Similar threads

Hot Threads

Recent Insights