Can We Use Regression Line x on y if y is the Dependent Variable?

In summary: However, in this case, y is the dependent variable and x is the independent variable. So, the appropriate regression line would be y on x, even though you are estimating x from y. This is because the assumption of OLS regression is that all of the error is in the dependent variable, which is y in this case. So, to minimize the error, you should use the regression line that has y as the independent variable.
  • #1
songoku
2,368
349
TL;DR Summary
Let say I have 10 bivariate data (x and y) where x is the independent variable and x is the dependent variable.

I want to estimate the value of x from a certain given value of y. Which regression line should I use, regression line y on x or regression line x on y?
I have note that states regression line x on y is used when we want to calculate x for given y but in this case y is dependent variable. I am pretty sure I can use either line if the value of product moment correlation coefficient (r) is close to 1 but for the case, let say r = 0.6, can we use regression line x on y even though y is dependent variable? Or should we use regression line y on x to calculate the value of x?

Thanks
 
Physics news on Phys.org
  • #2
The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable.
 
  • #3
Dale said:
The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable.
I see, so I should use regression line y on x even if I want to estimate x.

But I am sorry, I have another question. Can I argue that for the case y is given by the question (y is still the dependent variable), y is known more precisely so the appropriate regression line is x on y?

Thanks
 
  • #4
songoku said:
I see, so I should use regression line y on x even if I want to estimate x.
Yes, where y is the thing which has a large error and x is measured almost exactly.

songoku said:
But I am sorry, I have another question. Can I argue that for the case y is given by the question (y is still the dependent variable), y is known more precisely so the appropriate regression line is x on y?
This isn’t a matter of argument. How well can you measure the values of y? How well can you measure the values of x? The answer to those questions determines the method you should use.
 
  • #5
If you want to estimate x based on the values of y, you should do a regression of x on y (x dependent and y independent). Linear regression would minimize the sum-squared-errors of the sampled ##x_i## verses the estimated ##\hat{x_i}(y_i)##. Doing a regression the other way would minimize the wrong sum-squared-errors and all the related statistics would be wrong.
 
  • #6
Dale said:
Yes, where y is the thing which has a large error and x is measured almost exactly.

This isn’t a matter of argument. How well can you measure the values of y? How well can you measure the values of x? The answer to those questions determines the method you should use.
I understand

FactChecker said:
If you want to estimate x based on the values of y, you should do a regression of x on y (x dependent and y independent). Linear regression would minimize the sum-squared-errors of the sampled ##x_i## verses the estimated ##\hat{x_i}(y_i)##. Doing a regression the other way would minimize the wrong sum-squared-errors and all the related statistics would be wrong.
How about if the data I have is only x as independent variable and y is dependent variable and I need to estimate x for given y?

Thanks
 
  • #7
songoku said:
How about if the data I have is only x as independent variable and y is dependent variable and I need to estimate x for given y?
That is what I am talking about. They are both linear regression problems. However, the coefficients you get from the two linear regressions are not the same or even easily related. The errors in the sum-squared-error that are minimized in the linear regressions are projections onto different axes. (That is, minimizing ##\sum (y_i-\hat{y_i})^2## is not the same as minimizing ##\sum (x_i-\hat{x_i})^2##.)
So you should do a linear regression with X as a linear function of Y.
The issue is not how well the X and Y values can be measured, it is how well the values fit the selected model. That is the sum-squared-error that is being minimized.
 
  • #8
FactChecker said:
That is what I am talking about. They are both linear regression problems. However, the coefficients you get from the two linear regressions are not the same or even easily related. The errors in the sum-squared-error that are minimized in the linear regressions are projections onto different axes. (That is, minimizing ##\sum (y_i-\hat{y_i})^2## is not the same as minimizing ##\sum (x_i-\hat{x_i})^2##.)
So you should do a linear regression with X as a linear function of Y.
The issue is not how well the X and Y values can be measured, it is how well the values fit the selected model. That is the sum-squared-error that is being minimized.
I understand your explanation but why it seems to me that your suggestion is different from @Dale 's? Or maybe I misinterpret something?

Dale said:
How well can you measure the values of y? How well can you measure the values of x? The answer to those questions determines the method you should use.
Dale said:
The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable.

From those replies, the one that becomes independent variable is the one that can be measured more precisely, which is ##x## in my case so the regression line that should be used is ##y## on ##x##, even though I want to estimate ##x## from ##y##

But from your reply (@FactChecker ), I should use regression line ##x## on ##y## because I want to estimate ##x## for given ##y## so that the value of the estimation suits the model (the error in ##x## is minimized) even though my independent variable is ##x## (I can't change ##y## to be independent variable)

Am I correct to think that there are two different suggestions for my hypothetical case?

Thanks
 
  • #9
songoku said:
From those replies, the one that becomes independent variable is the one that can be measured more precisely, which is x in my case so the regression line that should be used is y on x, even though I want to estimate x from y
Yes, this is correct.

@FactChecker can confirm, but I don’t think that he is disagreeing with me. He is just showing you why the two choices are not equivalent.
 

FAQ: Can We Use Regression Line x on y if y is the Dependent Variable?

What is a regression line?

A regression line is a straight line that represents the relationship between two variables in a scatter plot. It is used to predict the value of one variable based on the value of the other variable.

How is a regression line calculated?

A regression line is calculated using a statistical method called linear regression. This involves finding the best-fitting line that minimizes the distance between the actual data points and the line.

What is the purpose of a regression line?

The purpose of a regression line is to show the relationship between two variables and to make predictions about future data points. It is commonly used in scientific research and data analysis to identify trends and patterns in data.

What is the difference between a regression line and a trendline?

A regression line and a trendline are often used interchangeably, but there is a subtle difference between the two. A regression line is used to predict the value of one variable based on the other, while a trendline is used to show the overall trend or pattern in the data.

Can a regression line be used to make predictions outside of the data range?

Yes, a regression line can be used to make predictions outside of the data range, but it should be done with caution. The accuracy of the predictions decreases as the distance from the data range increases, so it is important to use good judgment when making predictions beyond the data range.

Back
Top