Linear regression and random variables

In summary: The mean of y is a good estimate for the value of y if you have no other...information about the...variables.
  • #1
fog37
1,568
108
Hello,
I have a question about linear regression models and correlation. My understanding is that our finite set of data ##(x,y)## represents a random sample from a much larger population. Each pair is an observation in the sample.

We find, using OLS, the best fit line and its coefficients and run some statistical tests (t-test and F-test) to check the coefficients' statistical significance. The ultimate goal is to estimate with precision the population slope and intercept.

Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random? A random variable has its possible values and associated probabilities. Two random variables ##X## and ##Y## are said to be jointly normal if ##aX+bY## has a normal distribution.

That said, how to we get to the linear model ## y =\beta_1 x +\beta_0## considering ##X## and ##Y## as both random variables?

Thank you!
 
Physics news on Phys.org
  • #2
Linear regression is applied to the model ##y = \beta_1 x +\beta_0 + \epsilon##, where ##\epsilon## has a Normal distribution with mean 0. The independent ##x## values are not assumed to come from a random variable, but they can be.
 
  • Like
Likes fog37
  • #3
As @FactChecker said, the usual model is ##y=\beta_1 x+\beta_0+\epsilon## where ##\epsilon \sim \mathcal{N}(0,\sigma)## but this is completely equivalent to the model ##y\sim \mathcal{N}(\beta_1 x +\beta_0,\sigma)##. So if you prefer to think in terms of random variables then you certainly can. In fact, that equivalent model is often used in Bayesian statistics.
 
  • Like
Likes fog37 and FactChecker
  • #4
fog37 said:
Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random?
To repeat what others have said, the assumptions behind the linear regression model and associated OLS procedure say that the X values have no random errors, so the (X,Y) data is not from the realizations of a bivariate normal random variable.
 
  • Like
Likes fog37
  • #5
Dale said:
As @FactChecker said, the usual model is ##y=\beta_1 x+\beta_0+\epsilon## where ##\epsilon \sim \mathcal{N}(0,\sigma)## but this is completely equivalent to the model ##y\sim \mathcal{N}(\beta_1 x +\beta_0,\sigma)##. So if you prefer to think in terms of random variables then you certainly can. In fact, that equivalent model is often used in Bayesian statistics.
Hello Dale,
The sample data, i.e. all the available pairs ##(x,y)##, are modelled as following:
##y=\beta_1 x+\beta_0+\epsilon##

##Y## is a random variable and its expectation value of Y is ##E[Y|X] = \beta_1 x+ \beta_0##.

The regression model that we compute generates estimates of ##\beta_1## and ##\beta_0## which are ## \hat{\beta_1}## and ## \hat{\beta_0}##.

The regression model itself is ##\hat{\beta_1} x+\hat{\beta_0}##.

Does that mean that the regression model estimates the mean of ##Y## and not ##Y## itself?

We use the regression model ##y_{pred}= \hat{\beta_1} x+\hat{\beta_0}## for predictions of the ##y## values though...
 
  • #6
fog37 said:
Does that mean that the regression model estimates the mean of ##Y## and not ##Y## itself?

We use the regression model ##y_{pred}= \hat{\beta_1} x+\hat{\beta_0}## for predictions of the ##y## values though...
It means that the regression model estimates the mean of ##Y## given that ##X=x##.
 
  • #7
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.
 
  • Like
Likes fog37
  • #8
Dale said:
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.
Yes, I am not sure. Books present the linear model as a tool for estimating the value of ##y##, not the mean of ##y##.
 
  • #9
It depends on which results of the regression you use.
If you are just looking for a simple curve fit, ##\hat{y} = \beta_0 + \beta_1 x##, through a more complicated, non-random relationship, ##y = f(x)##, that minimizes the sum-squared errors, then you are estimating the value of ##y(x)## as a deterministic function of ##x##. In that case, any regression results regarding probabilities or statistical significance are not meaningful.

On the other hand, If you are assuming the model ##Y = \beta_0 + \beta_1 x + \epsilon## where ##\epsilon
\sim N(0,\sigma)##, then you are assuming that there is a random component of ##Y##. In that case, your regression result is estimating the mean of ##Y##, given the ##x## value. In that case, any regression results regarding probabilities or statistical significance are meaningful.
 
Last edited:
  • Like
Likes fog37 and Dale
  • #10
fog37 said:
Yes, I am not sure. Books present the linear model as a tool for estimating the value of ##y##, not the mean of ##y##.

The mean of y is a good estimate for the value of y if you have no other information...
 
  • #11
fog37 said:
Yes, I am not sure. Books present the linear model as a tool for estimating the value of ##y##, not the mean of ##y##.
I know that Wikipedia isn't authoritative, but Wikipedia says:

"A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data "
https://en.wikipedia.org/wiki/Statistical_model

So it seems like Wikipedia includes the random component as part of the statistical model since the random component is part of what is used to generate the sample data.

Please do not take either my comments or Wikipedia's as authoritative. If I could, I would post this under a non-mentor pseudonym to avoid exaggerating my credibility on this. But I am leaning towards considering the whole thing to be the statistical model, not just the best fit function.
 
  • Like
Likes FactChecker
  • #12
I struggled with how to distinguish between the statistical mean, ##\hat{Y}## versus the deterministic, ##y(x)##. Of course, for a repeatable experiment, given a fixed variable, x, the question would be whether the result, y, always gives the same value or varies randomly. But what about something that is not repeatable, such as values (e.g. daily temperature highs) versus calendar dates? For that, the x values can not be repeated. Conversely, what about something that we would consider deterministic, but the details are so complicated that we might consider them random? So I guess the best I can do is to refer to how one decides to model the process and whether a random term, ##\epsilon## is included in the model.
I don't know if there is a good study or reference on this issue. My thoughts on it seem rather amateurish.
 
  • Like
Likes Dale
  • #13
FactChecker said:
My thoughts on it seem rather amateurish.
Mine too, but your reasoning sounds good.
 
  • Like
Likes FactChecker
  • #14
fog37 said:
##Y## is a random variable and its expectation value of Y is ##E[Y|X] = \beta_1 x+ \beta_0##.
Better notation would be ## E{Y| X = x) ##.The value of Y given a specific value of X is a random variable. But considering that X has various possible values, it isn't precise to say that "Y is a random variable". In the model, there is a set of random variables. For each value of X=x, we have a different random variable ##Y_x##.
fog37 said:
The regression model that we compute generates estimates of ##\beta_1## and ##\beta_0## which are ## \hat{\beta_1}## and ## \hat{\beta_0}##.

The regression model itself is ##\hat{\beta_1} x+\hat{\beta_0}##.

Does that mean that the regression model estimates the mean of ##Y## and not ##Y## itself?
Since Y|X=x is a random variable, how can we interpret the concept of "estimating Y itself"? Do you mean generating a set of data that follows the distribution of Y?

fog37 said:
We use the regression model ##y_{pred}= \hat{\beta_1} x+\hat{\beta_0}## for predictions of the ##y## values though...

The term "estimator" in mathematical statistics can refer to any function of the data. (Thus an "estimator" itself is a random variable when the data is from random variables). The term "estimate" can refer to one specific value of an estimator that results from one specific set of data. Whether a particular function estimates a particular parameter of a model is a subjective question - it has to do with the intentions of the person using the model. Furthermore, how well a estimator estimates a parameter is also a subjective question because there are various ways to quantify the utility or dis-utility of estimates.

In linear regression, the dis-utility of an estimated value of Y|X=x is measured by the square of the difference between the estimated value and an observed value. This is a subjective choice. For example, another measure would be the absolute value of that difference. Yet another measure might be the percentage difference.

Since the linear regression model involves a set of random variables, we can't say what the "best" estimated values of its parameters are until we say how to condense the measures of dis-utility for each of the different random variables into a single number. If you look at how linear regression does it, it treats all the ##Y_x## variables as having equal importance and estimates their average dis-utility , effectively giving each possible value of X an equal importance. In a pratctical situation where some values of X can be less frequent or less important, that might not be the "best" way of doing things.



 
  • Informative
Likes jim mcnamara
  • #15
fog37 said:
Hello,
I have a question about linear regression models and correlation. My understanding is that our finite set of data ##(x,y)## represents a random sample from a much larger population. Each pair is an observation in the sample.

We find, using OLS, the best fit line and its coefficients and run some statistical tests (t-test and F-test) to check the coefficients' statistical significance. The ultimate goal is to estimate with precision the population slope and intercept.

Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random? A random variable has its possible values and associated probabilities. Two random variables ##X## and ##Y## are said to be jointly normal if ##aX+bY## has a normal distribution.

That said, how to we get to the linear model ## y =\beta_1 x +\beta_0## considering ##X## and ##Y## as both random variables?

Thank you!
"Does each pair represent the realization of a bivariate random variable with Gaussian joint distribution?"
Not classically, no. First of all, the assumption of a Gaussian distribution is not part of those required for regression, and when it's made it doesn't apply to the response but to the error distribution.
If you assume both response and predictor are random the regression model is typically viewed as saying the conditional expected value of Y given x.

"In the regression analysis, are both and random variables or only the variable is random?"
As noted above, traditionally only Y is considered random.

"A random variable has its possible values and associated probabilities. Two random variables and are said to be jointly normal if aX + bY has a normal distribution."
You're missing a bit here: you need to add the statement "for all real numbers, a, b".
 
  • #16
Stephen Tashi said:
To repeat what others have said, the assumptions behind the linear regression model and associated OLS procedure say that the X values have no random errors, so the (X,Y) data is not from the realizations of a bivariate normal random variable.
I think we need to distinguish between a random, but perfectly accurate ##X## value versus a random error in our estimated value of ##X##. The first case is within the scope of traditional linear regression. The second case is different. I do not know enough about that case to discuss it.
 
  • #17
fog37 said:
Hello,
I have a question about linear regression models and correlation. My understanding is that our finite set of data ##(x,y)## represents a random sample from a much larger population. Each pair is an observation in the sample.

We find, using OLS, the best fit line and its coefficients and run some statistical tests (t-test and F-test) to check the coefficients' statistical significance. The ultimate goal is to estimate with precision the population slope and intercept.

Does each pair ##(x,y)## represent the realization of a bivariate random variable ##Z=(X,Y)## with Gaussian joint distribution? In the regression analysis, are both ##X## and ##Y## random variables or only the variable ##Y## is random? A random variable has its possible values and associated probabilities. Two random variables ##X## and ##Y## are said to be jointly normal if ##aX+bY## has a normal distribution.

That said, how to we get to the linear model ## y =\beta_1 x +\beta_0## considering ##X## and ##Y## as both random variables?

Thank you!
It may be both. If you want to , e.g., measure weight at X=1,2,... years of age, then X is not random.
 
  • #18
Suppose you are using linear regression to fit the data to the model ##y=\beta_1 x + \beta_0 + \epsilon##.
If the ##x## values are known with no errors, then we know that linear regression works fine.

On the other hand, suppose that the measured ##x## values have some errors and that ##X_{measured} = \alpha_1 X_{actual} + \alpha_0 + \epsilon_X##. Then linear regression would give a result like
##Y=\beta_1 X_{measured} + \beta_0 + \epsilon##
## = \beta_1(\alpha_1 X_{actual} +\alpha_0 + \epsilon_X)+ \beta_0+ \epsilon##
## = (\beta_1\alpha_1) X_{actual} + (\beta_1\alpha_0+\beta_0) + (\beta_1\epsilon_X + \epsilon)##.
So it is still a valid process, but it is estimating ##Y## based on the measured ##X## value. That may be what you really want. But if you are trying to get the theoretical relationship between ##Y## and ##X_{actual}##, it might not be a good model to use.
 
  • #19
Dale said:
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.
The theoretical model, when written out, states the error term.
$$
Y = \beta_0 + \beta_1 \, x + \varepsilon
$$
Once you have collected data and estimated the intercept and slope all the quantities are known entities: no error term involved

$$
\hat{y} = \widehat{\beta_0} + \widehat{\beta_1} \, x
$$
 
  • Like
Likes fog37
  • #20
statdad said:
The theoretical model, when written out, states the error term.
$$
Y = \beta_0 + \beta_1 \, x + \varepsilon
$$
Once you have collected data and estimated the intercept and slope all the quantities are known entities: no error term involved

$$
\hat{y} = \widehat{\beta_0} + \widehat{\beta_1} \, x
$$
Going a little further: if you assume both Y and X are random it’s typical to assume the underlying model looks like this.
Assume that there are distributions F, M [I‘m also going to assume they have densities: that isn’t strictly required but it makes the exposition a little easier. Notice nothing is said here about either being a Gaussian distribution] such that
$$
h(x,y) = f\left(y - \left(\beta_0 + \beta_1 x\right)\right)m\left(x\right)
$$
so that the conditional distribution of Y given X = x depends on x. Usually it is assumed that both F and M have finite second moments with F symmetric about zero so that the conditional expectation of Y given X = x is
$$
E\left(Y \mid X=x\right) = \beta_0 + \beta_1 \, x
$$

In this sense all of the interpretations drawn from the regression equation are conditional.
This model easily generalizes to the multivariate case as well: M is assumed to be a multivariate distribution with positive definite covariance matrix, F has the same assumptions as above, and WHOOSH you have a joint distribution that’s the same form as above.
 
  • #21
statdad said:
The theoretical model, when written out, states the error term.

$$

Y = \beta_0 + \beta_1 \, x + \varepsilon

$$

Once you have collected data and estimated the intercept and slope all the quantities are known entities: no error term involved
$$

\hat{y} = \widehat{\beta_0} + \wideh

at{\beta_1} \, x

$$So Y^ is an unbiased estimator?
 
  • #22
If the predicted (fitted) y-values are ever referred to as unbiased estimators I've never encountered it. We say the estimated coefficients are unbiased estimators of the population coefficients [the parameters of the model] since the estimates have sampling distributions and the expected value of those distributions equal the appropriate parameters. There isn't a parameter that corresponds to yhat, so I'm not sure what quantity you would consider it to be unbiased for.
 
  • Like
Likes fog37
  • #23
statdad said:
If the predicted (fitted) y-values are ever referred to as unbiased estimators I've never encountered it. We say the estimated coefficients are unbiased estimators of the population coefficients [the parameters of the model] since the estimates have sampling distributions and the expected value of those distributions equal the appropriate parameters. There isn't a parameter that corresponds to yhat, so I'm not sure what quantity you would consider it to be unbiased for.

I guess the ideal relation between y and x.
 
  • #24
Dale said:
Hmm, I don’t know. Does the regression model not include the error term too? I actually don’t know the right terminology here.
Late to the game here I realize, but there seems to be some confusion about the use of the word "model".
In simple linear regression the theoretical model has this functional form
$$
Y = \beta_0 + \beta_1 x + \varepsilon
$$
where the ## \varepsilon## term denotes the random error, assumed to follow a distribution that has mean zero and fixed standard deviation ## \sigma##. The x values are constants, while ##\beta_0, \beta_1## are unknown constants. This means that Y itself is a random quantity having a mean that depends on x:
$$
E[Y] = \beta_0 + \beta_1
$$
and variance ##\sigma^2##. Note that the assumption about error distribution being Gaussian is not included in the assumptions -- if it is added then the LS estimates' distributions can be assumed to be known [certain weak conditions must be satisfied], but even without the assumption of the errors being Gaussian the LS estimates are, in general, asymptotically normal.
The other use of the word model relates to the fitted equation -- the equation with the LS coefficients estimated.
$$
\hat{y} = \widehat{\beta_0} + \widehat{\beta_1} x
$$

There is no error term used here as the fitted equation is a sample value, not a theoretical model.
 
  • #25
statdad said:
If the predicted (fitted) y-values are ever referred to as unbiased estimators I've never encountered it.
No one estimate of ##y## at a given ##x_0## could be considered "unbiased" since it is clearly biased by the data at other ##x## values.
EDIT: I am having second thoughts about this statement.
statdad said:
We say the estimated coefficients are unbiased estimators of the population coefficients [the parameters of the model]
One must be careful here. The meaning and interpretation of the model parameters are dependent on how the regression was performed and what each parameter meant. This is especially true when there are variables included (or excluded) due to their high (or low statistical) significance. This is not an issue if all the original independent variables are uncorrelated, but that is a rare occurrence. Otherwise, one must be very careful about any interpretation of the meaning of the parameters.
 
Last edited:
  • #26
The estimates are always unbiased for some coefficients -- the coefficients of the model you specified for the fit. That unbiasedness doesn't change if some of the x-values are correlated, unless the correlation is extremely strong or if the number of predictors is near the sample size. The question of what, exactly, they are unbiased for is key: they are unbiased for the model you specified, which is most likely not the correct model.
 
  • #27
You do not have to assume that the error terms in linear regression are normally distributed. If they are not, the estimator is still consistent, but it may no longer be efficient. As long as the x values can be measured without error, it makes no difference whether one assumes the x values to be random samples from some distributions or not. This changes if both x and y are measured with errors, one speaks then or "errors in variable models". They are quite nasty, as often the errors of x and y are not estimable from the data set. If the distribution of the expected x is assumed to be a random sample from some distribution, one speaks of a structural model, if they are fixed, it is called a functional model. If the ratio of the error variances of y and x is known, estimators for the slope and intercept are obtainable from orthogonal or Deming regression, https://en.wikipedia.org/wiki/Deming_regression.
 
  • Like
Likes WWGD and Dale
  • #28
FactChecker said:
No one estimate of ##y## at a given ##x_0## could be considered "unbiased" since it is clearly biased by the data at other ##x## values.
EDIT: I am having second thoughts about this statement.
The meaning of unbiasedness means that the expectation value of Y_i is equal to the population value y_i, E(Y_i)=y_i. Given that X_i= x_i, this is true for OLS.
Using ##Y_i=\beta_1 x_i + \beta_0 +\epsilon_i##,
$$ \hat{\beta}_1=\frac{\sum_i (x_i-\bar{x})(Y_i-\bar{Y})}{\sum_i (x_i-\bar{x})^2}.$$
This expression is linear in the ##\epsilon_i##. As long as for all i, ##E(\epsilon_i)=0##,
$$ E(\hat{\beta}_1)=\beta_1.$$
Same is true for $$\hat{\beta}_0= \bar{Y}-\hat{\beta}_1\bar{x}$$.
Finally, also $$E(\hat{Y}_i)=E(\hat{\beta}_1) x_i +E(\hat{\beta}_0)=\beta_1 x_i +\beta_0=y_i .$$
Clearly, for unbiasedness, the ##\epsilon_i## don't need to be Gaussian. They can also be different for different i (heteroscedastic), As long as ##E(\epsilon_i)=0## and the errors are independent for different i.
 
  • Like
Likes FactChecker
  • #29
DrDu said:
The meaning of unbiasedness means that the expectation value of Y_i is equal to the population value y_i, E(Y_i)=y_i. Given that X_i= x_i, this is true for OLS.
Thanks! I think that I see what my misconception was. I was thinking that the estimate at ##x_i## would be affected by the values of ##y_j## at ##x_j \ne x_i## and so would be biased unless that value of ##y_j## was equal to the mean of ##Y(x_j)##. In particular, any ##y_j## that was an outlier could seriously shift the entire regression line. That was a misconception. The assumption is that the regression in getting the values of the population ##Y(x_j)## mean for each input value ##x_j##, so outlier values of any specific ##y_j## is not an issue.
(I am not sure how clear my explanation is.)
 
  • #30
Of course outliers are an issue! But first one has to define what an outlier is. An outlier may violate ##E(\epsilon_i)=0##. OLS is sensitive to this, it is not a robust method. A single outlier of this kind may lead to a slope estimate arbitrary far away from the true one.
The second kind of outliers is one which has a much broader distribution of ##\epsilon_i## than the others. While the estimator remains unbiased, the efficiency may drop dramatically. Of course the sensitivity to both kinds of outliers is related.
A robust alternative to OLS is for example Theil-Sen regression. I.e. estimation of the slope as the median of all pairwise slopes. https://en.wikipedia.org/wiki/Theil–Sen_estimator
Other robust methods, instead of minimizing the squared distance from the regression line use absolute distance or some Huber loss functions, which are quadratic for small deviations but linear for larger deviation. https://en.wikipedia.org/wiki/Huber_loss
 
  • Like
Likes FactChecker
  • #31
DrDu said:
Of course outliers are an issue! But first one has to define what an outlier is. An outlier may violate ##E(\epsilon_i)=0##. OLS is sensitive to this, it is not a robust method. A single outlier of this kind may lead to a slope estimate arbitrary far away from the true one.
Yes. My point was that I should not worry about outliers at other ##x## values where the expected values of the estimators are concerned since it is the entire distribution at those ##x## values that determine those estimator expected values.
 
Back
Top