- #1
- 41,907
- 10,110
all the references I can find on the net to justifying a correlation treat it as a matter of judgment, and, quite correctly, that it depends on the application.
But it seems to me that one could compare the fit to the data of a horizontal line (i.e. average y) with that of the linear regression and ask whether the improved fit is better than chance.
The best horizontal fit has error term var(y) (population variance, not sample variance), while the best linear fit is less than that by ##\frac{Cov(x,y)^2}{Var(x)}##.
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)2?
I tried simplifying by assuming an N(0,1) distribution for Y, and that the mean and variance of the sample match the population, but I get that the expected value of Cov(x,y)2 is Var(x)Var(y). This makes no sense to me because it would lead to the expected error term of the linear fit being zero.
Does anyone have a reference for such an analysis? If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.
But it seems to me that one could compare the fit to the data of a horizontal line (i.e. average y) with that of the linear regression and ask whether the improved fit is better than chance.
The best horizontal fit has error term var(y) (population variance, not sample variance), while the best linear fit is less than that by ##\frac{Cov(x,y)^2}{Var(x)}##.
Now suppose, in reality, X and Y are independent. My question is, if we make some guess about the y distribution, what would be the expected value of Cov(x,y)2?
I tried simplifying by assuming an N(0,1) distribution for Y, and that the mean and variance of the sample match the population, but I get that the expected value of Cov(x,y)2 is Var(x)Var(y). This makes no sense to me because it would lead to the expected error term of the linear fit being zero.
Does anyone have a reference for such an analysis? If not, I'll post my algebra and ask where I am going wrong.
I am aware that there is a whole branch of stats that deals with justifying the number of tuning parameters in a mathematical model, but I was looking for something simpler to start with.