Would including the true/known y-intercept in my dataset be "overfitting"?

  • I
  • Thread starter fahraynk
  • Start date
In summary, if you know the true intercept for your data but still include it in the fit, it can lead to overfitting. To avoid this, you can adjust your data by subtracting the known intercept from all Y values, fitting a regression model with no constant term, and then adding the intercept back into the model. Alternatively, you can replicate the data point or directly sample the point to make it a real data point instead of a fake one. However, this approach can be questionable and there are statistical problems associated with it. It may be better to use methods such as restricted regression to handle coefficient constraints.
  • #1
fahraynk
186
6
Suppose I have a set of datapoints which I want to fit, and suppose I know what the data's true y intercept is, for example I know at X=0, Y=7. If I include this point in my data is it overfitting? Since the model needs to find the intercept, it has no choice but to choose x=0 and y=7. Is it cheating if I happen to know that point is the true intercept and include it in my data to be fit?

If so, what should I do in this case?
 
Physics news on Phys.org
  • #2
You can adjust your data for a (0,0) intercept by subtracting 7 from all Y values and fit a regression model with no constant term: ##Y = a\cdot X##. Then add the 7 back into your model: ##Y = a\cdot X + 7##. This is not overfitting.
 
  • Like
Likes fahraynk
  • #3
FactChecker said:
You can adjust your data for a (0,0) intercept by subtracting 7 from all Y values and fit a regression model with no constant term: ##Y = a\cdot X##. Then add the 7 back into your model: ##Y = a\cdot X + 7##. This is not overfitting.

So to summarize, do not include this data point in the fit, subtract it from all other Y values, and than add it back after the fit.
Why does it increase overfitting if I include the known data point in the fit? I am really curious.
Do you know something I can read or a term I can google which will explain it to me in more detail?
 
  • #4
You need to have a very firm theoretical justification for doing this. Fitting a linear model without an intercept parameter is a very big statistical problem. If you do it, the usual statistical measures don’t mean what they usually do. For example, the R^2 is no longer the proportion of the variance explained, your residuals will not have 0 mean, all of your other parameters will be biased, etc.

If your data is truly guaranteed to have a specific intercept then simply fit a model with an intercept parameter as always. If you are right then the parameter will not be significantly different from the theoretical value, and your statistical analysis of the remaining parameter will work better. If you are wrong then the data will show you.
 
  • Like
Likes fahraynk and FactChecker
  • #5
I guess that @Dale is right about the statistical problems. It you know that the intercept is at y=7 then it would be strange if confidence interval of the regression constant did not include y=7. If you want a well accepted statistical result, you may have to accept the regression intercept.
 
  • #6
fahraynk said:
for example I know at X=0, Y=7. If I include this point in my data is it overfitting?
I realized that I may have been misinterpreting what you were saying. I thought that you were saying to take your model, y=mx+b, set b to 7 and fit only m. That is a statistically problematic thing to do, although sometimes it can be done with proper precautions and care.

Upon re-reading your post, however, it seems like what you actually want to do is to take your data set of N observations of (x,y) and add 1 fake observation (0,7) for a total of N+1 data points.

If that is your approach then I would ask what makes you believe that at X=0 Y=7? If that is based on actual data then why not replicate that data, or directly sample X=0 and see. In other words, instead of making a fake data point, why not make it real?
 
  • Like
Likes fahraynk
  • #7
Dale said:
Upon re-reading your post, however, it seems like what you actually want to do is to take your data set of N observations of (x,y) and add 1 fake observation (0,7) for a total of N+1 data points.
Once, in a prior lifetime, I tried that something like that. I was surprised at how many (0,7) points I had to add to get the result I wanted. Each one nudges it closer to going through (0,7), but it took a very large number to get reasonably close. In any case, it makes the whole process questionable. I suppose one could estimate how confident one is that (0,7) is valid and how much data it would take to change his mind; then put a corresponding number of (0,7) points in. There may be some crude rationalization possible for that, but it seems very ad-hoc to me.

The subject of linear regression given coefficient constraints (aka restricted regression) has been studied (see, for instance, section 5.6 of Searle, "Linear Models") and there are implementations (e.g. SAS RESTRICT statement). There are some youtube videos that may be relevant (e.g. https://www.youtube.com/watch?v=K-PsdZDaNDE). I have no experience with it, but I assume that the mathematical/statistical approach is valid.
 
Last edited:
  • Like
Likes Dale and fahraynk
  • #8
Dale said:
I realized that I may have been misinterpreting what you were saying. I thought that you were saying to take your model, y=mx+b, set b to 7 and fit only m. That is a statistically problematic thing to do, although sometimes it can be done with proper precautions and care.

Upon re-reading your post, however, it seems like what you actually want to do is to take your data set of N observations of (x,y) and add 1 fake observation (0,7) for a total of N+1 data points.

If that is your approach then I would ask what makes you believe that at X=0 Y=7? If that is based on actual data then why not replicate that data, or directly sample X=0 and see. In other words, instead of making a fake data point, why not make it real?

This comes from mixing two molecules together. In the beginning I have pure molecule 1, and so I know the values of molecule 1 for the first measurement.

As soon as I mix them, I don't know how much is molecule 1, and how much is a combination of molecule 1 and molecule 2. I just get the measurements and I have to interpret them.

So I can use the first measurement, where I know the amount corresponds to 100% molecule 1 and none of molecule 2. It is like an intercept. I use y=mx+b as an example because its easier to talk about the theory.

Should I add the first value, or subtract it from all the others like FactChecker says and fit that? If so, why? Why is knowing the intercept considered overfitting?
 
  • #9
I found this, do you think this would work for my problem, instead of setting the first coefficient myself, or instead of putting the known point in the dataset, if I instead do a regression analysis like this, would it not increase my overfitting?
upload_2018-5-5_14-34-23.png


upload_2018-5-5_14-34-31.png


upload_2018-5-5_14-34-39.png


upload_2018-5-5_14-34-59.png
 

Attachments

  • upload_2018-5-5_14-34-23.png
    upload_2018-5-5_14-34-23.png
    1.8 KB · Views: 337
  • upload_2018-5-5_14-34-31.png
    upload_2018-5-5_14-34-31.png
    11.8 KB · Views: 335
  • upload_2018-5-5_14-34-39.png
    upload_2018-5-5_14-34-39.png
    13.4 KB · Views: 335
  • upload_2018-5-5_14-34-59.png
    upload_2018-5-5_14-34-59.png
    15.2 KB · Views: 357
  • #10
fahraynk said:
Why is knowing the intercept considered overfitting?
Where have you heard that? I interpret the term "overfitting" as having too many free parameters for the amount of data. In the extreme case, too many free parameters make it possible to develop a line that goes artificially through every data point. That is meaningless. I have never heard the term "overfitting" used when the number of free parameters is reduced.
 
  • Like
Likes StoneTemplePython and fahraynk
  • #11
FactChecker said:
I have never heard the term "overfitting" used when the number of free parameters is reduced.
Neither have I! That is why I don't get it!

But apparently it is done in the field in popular software.

it might help to see the model equation:

$$\frac{G}{Gt}\beta_1 + \frac{HG}{Gt}\beta_2 = Y^{calculated}$$
##G## and ##HG## are molecule 1 and 2, respectively.
##Gt## is total amount of G present in both molecules. Later on, some ##G## changes into ##HG##. I only know the value of ##G## at the first point, and I get a measurement at that point which corresponds to ##\frac{G}{Gt}=1## and ##\frac{HG}{Gt}=0##

So technically I think that means I know ##\beta_1##, which is the measurement I get at ##\frac{G}{HG}=1##. I could ignore this and just add that measurement to the total data, I can set a restriction like in my last post, or I can do what you said, subtract the first measurement from the rest of the data and fit the difference.

But what is the difference between the 3 methods? I have been told adding the first measurement to the data would be overfitting, but I don't know why. And people in the field use your method of subtracting the measurement from all the others. I prefer setting the restriction if it would work, because it is a fancier looking option
 
  • #12
fahraynk said:
This comes from mixing two molecules together. In the beginning I have pure molecule 1, and so I know the values of molecule 1 for the first measurement.
So just make your usual measurement on a sample of the pure molecule 1. Then it is a completely legitimate data point.

fahraynk said:
Should I add the first value, or subtract it from all the others like FactChecker says and fit that?
Once you have a legitimate data point then just treat it like all of the other data points.

fahraynk said:
Why is knowing the intercept considered overfitting?
It isn’t, as far as I know.

fahraynk said:
I have been told adding the first measurement to the data would be overfitting, but I don't know why.
I think that you need to go to the person who told you that and ask for an explanation. It doesn’t seem right to me, but maybe they are making some subtle point.
 
  • Like
Likes fahraynk
  • #13
Dale said:
So just make your usual measurement on a sample of the pure molecule 1. Then it is a completely legitimate data point.

Once you have a legitimate data point then just treat it like all of the other data points.

It isn’t, as far as I know.

I think that you need to go to the person who told you that and ask for an explanation. It doesn’t seem right to me, but maybe they are making some subtle point.
Thanks for the reply,

so, you think its better to add this data point to the rest of the data, and just let the solver figure it out on its own, rather than set a restriction such that ##\beta_1## is equal to the measured value at the first point?

Do you have any idea where fact checkers idea to subtract the first value from all the other points comes from? I ask because that is what others are telling me to do. I just want to have solid information as to why before I do something weird like that!

I pretty much agree with you though, that it should just be considered a legit data point, and let the solver do whatever it wants.
 
  • #14
fahraynk said:
Do you have any idea where fact checkers idea to subtract the first value from all the other points comes from? I ask because that is what others are telling me to do. I just want to have solid information as to why before I do something weird like that!
I never meant for you to "subtract the first value". The idea is to subtract the theoretically known intercept value of 7 from all Y data values. That changes the regression into one with a new dependent variable, ##Y_2 = Y - 7##, whose regression line should theoretically go through (0,0). That is simpler and there are several articles and Youtube videos on how to handle a regression with no (zero) constant term like that. Once you obtain such a regression line, ##Y_2 = aX##, you have an equation for ##Y = Y_2 + 7 = aX + 7##
 
Last edited:
  • Like
Likes fahraynk and Dale
  • #15
fahraynk said:
so, you think its better to add this data point to the rest of the data, and just let the solver figure it out on its own, rather than set a restriction such that β1β1\beta_1 is equal to the measured value at the first point?
Yes, I do. Restrictions can be added, but they often have unintended consequences such as introducing bias in the other regression parameters and producing residuals that are not 0 mean.
 
  • Like
Likes fahraynk
  • #16
FactChecker said:
I never meant for you to "subtract the first value". The idea is to subtract the theoretically known intercept value of 7 from all Y data values. That changes the regression into one with a new dependent variable, ##Y_2 = Y - 7##, whose regression line should theoretically go through (0,0). That is simpler and there are several articles and Youtube videos on how to handle a regression with no (zero) constant term like that. Once you obtain such a regression line, ##Y_2 = aX##, you have an equation for ##Y = Y_2 + 7 = aX + 7##
Yeah, sorry for confusion, [0,7] is considered the "first value" in my data set. Interestingly I calculated 40 different results by subtracting the first intercept and by the matrix equations setting a restriction on the first intercept that I posted above, and both produce identical results and residuals! So the subtraction of the known intercept must be the same as setting a restriction on the intercept. Thanks for sending me to the Youtube videos.
Dale said:
Yes, I do. Restrictions can be added, but they often have unintended consequences such as introducing bias in the other regression parameters and producing residuals that are not 0 mean.
Thanks, I agree with you in not liking the restriction. Is there any way you know of that I can quantify (or understand intuitively if not quantify) the bias caused with and without restrictions?
 
  • #17
fahraynk said:
Yeah, sorry for confusion, [0,7] is considered the "first value" in my data set. Interestingly I calculated 40 different results by subtracting the first intercept and by the matrix equations setting a restriction on the first intercept that I posted above, and both produce identical results and residuals! So the subtraction of the known intercept must be the same as setting a restriction on the intercept. Thanks for sending me to the Youtube videos.
Ok. There are two possibilities:
1) You only know that the data you collected had a datapoint (0,7), but you have no "indisputible" theoretical reason to say that the average value at X=0 will be Y=7. In that case, you just have a typical linear regression and should do nothing unusual.
2) You have an "indisputible" theoretical reason to know that the average value at X=0 is Y=7 which is solid enough that you do not want any statistical result which disagrees with it. Then you should use a model that forces the linear regression through X=0, Y=7.

The second situation was how I interpreted your original post. All my posts above have been only for that situation.
 
  • #18
FactChecker said:
Ok. There are two possibilities:
1) You only know that the data you collected had a datapoint (0,7), but you have no "indisputible" theoretical reason to say that the average value at X=0 will be Y=7. In that case, you just have a typical linear regression and should do nothing unusual.
2) You have an "indisputible" theoretical reason to know that the average value at X=0 is Y=7 which is solid enough that you do not want any statistical result which disagrees with it. Then you should use a model that forces the linear regression through X=0, Y=7.

The second situation was how I interpreted your original post. All my posts above have been only for that situation.
Yeah, the second situation is correct. I know at x=0 y=7 for a fact.
So, you think that the results will be worse if I don't force the model to conform?
I want to quantify this somehow.
Most of the answers I compute are really close with or without restricting the intercept, but there is one case (out of 40) where the answer I get with a restricted model is 28, and without a restricted model the answer is 114. I am not sure which is better, or how to explain it. The residual is always better with the restricted model, but I assume it is because it is fitting 1 less data point. (there is like 12 data points in total for each case I am fitting, but the data is true, with very small noise variation, so I guess that fitting 11 points vs 12 might show in the residual)
 
  • #19
Y
fahraynk said:
Yeah, the second situation is correct. I know at x=0 y=7 for a fact.
So, you think that the results will be worse if I don't force the model to conform?
I want to quantify this somehow.
Most of the answers I compute are really close with or without restricting the intercept, but there is one case (out of 40) where the answer I get with a restricted model is 28, and without a restricted model the answer is 114. I am not sure which is better, or how to explain it. The residual is always better with the restricted model, but I assume it is because it is fitting 1 less data point. (there is like 12 data points in total for each case I am fitting, but the data is true, with very small noise variation, so I guess that fitting 11 points vs 12 might show in the residual)
If you know for a fact that the average value of Y at X=0 must be 7, then you have no choise. All else being equal, you must give preference to a model that is correct over one that is wrong.
 
  • Like
Likes fahraynk
  • #20
Here is where the term "overfitting" applies to this issue. By allowing the regression model to use an intercept that is wrong, it will provide a "better" fit to the data which is erroneous. That is "overfitting" because it is allowing a better fit due to an erronious free constant parameter.
 
  • #21
FactChecker said:
Here is where the term "overfitting" applies to this issue. By allowing the regression model to use an intercept that is wrong, it will provide a "better" fit to the data which is erroneous. That is "overfitting" because it is allowing a better fit due to an erronious free constant parameter.
Ah!
Nice. Thanks!
But, it can't be that bad if the residual is lower for the case with the restriction! I guess the solutions are close because the model is trying to solve for that parameter anyway, and it has a lot of weight since their is only 12 data points. That explanation is perfect.

Dale said:
Yes, I do. Restrictions can be added, but they often have unintended consequences such as introducing bias in the other regression parameters and producing residuals that are not 0 mean.

When should this trade off be made? Is it quantifiable? Because I have 40 data sets, and 1 out of 40 gives much different results with and without adding the restriction, while all the others produce about the same value with and without the restriction. I have to choose between the models, should I just go with the one with the lower residual?
 
  • #22
fahraynk said:
When should this trade off be made? Is it quantifiable?
No, there isn’t a quantitative test you can apply that would justify it. Basically, you would need to have a reason so convincing that people reading it would agree it is necessary.

fahraynk said:
Because I have 40 data sets, and 1 out of 40 gives much different results with and without adding the restriction, while all the others produce about the same value with and without the restriction. I have to choose between the models, should I just go with the one with the lower residual?
Based on this, it seems like you should not use the restriction. It is difficult to justify and it doesn’t make a difference more than 95% of the time. The one time that it makes a difference is very likely to be a statistical outlier since you would expect one or more with 40 data sets.
 
  • Like
Likes fahraynk
  • #23
fahraynk said:
Ah!
Nice. Thanks!
But, it can't be that bad if the residual is lower for the case with the restriction!
That sounds wrong. If the only thing you did was to allow a regression constant in one model and not allow it in another model, the one where the regression constant is allowed must have a smaller sum-squared residual error. It has more freedom to fit the data while ignoring the theoretical x=0,y=7 intercept value. That being said, its smaller sum-squared residual error might give the regression a worse statistical measure of fit because of its added degree of freedom.
 
  • #24
A lot of software packages change the meaning of the residuals when you fit to a model without an intercept. So you often cannot compare residuals with and without an intercept. It is important to actually read and thoroughly understand the documentation of the specific package if you choose this route.
 
  • #25
Dale said:
A lot of software packages change the meaning of the residuals when you fit to a model without an intercept. So you often cannot compare residuals with and without an intercept. It is important to actually read and thoroughly understand the documentation of the specific package if you choose this route.
I programmed the solver in this case
 
  • #26
FactChecker said:
That sounds wrong. If the only thing you did was to allow a regression constant in one model and not allow it in another model, the one where the regression constant is allowed must have a smaller sum-squared residual error. It has more freedom to fit the data while ignoring the theoretical x=0,y=7 intercept value. That being said, its smaller sum-squared residual error might give the regression a worse statistical measure of fit because of its added degree of freedom.
We are talking about the sum of squared error between 12 points and the the prediction vs the sum of square error between 11 points and prediction, but the difference is small, they are all pretty close.
If the solver would choose a value near the true intercept anyway, the difference may be small, and that small difference might not be overcome by the 1 additional data point, no?

Thank you both for all your help by the way, you have been really helpful and awesome dale and FactChecker
 
  • #27
fahraynk said:
We are talking about the sum of squared error between 12 points and the the prediction vs the sum of square error between 11 points and prediction, but the difference is small, they are all pretty close.
If the solver would choose a value near the true intercept anyway, the difference may be small, and that small difference might not be overcome by the 1 additional data point, no?
Ok. It looks to me like you are still talking about adding one point at the theoretical intercept and comparing with and without that point. Is that right? I would not recomment that at all. If you know the theoretical intercept, define a model for that (with an intercept at (0,7)) and do the statistics for that. Otherwise, allow the regression to determine a best-fit intercept as a normal part of it's regression. In either case, start with the identical number of sample data -- just use different models.
 
Last edited:
  • Like
Likes fahraynk
  • #28
FactChecker said:
If you know the theoretical intercept, define a model for that (with an intercept at (0,7)) and do the statistics for that.
I would not recommend that approach. It is statistically non standard and leads to many subtle statistical problems that would need to be addressed. It is also difficult to convince readers that it must be done, and if it is done and makes a substantial difference then it indicates a flaw in either your data or your model.

I would recommend actually acquiring the X=0 data points and then doing a standard fit with the intercept.
 
  • Like
Likes fahraynk
  • #29
Dale said:
I would not recommend that approach. It is statistically non standard and leads to many subtle statistical problems that would need to be addressed. It is also difficult to convince readers that it must be done, and if it is done and makes a substantial difference then it indicates a flaw in either your data or your model.

I would recommend actually acquiring the X=0 data points and then doing a standard fit with the intercept.

So I did a test, I calculated the average residual over all models, and the average difference between the true intercept and the fitted intercept.

So the answer with the subtraction method is much lower to the true intercept, and also a lower residual, but I don't think I can trust the lower residual.

With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?

Thank you so much, the advice I am getting has been very VERY helpful
 
Last edited:
  • #30
fahraynk said:
With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?
This is just the wrong direction to go in. If that is not a measured experimental data point, then it is a theoretical point. If it's a theoretical true value, why only include it once? Why not two or three times? Or 20 times? Pretending that the theoretical intercept is an experimental sample data point is the wrong way to go.
You should either give up the theoretical intercept and apply a regression model allowing it to freely determine the intercept, or you should use a model that forces an intercept of (0,7).

I would not recommend ignoring the theoretical known value, no matter how it changes the statistical results. I see no reason to prefer accurate statistical measures of a known invalid model. I would prefer modified statistical measures of a known valid model.
 
Last edited:
  • Like
Likes fahraynk
  • #31
Below is a simple example to illustrate the pros and cons of the two regression models.
Suppose: Suppose the true physics without any random behavior is the green (truth) line Y=X, and suppose we know from theory that it goes through (0,0). Two sample data points, ##S_1## and ##S_2## include some random behavior which puts them above the truth line and forces the typical regression (blue line, ##regr_1##) to have a nonzero Y-axis intercept at point Int. Suppose the restricted red regression line (##reg_2## through (0,0)) is calculated because we theoretically know that Y=0 when X=0 if there is no random behavior.

Then: The red line gives better estimates near (0,0) and worse estimates farther away. Also, its slope is worse. But it has one advantage when challenged by skeptical people -- it is correct at the Y-intercept, where a theoretical answer is known. I would prefer to avoid using the blue line as my model when it is undeniably wrong at the Y-intercept. But that depends on how the model is used. If accuracy far from the Y-intercept is more important than accuracy near the Y-intercept, then you may prefer the model that ignores the known theory of the Y-intercept value.

regressionWithoutConstant.png
 

Attachments

  • regressionWithoutConstant.png
    regressionWithoutConstant.png
    13.1 KB · Views: 309
Last edited:
  • Like
Likes fahraynk
  • #32
fahraynk said:
With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?
Yes, but only if you actually measure the intercept. So it should be a real data point treated just like any other real data point.
 
  • Like
Likes fahraynk and FactChecker
  • #33
FactChecker said:
I would not recommend ignoring the theoretical known value, no matter how it changes the statistical results. I see no reason to prefer accurate statistical measures of a known invalid model. I would prefer modified statistical measures of a known valid model.
That can be done, but it has to be a very explicit and convincing argument. Personally, when I see a statistical model done without an intercept I am instantly highly suspicious. The burden of proof is on the scientist much more stringently, and frankly in some infamous papers where I have seen this done it was done poorly and rendered the conclusion completely unbelievable.

So overall, my opinion is opposite yours. I would rather use a standard and reasonable statistical process and look at the fitted intercept as a check on the quality of the data and the model.
 
  • Like
Likes fahraynk, StoneTemplePython and FactChecker
  • #34
Dale said:
That can be done, but it has to be a very explicit and convincing argument.
I agree completely with that. The theoretical reason behind the Y-intercept of 7 must be hard, solid science. If there is any doubt, the standard regression intercept should be allowed.
 
  • Like
Likes fahraynk and Dale
  • #35
@fahraynk FYI both @FactChecker And I agree about this point also
FactChecker said:
If that is not a measured experimental data point, then it is a theoretical point. If it's a theoretical true value, why only include it once? Why not two or three times? Or 20 times? Pretending that the theoretical intercept is an experimental sample data point is the wrong way to go.
So don’t make a fictitious data point.
 
Last edited:
  • Like
Likes FactChecker and fahraynk

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
585
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
9
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • STEM Educators and Teaching
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
Back
Top