How to perform a fit with correlated variables

In summary: If you have not done that, just eliminate the variable that makes the most sense.In summary, the speaker is trying to perform a fit with two highly correlated variables, but they are unsure of the best way to present the results. They have tried setting one of the variables to a fixed value and fitting for the other, and found that this gives a better RMS error but with a larger uncertainty. They are asking for advice on the best approach and mention techniques such as forward stepwise linear regression, but are unsure if this is applicable to their non-linear model. The expert suggests eliminating one of the correlated variables and using AIC or BIC to choose the best model.
  • #1
BillKet
313
29
Hello! I need to perform a fit with several variables and 2 of them are very correlated (above 0.99). The functional form of these 2 variables is something like: ##(p+q)x+qf(x)##, where ##f(x)## contains polynomials and some square roots of x, but the coefficients appearing in ##f(x)## are much smaller than one, for example something like ##10^{-7} x^2## (for completeness, but not very relevant to my questions, this is from fitting the p and q parameters of the lambda doubling in a ##^2\Pi_{1/2}## state in a diatomic molecule). If I keep both p and q as free variables, I end up with some values around p=0.1 and q=0.001 with the error for both on the order of 0.0001 and a very good RMS error for the points used for the fit. If I set q=0 and fix it at zero, the uncertainty on p becomes 10 times smaller, but the RMS error is about 50% bigger. I also tried to fix q at the fitted value i.e. q=0.001 and fit just for p. In this case the RMS was as good as initially (even slightly better) and the uncertainty on p was 10 times smaller than initially. I am not sure what is the best way to present my results. If I let both p and q to vary, the uncertainty on p is big, but it feels like that doesn't reflect the truth, as that error is mainly influenced by q, as they appear as p+q. If I fixed q=0.001, the errors on q and p would be different by a factor of 10 and I am not sure if that makes sense mathematically, as they do appear as p+q. Can someone advice me on what is the best way to proceed? Thank you!
 
Physics news on Phys.org
  • #2
You do not say if you are using linear regression or some other technique. Forward stepwise linear regression would make a model with the highest correlated variable first. Then it would remove the correlated part from the other variables and see if it is statistically reasonable to introduce the remainder into the model. There are techniques called forward selection, backward elimination, and bidirectional elimination.
See https://en.wikipedia.org/wiki/Stepwise_regression
There are critics of these methods, but that is true of all statistical methods. All statistical methods should be used wisely.
If you are using some non-linear model, I think that you could still remove the correlated part of one of your variables and see if the remainder is statistically reasonable to add to the model after the first one is included.
 
  • Informative
Likes Twigg
  • #3
FactChecker said:
You do not say if you are using linear regression or some other technique. Forward stepwise linear regression would make a model with the highest correlated variable first. Then it would remove the correlated part from the other variables and see if it is statistically reasonable to introduce the remainder into the model. There are techniques called forward selection, backward elimination, and bidirectional elimination.
See https://en.wikipedia.org/wiki/Stepwise_regression
There are critics of these methods, but that is true of all statistical methods. All statistical methods should be used wisely.
I am not sure what you mean by linear regression. Isn't that meant only if the dependence is linear? I am using least squares fitting.
 
  • #4
BillKet said:
I am not sure what you mean by linear regression. Isn't that meant only if the dependence is linear? I am using least squares fitting.
Linear regression uses least-squares fitting and is not as restrictive as you might initially think.
Suppose you are looking for the relationship between ##X## and ##Y##, with ##Y## a function of ##X##.
The regression finds the least-squares linear model, but you can apply it to non-linear relationships. You can try linear regression on a model ##Y = aX+b##, but if the relationship looks more like ##Y = aX^2+b##, you can apply linear regression on that. Just square all the ##X## data.
 
  • #5
FactChecker said:
Linear regression uses least-squares fitting and is not as restrictive as you might initially think.
Suppose you are looking for the relationship between ##X## and ##Y##, with ##Y## a function of ##X##.
The regression finds the least-squares linear model, but you can apply it to non-linear relationships. You can try linear regression on a model ##Y = aX+b##, but if the relationship looks more like ##Y = aX^2+b##, you can apply linear regression on that. Just square all the ##X## data.
But my relationship is a lot more complicated than that. For example I have something of the form:

$$Bx(x+1)+D(x(x+1))^2+(p+q)x+q10^{-7}\sqrt{x}$$

I do know the functional form of my equation, I don't understand how can I fit a line to this.
 
  • #6
BillKet said:
Can someone advice me on what is the best way to proceed?
What you are running into is called multicolinearity. Or maybe, since it is just two correlated variables, just colinearity.

The easiest thing to do is to just eliminate one of the colinear variables. You can use the AIC or the BIC to choose which model is better if you don’t have a good theoretical reason for choosing one. Or a more rigorous model-building approach like stepwise regression.

You can keep both parameters as long as you are not trying to make inferences about the parameter values. Keeping both will still give good fits to the data, but the parameter values themselves are fundamentally unstable
 
  • Like
Likes Twigg and FactChecker
  • #7
Sorry. I jumped to conclusions before thoroughly reading your initial post. It looks to me as though the last term will be hard to determine since its contribution is so small. I assume that ##B##, ##D##, ##p##, and ##q## are the unknown parameters. In that case you might consider applying linear regression to the model ##Y=BX_1+DX_2+pX_3+qX_4##, where ##X_1=x(x+1)##, ##X_2=(x(x+1))^2##, ##X_3=x##, and ##X_4=x+10^{-7}\sqrt x##.
I see your point that ##pX_3+qX_4## is problematic, nearly redundant. I wonder what a stepwise linear regression would do with it.
 
  • #8
Dale said:
What you are running into is called multicolinearity. Or maybe, since it is just two correlated variables, just colinearity.

The easiest thing to do is to just eliminate one of the colinear variables. You can use the AIC or the BIC to choose which model is better if you don’t have a good theoretical reason for choosing one. Or a more rigorous model-building approach like stepwise regression.

You can keep both parameters as long as you are not trying to make inferences about the parameter values. Keeping both will still give good fits to the data, but the parameter values themselves are fundamentally unstable
So by eliminating one parameter, do you mean setting it to zero? Based on the physics model upon which this equation is built, I do need both parameters. Basically, if I set q=0, p can take over the initial q value in the p+q term, but the sqrt term will vanish, and hence the model will be wrong. If I set p to zero, q would need to become 2 orders of magnitude bigger to take over the p+q part, but then the sqrt part will be too big. I am not sure how can I get rid of one of the parameters, without using a wrong model.
 
  • #9
BillKet said:
So by eliminating one parameter, do you mean setting it to zero? Based on the physics model upon which this equation is built, I do need both parameters. Basically, if I set q=0, p can take over the initial q value in the p+q term, but the sqrt term will vanish, and hence the model will be wrong. If I set p to zero, q would need to become 2 orders of magnitude bigger to take over the p+q part, but then the sqrt part will be too big. I am not sure how can I get rid of one of the parameters, without using a wrong model.
Then I think you should try a standard linear regression that will force both terms into the model of post #7 and see what you get. At least it would fit your theory. It would be the least-squares model.
As @Dale says, it would be a very ill-conditioned problem.
 
  • #10
FactChecker said:
Then I think you should try a standard linear regression that will force both terms into the model of post #7 and see what you get. At least it would fit your theory. It would be the least-squares model.
As @Dale says, it would be a very ill-conditioned problem.
I see what you mean by linear regression in this case, thanks! But the way I did the fit was basically like that, i.e. I forced both terms into the fit. And the fit looks great as well as the values of p and q are around the values I would expect from theory. My only concern is with uncertainties on the p and q. I saw in other molecular physics papers people fixing one of the parameters when it was very correlated with another, but I am not sure how to quote the errors in that case. I guess it depends on the field (and hence the readers) but I was wondering how would you quote the values and uncertainties in this situation.
 
  • Like
Likes Twigg
  • #11
BillKet said:
I see what you mean by linear regression in this case, thanks! But the way I did the fit was basically like that, i.e. I forced both terms into the fit. And the fit looks great as well as the values of p and q are around the values I would expect from theory. My only concern is with uncertainties on the p and q. I saw in other molecular physics papers people fixing one of the parameters when it was very correlated with another, but I am not sure how to quote the errors in that case. I guess it depends on the field (and hence the readers) but I was wondering how would you quote the values and uncertainties in this situation.
I'm sorry that I don't feel qualified to answer that question. Perhaps others with knowledge of the molecular physics papers that you refer to can give you better advice. You might want to provide links to those papers and ask specific questions about them. In that case, there might be a better section of this forum to ask the question.
 
  • #12
FactChecker said:
I'm sorry that I don't feel qualified to answer that question. Perhaps others with knowledge of the molecular physics papers that you refer to can give you better advice. You might want to provide links to those papers and ask specific questions about them. In that case, there might be a better section of this forum to ask the question.
Oh sorry for the confusion, I meant, assuming you were to publish this in your own field (not molecular spectroscopy), how would you present your results.
 
  • #13
BillKet said:
Oh sorry for the confusion, I meant, assuming you were to publish this in your own field (not molecular spectroscopy), how would you present your results.
Sorry. This is a very extreme case where the difference between the two terms is seven orders of magnitude lower. I have no experience with that, other than numerical issues on the computer.
 
  • #14
FactChecker said:
Sorry. This is a very extreme case where the difference between the two terms is seven orders of magnitude lower. I have no experience with that, other than numerical issues on the computer.
That's totally ok, thanks a lot for the insights! Just for reference (and for others reading), the resolution of the experiment is good enough such that the sqrt term does make a difference when performing the fit
 
  • #15
BillKet said:
That's totally ok, thanks a lot for the insights! Just for reference (and for others reading), the resolution of the experiment is good enough such that the sqrt term does make a difference when performing the fit
Stepwise Regression, and Analysis Of Variance, ANOVA, methods would calculate the Coefficient of Partial Determination to see if the additional term (with appropriately adjusted coefficients) is statistically justified. There is a probability associated with that ratio, but I do not know if that is appropriate for your application.
 
  • #16
BillKet said:
So by eliminating one parameter, do you mean setting it to zero?
I actually mean a model without that parameter at all. Sometimes a model without a given parameter is equivalent to a model with the parameter set to zero, sometimes set to one, sometimes some other value. It depends on the model.

BillKet said:
Based on the physics model upon which this equation is built, I do need both parameters.
OK, that is fine then. But you cannot make inferences about the values of the two parameters. You need to restrict your use of the model to making inferences about predictions. The predictions will still be valid even though the parameter estimates will not.
 
  • Like
Likes Twigg
  • #17
BillKet said:
If I let both p and q to vary, the uncertainty on p is big, but it feels like that doesn't reflect the truth, as that error is mainly influenced by q, as they appear as p+q. If I fixed q=0.001, the errors on q and p would be different by a factor of 10 and I am not sure if that makes sense mathematically, as they do appear as p+q.
If I understand you correctly, I believe your analysis is spot on and there isn't anything you can do about it. The model doesn't have any more information than this. It only has a high degree of certainty on p+q, not p. I believe no trickery will get around that fundamental issue.

If you want to test it, try comparing the covariance matrices you get when you fit to variables (p,q) and (p+q,q). In the former case, you should see large off-diagonal terms, and in the latter case I believe that correlation off-diagonal term will be small. If you want to present your results with no amibiguity, I would present your whole covariance matrix for (p,q). Alternatively, you could just present the error on p+q and q (assuming the correlation was small) and put a note in the supplementary materials of your paper (if it has one). I think both of those would be very honest and upfront presentations of your result.

I hope that was helpful!
 
  • Like
Likes BillKet and Dale

FAQ: How to perform a fit with correlated variables

1. What is a fit with correlated variables?

A fit with correlated variables is a statistical method used to determine the relationship between two or more variables that are not independent of each other. This means that the values of one variable can affect the values of the other variable, making it important to account for this correlation when analyzing the data.

2. Why is it important to perform a fit with correlated variables?

Performing a fit with correlated variables is important because it allows for a more accurate analysis of the data. Ignoring the correlation between variables can lead to biased results and incorrect conclusions. By performing a fit with correlated variables, the relationship between the variables can be properly accounted for.

3. What are some common methods for performing a fit with correlated variables?

Some common methods for performing a fit with correlated variables include using a multivariate regression model, calculating a correlation coefficient, and using a generalized least squares approach. Each method has its own advantages and may be more suitable depending on the specific data and research question.

4. How do you interpret the results of a fit with correlated variables?

The results of a fit with correlated variables can be interpreted by looking at the coefficient estimates and their corresponding p-values. A positive coefficient indicates a positive correlation between the variables, while a negative coefficient indicates a negative correlation. The p-value indicates the statistical significance of the relationship between the variables.

5. What are some potential challenges when performing a fit with correlated variables?

Some potential challenges when performing a fit with correlated variables include identifying the correct variables to include in the analysis, dealing with multicollinearity (high correlation between predictor variables), and choosing the most appropriate method for the specific data and research question. It is important to carefully consider these challenges and choose the best approach for the analysis.

Back
Top