I Variable Normalization for different variable ranges....

AI Thread Summary
Normalization is important in feature scaling, especially when independent variables have significantly different ranges, as it can prevent variables with larger ranges from disproportionately influencing the dependent variable. While normalization is not always necessary, it is beneficial when variables differ greatly in magnitude and variance, as this can improve the stability of calculations. High correlation between independent variables can complicate interpretations, but it does not inherently invalidate their inclusion in a model; both correlated variables can provide useful information about the dependent variable. Stepwise regression is suggested as a method to address issues with correlated independent variables by assessing their significance after accounting for each other. Overall, careful consideration of normalization and correlation is essential for effective model building.
fog37
Messages
1,566
Reaction score
108
TL;DR Summary
Understand when and when not to normalize the range of the independent variables...
Hello,

On the topic of feature scaling: I am wondering if normalization needs to be used all the time or only in some particular circumstances. Normalization means transforming/remapping the range of a variable with values ##[x_0,x_f]## to the range ##[0,1]##.

For example, let's consider a linear regression model with 3 independent variables and one dependent variable: $$Y= a X_1 +b X2 + c X3$$
It is generally likely that the independent variables ##X_1 , X_2, X_3## have very different ranges. For example, ##X_1## may have values between 0 and 2000 while ##X_3## value only between 0 and 0.5...Is that an issue? Would the variable with the largest range possibly influence the dependent variable ##Y## more significantly just because of its wider range and not because it is truly important? I don't see normalization being applied all the time...

Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?

Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...

Thank you!
 
Physics news on Phys.org
fog37 said:
Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?
I don't think it ever hurts. Some standard algorithms will automatically normalize them every time. The times when you should really consider it is when the variables differ greatly in their magnitude and variance. Those are times when some matrices in the calculations become "ill-conditioned". In those cases, small changes in the calculation round-off and accuracy can make significant changes in the solution.
fog37 said:
Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...
There is no problem with the conclusion that ##Y## and ##X_1## are correlated just because ##X_2## and ##X_1## are correlated. Both correlations exist. You should be very careful about drawing any conclusions regarding cause and effect, but either ##X_1## or ##X_2## can be used to estimate ##Y##.
There is a process called "stepwise regression" that might interest you. It treats the issue of correlated independent variables directly.
 
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
 
fog37 said:
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
Suppose you want to predict ##Y## based on correlated ##X_1## and ##X_2##. Suppose you start with a model that includes only the variable, say ##X_1##, most highly correlated with ##Y##. What to do about ##X_2##?
Suppose that you remove the correlation of ##X_2## with ##X_1## to get a residual variable, ##\hat{X_2}##. Likewise, you can remove the correlation of ##Y## with ##X_1## to get a residual variable, ##\hat{Y}##. Then the question is whether, having accounted for ##X_1##, there is a significant enough remaining correlation between the residual variables, ##\hat{Y}## and ##\hat{X_2}## to include it.
I suggest that you take a hard look at the process of stepwise regression if you have further questions.
 
I was reading a Bachelor thesis on Peano Arithmetic (PA). PA has the following axioms (not including the induction schema): $$\begin{align} & (A1) ~~~~ \forall x \neg (x + 1 = 0) \nonumber \\ & (A2) ~~~~ \forall xy (x + 1 =y + 1 \to x = y) \nonumber \\ & (A3) ~~~~ \forall x (x + 0 = x) \nonumber \\ & (A4) ~~~~ \forall xy (x + (y +1) = (x + y ) + 1) \nonumber \\ & (A5) ~~~~ \forall x (x \cdot 0 = 0) \nonumber \\ & (A6) ~~~~ \forall xy (x \cdot (y + 1) = (x \cdot y) + x) \nonumber...

Similar threads

Back
Top