Exploring the Benefits of Linearizing Data - Improve Analysis and Visualization

  • #1
fog37
1,569
108
TL;DR Summary
why linearizing the data that fits a certain relationship
Hello,

There is a physical phenomenon in which the variable ##X## is related to the variable ##Y## by a cubic relationship, i.e. $$Y= k X^3$$
The data I collected, ##(X,Y)##, seems to fit this relationship well: I used Excel to best fit the data to a power law function (3rd power) and there is good agreement....

What would I gain by linearizing the data? That would be achieved by plotting ##Y## versus ##X^3## and the data should follow a linear trend. The best fit line would then be a straight line with slope ##k## and intercept ##0##. I don't think there would be any benefit in linearizing the data since the power law best fit seems to do the job...

Thank you for any input.
 
Physics news on Phys.org
  • #2
Data is sometimes linearized in order to apply stability theory that applies to linear systems. The aerodynamics of an airplane is linearized at every flight condition (including surface positions) to see what its stability properties are.
 
  • Like
Likes DeBangis21 and fog37
  • #3
I also thing that the R squared value that Excel generates for the power law fit is meaningful since ##Y= k X^3## is a linear model...

So maybe we can linearize a truly nonlinear model (nonlinear in the statistical sense) transforming the data so the best fit line will the a straight line and then calculate the R^2 which would give us an idea of how well the nonlinear model fits the data...
 
  • #4
But if we were set on linearizing ##Y=k X^3##, we would take the log of both side of the equation and get $$log(Y) = 3 log(X) + log(k)$$
Couldn't we just linearize the power law ##Y=k X^3## by simply creating two new variables, ##Y_{new}=Y## and ##X_{new}= X^3## and plotting ##Y_{new}## versus ##X_{new}##?

Thanks
 
  • #5
To me, the term "linearize" at a point ##x=x_0## means to approximate the function, ##f(x)## around ##x_0##, with its tangent line, ##f'(x_0)x + f(x_0)##.
 
  • #6
fog37 said:
But if we were set on linearizing ##Y=k X^3##, we would take the log of both side of the equation and get $$log(Y) = 3 log(X) + log(k)$$

This formulation assumes that [itex]X[/itex], [itex]Y[/itex] and [itex]k[/itex] are strictly positive.

Couldn't we just linearize the power law ##Y=k X^3## by simply creating two new variables, ##Y_{new}=Y## and ##X_{new}= X^3## and plotting ##Y_{new}## versus ##X_{new}##?

This formulation does not.

Also: if [itex]X[/itex] is not known exactly, is the uncertainly in [itex]X^3[/itex] (proportional to [itex]X^2[/itex]) or in [itex]\ln X[/itex] (proportional to [itex]X^{-1}[/itex]) going to be larger?
 
  • Like
Likes fog37
  • #7
@fog37 , You have not said whether there is any random behavior in your problem. If there is, then one good reason to transform it to linear might be to get the random behavior in the form of an added normal random variable. Then the results of statistical linear analysis can be applied.

Suppose your original problem is of the form ##Y = r X^3##, where ##r## is a random multiplier with a mean of 1. That is, the random behavior is proportional to the size of ##X^3##. If you can transform the problem into the form ## \log(Y) = a_1 \log(X) +a_0 + \epsilon##, where ##\epsilon## is a random normal variable, then you can apply linear statistical analysis to obtain estimators of the parameters and their associated statistical properties. Those results can be applied to the original problem in the form ##Y = e^{\epsilon} e^{a_0} X^{a_1}##
 
  • Like
Likes Dale
  • #8
pasmith said:
This formulation assumes that [itex]X[/itex], [itex]Y[/itex] and [itex]k[/itex] are strictly positive.
This formulation does not.

Also: if [itex]X[/itex] is not known exactly, is the uncertainly in [itex]X^3[/itex] (proportional to [itex]X^2[/itex]) or in [itex]\ln X[/itex] (proportional to [itex]X^{-1}[/itex]) going to be larger?
Ok, for simplicity, let's assume we collect some data from an experiment. For specific values of the variable ##X## we obtain certain values for the variable ##Y##. All ##X## and ##Y## values are positive so the log transformation would not a problem.

I guess our data points ##(X,Y)## are to be viewed as a sample from a general population. ##Y## values would slightly change from sample to sample (if we repeated the experiment and collected more than one sample).

Is the change in the collected values of ##Y##, from sample to sample, what @FactChecker refers to as random behavior?

If the data ##(X,Y)##, once plotted, seems to follow a curvilinear polynomial trend like ##Y= a X^3##, OLS can still be used because the model is linear. OLS can be used for polynomial regression, I believe. Confidence intervals, p-value, R-squared still apply for a polynomial regression and would be meaningful results.

Why would we then need to change the model ##Y= a X^3## and linearize it, either by using the log transformation or a change of variables like ##Y=Y_new## and ##X_new=X^3##? I still don't get that part...
 
  • #9
fog37 said:
TL;DR Summary: why linearizing the data that fits a certain relationship

What would I gain by linearizing the data?
This is already linear.

fog37 said:
Couldn't we just linearize the power law Y=kX3 by simply creating two new variables, Ynew=Y and Xnew=X3 and plotting Ynew versus Xnew?
Yes. That is why it is already linear.

The only reason you might transform the data is if you found evidence of heteroskedascity in the residuals.
 
  • Like
Likes fog37
  • #10
Dale said:
This is already linear.

Yes. That is why it is already linear.

The only reason you might transform the data is if you found evidence of heteroskedascity in the residuals.
Thank you. I Indeed, there are two types of linear :

1) linear in the independent variables ##X##
2) linear in the parameters ##b##

Linear regression is linear in both senses while polynomial regression is only linear in the parameters.

What does "linear in the parameters" guarantee? All GLM models are general "linear" models because satisfy linearity in the parameters (ex: the logit in logistic regression is both linear in the variable ##X## and in the parameters ##b##)

Does linearity in the parameters directly imply that OLS can be used to estimate the parameters or not necessarily? I don't think so....What does it guarantee then?

Back to my data ##(X,Y)## following a cubic trend best fit line ##Y=k X^3##. Does polynomial regression has the same assumptions as linear regression (homoscedasticity, gaussian residuals, etc.)?

There are two possible data transformations. Plotting ##Y## vs ##X^3## or ##log(Y)## vs ##log(X)## both produce new transformed data that follow a straight best fit line. But as Dale mentions, one transformation may be better than the other because it allows us to get data that satisfy the required assumptions for linear regression which may not be satisfied by the other transformation... Is that correct?
 
  • #11
fog37 said:
Linear regression is linear in both senses while polynomial regression is only linear in the parameters.
In statistics, what you are calling "polynomial regression" is still a linear regression. If I have a model ##y=b_0 + b_1 x + b_2 x^2 + b_3 x^3 ## this is a linear model because all of the regression coefficients, ##b_i## are linear. You will use the same underlying algorithm to find the ##b_i## as you would for the model ##y=b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 ##, the residuals will be the same, all of the diagnostic techniques would be the same. They are both linear in every way that counts in statistics. We are focused on what you call "linear in the parameters". From a statistics perspective that is "linear".

The model ##y=b_0 x^{b_1}## is a non-linear model that can be linearized.

fog37 said:
What does "linear in the parameters" guarantee? All GLM models are general "linear" models because satisfy linearity in the parameters (ex: the logit in logistic regression is both linear in the variable X and in the parameters b)
GLM models are not linear in the parameters, they are only linear in the link function of the parameters. (Unless by "parameters" you mean the link function of the parameters, which would be reasonable too)

fog37 said:
Does linearity in the parameters directly imply that OLS can be used to estimate the parameters or not necessarily? I don't think so....What does it guarantee then?
Not directly, no. But together with the usual assumptions about the noise yes. The guarantee is that OLS is the best linear unbiased estimator. That is the Gauss–Markov theorem.

fog37 said:
Does polynomial regression has the same assumptions as linear regression (homoscedasticity, gaussian residuals, etc.)?
Yes, it is the same thing. The same assumptions apply as well as the same diagnostic tools.

fog37 said:
There are two possible data transformations. Plotting ##Y## vs ##X^3## or ##log(Y)## vs ##log(X)## both produce new transformed data that follow a straight best fit line. But as Dale mentions, one transformation may be better than the other because it allows us to get data that satisfy the required assumptions for linear regression which may not be satisfied by the other transformation... Is that correct?
Yes. What I usually do is I fit the first model and I look at my residuals vs ##X## or ##X^3##. If my residuals are fairly independent then I use that model. If my residuals are strongly increasing for larger ##X## or ##X^3## then I will do the log transform.
 
Last edited:
  • Like
  • Informative
Likes jbergman, FactChecker, hutchphd and 1 other person
  • #12
Thank YOU!
 
  • Like
Likes Dale

FAQ: Exploring the Benefits of Linearizing Data - Improve Analysis and Visualization

What does it mean to linearize data?

Linearizing data involves transforming non-linear relationships into a linear form. This is often done using mathematical functions such as logarithms, square roots, or reciprocal transformations. The goal is to simplify the analysis and make it easier to identify trends and relationships within the data.

Why is linearizing data beneficial for analysis?

Linearizing data can make complex relationships more understandable and easier to interpret. Linear models are simpler and more robust, allowing for more straightforward statistical analysis and better predictive performance. It also helps in identifying outliers and understanding the underlying structure of the data.

How does linearizing data improve visualization?

Linearizing data can make visualizations clearer and more informative. Non-linear data can be difficult to interpret when plotted, as patterns and trends may not be immediately apparent. Transforming the data to a linear form can help in creating plots where trends are more easily observed and understood.

What are some common techniques for linearizing data?

Common techniques for linearizing data include logarithmic transformations, square root transformations, and reciprocal transformations. Each method is suitable for different types of non-linear relationships. Choosing the right transformation depends on the specific characteristics of the data and the nature of the relationship being studied.

Are there any drawbacks to linearizing data?

While linearizing data can simplify analysis and visualization, it can also introduce complexities. For example, interpreting the results of a transformed model can be less intuitive, and the transformation itself may not perfectly linearize the data. Additionally, not all non-linear relationships can be effectively linearized, and some information might be lost in the transformation process.

Similar threads

Replies
4
Views
1K
Replies
5
Views
1K
Replies
30
Views
3K
Replies
22
Views
3K
Replies
5
Views
2K
Replies
30
Views
3K
Back
Top