Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

Ackbach · Jun 22, 2017

Cross-posted on SE.DS Beta.

I'm just doing a simple linear regression with gradient descent in the multivariate case. Feature normalization/scaling is a standard pre-processing step in this situation, so I take my original feature matrix $X$, organized with features in columns and samples in rows, and transform to $\tilde{X}$, where, on a column-by-column basis,
$$\tilde{X}=\frac{X-\bar{X}}{s_{X}}.$$
Here, $\bar{X}$ is the mean of a column, and $s_{X}$ is the sample standard deviation of a column. Once I've done this, I prepend a column of $1$'s to allow for a constant offset in the $\theta$ vector. So far, so good.

If I did not do feature normalization, then my prediction, once I found my $\theta$ vector, would simply be $x\cdot\theta$, where $x$ is the location at which I want to predict the outcome. But now, if I am doing feature normalization, what does the prediction look like? I suppose I could take my location $x$ and transform it according to the above equation on an element-by-element basis. But then what? The outcome of $\tilde{x}\cdot\theta$ would not be in my desired engineering units. Moreover, how do I know that the $\theta$ vector I've generated via gradient descent is correct for the un-transformed locations? I realize all of this is a moot point if I'm using the normal equation, since feature scaling is unnecessary in that case. However, as gradient descent typically works better for very large feature sets ($> 10k$ features), this would seem to be an important step. Thank you for your time!

I like Serena · Jun 22, 2017

If I'm not mistaken, you're trying to find $\theta$, such that $X\theta$ is as close as possible to some $y$.
That is, find the $\theta$ that minimizes $\|X\theta - y\|$.
Afterwards, the prediction is $\hat y = X\theta$.
Is that correct?

If so, then after normalization, we find $\tilde \theta$, such that $\tilde X\tilde\theta$ is as close as possible to the also normalized $\tilde y = \frac{y-\bar y}{s_y}$, yes?
In that case the prediction becomes:
$$\hat y = \bar y + \tilde X\tilde\theta s_y$$
which is in engineering units.

Ackbach · Jun 22, 2017

Actually, I just learned the answer (see the SE.DS Beta link): you transform the prediction location $x$ in precisely the same way you did for the columns of $X$, but component-wise. So you do this:

1. To each element of $x$, you add the mean of the corresponding column of $X$.
2. Divide the result by the standard deviation of the corresponding column of $X$.
3. Prepend a $1$ to the $x$ vector to allow for the bias. Call the result $\tilde{x}$.
4. Perform $\tilde{x}\cdot\theta$, which is the prediction value.

As it turns out, if you do feature normalization, then the $\theta$ vector DOES contain all the scaling and engineering units you need. And that actually makes sense, because you're not doing any transform to the outputs of the training data.

Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

FAQ: Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

What is linear regression gradient descent?

What is feature normalization/scaling?

Why is feature normalization/scaling important in linear regression gradient descent?

How do you perform feature normalization/scaling in linear regression gradient descent?

Are there any disadvantages to feature normalization/scaling in linear regression gradient descent?

Similar threads

Hot Threads

Recent Insights