Please convince me (OLS matrix derivation)

In summary: I'm not able to finish this tonight, I'll have to look into it more.In summary, the derivative of the B'X'XB with respect to B is 2X'XB.
  • #1
longrob
52
0
Hi all

In the derivation of the normal equations for Ordinary Least Squares estimates we have B (m x 1 column vector) and X (n x m matrix). Could someone please convince me that the derivative with respect to B of
B'X'XB
is
2X'XB

Thanks !
LR
 
Physics news on Phys.org
  • #2
I'm sorry, I'm not familiar with the OLS you're referring to. The one I'm familiar is
one used either to find the best approximation (using orthogonal projection), to an
inconsistent system, or to find the line of best fit to a set of data points, i.e., the
least-squares line; the line that minimizes the sum of square residuals, and I don't
recognize your layout, nor the meaning of the derivative B'. Would you expand on your
layout?
 
  • #3
Hi and thanks for your reply.
Could you take a look here:
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
On page 18/19 you see exactly what (I think) you are referring to in terms of the orthogonal projection. What I am referring to is on the bottom of page 19:
"Differentiating with respect to beta and setting to zero, we find that beta-hat satisfies"
It's this differentiation I am specifically talking about (of the 2nd term o the RHS)

Thanks again
LR
 
  • #4
Excuse me:
I meant "the 3rd term in the expression above" not "the 2nd term on the RHS".
 
  • #5
Sorry, longrob, I have not had a chance to look at it, and I'm going to be busy for a while.

Still, from what I saw, it seems like the chain rule should be involved, since d/dB[(B'X'XB)]=

d/dB[((XB)'XB)] , and this is a product together with a chain rule; we have g(B)=(XB)'

and f(B)=XB (where X is a contant re B ), so that g(B) is a composition of scaling and

transposition. I'll look into it if I have a chance.
 
  • #6
I'm sorry, I have been a bit confused about this myself; the best I have been able
to come up with, has been this; maybe/hopefully you or someone else can finish this
and/or make it more precise:

In B'X'XB , X'X is a square nxn matrix, and, basically, B' is the same as B, only
expressed in a different format , so what you get is a quadratic form B'MB , or,
another approach is that d/dB(B'MB)= d/dB([BM]B]= MB+BM , and somehow
(despite the product not being commutative), this is equal to 2MB=2XX'B .

Sorry, I have not been able to figure it out better.
 
  • #7
Hi Bacle, thanks for your messages. I'm glad I'm not the only one who is a bit confused by it.

For completeness and the benefit of others, I'll explain the setup so that it's not necessary to refer to the link I posted.

We have
y = XB + e

where y is a n x 1 column vector of responses
X is a n x (k+1) matrix with a leading column of 1s for the regression intercept and the remaining columns being nx1 vectors of the observed values of the explanatory variables.
B is a (k+1) x 1 column vector of regression coefficients.
e is a n x 1 column vector of errors

We wish to find the values for the elements of B which minimise the sum of the squares errors.
That is, we wish to minimise e'e
Now,
e'e = (y - XB)' (y - XB)

= (y'- B'X') (y - XB)

= y'y - y'XB - B'X'y + B'X'XB

Now, since y'XB and B'X'y are 1 x 1 , they are equal, so:

= y'y - 2y'XB + B'X'XB

Differentiating this with respect to B and setting to zero yields

X'XB = X'y

which are the so-calleed "normal equations" which upon rearrangement gives the least squares estimator for B (provided that X'X is invertible)

It is the last step above which I don't get - that the derivative of B'X'XB with respect to B is 2X'XB. Obviously B'X'XB is a quadratic form, so it does make some intuitive sense, but I need more than that...

Thanks for reading
LR
 
Last edited:
  • #8
Longrob, I do know the more geometric, and linear-algebraic derivation, using
orthogonal projection, if you're interested. You're working in a normed space (V,||.||).

The idea is that you're given an inconsistent linear system Ax=b; while there is no
actual solution x , there is a second best, which is finding the vector x_
that best approximates the solution, i.e., to find x_ such that, for any x':

||Ax'-b||>||Ax_-b||

And this x_ is actually the projection of b into the row space Ax
(if b were in the row space, Ax=b would have a solution).

With this layout, I know how to show that the line y=Ax^+B^
is the closest line to the collection of points in the LS sense (it is a sort-of inverse
problem , in that we are looking for the subspace--the least squares line/subspace
-- instead of having the space given as Ax).

Anyway, let me know if you're interested in this type of derivation. I got a bit
confused with the justification for differentiating matrices as if they were standard
variables, BTW; do you know if this is some sort of Borel Functional Calculus or
something similar to that?.
 
  • #9
Hi again

I need to stick with the pure linear algebraic derivation at the moment, but thanks anyway. I may come back to you later on that, as I am also interested in the geometric interpretation.

Anyway, I think I have solved it. Basically, it revolves around the "rule" that the derivative of b'Ab with respect to b is (A+A')b, or in the case at hand, where A is symmetric, 2Ab. This can be proved as follows.

A is an n x n symmetric matrix. b is a n x 1 column vector.

[itex]\mathbf{b'Ab}=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset{j=1}{\sum}}b_{i}a_{ij}b_{j}[/itex]

[itex]=\overset{n}{\underset{i=1}{\sum}}b_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i}{\sum}}b_{i}a_{ij}b_{j}[/itex]

[sigh]
I can't get LaTex to display properly, as you can see above. I created it in Lyx and have tested it here http://www.codecogs.com/latex/eqneditor.php that it displays correctly. I can't post the rest of the proof in Latex as it just displays the Latex code:
[itex]\frac{\partial(\mathbf{\mathbf{b}'}\mathbf{Ab})}{\partial b_{k}}&=2b_{k}a_{kk}+2\overset{n}{\underset{i\neq k}{\sum}}a_{ki}b_{i}[/itex]
What am I doing wrong with this Latex ?

The rest of the proof is to form the partial derivative with respect to a general b sub k and to notice that this is just 2 times the elements of the vector Ab, so the derivative with respect to the whole vector b is 2 times the vector Ab.
 
  • #10
OK, here goes again. I think I fixed the LaTex problems:
let [itex]\mathbf{A=X^{\textrm{T}}X}[/itex] where [itex]\mathbf{A}[/itex] is a [itex]n\times n[/itex] square symmetric matrix with elements [itex]a_{ij}[/itex]. [itex]\mathbf{\mathbf{\boldsymbol{\beta}}} [/itex] is the [itex]n\times 1[/itex] column vector. Expanding out.

[itex]\begin{align}\mathbf{\boldsymbol{\beta}^{\textrm{T}}A\boldsymbol{\beta}}&=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset {j=1}{\sum}}\beta_{i}a_{ij}\beta_{j}
\end{align}[/itex]

[itex]\begin{align}=\overset{n}{\underset{i=1}{\sum}} \beta_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i} {\sum}}\beta_{i}a_{ij}\beta_{j}\end{align}[/itex]

Then, form the partial derivative of [itex]\mathbf{\mathbf{\mathbf{\boldsymbol{\beta}}^ {\textrm{T}}}\mathbf{A\boldsymbol{\beta}}}[/itex] with respect to [itex]\beta_{k}[/itex], an arbitrary element of [itex]\mathbf{\mathbf{\boldsymbol{\beta}}} [/itex]

[itex]\begin{align}\frac{\partial(\mathbf{\mathbf{\beta}^{\textrm{T}}}\mathbf{A\boldsymbol{\beta}})} {\partial\beta_{k}}&=2\beta_{k}a_{kk}+2 \overset{n}{\underset{i\neq k}{\sum}}a_{ki}\beta_{i}\end{align}[/itex]

[itex]\begin{align}=2\overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i}\end{align}[/itex]

But [itex]\overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i}[/itex] is the [itex]k[/itex]th element of the vector [itex]\mathbf{A}\boldsymbol{\beta}[/itex] so it follows that

[itex]\begin{align}\frac{d( \mathbf{\mathbf{\beta}^{\textrm{T}}} \mathbf{A\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}&=2 \mathbf{A\boldsymbol{\beta}}\end{align}[/itex]

Thus

[itex]\begin{align}\frac{d(\mathbf{\beta^{\textrm{T}} X^{\textrm{T}}X\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}= 2\mathbf{ {\mathbf{ X^{\textrm{T}}X\boldsymbol{\beta}}}} \end{align}[/itex]

If anyone can check/correct I would be grateful.
 

FAQ: Please convince me (OLS matrix derivation)

1. What is OLS matrix derivation?

OLS matrix derivation is a method used in statistics to find the best fitting line for a set of data points. It involves using matrix algebra to minimize the sum of the squared differences between the actual data points and the predicted values from the line.

2. Why is OLS matrix derivation important?

OLS matrix derivation is important because it allows us to find the most accurate line of best fit for a given set of data. This is useful in various fields such as economics, psychology, and engineering, where understanding the relationship between variables is crucial.

3. How does OLS matrix derivation work?

OLS matrix derivation works by first creating a matrix of the independent variables (X) and a vector of the dependent variable (y). Then, using matrix algebra, the coefficients for the line of best fit are calculated by minimizing the sum of the squared differences between the actual values and the predicted values from the line.

4. What are the assumptions of OLS matrix derivation?

The main assumptions of OLS matrix derivation are that the errors between the predicted values and the actual values are normally distributed, the errors have a mean of 0, and the errors are not correlated with each other. Additionally, it is assumed that the independent variables are not highly correlated with each other.

5. Can OLS matrix derivation be used for non-linear relationships?

No, OLS matrix derivation is only applicable for linear relationships between the independent and dependent variables. If the relationship is non-linear, other methods such as polynomial regression or non-linear least squares should be used.

Similar threads

Back
Top