Understand the delta rule increment in gradient descent

In summary: The directional derivative is just a way of saying that the gradient descent algorithm is finding a path x(t) which will lead you to a minimum of f. The equation you are looking for is simply the directional derivative of the function.
  • #1
fog37
1,569
108
TL;DR Summary
Understand the delta rule increment in gradient descent
Hello Everyone,

I have a question about the gradient descent algorithm. Given a multivariable function ##f(x,y)##, we can find its minima (local or global) by either setting its gradient ##\nabla f = 0## or by using the gradient descent iterative approach. The first approach (setting the gradient equal to zero) is not always feasible for functions that have too many independent variables and are complicated.

Let's focus on the gradient descent and consider a 1D function ##f(x)## for simplicity. The gradient descent approach is a numerical method that involves the repetitive calculation of gradient ## - \nabla f ## to find the values of x where the function has a minimum.

If we are at a location ##x_0##, we calculate the gradient ##\Delta f(x_0)## and then move in the direction of the negative gradient by a step ##\Delta x##: $$x_{new} = x_{old} + \Delta x $$ $$ x_{new} = x_{old} - \eta \nabla f $$
Here my dilemma: why is the the increment ##\Delta x##, which can be positive or negative, equal to the the product of a small and arbitrary constant ##\eta## and the negative of magnitude of the gradient ##\nabla f##?

I don't see how the increment to apply to ##x_{old}## results to be equal to $$ \Delta x = \eta \nabla f $$

Is there an in depth derivation to prove that last result?

Thanks!
 
Physics news on Phys.org
  • #2
You understand that the gradient descent is a method of incrementally "rolling down" the multivariate hill which is the function? I think the algorithm then makes you go back partway and do it again. When iterated this will put you at the bottom and for proper choice of parameter avoid being stuck in an endless oscillation.
 
  • Like
Likes fog37
  • #3
hutchphd said:
You understand that the gradient descent is a method of incrementally "rolling down" the multivariate hill which is the function? I think the algorithm then makes you go back partway and do it again. When iterated this will put you at the bottom and for proper choice of parameter avoid being stuck in an endless oscillation.
Yes, I follow that. I just don't see how the increment ##\Delta x## is equal to the scaled gradient ## - \eta \nabla f##...
 
  • #4
I think they are saying to choose that increment (for an arbitrary small ##\eta## and then do the gradient again? Obviously it can only be true for arbitrary ##\eta## if ##\nabla f=0##
I don't know the exact algo. they are using.
 
  • #5
fog37 said:
Yes, I follow that. I just don't see how the increment ##\Delta x## is equal to the scaled gradient ## - \eta \nabla f##...

You are essentially trying to solve this ODE: [tex]
\frac{dx}{dt} = - \nabla f,[/tex] which has fixed points iff [itex]\nabla f = 0[/itex] (and the fixed points are stable if [itex]f[/itex] has a minimum). If you approximate that using the forward Euler method you get [tex]
\frac{x_{n+1} - x_n}{h} = - \nabla f[/tex] which is essentially your expreesion for [itex]x_{new}[/itex]. A more sophisticated method would be to set [tex]
\frac{dx}{dt} = - \alpha(t) \nabla f[/tex] where [itex]\alpha(t) > 0[/itex] which leads you to a similar result.

(I don't know if this is how the method is derived, but it does explain it.)
 
  • Like
Likes fog37
  • #6
pasmith said:
You are essentially trying to solve this ODE: [tex]
\frac{dx}{dt} = - \nabla f,[/tex] which has fixed points iff [itex]\nabla f = 0[/itex] (and the fixed points are stable if [itex]f[/itex] has a minimum). If you approximate that using the forward Euler method you get [tex]
\frac{x_{n+1} - x_n}{h} = - \nabla f[/tex] which is essentially your expreesion for [itex]x_{new}[/itex]. A more sophisticated method would be to set [tex]
\frac{dx}{dt} = - \alpha(t) \nabla f[/tex] where [itex]\alpha(t) > 0[/itex] which leads you to a similar result.

(I don't know if this is how the method is derived, but it does explain it.)
Thanks passmith.

But the function is ##f(x)## and the independent variable is ##x##. The derivative of ##f(x)## is essentially the gradient: ##\nabla f(x) = \frac {df} {dx}##.

The ODE would be: $$\frac {df} {dx} = \nabla f(x)$$So $$f(x_{new}) -f(x(_{old}) = h \nabla f(x)$$ and not $$x_{new} -x_{old} = h \nabla f(x)$$
 
  • #7
No; you are finding a path [itex]x(t)[/itex] which will lead you to a minimum of [itex]f[/itex]. Thus you have [tex]
\frac{dx}{dt} = - \alpha\nabla f[/tex] where the gradient is evaluated at [itex]x(t)[/itex].
 
  • Like
Likes fog37
  • #8
pasmith said:
No; you are finding a path [itex]x(t)[/itex] which will lead you to a minimum of [itex]f[/itex]. Thus you have [tex]
\frac{dx}{dt} = - \alpha\nabla f[/tex] where the gradient is evaluated at [itex]x(t)[/itex].
Is that different (or the same) from the concept of directional derivative
$$\nabla f \cdot u$$ where ##u## is a unit vector in a particular direction?
 
  • #9
I'm a bit lost by the end of the first post. I think the definition of gradient descent is you pick ##\Delta x = -\eta \Delta f##. So there's nothing to prove with that equation, it's just a description of how ##\Delta x ## is picked. Is your question why that is a good choice?
 
  • Like
Likes fog37
  • #10
Another way to look at it is that you are making a series of guesses [itex]x_0, x_1, \dots [/itex] as yto where the miniimum is. If you guess wrong, then the best way to improve your guess is to head downhill. Which way is downhill? In the direction of [itex]-\nabla f[/itex]. Of course if you go too far downhill you may find yourself headding up the other side of te valley, so you need to control how far you head downhill. That's where [itex]\eta > 0[/itex] comes from.
 
  • Like
Likes fog37
  • #11
Office_Shredder said:
I'm a bit lost by the end of the first post. I think the definition of gradient descent is you pick ##\Delta x = -\eta \Delta f##. So there's nothing to prove with that equation, it's just a description of how ##\Delta x ## is picked. Is your question why that is a good choice?
Hi Office_shredder.

Yes, I am wondering where that equation comes from and why it was set up that way, i.e. the increment ##\Delta x## being equal to the derivative (gradient) of the function ##f## multiplied by a constant.

Dimensionally, the left hand side (independent variable ##x##) does not seem to agree with the right hand side of the equation...OR does it?

Thanks
 
  • #12
pasmith said:
Another way to look at it is that you are making a series of guesses [itex]x_0, x_1, \dots [/itex] as yto where the miniimum is. If you guess wrong, then the best way to improve your guess is to head downhill. Which way is downhill? In the direction of [itex]-\nabla f[/itex]. Of course if you go too far downhill you may find yourself headding up the other side of te valley, so you need to control how far you head downhill. That's where [itex]\eta > 0[/itex] comes from.
Conceptually, I with you and understand your point. I am still confused on how we set the increment ##\Delta x## equal to the gradient which is ##\frac {df}{dx}##. The learning parameter must have particular units.
 
  • #13
It makes more sense in higher dimensions where the gradient specifies the direction you wish to go. It is like following the fall line when snow skiing...and the ##\eta## just determines the granularity of your approximation.
 
  • Like
Likes pbuk
  • #14
fog37 said:
Yes, I am wondering where that equation comes from and why it was set up that way, i.e. the increment ##\Delta x## being equal to the derivative (gradient) of the function ##f## multiplied by a constant.
If ## \nabla f \ne 0 ## you know you are not at a minimum, so in order to find a minimum you have to travel in some direction. Which is the best direction? Directly downhill i.e. ## -\nabla f ##. How far? Well if you go to far you could easily overshoot, but if you don't go far enough then it will take hours to find a solution, so you need to use a value that is tuned to the particular problem by some algorithm or heuristic - and we call that ## \eta ##.

fog37 said:
Dimensionally, the left hand side (independent variable ##x##) does not seem to agree with the right hand side of the equation...OR does it?
fog37 said:
The learning parameter must have particular units.

Dimensional analysis is only going to confuse you here, but since you brought it up then for a problem with a single independent variable ## \nabla f ## is a scalar and so ## \eta ## must have the same dimensions as ## x ##. Generally in numerical algorithms we don't think about dimensional analysis, we trust the maths to take care of this (if the dimensions of the problem are correct and our algorithm is correctly derived then it is all good): to the computer they are all just floats.
 
Last edited:
  • #15
pbuk said:
for a problem with a single independent variable ## \nabla x ## is a scalar and so ## \eta ## must have the same dimensions as ## x ##
Hmm on second thoughts I'm not sure this is correct, which perhaps just strengthens my assertion that dimensional analysis will only confuse things!
 
  • #16
Let ## f(x) ## have dimensions ## D_f ## and x have dimensions ## D_x ## then ## \nabla f ## has dimensions ## D_f D^{-1}_x ## and we have ## D_{\Delta x} = D_x = D_\eta D_f D^{-1}_x \implies D_\eta = D^{-1}_fD^2_x##.

I said it wouldn't help.
 
  • Like
Likes fog37
  • #17
pbuk said:
Let ## f(x) ## have dimensions ## D_f ## and x have dimensions ## D_x ## then ## \nabla f ## has dimensions ## D_f D^{-1}_x ## and we have ## D_{\Delta x} = D_x = D_\eta D_f D^{-1}_x \implies D_\eta = D^{-1}_fD^2_x##.

I said it wouldn't help.
Thanks pbuk.

I see your point.

Also, when I mentioned dimensions, I truly intended units :)

So $$x_{new}= x_{old} +\eta \nabla f(x)$$.
The term ##\eta \nabla f(x)## needs to have the same units as the variable ##x##. The equation $$x_{new}- x_{old} = \eta \nabla f(x)$$ does not originate from other mathematical steps/manipulation but it is set up to be this way in the sense that, as a path to find the ##x## for the minimum, we choose to the strategy to set ##\Delta x = \eta \nabla f(x)##...
 
  • #18
fog37 said:
The equation $$x_{new}- x_{old} = \eta \nabla f(x)$$ does not originate from other mathematical steps/manipulation but it is set up to be this way in the sense that, as a path to find the ##x## for the minimum, we choose to the strategy to set ##\Delta x = \eta \nabla f(x)##...
Yes, you now have the understanding, but be careful about sign conventions, particularly when you come to implement it in code. Using your symbols I would write ## x_{new} = x_{old} - \eta \nabla f(x) ##, and note that this translates simply to a multivariate problem with x as a vector implemented in whatever way fits the language (e.g. an array in C++, a numpy.array in Python...).
 
  • #19
fog37 said:
Summary:: Understand the delta rule increment in gradient descent

Is there an in depth derivation to prove that last result?
I believe you are "overthinking" this badly.
For the 1D problem this technique makes no sense! So do not consider it
For 2D (and higher) the gradient gives you the preferred direction to try (i.e. downhill). In order to be sure it is not a local minimum it must work at any choice of scale. QED
 

FAQ: Understand the delta rule increment in gradient descent

What is the delta rule increment in gradient descent?

The delta rule increment in gradient descent is a mathematical formula used to update the weights of a neural network during the training process. It calculates the change in weight based on the error between the predicted output and the actual output of the network.

How does the delta rule increment affect gradient descent?

The delta rule increment plays a crucial role in gradient descent by adjusting the weights of the network in the direction that minimizes the error. This helps the network to learn and improve its predictions over time.

What is the formula for calculating the delta rule increment?

The formula for calculating the delta rule increment is: Δw = α * (target - output) * input, where Δw is the change in weight, α is the learning rate, target is the desired output, output is the actual output, and input is the input to the neuron.

How is the learning rate related to the delta rule increment?

The learning rate, denoted by α, determines the size of the weight update in each iteration. A higher learning rate means a larger weight update, which can lead to faster learning but may also result in overshooting the optimal weights. On the other hand, a lower learning rate may take longer to converge but is less likely to overshoot.

Can the delta rule increment be used for any type of neural network?

Yes, the delta rule increment can be used for any type of neural network, including feedforward, recurrent, and convolutional neural networks. It is a fundamental concept in gradient descent, which is the most commonly used optimization algorithm for training neural networks.

Similar threads

Replies
3
Views
2K
Replies
3
Views
2K
Replies
2
Views
1K
Replies
5
Views
1K
Replies
4
Views
2K
Replies
1
Views
4K
Replies
16
Views
3K
Replies
1
Views
2K
Back
Top