- #1
SilverSoldier
- 26
- 3
I'm reading the book Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, and currently reading this chapter on numerical methods--specifically, the section on constrained optimization.
The book states the following.
Suppose we wish to minimize a function ##f\left(\textbf{x}\right)## over some set $$S=\left\{\textbf{x}\,|\,\forall\,i,g^{(i)}\left(\textbf{x}\right)=0,\,\forall\,j,h^{(j)}\left(\textbf{x}\right)\leq0\right\},$$ where ##g^{(i)}## and ##h^{(j)}## are functions called the equality constraints and inequality constraints respectively; i.e., we wish to find ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##.
To do this, we start by defining a function ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## called the generalized Lagrangian as follows; $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right).$$ Then, $$\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right).$$ The way I understand this is that we start by picking some random values for ##\textbf{x}## and ##\boldsymbol{\lambda}##, and then find the best non-negative ##\boldsymbol{\alpha}## for which ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## is maximum.
Say the ##\textbf{x}## we picked randomly happened to lie in ##S##. Then we should end up at the result that ##\boldsymbol{\alpha}=\textbf{0}##.
It remains now to maximize ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## over ##\boldsymbol{\lambda}##, and the minimize the resulting function over ##\textbf{x}##. Again, because the ##\textbf{x}## we have picked randomly happens to lie in ##S##, ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## is already maximum for any ##\boldsymbol{\lambda}##, because ##\forall\,i, g^{(i)}\left(\textbf{x}\right)=0##.
Therefore, in such a case where the ##\textbf{x}## we pick randomly just so happens to lie in ##S##, we see that $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right).$$ What I do not understand is how minimizing ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)## over ##\textbf{x}## now will keep ##\textbf{x}## within ##S##, and find a minimum in this constrained region.
If $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right),$$ mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?
The book states the following.
Suppose we wish to minimize a function ##f\left(\textbf{x}\right)## over some set $$S=\left\{\textbf{x}\,|\,\forall\,i,g^{(i)}\left(\textbf{x}\right)=0,\,\forall\,j,h^{(j)}\left(\textbf{x}\right)\leq0\right\},$$ where ##g^{(i)}## and ##h^{(j)}## are functions called the equality constraints and inequality constraints respectively; i.e., we wish to find ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##.
To do this, we start by defining a function ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## called the generalized Lagrangian as follows; $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right).$$ Then, $$\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right).$$ The way I understand this is that we start by picking some random values for ##\textbf{x}## and ##\boldsymbol{\lambda}##, and then find the best non-negative ##\boldsymbol{\alpha}## for which ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## is maximum.
Say the ##\textbf{x}## we picked randomly happened to lie in ##S##. Then we should end up at the result that ##\boldsymbol{\alpha}=\textbf{0}##.
It remains now to maximize ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## over ##\boldsymbol{\lambda}##, and the minimize the resulting function over ##\textbf{x}##. Again, because the ##\textbf{x}## we have picked randomly happens to lie in ##S##, ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## is already maximum for any ##\boldsymbol{\lambda}##, because ##\forall\,i, g^{(i)}\left(\textbf{x}\right)=0##.
Therefore, in such a case where the ##\textbf{x}## we pick randomly just so happens to lie in ##S##, we see that $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right).$$ What I do not understand is how minimizing ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)## over ##\textbf{x}## now will keep ##\textbf{x}## within ##S##, and find a minimum in this constrained region.
If $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right),$$ mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?