Constrained Optimization with the KKT Approach

SilverSoldier · Nov 30, 2022

I'm reading the book Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, and currently reading this chapter on numerical methods--specifically, the section on constrained optimization.

The book states the following.

Suppose we wish to minimize a function ##f\left(\textbf{x}\right)## over some set $$S=\left\{\textbf{x}\,|\,\forall\,i,g^{(i)}\left(\textbf{x}\right)=0,\,\forall\,j,h^{(j)}\left(\textbf{x}\right)\leq0\right\},$$ where ##g^{(i)}## and ##h^{(j)}## are functions called the equality constraints and inequality constraints respectively; i.e., we wish to find ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##.

To do this, we start by defining a function ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## called the generalized Lagrangian as follows; $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right).$$ Then, $$\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right).$$ The way I understand this is that we start by picking some random values for ##\textbf{x}## and ##\boldsymbol{\lambda}##, and then find the best non-negative ##\boldsymbol{\alpha}## for which ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## is maximum.

Say the ##\textbf{x}## we picked randomly happened to lie in ##S##. Then we should end up at the result that ##\boldsymbol{\alpha}=\textbf{0}##.
It remains now to maximize ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## over ##\boldsymbol{\lambda}##, and the minimize the resulting function over ##\textbf{x}##. Again, because the ##\textbf{x}## we have picked randomly happens to lie in ##S##, ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## is already maximum for any ##\boldsymbol{\lambda}##, because ##\forall\,i, g^{(i)}\left(\textbf{x}\right)=0##.

Therefore, in such a case where the ##\textbf{x}## we pick randomly just so happens to lie in ##S##, we see that $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right).$$ What I do not understand is how minimizing ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)## over ##\textbf{x}## now will keep ##\textbf{x}## within ##S##, and find a minimum in this constrained region.

If $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right),$$ mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?

FactChecker · Nov 30, 2022

SilverSoldier said:

It remains now to ..... minimize the resulting function over ##\textbf{x}##. .....

SilverSoldier said:

mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?

Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.

SilverSoldier · Nov 30, 2022

FactChecker said:

Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.

But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##?

FactChecker · Nov 30, 2022

SilverSoldier said:

But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##?

Oh, I am sorry. I missed the point. By maximizing over ##\lambda_x## and ##\alpha_x \ge 0## for any given ##x##, you are finding the multipliers that provide the best penalty for violating the constraints of ##g## and ##h##. So that forces the minimization over ##x## to stay within ##S##.

UPDATE: After thinking more about this, I do not feel I am knowledgeable enough to give authoritative help here. I will leave this to others.

Office_Shredder · Dec 1, 2022

I think fact checker has it. If you pick a value of x for which ##g^i(x)>0## you can pick ##\lambda_i>>>0## and ##L## is very positive. Similarly if ##g^i(x) <0## you can pick ##\lambda_i<<<<0##. So ##g^i(x)=0## when you minimize ##L##.

But for the inequality constraints you only get ##h^i(x)\leq0## since otherwise we make ##\alpha^i>>>>0##

SilverSoldier · Dec 1, 2022

I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?

Say we wanted to find the minimum of ##\dfrac{x^2+y^2}{5}## subject to the constraints ##x^2+y-2\leq0## and ##10x-y-22=0##. The actual minimum, which is ##1.6## given these constraints, is achieved at ##x=2## and ##y=-2##. To find this minimum using the KKT approach, we first define the generalized Lagrangian, as follows; $$\dfrac{x^2+y^2}{5}+\lambda\left(10x-y-22\right)+\alpha\left(x^2+y-2\right).$$ Say we start by picking ##\left(1.8, -4\right)## randomly for ##\left(x,y\right)##. Notice ##\left(1.8, -4\right)## satisfies both constraints and so ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(1.8, -4\right)=3.85##. What does it now mean to minimize this? Does this mean we now vary ##x## and ##y## until we look for values that minimize ##\dfrac{x^2+y^2}{5}##?

Can you also please describe how this approach may be implemented in code?

Office_Shredder · Dec 1, 2022

I think you're thinking about the ordering wrong. The first thing is min over x. So you can imagine having some function ##K(x)## that e are trying to minimize. That function is max over ##\lambda##, max over ##\alpha##, ##L(x,\lambda, \alpha)##. So the ordering in the OP says for each ##x##, pick the worst choice of ##\lambda## and ##\alpha## to compute ##K(x)##. Then find the ##x## that makes this as small as possible.

FactChecker · Dec 1, 2022

SilverSoldier said:

I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?

If the first thing you do is minimize wrt ##x## then you are already too late to try to keep ##x## within ##S##. The first thing you must do is to maximize wrt ##\lambda## and ##\alpha## to force ##x## to stay within ##S##.

Constrained Optimization with the KKT Approach

FAQ: Constrained Optimization with the KKT Approach

What is the KKT approach in constrained optimization?

What are the KKT conditions?

How does the KKT approach differ from other optimization methods?

What are the advantages of using the KKT approach?

What are some applications of the KKT approach?

Similar threads

Hot Threads

Recent Insights