Different norms and how L1 leads to the sparsest solution

  • I
  • Thread starter fog37
  • Start date
In summary, there are different types of norms that can be calculated for vectors in a finite dimensional space, with each norm offering its own benefits. The concept of regularization penalty, which is used in relation to norms, refers to the penalty function being equal to the number of nonzero elements in the vector. In solving an underdetermined system of linear equations, the L1 norm is preferred as it promotes finding the sparsest solution, with the L0 norm being the best option for achieving this but lacking convexity. The L2 norm guarantees that all coordinates are nonzero, while the L1 norm sits in the middle and only allows for small changes in the nonzero coordinate when making the penalty function smaller. In general, the L2 norm yields the
  • #1
fog37
1,569
108
TL;DR Summary
Understand why L0 and L1 lead to the sparsest solution of a linear system of equations
Hello,

I think I understand the different types of norms that we can calculate for a vector living in a finite dimensional space. Each norm has its own benefits.

In 2D, for simplicity, all vectors with the same ##L2## norm are points on a circle (the isoline). All vectors with the same ##L1## are points on the diamond with vertices on the axes. and all vectors with the same ##L0## are points on

Question: what is the concept of regularization penalty that I have read about in regards to norms?

Question: in solving an underdetermined system of linear equations (more unknown that equations), why does looking looking for the ##l-1## norm promotes finding the sparsest solution?
For example, in the figure below, the black straight line represents all possible and infinite solutions of the underdetermined system. We need to find a solution (which is where it the ball touches the black line) which is required to be sparse...The ##L1## solution is the sparsest since it has only one nonzero coordinate... But doesn't it all depend on the position of the black line of solutions? For example, the solution black line could be positioned in such a way that it perfectly overlaps one of the sides of the diamond. Or the black line could touch the circle (L2) where the solution is sparse if the black line was vertical or horizontal...I am confused...

1652022831296.png


The best norm, when looking for the sparsest solution of a system, is the L0 norm but it does not actually satisfy the property of being convex...
 
Mathematics news on Phys.org
  • #2
For the ##L_0## norm the penalty function is literally the number of nonzero elements, so it has to give the sparsest solution possible.

The ##L_2## norm has the property that the partial derivative with respect to any coordinate at 0 is 0, so if you have some coordinate that is non zero and some coordinate that is zero, you generally can modify the 0 coordinate by a lot, in exchange for making the nonzero coordinate only a little smaller, and still reduce the penalty function. The ##L_2## norm basically guarantees that every coordinate is nonzero, because there's no penalty for making that happen just a little bit.

The only time you get a zero coordinate is when changing it doesn't let you modify the other coordinates at all, which corresponds to when the solution set happens to be a vertical or horizontal line in your diagram.

The ##L_1## norm sits in the middle. If you have a coordinate whose value is 0, you have to increase it by less than the amount you can decrease all the other coordinates in order to make your penalty function smaller. For example if you consider solutions to ##x-2y=1##, for every unit that I change ##x## by, I only have to change ##y## by 0.5 units. So if I have a random solution, say ##x=19## and ##y=9##, I know if I move ##x## to make it smaller, then even if that ends up making ##y## larger, the ##L_1## norm is guaranteed to decrease. This means I might as well make ##x## zero.

In general the only time you don't get to win to this is when you have something like ##x+y=1## which corresponds to the solution set being parallel to the face of the unit ball like you observed.

Notice that even when the ##L_2## norm finds a solution with one zero in it, the ##L_1## norm finds it also. And when the ##L_1## norm doesn't guarantee finding a sparse solution, the ##L_2## norm yields the least sparse solution possible (both coordinates being equal), and the sparse solution of one zero coordinate is at least one of the possible ##L_1## minimizing solutions.
 
  • Like
Likes mfb and fog37

FAQ: Different norms and how L1 leads to the sparsest solution

What are different norms in the context of L1 and L2 regularization?

In machine learning and statistics, norms are mathematical measures that quantify the size or length of a vector. In the context of L1 and L2 regularization, norms refer to the penalty terms added to the cost function to prevent overfitting and encourage simpler models. L1 norm, also known as the Manhattan norm, penalizes the absolute values of the model parameters, while L2 norm, also known as the Euclidean norm, penalizes the squared values of the model parameters.

How does L1 lead to the sparsest solution?

When using L1 regularization, the penalty term forces some of the model parameters to become exactly zero, resulting in a sparse solution. This is because the L1 norm has a "shrinkage" effect on the model parameters, making them smaller and potentially eliminating some of them entirely. This sparsity can be beneficial for feature selection and interpretability of the model.

What are the advantages of using L1 regularization?

Aside from promoting sparsity in the model, L1 regularization also has the advantage of being more robust to outliers in the data. This is because the absolute values in the L1 norm are less affected by extreme values compared to the squared values in the L2 norm. L1 regularization can also help with feature selection and reducing the complexity of the model.

Are there any drawbacks to using L1 regularization?

One potential drawback of L1 regularization is that it can only select features, not shrink them to small values. This means that if there are highly correlated features in the data, L1 regularization may not be able to effectively reduce their impact on the model. Additionally, L1 regularization can be computationally expensive for large datasets, as it involves solving an optimization problem for each iteration of the model training process.

How do different norms and L1 regularization relate to each other?

Different norms, such as L1 and L2, are used in the context of L1 regularization to penalize the model parameters and prevent overfitting. L1 regularization, in particular, uses the L1 norm to promote sparsity in the model by shrinking some of the model parameters to exactly zero. The choice between different norms and regularization techniques ultimately depends on the specific dataset and modeling task at hand.

Back
Top