L1, L2, and L0.5 Regularization Techniques.
[TOC]
Introduction
In this article, I aim to give a little introduction to L1, L2, and L0.5 regularization techniques, these techniques are also known as the Lasso, Ridge, and Elastic Net regression techniques respectively. Links referenced are provided at the end of this article. Before we start, we would need to answer two questions.
- What is Regularization ?
- Why is it needed ?
Starting with the first question and describing it plainly, regularization is the act of shrinking coefficients to zero and in this case coefficients are the weights that are learned during model training. With this definition it can be said that the higher the weights or coefficient values obtained during training, the complex our machine learning model becomes, and this complexity introduces a problem. What this problem is, is answered by the second question.
Why is it needed? Simply put, regularization is used in solving the over-fitting problem that occurs when training a machine learning model. An example of such is, getting high training accuracy but when used on a “similar” but different dataset, you observe an incredibly low accuracy. By “similar” I mean a dataset that provides information closely related to the previous dataset used for training. Now, to define over-fitting, this is a situation that occurs when a machine learning model tries to capture all the information in a dataset but in turn also captures the noise in the dataset. Noise in this sense are data points that do not provide any form of information but are found in the dataset due to random “occurrences”.
Now that we understand that regularization is used to prevent over-fitting of models on a particular dataset in order for them to perform well across other datasets, we can take off.
L1 Regularization ( Lasso Regression):
The L1 regularization technique is a technique that reduces over-fitting by shrinking coefficients (weights) to zero, indirectly this act performs feature selection for the model. Feature selection on the other hand is the act of selecting specific features in a dataset one feels will give a machine learning model valuable information in order to perform well.
Although the idea of feature selection is a great one because in large datasets, some features do not contribute to the learning of the model and hence they should be removed, but L1 regularization does this feature selection in order to leave only variables that will minimize the prediction error. The minor disadvantage to this, is, this mode of feature selection I believe introduces bias because not many factors are considered before eliminating a variable and from a human perspective that might not be valid act.
The equation that governs the L1 regularization technique is: \(w(W) = \lambda \sum_{i=1}^{i=n} ||W_i||\) Now, when you bring in the L1 regularization equation together with the error function of the residual sum of squares, we have: \(Error(y, \hat{y}) + \lambda \sum_{i=1}^{i=n} ||W_i||\) In the first equation stated above, omega( ω) can be noted as the regularization term, Wi are the weights, and lambda (λ) the regularization parameter, in simple terms the regularization parameter controls how much we perform regularization on our model. The higher the value of the regularization term, the more coefficients are shrunk to zero and the more feature selection occurs.
L2 Regularization (Ridge Regression):
In L2 regularization, we shrink the coefficients (weights) towards zero without eliminating them. This type of regularization forces the model to rely on all the features made available to it while still preventing over-fitting.
The equation that governs the L2 regularization technique is: \(w(W) = \lambda \sum_{i=1}^{i=n}W_i^2\)
When we include the error function of the residual sum of squares we have: \(Error(y, \hat{y}) + \lambda \sum_{i=1}^{i=n}W_i^2\) Unlike Lasso regression (L1), in ridge regression (L2) the higher the value of the regularization parameter the more the coefficients are shrunk towards zero but not to zero, hence no feature selection.
L0.5 Regularization (Elastic Net Regression):
L0.5 regularization technique is the combination of both the L1 and the L2 regularization techniques. This technique was created to over come the minor disadvantage of the lasso regression technique (L1) to a degree. Here, Elastic Net still performs feature selection but simultaneously it finds the optimal values of coefficients (weights) using the Ridge regression technique (L2).
The L0.5 Regularization technique is governed by the equation below: \(w(W) = r\lambda \sum_{i=1}^{i=n}||W_i||+\frac{1-r}{2} \lambda \sum_{i=1}^{i=n}W_i^2\) and when we add error function of the residual sum of squares, we have: \(Error(y,\hat{y}) + r \lambda \sum_{i=1}^{i=n}||W_i||+\frac{1-r}{2}\lambda \sum_{i=1}^{i=n} W_i^2\) r as shown above is defined as the mix ratio, i.e. when we r=0 elastic net regression becomes equivalent to ridge regression, and when r = 1, elastic net regression is equivalent to lasso regression.
Now that we have gotten to the end of this brief introduction , It is also good to note that there are other forms of regularization techniques especially in deep learning, some of these techniques include dropout, early stopping, and data augmentation.
Reference
https://medium.com/analytics-vidhya/l1-l2-and-l0-5-regularization-techniques-a2e55dceb503
文档信息
- 本文作者:Kilin
- 本文链接:https://star-twinking.github.io/2021/09/05/L1,-L2,-and-L-0.5-Regularization-Techniques/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)