# Overview

Regularization is important method to prevent from overfitting and used in many algorithms. If I try to write accurately in good manner, it will be so long article. So here, I don’t stick to mathematical accuracy so much and in rough way I’ll try to show the system.

# Why is regularization necessary?

Regularization is to prevent from overfitting.
Overfitting is one of the problems which come when you make a model. By the training, the parameters fit the data too much to work on test data well.

The following image is one of the examples of overfitting. The points are data. Both of A and B are the line drawn by models. The line A precisely follows all the points. The line B roughly follows the points. But because the line B fits the data too much, it doesn’t express well the unknown data points. For example, if I try to predict the value which corresponds to $d$, on the line A, the predicted value is $d_A$. As you can see, it is not appropriate. (Actually, this image has bit strange point. But it is not the point. So, don’t care.)

Regularization is to solve this kind of situation.

# Regularization

## Approach

The regularization’s approach to solve an overfitting is not so difficult. It keeps the parameters from coming to be big.

On the image above, we saw two lines A and B. The line like A, meaning overfitted one, has tendency that the coefficients become big. So regularization tries to prevent from that.

## Simply

Simply, regularization is expressed as following.
$L(w) + \lambda R(w)$
$L(w)$: loss function
$\lambda$: the parameter to manipulate the strength of the regularization
$R(w)$: regularization term

Usually, through the training, the model tries to decrease the loss. On that time, the target function is the loss function. By regularization, we add the regularization term to the loss function.

## What does this numerical expression mean?

As regularization, we added the regularization term to the loss function. We need to know what it means. But if I write precisely in good manner, it becomes too mathematical and not appropriate as a blog article. So here, I'll write down simple mathematical rule which is related to the understanding in rough manner.

If you want to know the precise mathematical background, I'll recommend to read the book like Pattern Recognition and Machine Learning.

### Example

For the simplicity, I set the premise that the number of the parameters is just two, $x$ and $y$.

The maximum and minimum of $f(x, y)$ on the condition of $g(x, y)=0$ can be attained by the following.
$L(x, y, \lambda) = f(x, y) - \lambda g(x, y)$
The local maximum and minimum can be the candidates of the maximum and minimum of $f(x, y)$ on the condition of $g(x, y)$.

So, as a rough understanding, we can grasp that by adding the constraint, the effect of constraint can be reflected to the loss function.

## Chart

Now, we more or less know that by adding the regularization term to the loss function, we can reflect the effect of constraint to the loss function. We can check graphically the meaning of the “constraint”.

For the simplicity, let's think about the 2 variables case. About regularization, we frequently see the chart like this.

The blue circles are the loss function and the red one is constraint. As a simple understanding, you can think that the points that the both, red and blue, bumped are the wanted parameters. There are some types of regularizations like l1, l2. Graphically, it depends on the shape of the red circle.

Anyway, by those, the values of parameters don’t become too big and by some types of regularization method, we can more or less deactivate the not strong parameters.