Solving the Problem of Overfitting: Regularisation

-by gagan

Introduction

Have you ever been in a situation where you memorised the answers for an examination but you fumbled in the examination hall because you just memorised topics but couldn’t recognise patterns.

Something similar happens when training a machine learning model too. When the model learns the training data too well , it cannot predict the values outside of the training data. This is known as overfitting.

How to solve overfitting?

The problem of overfitting can be solved mainly 3 important methods:

Reducing the weights($w$) of the input features such that the curve fits better for a wide range of test data. For example if the output function was $f(x)_w,_b=92x_1 + 102x_2 + 20x_3 -100$ we can reduce the weights such that it fits better :

The new output function becomes: $f(x)_w,_b=2x_1 + 0.2x_2 + 3.2x_3 -3$
Another method of reducing over-fitting is by eliminating unnecessary features or neglecting the features which are not much important. For example we can eliminate the input feature → distance to the nearest coffee store, while estimating the price of a house.
The last method that can be used to reduce overfitting is by using the technique of regularisation which we are going to discuss in the next section.

What is Regularisation?

Regularisation is basically a technique to simplify the model and prevent it from fitting the data too well. It works by penalizing complex models and force them to focus on the larger picture rather than memorizing small details. This sounds too generic right? Lets focus on how it actually works:

Types of Regularisation:

There are many kinds of regularisation but for now lets discuss 2 of the most important types.

L1 Regularisation:

In this type of regularisation we eliminate the unnecessary features(in our case the unnecessary features are those features which have very less weight.) by shrinking the value of $w$ by adding a penalty to the loss function.

Initially the loss function was $J(x)_w,b= 1/2m\sum{i=1}^m(\hat{y}-y^i)^2$ but in L1 regularisation we add a penalty terms defined by $λ∑∣wi∣$, where $λ$ is known as the regularisation strength or penalty terms which controls how much regularisation we apply to our model.

The new error becomes:

$J(w) = \sum_{i=1}^{n} (y_i - \hat{y}i)^2 + \lambda \sum{j=1}^{p} |w_j|$