Machine Learning: Understanding Overfitting, Underfitting, and Regularization.

Hello, Welcome to COT, this article explains Overfitting, Underfitting, and their solutions with mathematics. You need basic understanding of linear regression, different terms, and signs used in ML. I will explain it in two steps.

1. Overfitting and Underfitting.
2. How to solve Overfitting and Underfitting.

1.  Overfitting and Underfitting (or Bias and Variance).

You see squared error(i.e. [target - pred]^2) in many algorithms, it can be divided into three parts-

Error = Bias^2 + Variance + Irreducible Error

How do I know that? Well you can derive it mathematically, it is possible.

Bias of a machine learning model is difference between what was expected and what it is predicting. High bias means underfitting.

If a machine learning model fits well for training data, but when it is tested on unknown data(or test data), and it performs bad. Then the difference between accuracy on the training data, and the test data accuracy is called variance. High Variance means overfitting.

Low Bias + Low variance = High performing model

Graph 1: Our model is fitting into data perfectly, but this perfection is not good when test data is given to the model. It may give wrong results, because it has learned from some useless information. We call it overfitting.

Graph 2: Our model is not fitting good enough to give satisfying accuracy. It will perform same on test data, and predictions will be wrong. We call it underfitting.

Graph 3: Here it looks balanced. Bias and Variance, both are low enough. It will welcome test data also. We do not get high performing model easily, we have to find it. But solving overfitting, and underfitting helps pretty much.

Bias, and variance have relationship between them, you can see that in the following graph -    
[Vertical axis is error/cost in above image. Sorry.]
[ Model Complexity: Increasing no. of features, increases model complexity, they are inversely proportional.

You can see in above graph that when bias increases, variance decreases, and vice-versa. It looks impossible to get 0 bias, and 0 variance at the same time, but they can be reduced to get satisfying results. Everything should be balanced to learn optimal solution.

To learn more about bias-variance relationship you can read this wiki. You will get derivation of decomposing squared error into 3 parts (var + bias + irreducible) there.

2. How to solve overfitting and underfitting?

If we solve high bias, and high variance problems, then they will get solved automatically.

(a) Solving Underfitting

Solution 1: By looking at variance-bias graph above, you can say if we increase model complexity, then bias will decrease. It is the best solution.

Solution 2: Use non-linear model, but it may cause overfitting.

Solution 3: Increase training time, it may work.

(b) Solving Overfitting

Solution 1: According to variance-bias graph, if you reduce model complexity of the model, then variance will decrease. But, it may cause information loss, and again it may increase bias. If you are using some useless features like name of a person, unique id, or any less import feature, then you can remove them without any doubt.

Solution 2: Regularization, here comes my favourite topic.

Regularization is clever, it does not remove features. It just reduces their magnitude.

 [Note: Linear regression with regularization is called Ridge regression]

In regularization, we add an extra term in cost function, which is called penalty term, and it reduces magnitude of weights. How is that possible?, we are going to see, but before that you can see how objective functions of linear regression, ridge regression looks like.

[ Objective function: It is basically a function which we want to minimize or maximize. Cost function with fancy name. ]

By looking at the expression of ridge regression feels like, the regularization term is even increasing magnitude of weights :), but no, higher the lambda (or alpha term in ML) is, lesser is features/weights magnitude. How? , Here you can see the maths, open the image.

Don't worry, it was a good explanation(Matrix mathematics), so I posted it. Let's understand it in a simple way-

(i) We have our objective function, to minimize it we are taking derivative of it, and making the derivate equal to 0 to find minimum.

(ii) Putting independent variable(which is w) to LHS, and all other things(constants, and x) to RHS.

(iii) The expression we get for w(red box in above image) will minimize the objective function.

(iv) You can see the lambda is in inverse term. So, it is inversely proportional to the w(or wight). As lambda increases, the w which minimizes objective function will decreases, and that will give us better results. 

If you have any question, please ask me in the comment section below. And if this helped you, you can share with your friends.

Post a Comment