Posted in AI, Tech

The previous post: Find your best fit

This post is a continuation of the previous one. This one shall discuss a few more ways of evaluating a Machine Learning Algorithm – a Classification Model, in particular.

### Learning Curve Say we have fit some function to ‘x’ number of training samples. The error of this function keeps increasing, as more and more data samples are added. However, after a certain limit, say ‘n’ number of samples, the error value will plateau.

Plotting a learning curve can assist in taking decisions on collecting more samples to enhance the performance of the classifier on the Test set. It is not going to make much difference to the performance if the learning curve has already reached the plateau.

### Precision and Recall

Let’s consider an Email Spam Classifier. The prediction of the classifier vs the actual result, can be one of the following: [Here, 0 -> False, 1 -> True] If the spam classifier predicts an email to be a spam (Predicted class is 1) and it turns out to be actually a spam (i.e. Actual class is also 1) then the result is said to be ‘True Positive’. The classification was accurate. However, if the spam classifier classifies the email to be a spam (Predicted class is 1) and the email is not actually a spam (Actual class is 0) then the result is said to be ‘False Positive’. The classifier seems to have erroneously classified the mail as spam. False Negative and True Negative can be similarly interpreted.

Precision and Recall are further used to calculate F1-Score, that measures the performance of a classification algorithm. Let us see what they mean and how they are calculated.

`Precision = True Positive/(True Positive + False Positive)`

Precision: Of all the emails that were predicted to be spam, the fraction of those that were actually spam is called the Precision.

`Recall = True Positive / (True Positive + False Negative)`

Recall: Of all the emails that actually are spam, Recall is the fraction that was correctly predicted as spam. The relationship between Precision and Recall is illustrated in this graph. They are inversely related. There is a threshold where the classifier performs the best and has the highest precision.

F1-Score

F1-Score is the weighted average of Precision and Recall. It is calculated as:

`F1-Score = 2 (Precision * Recall) / (Precision + Recall)`

Here’s an example of three different algorithms with their precision, recall and F1 score: The first algorithm has the highest F-Score and hence performs the best.

Evaluating an Algorithm at an early stage results in making a calculative decision on further steps to be taken to improve performance. It is usually suggested to make a very quick and dirty implementation using a simple algorithm initially, and then improvise, by analyzing the algorithm and taking decisions like adding more/reducing features, collecting more samples etc.

Posted in AI, Tech

Let’s say we have chosen and implemented the best Machine Learning Algorithm, suitable for the data of our choice but ended up figuring out that the algorithm is actually making unacceptably large errors in prediction. What do we do next? Let’s discuss some choices that can be made in such a scenario and also get an intuition of what can be expected out of them. These can be used in general, for debugging or analyzing the performance of a Machine Learning Algorithm on a specific implementation.

2. Try smaller set of features
5. Increasing / decreasing the regularization parameter

The result of these can only make sense after understanding some underlying concepts like Over-fit, Under-fit, Bias, Variance etc. Here is what they mean:

#### Under-fit and Over-fit: As illustrated in the graph, if the regression model fits only few data samples, it is said to be under-fitting. Insufficient number of Features might cause under-fitting. This is also called High Bias. As we can see, the model fits almost all the data samples provided to the algorithm. It may perform well on the training set but will not generalize well enough when new samples are encountered in the test set. This is also called High Variance.

#### Bias vs Variance:

When the training set error and the test set error are plotted against, say, the size of the data sample, we get a graph similar to that given below. As more number of sample are fed, the algorithm performs excellently on the training set, thus reducing the error drastically. Whereas, on the test set it may performs poorly leading to high error.

[Error is the difference between the prediction made by the model and the actual value. The entire data sample is usually divided into a Training set, a Test set and sometimes a Cross validation set. The predictive model is trained using the training set. Cross validation and test sets are used to validation and Testing the model accordingly.]

High Bias: If both the training set error and the test set error are high, we can incur that the algorithm is suffering a high bias or is under-fitting.

High Variance: In case the training set error is very low and error on test set is very high when compared to the training set, the algorithm is suffering from high variance or is over-fitting.

#### Back to the diagnosis:

Now that we have the required background, let’s go back to the diagnosis and analyze each of our choices.

• If the test set error is too higher than that of the training set and we conclude the possibility of high variance, we can fix it by collecting more data samples and training the model again.
• This might not be such a great idea if both the training and test set errors are high.
• Say the algorithm is under-fitting the samples, the number of features might not be sufficient for the model to make accurate predictions. Collecting some more features may prove to be helpful.
• For Example, if the model is predicting the price of a hotel, adding more features like number of rooms, room size, floor elevation, balconies, furniture etc can act as additional features.
3. Try smaller set of features:
• If the algorithm is over-fitting the data samples; in other words, it performs extremely well on the training set but is highly erroneous on the test set, adding more features might help improve the performance.