[Jump into ML] Overview of Cross-validation

Cross-validation is one of the most popular methods to evaluate model performance and tune model parameters. Like the bootstrap, it belongs to a family of Monte Carlo methods. Today, we will go over several types of CV methods.

A. Hold-out / k-Fold CV

The simplest way is to divide the whole dataset into train and test dataset and evaluate the trained model with test data set(hold-out). Thus, you get one validation score. A more complicated method is to divide the whole data into k parts. One of the k parts becomes a test set and the others train set. This process is repeated until the whole dataset is included in test set. The average of k validation scores becomes the final score. People usually choose k as 5 or 10.

B. Leave-one-out(LOOCV) / Leave-p-out(LoPC)

These are extreme cases of k-Fold CV where ‘one’ or ‘several(=p)’ instances become the test set. Great advantage of this method is that you can include more data into training set, therefore much less information is lost. However, the model has to be trained n(=number of instances, for LOOCV) times, which is computationally expensive.

C. Stratified k-Fold CV

Stratified sampling can make samples represent the data. This type of approach is useful when the data is imbalance. Stratified k-Fold works by dividing dataset to k strata that are as similar as possible, for example, parts that are containing the same proportion of target 0 and 1. For regression case, the strata target averages are approximately the same. The rest works same as ordinary k-Fold.

D. Repeated k-Fold CV

This method is the most robust method of all four. This works by randomly sampling a predefined number(p) of instances for testing, and the rest as training. Notice that this p can vary at each iteration, which is one advantage of repeated k-Fold. K becomes the number of times the model will be trained. One disadvantage is that it is not guaranteed that the whole data will be run over as test set, unless k is very large. Still, there is no selection bias as the selection process is randomized.

All CV methods can be implemented with sklearn.model_selection

A. Hold-out / k-Fold CV

B. Leave-one-out(LOOCV) / Leave-p-out(LoPC)

C. Stratified k-Fold CV

D. Repeated k-Fold CV

Share on: