Cross-Validation Techniques: k-fold Cross-Validation vs Leave One Out Cross-Validation

7 min readMay 22, 2021

by Shang Ding

with Greg Page

When building supervised learning models, we must always consider the risk of overfitting. Overfitting occurs when a model has been optimized to work well with the specific set of observations used to build it, but cannot replicate that performance against new, yet-unseen data.

One of the most common ways of mitigating against overfitting risk is by using a train-test split. Typically, a modeler using this method randomly partitions an entire dataset, using some designated percentage (often 60 or 70%) for model building, while keeping the other records as a test set, or holdout set, to evaluate performance against data not used in model construction.

Especially when the number of observations in a dataset is limited, the train-test split method is prone to bias. For example, let’s say we are using the sample dataset shown below, which contains just 15 total records, to predict the ‘Label’ outcome using ‘Mean’ and ‘SD’ values as inputs.

With a completely random sampling approach, it is possible that all the training set records could contain ‘red’ labels, while all of the ‘green’ records go to the test set. That would make our model results invalid, as the model trained only on red labels would always predict red outcomes. Alternatively, if all of the green labels ended up in the training set, we would have a similar imbalance problem. Some data partitioning methods can try to correct for this, but may not correct all the imbalances in our dataset.

To address this issue, we can use cross-validation. With cross-validation, rather than making a single “slice” that assigns all records to either the training or testing sets, we instead repeatedly sub-divide the observations into smaller groups. In k-fold cross-validation, the k-value refers to the number of groups, or “folds” that will be used for this process.

In a k=5 scenario, for example, the data will be divided into five groups, and five separate models will actually be built. For the first model, one-fifth of the data will be kept aside as a holdout set, while the other four-fifths is used to train a model. Then, a second model is built — but this time, a different group of records is kept as the holdout set, while the remaining four-fifths of the data is used to train the model. This process is repeated three more times — in each case, a distinct set of records is used as the holdout set, while the rest of the available data is used for model building.

When the k value changes, the overall concept remains the same, but the number of total folds, and their relative size, is altered. With k=10, there would be 10 model-building iterations; each time, one-tenth of the data would be used as a holdout set, with the other 90 percent being used to build the model.

Notice:

The illustrations shown below are included here to demonstrate the k-fold cross-validation concept. In this example, the folds used for cross-validation are selected sequentially. In reality, the distinct subgroups of data are selected randomly from the original dataset.

Each time that the model is fit, this process will generate an accuracy score. In this example, we will generate five accuracy scores, and we will use the mean of these five values as our expected model accuracy.

Implementation in Python

Import the modules

The code below shows an implementation in Python, with a logistic regression model built from a dataset slightly larger than the one used in the example above.

In scikit-learn, the ‘cv’ parameter of the cross_val_score() function represents the k-value. The most commonly used k-values are 5 and 10, but there are no set rules for this — larger or smaller values can be used, too.

If this process were performed repeatedly, the average accuracy could be expected to change slightly each time. Even with the same k-value, the specific composition of each holdout set would change, as would the mix of observations used for each model building iteration. As the k-value increases, the risk of bias due to random row assignments becomes smaller, but the compute time needed to run the algorithm grows.

Leave-One Out Cross-Validation

When k = the number of records in the entire dataset, this approach is called Leave One Out Cross Validation, or LOOCV.

When using LOOCV, we train the model n times (with n representing the number of records in the dataset). Each time, only one record will be the test set, with the rest of the records used to build the model.

Implementation in Python

As shown by the model output above, we obtained 15 accuracy scores, since our dataset contains 15 records. The average of the 15 accuracy scores is 0.75.

LOOCV’s main advantage comes through in the way it reduces bias — when we use LOOCV, we obtain the same result every time. Since every row is used once as a unique test set, with all the other rows used for model building, there is no randomness with row assignment.

The most salient LOOCV disadvantage is the high calculation cost: especially with large datasets, the calculation time required to build n unique models can be prohibitive.

Cross-Validation: Hyperparameter tuning

Cross-validation plays a very important role in hyperparameter tuning.

Hyperparameters are model settings that can be determined and adjusted by the modeler. For example, when we build a random forest model, we can decide on the number of decision trees to be used in the forest, the maximum number of features to be considered at each split, and the maximum allowable depth of the tree.

A popular way to make those determinations is by using the grid search process — in this process, each possible unique combination of settings is tested. With cross-validation, we can assess the relative accuracy of each combination, and then select the combination that brings the highest average accuracy value.

Note that when we do this, we still keep a separate set of records aside as a final holdout set. The cross-validation process occurs entirely with the training set; once the optimal set of hyperparameters is found, a model is trained with those settings, and then that model is checked against a test set.

Implementation in Python

In the example shown immediately above, a random forest model is built in scikit-learn without any specified hyperparameters. With just the default settings in place, the accuracy score against the test set is 0.586.

To implement the GridSearchCV, we build a dictionary, or a set of key-value pairs, with several possible values that could be used for the hyperparameters shown below:

We pass the dictionary which we set above to the ‘param_grid’ parameter in the GridSearchCV() function, and specify cv=5 for five-fold cross-validation.

As noted above, in the grid search process, all the possible combinations of values are assessed, with accuracy scores determined through the use of cross-validation. By calling ‘best_params’, we will get the optimal combination of the hyperparameters.

Then, we input that optimal combination of settings, and use this to fit a new model:

With the help of cross-validation, we have improved model accuracy.

Summary: Cross-Validation

Cross-validation is an important element in the toolkit of any data scientist or machine learning engineer.

As noted above, the choice of a k-value is up to the modeler, and can be as large as the number of records used in the entire dataset. Larger k-values reduce the risk of bias due to the randomness involved in row assignments, but they come with a cost (increased computational complexity).

Especially when we have a limited number of records on hand for model building, cross-validation can be a particularly handy way to gain a sense of likely model accuracy against records that were not used in model building. It can also be helpful in other parts of the modeling process, such as hyperparameter tuning.

Cross-Validation Techniques: k-fold Cross-Validation vs Leave One Out Cross-Validation

Written by Shang Ding