🌑

Stephen's Blog

The Guide to K-Fold Cross Validation in Machine Learning

Stephen Cheng

 

Intro

In machine learning, if a model simply memorizes the labels of the training samples, it may achieve a perfect score on the training data but fail to make meaningful predictions on new-unseen data. This problem is known as overfitting. To avoid it, it is standard practice in supervised learning to set aside a portion of the data, called the test set (X_test, y_test), for evaluating the model’s performance. That is where K-Fold Cross-Validation comes in. It offers a sneak peek at how your model might fare in the real world. In this guide, we will unpack the basics of K-Fold Cross-Validation and compare it to simpler methods like the Train-Test Split.

Cross Validation Workflow

K-Fold Cross-Validation is a robust technique used to evaluate the performance of machine learning models. It helps ensure that the model generalizes well to unseen data by using different portions of the dataset for training and testing in multiple iterations. Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.

K-Fold Cross-Validation vs Train-Test Split

While K-Fold Cross-Validation partitions the dataset into multiple subsets to iteratively train and test the model, the Train-Test Split method divides the dataset into just two parts: one for training and the other for testing. The Train-Test Split method is simple and quick to implement, but the performance estimate can be highly dependent on the specific split, leading to high variance in the results.

The images below illustrate the structural differences between these two methods. The first image shows the Train-Test Split method, where the dataset is divided into 80% training and 20% testing segments.

The second image depicts a 5-Fold Cross-Validation, where the dataset is split into five parts, with each part serving as a test set in one of the five iterations, ensuring each segment is used for both training and testing.

We can see that K-Fold Cross-Validation provides a more robust and reliable performance estimate because it reduces the impact of data variability. By using multiple training and testing cycles, it minimizes the risk of overfitting to a particular data split. This method also ensures that every data point is used for both training and validation, which results in a more comprehensive evaluation of the model’s performance.

What Does ‘K’ Represent in K-Fold Cross-Validation?

In K-Fold Cross-Validation, K represents the number of groups into which the dataset is divided. This number determines how many rounds of testing the model undergoes, ensuring each segment is used as a testing set once.

Here is a heuristic:

  • K = 2 or 3: These choices can be beneficial when computational resources are limited or when a quicker evaluation is needed. They reduce the number of training cycles, thus saving time and computational power while still providing a reasonable estimate of model performance.
  • K = 5 or 10: Choosing K = 5 or K = 10 are popular choices because they provide a good balance between computational efficiency and model performance estimation.
  • K = 20: Using a larger value of K can provide a more detailed performance evaluation. However, it increases the computational burden and might result in higher variance if the subsets are too small.

Implementing K-Fold Cross-Validation in Python

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time).

1
2
3
4
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.96…, 1. , 0.96…, 0.96…, 1. ])

The mean score and the standard deviation are hence given by:

1
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.98 accuracy with a standard deviation of 0.02

By default, the score computed at each CV (Cross Validation) iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

1
2
3
4
from sklearn import metrics
scores = cross_val_score(
clf, X, y, cv=5, scoring='f1_macro')
scores

array([0.96…, 1. …, 0.96…, 0.96…, 1. ])

When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin. It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:

1
2
3
4
from sklearn.model_selection import ShuffleSplit
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(clf, X, y, cv=cv)

array([0.977…, 0.977…, 1. …, 0.955…, 1. ])

Conclusion

This guide has shown you how K-Fold Cross-Validation is a powerful tool for evaluating machine learning models. It’s better than the simple Train-Test Split because it tests the model on various parts of your data, helping you trust that it will work well on unseen data too.

, , — Apr 15, 2024

Search

    Made with ❤️ and ☀️ on Earth.