K-fold cross-validation is an evaluation procedure for small datasets. Instead of one fixed Validation set, we split the training portion into equal pieces — called folds — and train and evaluate times. Each time, one fold is used as the validation set and the remaining folds are used for training.

The motivation: a single train/validate split can give misleading results on a small dataset. If we get unlucky in the random split, the validation set might not be representative, and our hyperparameter choices might be miscalibrated. Cross-validation averages over multiple splits, giving a much more robust estimate.

The picture

With and folds labelled 1 through 5:

  • Run 1: validate on fold 1, train on folds 2-5.
  • Run 2: validate on fold 2, train on folds 1, 3, 4, 5.
  • Run 3: validate on fold 3, train on folds 1, 2, 4, 5.
  • Run 4: validate on fold 4, train on folds 1, 2, 3, 5.
  • Run 5: validate on fold 5, train on folds 1, 2, 3, 4.

By the end, every fold has been used once for validation and four times as part of training. The five validation scores together give a much more robust estimate than any single train/validate split would. We typically report the mean and standard deviation of the scores across folds — the mean is the central estimate, the standard deviation quantifies the uncertainty.

After cross-validation

Once cross-validation has been used to compare configurations and pick one, we do something specific:

  • If we have a separate Test set held out, the final model is retrained on all the non-test data using the chosen configuration, and then evaluated once on the test set.
  • If there’s no separate test set (the dataset was too small to spare one), we report the cross-validation average and standard deviation as the performance estimate.

The reason for using all folds for training in each iteration: the model gets to learn from every example except the one currently held out. This squeezes the most learning out of a small dataset. The cost is computational: training times takes times as long as training once.

In scikit-learn

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
 
clf = make_pipeline(StandardScaler(), LogisticRegression())
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f'{scores.mean():.3f} ± {scores.std():.3f}')

cross_val_score(...) performs -fold CV (default 5-fold) and returns one score per fold. The scoring= parameter chooses the metric — 'accuracy', 'roc_auc', 'f1', anything from sklearn.metrics.

StratifiedKFold is a variant that preserves class proportions in each fold — essential for imbalanced classification problems. GroupKFold keeps related groups (e.g., samples from the same patient) entirely in either training or validation, never splitting them — essential when group leakage is a concern.

For Hyperparameter tuning combined with cross-validation, GridSearchCV and RandomizedSearchCV automate the full procedure: try each configuration, run cross-validation, pick the best.