Hyperparameter

A hyperparameter is a configuration value that controls how a model is trained but is not itself learned by the training procedure. The Learning rate is a hyperparameter. The polynomial degree in Polynomial regression is a hyperparameter. The number of trees in a random forest, the regularization strength in a regularized linear model, the number of nearest neighbors in kNN — all hyperparameters.

The contrast is with parameters, the values the training procedure adjusts: the weights $w$ in Linear regression or Logistic regression, the entries of a neural network’s weight matrices. Parameters are learned by Gradient descent; hyperparameters are set by the practitioner before training begins.

Why this matters

Different hyperparameter values produce different models. A learning rate that’s too small leads to a slow optimizer that doesn’t converge; too large leads to one that oscillates. A polynomial degree that’s too low leads to underfitting; too high to overfitting. Picking good hyperparameters is much of the art of practical machine learning.

How to pick them

Hyperparameters can’t be picked by minimizing the training loss — every hyperparameter has some value that overfits the training set perfectly. Picking by training loss reduces to pick the most overfitted model, which is the opposite of what we want.

Instead, hyperparameters are picked by evaluating on a Validation set (or by K-fold cross-validation) and choosing the value that maximises generalization — performance on data the model wasn’t trained on. The standard procedures:

Grid search. Pick a grid of candidate values for each hyperparameter, train a model with each combination, evaluate on validation, pick the best. Exhaustive but slow when there are many hyperparameters.
Random search. Pick random combinations from a distribution over hyperparameter values. Often finds nearly-optimal hyperparameters faster than grid search.
Bayesian optimization. A more sophisticated approach that models the validation-score-vs-hyperparameter surface and picks the next combination to try based on what the model expects to learn most from.

In scikit-learn, GridSearchCV and RandomizedSearchCV automate grid and random search with cross-validation built in:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
 
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='accuracy')
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)

Common hyperparameters

The hyperparameters worth knowing for each model family:

Linear regression / Logistic regression — regularization type (L1 / L2 / none), regularization strength.
Polynomial regression — polynomial degree.
Decision trees — max depth, min samples per leaf, max features.
Random forests — number of trees, max depth, max features per split.
kNN — number of neighbors $k$ , distance metric.
SVMs — kernel type, regularization $C$ , kernel-specific parameters ( $γ$ for RBF).
Neural networks — learning rate, batch size, number of layers, layer widths, activation functions, dropout rate, weight decay, learning-rate schedule.

For neural networks, the hyperparameter space is enormous and tuning is correspondingly expensive — single training runs can take days, and a full sweep over many configurations isn’t feasible. Practitioners rely heavily on experience, published recipes, and adaptive methods like learning-rate schedules that reduce the need to tune the learning rate carefully.

Idriss Rami — Notes

Explorer

Hyperparameter

Why this matters

How to pick them

Common hyperparameters

Graph View

Table of Contents

Backlinks