Learning rate

The learning rate is the size of each step in Gradient descent — how far we move along the negative gradient before recomputing it. It’s conventionally written $η$ (the Greek letter eta) and is one of the most important hyperparameters in training a machine-learning model.

The update rule for gradient descent:

$w_{t + 1} = w_{t} - η \nabla J (w_{t})$

The negative gradient says which direction decreases the loss. The learning rate says how far to step in that direction.

Too small

If $η$ is too small, training is very slow — we creep down the hill in tiny steps and need many iterations to reach the minimum. With $η = 0.0001$ , an optimizer that would converge in 100 steps at $η = 0.1$ might take 100,000 steps. The loss curve is smooth but flat — barely decreasing over many iterations.

Too large

If $η$ is too large, the updates may overshoot the minimum. We compute the gradient at the current point, take a big step in that direction, and end up on the other side of the valley — possibly at a point with higher loss than where we started. The loss may oscillate, bounce around without settling, or even diverge entirely with the parameters growing without bound.

Just right

A good learning rate produces a loss curve that drops quickly at first, then levels off as it approaches the minimum. The descent should be steady — neither glacial nor spasmodic.

Picking it

Picking $η$ is partly an art. Common starting values are 0.01 or 0.001. Run training for a while, look at how the loss changes, adjust:

If the loss drops quickly and smoothly, $η$ is roughly right.
If the loss stays flat, $η$ is too small — increase by 10×.
If the loss bounces around or grows, $η$ is too large — decrease by 10×.

A worked example from the textbook: minimizing $f (x, y) = 3 x^{2} + 4 y^{2} + 2 x - 4 x y$ starting from $(1, 1)$ . With $η = 0.1$ , the iterations converge to the analytical minimum $(- 0.5, - 0.25)$ in about 25 steps. With $η = 0.3$ , the path overshoots — taking large steps past the minimum, then correcting back. With $η$ much larger, the path spirals outward and diverges. With $η = 0.001$ , the path creeps slowly inward and doesn’t reach the minimum within 200 iterations.

Beyond a constant

Modern training algorithms often use a learning-rate schedule — $η$ that varies over time. Common patterns:

Step decay: drop $η$ by a factor of 10 every $K$ epochs.
Cosine annealing: $η$ follows a cosine curve from $η_{m a x}$ to $η_{m i n}$ .
Warmup: start small, ramp up linearly to $η_{m a x}$ , then decay.

These help training converge faster early (when far from the minimum, big steps help) and more precisely later (when near the minimum, small steps prevent overshoot).

Adaptive methods like Adam, RMSProp, and AdaGrad adjust the learning rate per parameter based on the history of gradients, removing some of the need to tune $η$ by hand.

Idriss Rami — Notes

Explorer

Learning rate

Too small

Too large

Just right

Picking it

Beyond a constant

Graph View

Table of Contents

Backlinks