Mean squared error

Mean squared error (MSE) is the standard Loss function for Regression. Given a dataset of $N$ examples ${(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}$ where each $y_{i}$ is the true output, the MSE is:

$J (w) = \frac{1}{2 N} \sum_{i = 1}^{N} (f (x_{i}, w) - y_{i})^{2}$

For each example, compute the model’s prediction $f (x_{i}, w)$ , subtract the true value $y_{i}$ , square the difference, sum across examples, divide by $2 N$ . The result is a single number whose value depends on $w$ .

A few things about this formula:

The $N$ in the denominator makes it an average. Without it, the loss grows without bound as we add more data. The size of the dataset shouldn’t change what small loss means.

The factor of 2 has nothing to do with averaging, it’s a notational convenience. Differentiating the squared term pulls down a factor of 2 from the chain rule, and that 2 cancels the 1/2, leaving a clean expression. Some sources omit the 1/2; the resulting gradient picks up a factor of 2 in front. Either is fine; the Introduction to Data Science textbook uses the 1/2 version.

The squaring does two things. It makes the loss positive, so errors above and below the prediction don’t cancel each other. And it penalizes large errors more than proportionally: a prediction off by 4 units contributes 16 to the loss, off by 2 contributes 4, off by 1 contributes 1. The model is strongly motivated to avoid big mistakes, even at the cost of accepting more small ones.

Why not classification

MSE is the right choice for regression but the wrong choice for classification. Combined with the Sigmoid function in Logistic regression, MSE produces a non-convex loss surface: gradient descent has weaker guarantees, and the gradient becomes very small (vanishes) when the sigmoid saturates, so training stalls. The standard classification loss Binary cross-entropy gives a convex surface with sigmoid outputs, and its gradient stays well-behaved even when predictions are far from the labels.

Variables in the formula

$w$ — the parameter vector, what we’re trying to find.
$f (x_{i}, w)$ — the model’s prediction at $x_{i}$ .
$y_{i}$ — the ground-truth value at $x_{i}$ .
$N$ — the number of training examples.
$J$ — the symbol the Introduction to Data Science textbook uses for a loss function. Sometimes $L$ in other sources.

The training problem

There are infinitely many possible $w$ vectors. Most give terrible fits and large losses. A small subset gives reasonable fits. The particular $w$ that minimizes $J$ gives the best fit by this measure. For Linear regression this minimum has a closed-form solution; for more complex models we use Gradient descent.

Idriss Rami — Notes

Explorer

Mean squared error

Why not classification

Variables in the formula

The training problem

Graph View

Table of Contents

Backlinks