Mean squared error (MSE) is the standard Loss function for Regression. Given a dataset of examples where each is the true output, the MSE is:

For each example, compute the model’s prediction , subtract the true value , square the difference, sum across examples, divide by . The result is a single number whose value depends on .

Several details deserve unpacking:

The in the denominator makes this an average — without it, the loss would grow without bound as we add more data, which would be confusing. The size of the dataset shouldn’t change what small loss means.

The factor of 2 has nothing to do with averaging — it’s a notational convenience. When we differentiate the squared term we’ll get a factor of 2 from the chain rule, and that 2 cancels the 1/2, leaving a clean expression. Some sources omit the 1/2; the resulting gradient picks up a factor of 2 in front. Either is fine; the Introduction to Data Science textbook uses the 1/2 version.

The squaring serves two purposes. First, it makes the loss positive, so errors above and below the prediction don’t cancel each other. Second, it penalizes large errors more than proportionally: a prediction off by 4 units contributes 16 to the loss, off by 2 contributes 4, off by 1 contributes 1. The model is strongly motivated to avoid big mistakes, even at the cost of accepting more small ones.

Why not classification

MSE is the right choice for regression but the wrong choice for classification. Combined with the Sigmoid function in Logistic regression, MSE produces a non-convex loss surface — gradient descent has weaker guarantees, and the gradient becomes very small (vanishes) when the sigmoid saturates, so training stalls. The standard classification loss Binary cross-entropy gives a convex surface with sigmoid outputs, and its gradient stays well-behaved even when predictions are far from the labels.

Variables in the formula

  • — the parameter vector, what we’re trying to find.
  • — the model’s prediction at .
  • — the ground-truth value at .
  • — the number of training examples.
  • — the symbol the Introduction to Data Science textbook uses for a loss function. Sometimes in other sources.

The training problem

There are infinitely many possible vectors. Most give terrible fits and large losses. A small subset gives reasonable fits. The particular that minimizes gives the best fit by this measure. For Linear regression this minimum has a closed-form solution; for more complex models we use Gradient descent.